What is a RAG system? Retrieval-Augmented Generation (RAG) is an AI architecture that improves LLM output accuracy by retrieving relevant documents from an external knowledge base at inference time, rather than relying solely on the model's training data. A RAG system has three core components: a document store, a retrieval layer (typically embedding + vector search), and a generation layer (the LLM).
TL;DR: Retrieval quality is a product problem, not a model problem. Most RAG failures happen before the LLM sees any text. Fix chunking, add eval harnesses early, and use hybrid search.
Most RAG tutorials show you how to chunk a PDF, embed it, and run a similarity search. That part is not hard. What they skip is everything that happens when the retrieval goes wrong — which it will, regularly, in production.
We've built RAG systems for two production domains with fundamentally different failure modes: legal document research (Kodus Legal) and construction blueprint analysis (Cimenta). Here's what we learned building both to production in 2025–2026.
Chunking is a product decision, not a technical one
How you chunk determines what you can retrieve. Splitting by token count is naive. Legal documents have logical sections — clauses, definitions, exhibits — that should never be split mid-thought. Blueprints have spatial context that doesn't survive naive text extraction at all.
Before you write a chunking function, you need to understand the document structure and the query patterns. What is the user actually asking? Is it a lookup ("what does section 4.2 say") or a synthesis ("what are all the indemnification clauses")? Those require different chunking strategies.
Retrieval quality degrades silently
This is the hard part. When retrieval fails, the model often still returns a confident-sounding answer — it just uses the wrong context, or hallucinates the missing pieces. You won't know unless you're measuring it.
Build an eval set early. For every domain, collect 20-50 representative queries with known correct answers. Run them on every retrieval change. A score that drops two points on your eval set is a signal worth investigating before you push to production.
Hybrid search is almost always worth it
Pure semantic search misses exact matches. Pure keyword search misses conceptual similarity. In practice, a weighted combination of both — BM25 for recall, embeddings for semantic relevance, RRF or a learned reranker to combine — outperforms either alone across the domains we've tested.
The overhead is modest. The improvement in edge-case retrieval is significant.
The model's job is synthesis, not retrieval
Once you have good retrieval, give the model clean, well-structured context. Don't dump 8,000 tokens of raw document text and ask it to figure out what's relevant. That's the retrieval layer's job. By the time context reaches the model, it should be the right stuff, in a readable format, with clear provenance.
This discipline also helps with latency and cost. Smaller, better-targeted prompts are faster and cheaper than large, unfocused ones.
What we'd do differently
On Kodus Legal, we underestimated how much domain-specific vocabulary would affect embedding quality. Legal language is precise in ways that general-purpose embeddings handle poorly. Fine-tuning on legal corpora — or at minimum, using a legally-specialized embedding model — would have improved retrieval quality from day one.
We fixed it in V2. The lesson: embedding model selection is a product decision too.
Key takeaways for building a RAG system in production
Q: What is the most common reason RAG systems fail in production? Silent retrieval degradation — the model returns confident answers using wrong or missing context. Without an eval set, you won't catch this until users report it.
Q: Should I use semantic search or keyword search for RAG? Use both. Hybrid search combining BM25 (keyword recall) and vector embeddings (semantic relevance), merged with RRF or a learned reranker, consistently outperforms either approach alone. The implementation overhead is low relative to the retrieval quality gain.
Q: How should I chunk documents for a RAG pipeline? Chunk along logical document boundaries, not token counts. For legal text, respect clause and section structure. For technical documents, keep related concepts together. The right chunking strategy depends on the query patterns your users will have.
Q: What eval approach works for RAG? Build a domain-specific eval set of 20–50 representative queries with known correct answers before you write your first chunking function. Run it on every retrieval change. A 2-point drop in eval score warrants investigation.
If you're building a RAG system and want a second opinion on your architecture before it hits production, we offer AI product development services — including technical design reviews and end-to-end builds.