RAG in Production: The Parts Nobody Talks About

The tutorial version

Every RAG tutorial tells the same story: chunk your documents, embed them, store the vectors, retrieve the top-k most similar chunks at query time, stuff them into a prompt, and let the LLM generate an answer. It works beautifully with a 10-page PDF and five sample questions.

Then you try it with 50,000 documents, 200 concurrent users, an SLA of "answers must be accurate," and a CEO who heard that AI will hallucinate and wants to know what you're doing about it.

The gap between tutorial RAG and production RAG is enormous. Here's what fills it.

Chunking is the whole game

The most consequential decision in a RAG system isn't which embedding model you use or which vector database you pick. It's how you chunk.

Bad chunking creates two failure modes that look completely different but have the same root cause:

Chunks too small: You retrieve fragments that don't have enough context. The LLM generates an answer from incomplete information and sounds confident doing it.
Chunks too large: The relevant information gets diluted by surrounding text. Retrieval scores drop. The actually-relevant chunk ranks 8th instead of 1st, and your top-5 retrieval misses it entirely.

In production, I've found that semantic chunking — splitting on actual content boundaries (sections, paragraphs, topic shifts) rather than fixed token counts — makes a measurable difference. It's more expensive to implement and harder to debug, but retrieval precision goes up 15-20% in my experience.

The other thing nobody tells you: you'll need different chunking strategies for different document types. A legal contract should be chunked by clause. A technical manual should be chunked by section. A support ticket history should be chunked by conversation turn. One-size-fits-all chunking is why most production RAG systems underperform.

Retrieval quality is invisible until it fails

Here's the insidious thing about RAG failures: they're silent. The system retrieves the wrong chunks, the LLM generates coherent text from those wrong chunks, and the user gets a confident, well-written, completely wrong answer.

You need two things to catch this:

1. Retrieval evaluation, separate from generation evaluation

Most teams evaluate RAG end-to-end: "Did the final answer match the expected answer?" This tells you nothing about why things fail. Was it the retrieval (wrong chunks selected) or the generation (right chunks, but bad synthesis)?

In every production RAG system I've built, I log the retrieved chunks alongside the final answer. I run offline evaluations on retrieval independently:

Recall@k: Of the relevant chunks for this query, how many were in the top-k retrieved?
Precision@k: Of the top-k retrieved chunks, how many were actually relevant?
Mean Reciprocal Rank: Where did the first relevant chunk appear in the ranking?

If your retrieval metrics are bad, no amount of prompt engineering will save you. Fix retrieval first.

2. Continuous monitoring, not one-time testing

Retrieval quality degrades over time in ways that are hard to predict:

New documents get added that are similar to existing ones, creating ambiguity in the vector space
User query patterns shift away from what the system was optimized for
The embedding model you're using gets updated by the provider (yes, this happens)
Your knowledge base grows and retrieval thresholds that worked at 10K documents don't work at 100K

I run weekly automated retrieval evaluations against a maintained golden test set. It's the kind of work nobody wants to do, and it's the reason our RAG systems stay accurate for months instead of weeks.

The context window is not your friend

"Context windows are huge now — just stuff everything in!" I hear this constantly. It's a trap for three reasons:

Cost. A 128K context window costs proportionally. If you're stuffing 20 chunks of 500 tokens each into every query, and you're handling 10,000 queries a day, that's real money. Most of those tokens are irrelevant padding.

Lost in the middle. Research (and my production experience) confirms that LLMs pay disproportionate attention to the beginning and end of the context. Information in the middle gets underweighted. If your most relevant chunk lands at position 8 of 15, the model may functionally ignore it.

Noise hurts more than absence. Adding marginally relevant context doesn't just waste tokens — it actively degrades output quality. The model has to distinguish signal from noise, and it's not great at it. I've seen cases where reducing from top-10 to top-3 retrieval improved answer accuracy because the LLM stopped getting confused by semi-relevant chunks.

Hybrid search isn't optional

Pure semantic search (embeddings only) fails on a specific, common class of queries: exact-match questions. When a user asks "What is the cancellation policy for contract #A-4821?", semantic search might return chunks about cancellation policies in general, but miss the specific contract because "A-4821" doesn't have meaningful semantic content.

In production, I always run hybrid search: semantic retrieval plus keyword/BM25 retrieval, with a reranker to merge and score the results. The keyword path catches the exact matches that embeddings miss. The semantic path catches the conceptual matches that keywords miss. The reranker gives you the best of both.

This isn't a nice-to-have. Without hybrid search, your RAG system will fail on roughly 20-30% of real production queries in my experience — the ones that reference specific identifiers, codes, names, or technical terms.

Metadata filtering: the underrated superpower

If your vector database supports metadata filtering (and any production-grade one does), use it aggressively. Don't just search "all documents." Filter first, search within the filtered set.

Examples from real systems:

User asks about their account → filter to documents tagged with their account ID before retrieval
Question is about a specific product → filter to product documentation only
Time-sensitive query → filter to documents from the last 90 days
Department-specific question → filter to that department's knowledge base

Metadata filtering reduces the search space, which improves both speed and precision. It also enforces access control — a user should only retrieve documents they're authorized to see. This is a security boundary, not just an optimization.

The generation layer

Once retrieval is solid, the generation side has its own production concerns:

Prompt versioning

Your system prompt will change. A lot. You'll tweak it to reduce hallucination, adjust the tone, add guardrails, handle new edge cases. In production, you need to know which prompt version generated which output. Version your prompts like code. Roll back when something regresses.

Citation and grounding

"The AI said it" is not acceptable to enterprise customers. Every answer needs to cite its sources — which chunks it used, from which documents. This lets users verify. It also makes debugging trivial: when an answer is wrong, you can immediately see whether the problem was retrieval (wrong sources) or generation (right sources, wrong interpretation).

Fallback behavior

What does the system do when it doesn't know? The default LLM behavior is to generate something plausible. In production, you need it to say "I don't have enough information to answer this" — and you need it to do so reliably. This requires explicit instructions in the prompt, confidence thresholds on retrieval scores, and testing specifically for this behavior.

What a production RAG stack actually looks like

For reference, here's the real architecture of a production RAG system I run today:

Ingestion pipeline: Document processing → cleaning → semantic chunking → embedding → vector store + metadata index
Query pipeline: Query understanding → metadata filter → hybrid retrieval (semantic + BM25) → reranking → context assembly → LLM generation → citation extraction
Evaluation layer: Retrieval metrics, generation metrics, golden test sets, weekly automated evals, drift detection
Observability: Per-query logging (query, retrieved chunks, generated answer, latency, cost, user feedback), dashboards, alerting on quality regression
Governance: Prompt versioning, model version tracking, access control via metadata filtering, audit trail

This isn't overengineering. This is what it takes to run RAG that an enterprise trusts. The tutorial version is items 1 and 2, without the nuances. The production version is all five.

RAG is the most deployed LLM pattern in production today, and it's also the most commonly deployed badly. The core idea is sound. The execution just requires a lot more engineering than the tutorials suggest.

If you're building RAG and it "mostly works," that's exactly the state you should be worried about. "Mostly works" means silently wrong answers reaching users who trust your system. The difference between a RAG demo and a RAG product is the infrastructure to catch, measure, and fix the failures.