Building a Production RAG Pipeline That Actually Doesn't Hallucinate

The technical blueprint for a RAG system that holds up in production: chunking strategies, hybrid search, reranking, citation patterns, and evaluation with Ragas and TruLens.

Why Naive RAG Hallucinates (And It's Not the LLM's Fault)

Most RAG hallucinations don't originate in the language model — they originate in the retrieval layer. The LLM is faithfully answering the question given the context it received. The problem is that the context was wrong, incomplete, or irrelevant. Three failure modes account for the majority of production RAG hallucinations. First, retrieval miss: the correct information exists in your corpus but wasn't retrieved — either because the query embedding didn't match the document embedding closely enough (semantic gap), or because the relevant passage was fragmented across chunk boundaries during ingestion. The LLM, receiving no relevant context, invents a plausible-sounding answer rather than saying it doesn't know. Second, context window dilution: you retrieved 20 chunks, most of which are tangentially related but not directly relevant. The LLM averages over this noisy context and produces an answer that's partly grounded and partly confabulated. Third, conflicting retrieved content: if your corpus has documents that contradict each other (a policy that was updated but the old version wasn't deleted), the LLM may blend the two, producing an answer that matches neither source accurately. Fixing RAG hallucinations is therefore a retrieval engineering problem, not a prompt engineering problem. The fixes are: better chunking (so relevant information isn't split across boundaries), better retrieval (so the right chunks are actually returned), and better filtering (so irrelevant and conflicting chunks are excluded before the LLM sees them).

Chunking Strategies: Fixed, Semantic, and Hierarchical

Chunking is where most RAG implementations go wrong. The default — split every document into 512-token chunks with a 50-token overlap — is a reasonable starting point but fails for structured documents, technical content, and anything with cross-referential density. Fixed-size chunking with overlap is appropriate for uniform, dense text (legal boilerplate, product documentation with consistent structure) where information density is roughly even and cross-paragraph references are rare. The overlap handles boundary fragmentation at the cost of redundant embedding computation. Semantic chunking uses an embedding model to detect topic boundaries: compute sentence embeddings, then split where cosine similarity between adjacent sentences drops below a threshold. This produces variable-length chunks that correspond to semantic units rather than token counts — significantly better for mixed-content documents. The trade-off is slower ingestion and the need to tune the similarity threshold per corpus type. Hierarchical (or parent-document) chunking is the most powerful for retrieval quality: ingest documents at two levels of granularity simultaneously. Small chunks (128–256 tokens) are used for embedding and retrieval — their smaller size makes them more semantically specific, improving retrieval precision. Large chunks (the full section or parent document) are what you actually send to the LLM context. When a small chunk is retrieved, you look up its parent and use that as the context. This gives you the retrieval precision of small chunks with the coherence and completeness of large context windows. LlamaIndex implements this natively with its ParentDocumentRetriever. LangChain has a ParentDocumentRetriever as well. For technical documentation and multi-page PDFs, hierarchical chunking reduces hallucination rates by 25–40% compared to fixed chunking in controlled evaluations.

Embedding Model Selection and Its Downstream Impact

The embedding model is the core of your retrieval system's semantic understanding, and the gap between a mediocre embedding model and a state-of-the-art one is measured in real retrieval accuracy. Several factors matter. Domain specificity: OpenAI's text-embedding-3-large and Cohere's embed-v3 are strong general-purpose models, but for code retrieval, legal documents, or biomedical text, a domain-adapted model (e.g., code-specific models like Voyage Code 2, or legal-specific embeddings fine-tuned on court documents) will outperform general models by a significant margin. Dimensionality: higher-dimensional embeddings (1536 vs 384) capture more nuanced semantic distinctions but cost more to store and compute similarity over. For most production use cases, 1024-dimensional embeddings are a good balance. Context length: some embedding models have a 512-token input limit, which interacts directly with your chunking decisions — chunks longer than the model's context limit are truncated during embedding, losing the tail content silently. Verify that your embedding model's context limit is at least as large as your largest expected chunk. Matryoshka Representation Learning (MRL): models trained with MRL (like text-embedding-3-large) support variable-dimension embeddings — you can store full 3072-dimensional vectors but query with 256-dimensional truncations for speed, with a small accuracy penalty. This is useful for tiered retrieval: fast low-dimensional first-pass, then full-dimension reranking of the top-k results. Always evaluate embedding model performance on a sample of your actual queries against your actual corpus — MTEB benchmark scores don't always translate to your specific domain. Build a 100–200 query golden set early and use it to compare models before committing to an embedding infrastructure.

Hybrid Search: BM25 + Dense Retrieval

Dense (embedding-based) retrieval has a well-documented weakness: it struggles with exact keyword matching and rare terms. If a user asks about a specific model number, a proprietary product name, or a precise legal citation, the embedding of that query may not be close enough to the document embedding to surface the correct result — even if the document contains the exact string verbatim. BM25, the classic sparse retrieval algorithm, is designed precisely for this: it matches query terms to document terms directly, with TF-IDF-style weighting that handles rare and specific terms better than any embedding model. Hybrid search combines both: run BM25 and dense retrieval independently, then merge the result sets using Reciprocal Rank Fusion (RRF) or a learned weighted sum. RRF is the most common approach: for each retrieved document, compute `1 / (k + rank_BM25) + 1 / (k + rank_dense)` (where k is typically 60), then sort by this combined score. In practice, hybrid search improves recall across nearly all query types without hurting precision on semantic queries — it's almost always worth implementing. Weaviate, Qdrant, and Elasticsearch all support hybrid search natively. For pgvector-based systems, you need to implement BM25 separately (using PostgreSQL full-text search or Tantivy) and merge results at the application layer. The operational cost is running two retrieval passes per query, which adds 5–15ms latency in typical deployments — a worthwhile trade for significantly fewer retrieval misses.

Reranking: The Final Quality Gate Before the LLM

After hybrid search returns your top-20 or top-50 candidate chunks, a reranker applies a more expensive but more accurate relevance model to re-score and re-rank them. The distinction from embedding similarity is important: embedding models compute relevance by comparing two vectors independently encoded — they never see query and document together in a single forward pass. Rerankers (typically cross-encoders) take the query and each candidate document as a combined input, allowing attention mechanisms to model query-document interactions directly. This produces dramatically better relevance scores at the cost of higher latency. Cohere Rerank, Voyage Rerank, and open-source cross-encoders like ms-marco-MiniLM-L-12-v2 are the common choices. A typical pipeline: hybrid retrieval returns top-40 candidates, reranker scores all 40, you take the top-5 by reranker score. The LLM sees only 5 highly relevant chunks rather than 20 partially relevant ones — reducing context dilution and improving answer quality. Latency trade-off: cross-encoder reranking adds 50–200ms per query at typical passage counts. For sub-100ms SLA requirements, use a lighter reranker model or apply reranking only for queries classified as high-stakes (complex questions, sensitive domains). For information retrieval benchmarks, reranking consistently improves nDCG@10 by 8–15% over first-stage dense retrieval. The Haystack framework has an excellent modular pipeline implementation for hybrid retrieval plus reranking with pluggable component swap-out.

Citation and Source Attribution Patterns

Grounded answers with citations are not just a user experience feature — they are the primary mechanism for detecting and surfacing hallucinations. Without citations, you can't verify answers. With citations, users and automated evaluators can spot-check LLM claims against the source material. The implementation pattern has two parts. First, at context preparation time: each chunk passed to the LLM is prefixed with a source identifier tag: `[SOURCE:doc_id:chunk_id] ... chunk text ...`. This identifier carries document name, page number, and chunk index. Second, in the system prompt: the LLM is explicitly instructed to cite every claim with the source identifier in a defined format (e.g., `[1]` inline with a `## Sources` section at the end listing each referenced identifier). The application layer then maps citation identifiers back to the original documents and constructs clickable source links. For PDF-based knowledge bases, store the page bounding box of each chunk at ingestion time so you can deep-link to the exact page and paragraph. The key discipline is rejecting or flagging answers that contain no citations when citations are expected — this is a strong signal of a retrieval miss or an LLM that drifted off-context. Enforce citation presence as a structural validator in your evaluation pipeline, not just a nice-to-have in the prompt.

Evaluation with Ragas and TruLens

RAG systems require a specialized evaluation framework because standard NLP metrics (BLEU, ROUGE) measure surface similarity, not factual grounding. Two tools dominate production RAG evaluation: Ragas and TruLens. Ragas evaluates RAG pipelines on four dimensions: context precision (are the retrieved chunks actually relevant to the question?), context recall (were all the relevant chunks retrieved?), faithfulness (is the generated answer grounded in the retrieved context, with no claims that aren't supported?), and answer relevancy (does the answer actually address what was asked?). Ragas uses an LLM internally to compute these scores, making it reference-free — you don't need human-labeled ground truth for every question. TruLens takes a broader view: it instruments your entire RAG application, records inputs, outputs, and intermediate retrieval results for every query, and applies LLM-based feedback functions (similar to Ragas dimensions) at query time. The TruLens dashboard surfaces aggregate metrics over time, letting you detect drift. For a production RAG pipeline, the minimum evaluation setup is: a golden test set of 100–200 representative queries with known correct answers (curated by domain experts), Ragas run against this set on every code deployment, and TruLens running in production against a 5–10% sample of live queries. Set thresholds: faithfulness below 0.85 and context recall below 0.80 should trigger an investigation. Track these metrics over time on a dashboard — a gradual decline in faithfulness often precedes a user-visible degradation by weeks, giving you time to act before complaints accumulate.

Monitoring in Production: What to Track

Evaluation at deployment time is necessary but not sufficient. Production RAG pipelines degrade in ways that offline evals don't catch: query distribution shifts as users explore new topics, document corpus staleness as source knowledge ages, and infrastructure-level latency regressions from index growth. A production RAG monitoring setup tracks five metric categories. Retrieval quality: context relevance scores (from TruLens or a custom LLM judge) on a sampled 5–10% of live queries. Track mean score and the 5th percentile — the average can look healthy while a tail of badly served queries compounds into user churn. Hallucination rate: faithfulness score on sampled queries. Any single-week drop of more than 5 percentage points should trigger an alert. Latency by component: log embedding time, vector search time, reranking time, and LLM generation time separately. When overall latency degrades, you want to know which component caused it. Index freshness: if your corpus is updated regularly, track the lag between document creation and availability in the search index. Stale indices are a hidden source of hallucinations — the LLM answers based on outdated retrieved content. User feedback signals: if your UI has thumbs up/down on answers, route negative feedback into a queue for manual review and automatic addition to your Ragas evaluation set. Negative feedback is the most signal-dense data you have about what your pipeline is getting wrong. The AI Readiness Assessment on AgentList includes a RAG-specific readiness checklist covering corpus quality, retrieval architecture, and monitoring requirements before you go to production.

Related Resources

Find agencies that specialize in the frameworks and use cases covered in this article.

LangChain Stack Profile →LlamaIndex Stack Profile →Haystack Stack Profile →AI Readiness Assessment →Agent Benchmarks →

Technical Deep-Dive

Prompt Engineering for AI Agents: Beyond Basic Instructions

Read →

Explore the Directory

Find the right AI agent agency for your project.