RAG Systems in Production

Sajawal Khan Sadozai

Retrieval-Augmented Generation has become the default architecture for enterprise AI applications that need to answer questions grounded in private, proprietary, or frequently-updated knowledge. Every major cloud provider has a managed RAG offering. Dozens of frameworks make it trivially easy to build a prototype. The prototype almost always works. The production system almost always disappoints — until you understand the gap between what makes RAG work in a demo and what makes it work when real users depend on it every day.

Why RAG Prototypes Fail in Production

A RAG prototype typically involves three steps: embed some documents, store them in a vector database, retrieve the top-k most similar chunks when a user asks a question, and pass those chunks to an LLM with the question. This works remarkably well on clean, well-structured documentation with predictable queries.

Production breaks this model in several predictable ways:

Real documents are messy. Enterprise knowledge bases contain PDFs with scanned text, HTML pages with navigation noise, Word documents with inconsistent formatting, and spreadsheets that don't translate naturally to text. Raw ingestion of these sources produces chunks that are confusing, incomplete, or misleading when retrieved.
Real queries are ambiguous. Users don't query enterprise knowledge bases the way you write documentation. They ask vague questions, use internal shorthand, make spelling mistakes, and often don't know what terminology to use for the thing they're looking for. Semantic similarity alone doesn't bridge this gap reliably.
Top-k retrieval is a blunt instrument. Returning the five most similar chunks by cosine distance works when the right answer is in one of those five chunks. When the right answer requires synthesising information from three different documents — or when the most similar chunk is not the most relevant one — top-k retrieval fails silently and the LLM generates a confident but incorrect response.
Knowledge goes stale. Static knowledge bases become incorrect knowledge bases. In fast-moving domains — product documentation, policies, pricing, regulatory guidance — yesterday's correct answer is today's liability. Without systematic freshness management, a RAG system becomes less trustworthy over time, not more.

Document Processing: Where Production RAG Actually Starts

The quality of a RAG system is determined more by how documents are processed than by which LLM or vector database you choose. We spend significantly more time on ingestion pipelines than most teams expect, and it pays back consistently.

Format-specific extractors — PDFs, HTML, DOCX, and Markdown each require different extraction approaches. We build dedicated extractors for each format rather than using a generic parser. A PDF extractor that understands table structure produces dramatically better chunks than one that treats the entire document as a flat text stream.
Noise removal before chunking — Headers, footers, page numbers, navigation menus, and boilerplate text that appears across many documents create retrieval noise. We strip these elements during extraction so they don't pollute chunk content or inflate similarity scores.
Semantic chunking over fixed-size chunking — Fixed-size chunking (split every 512 tokens) is the default in most tutorials and the wrong default for most real documents. Splitting mid-sentence, mid-table, or mid-code-block produces chunks that are meaningless in isolation. We use semantic chunking — splitting at natural content boundaries like paragraph breaks, section headers, and logical unit boundaries — and see consistently better retrieval precision as a result.
Chunk overlap with context preservation — A 10–15% overlap between adjacent chunks ensures that context at the boundaries isn't lost. The overlap size depends on document type — technical documentation with dense cross-references benefits from larger overlaps than a simple FAQ.
Metadata enrichment at ingestion — Every chunk gets metadata: source document, section title, document date, content type, and relevant tags. This metadata is used at retrieval time to filter and rerank results, dramatically improving relevance for structured queries like "show me only policy documents updated after January 2025."

Embedding Models: Choosing the Right One

Not all embedding models are equal — and the right choice depends on your domain, your query patterns, and your latency requirements.

General-purpose vs domain-specific — OpenAI's text-embedding-3-large and Cohere's embed-v3 are strong general-purpose choices. For highly specialised domains — legal, medical, financial — domain-specific fine-tuned models consistently outperform general models on retrieval precision. We evaluate both on a representative sample of the actual knowledge base before committing.
Embedding dimensions and cost — Higher-dimensional embeddings capture more semantic nuance but cost more to store and query. text-embedding-3-large at 3,072 dimensions outperforms text-embedding-3-small at 1,536 dimensions — but the difference in retrieval quality on most enterprise knowledge bases is smaller than the 2x difference in cost. We profile the actual precision gap before recommending the larger model.
Query vs document embedding asymmetry — Some embedding models are trained with separate encoders for queries and documents, producing better alignment between question embeddings and answer embeddings. Cohere's models and several open-source alternatives use this architecture. For query-heavy systems, the asymmetric approach reliably improves recall.
Late interaction models for precision-critical applications — ColBERT and similar late interaction architectures compute similarity at the token level rather than the document level, allowing much finer-grained matching. For applications where precision matters more than speed — legal research, medical diagnosis support — late interaction models outperform bi-encoder models significantly.

Retrieval: Beyond Top-K Similarity

Raw vector similarity is a starting point, not a complete retrieval strategy. The systems we build in production layer multiple retrieval signals to maximise both recall and precision.

Hybrid search — dense plus sparse — Combining semantic search (vector similarity) with keyword search (BM25 or similar sparse retrieval) dramatically improves recall for queries that contain specific terms, product names, or technical identifiers that pure semantic search can miss. We use Reciprocal Rank Fusion to combine the two result sets into a single ranked list.
Metadata filtering before similarity search — Filtering the candidate document set by metadata (document type, date range, department, product area) before running vector similarity reduces the search space and improves precision. A query about "current refund policy" should not be retrieving chunks from a policy document that was superseded two years ago.
Reranking with a cross-encoder — A cross-encoder reranker takes the top-20 or top-50 chunks from initial retrieval and scores each one against the query using a more computationally expensive but more accurate relevance model. The top-5 after reranking are significantly more relevant than the top-5 from initial retrieval. We use this in every production system where latency allows.
Query expansion and reformulation — Before retrieval, we expand the user's query with synonyms, related terms, and hypothetical document fragments. HyDE (Hypothetical Document Embeddings) — generating a hypothetical answer to the question and using its embedding for retrieval — consistently improves recall for knowledge bases with technical or specialised vocabulary.
Multi-query retrieval for complex questions — For questions that require information from multiple parts of the knowledge base, we decompose the query into sub-questions, retrieve for each independently, and synthesise a combined context. This handles questions like "compare the refund policies for product A and product B" that a single retrieval pass would struggle with.

Preventing Hallucinations in Production

The most serious failure mode in a production RAG system is not retrieval failure — it is confident, fluent, incorrect answers. An LLM that cannot find the answer in the retrieved context will often generate a plausible-sounding response anyway. For enterprise applications, this is not acceptable.

Strict grounding prompts — System prompts that explicitly instruct the model to answer only from the provided context and to say "I don't have information about this" when the context is insufficient. The phrasing matters — vague instructions produce vague compliance. Explicit instructions ("Do not use any knowledge outside the provided documents") produce reliable compliance.
Faithfulness evaluation — After generation, a separate evaluation step checks whether the generated response is entailed by the retrieved context. We use a lightweight NLI (Natural Language Inference) model for this check. Responses that fail the entailment check are flagged for human review rather than sent to the user.
Confidence scoring and abstention — When retrieval similarity scores are below a threshold, or when retrieved chunks are all below a minimum relevance score, the system should abstain rather than attempt an answer with insufficient context. "I don't have reliable information about this" is always better than a confident wrong answer.
Citation requirement in responses — Requiring the LLM to cite the specific source chunks it used in its response forces the model to ground its answer and makes it auditable. If a response cites a source that doesn't support the claim, the inconsistency is detectable. We implement citation requirements on every enterprise RAG deployment.

Keeping the Knowledge Base Fresh

A RAG system is only as current as its knowledge base. In most enterprise deployments, the knowledge base is not static — policies change, products update, regulations evolve. Managing freshness is an operational concern that needs to be designed into the system from the start.

Document versioning with supersession tracking — When a document is updated, the old version's chunks are marked as superseded and excluded from retrieval. Without this, a query can retrieve contradictory information from the current and previous versions of the same policy document.
Automated ingestion pipelines — Manual knowledge base updates are a bottleneck and a reliability risk. We build automated pipelines that detect changes in source systems — Confluence pages, SharePoint documents, website content — and trigger re-ingestion for changed documents without manual intervention.
Freshness metadata in retrieval — Every chunk carries a last-updated timestamp. Retrieval scoring can weight more recent content higher for time-sensitive queries. A query about "current pricing" should prefer a document updated last week over one updated last year, even if the older one has marginally higher semantic similarity.
Staleness alerts — We set up monitoring that flags documents that haven't been reviewed in a configurable period. In regulated industries, having the system proactively surface potentially stale content for human review is a compliance requirement, not just a quality concern.

Evaluation: How We Measure RAG Quality

You cannot improve a RAG system you cannot measure. We run a structured evaluation suite on every production RAG deployment before go-live and after every significant change.

Retrieval precision and recall — Given a set of test questions with known correct source documents, what percentage of retrievals include the right document in the top-k results? This measures the retrieval layer in isolation from the generation layer.
Answer faithfulness — Does the generated answer accurately reflect the content of the retrieved context? Measured using an NLI model or a dedicated evaluation LLM like GPT-4 as a judge.
Answer relevance — Does the generated answer actually address the question that was asked? A response can be perfectly faithful to the retrieved context and still not answer the question if the wrong context was retrieved.
End-to-end correctness on golden dataset — A curated set of question-answer pairs where the correct answer is known. We measure the percentage of questions where the system produces the correct answer end-to-end. This is the most important metric but requires the most effort to maintain.
Latency at the 95th percentile — Median latency is misleading for user-facing systems. We measure and set SLAs at the 95th percentile — the experience of the slowest 5% of requests, which is often where retrieval bottlenecks and generation timeouts appear.

The Stack We Use in Production

Our current production RAG stack for enterprise deployments:

Document processing — Unstructured.io for multi-format extraction, custom post-processing for noise removal and semantic chunking.
Embedding — OpenAI text-embedding-3-large for general enterprise use, domain-specific fine-tuned models for specialised deployments.
Vector store — Pinecone for managed deployments, pgvector on Supabase for applications where keeping everything in one database simplifies operations.
Sparse retrieval — Elasticsearch or OpenSearch for BM25, combined with vector results via RRF.
Reranking — Cohere Rerank for most deployments, ColBERT-based reranking for precision-critical applications.
Generation — GPT-4o for complex multi-document synthesis, GPT-4o-mini for high-volume lower-complexity queries where cost matters.
Evaluation — RAGAS framework for automated evaluation metrics, custom golden dataset evaluation for domain-specific correctness.

Building a RAG system for your enterprise knowledge base and want it done right the first time? Our AI team has shipped production RAG systems across multiple industries — we'd love to help.

Building RAG Systems That Actually Work in Production