How Retrieval-Augmented Generation (RAG) Works
Introduction
Imagine asking a very well-read assistant a question. If the answer is something they studied years ago, they can recall it from memory with confidence. But if you ask about last week's news, your company's internal policies, or a document they have never seen, they have nothing to draw from. And because they cannot easily admit uncertainty, they might answer anyway, confidently, and incorrectly.
This is the core limitation of large language models. They are trained on data up to a certain point in time. After that cutoff, they know nothing about new events, your private documents, or domain-specific knowledge they were never trained on. When asked about something outside their training, they sometimes generate plausible-sounding but entirely fabricated answers, a behavior called hallucination.
Retrieval-Augmented Generation, universally known as RAG, addresses this by giving the model a way to look up relevant information at the moment it needs to answer a question. Rather than relying solely on what it memorized during training, the model is provided with actual documents drawn from a knowledge base before it generates a response. This simple change in architecture has made RAG one of the most widely deployed patterns in production AI systems.
Problem Statement
Standard language models have three fundamental limitations that make them unreliable for many practical applications. First, their knowledge is frozen at training time. A model trained in 2023 knows nothing about 2024 events regardless of how capable it is. Second, they have no access to private, proprietary, or domain-specific knowledge that was not part of their training corpus. Third, they cannot reliably acknowledge the boundaries of their own knowledge, which means they sometimes fabricate convincing but incorrect answers.
Fine-tuning the model on new data is one partial solution, but it is expensive, requires significant labeled data, must be repeated whenever knowledge changes, and still does not guarantee the model will cite specific documents or refuse to speculate beyond what it has seen. Prompt engineering helps steer behavior but cannot add knowledge the model never had.
What is needed is a way to dynamically provide relevant factual context at inference time, without modifying the model itself, in a way that can be updated independently of the model's training cycle. RAG is that solution.
Core Concepts and Terminology
| Term | Definition |
|---|---|
| RAG (Retrieval-Augmented Generation) | A technique that retrieves relevant documents from a knowledge base and passes them to a language model as context before generating a response. |
| Embedding | A numerical representation of text as a list of numbers (a vector) that captures semantic meaning. Similar texts produce embeddings that are close together in vector space. |
| Embedding Model | A model trained to convert text into embeddings. Used to encode both documents during indexing and queries during retrieval. |
| Vector Database | A database optimized for storing and searching embeddings. Given a query embedding, it finds the stored embeddings most similar to it. |
| Chunk | A segment of a document, typically a paragraph or a fixed number of tokens, that is stored and retrieved as a unit. |
| Semantic Search | Searching by meaning rather than exact keyword matching. A vector database performs semantic search by comparing embedding similarity. |
| Hallucination | When a language model generates confident-sounding but factually incorrect information, often because it lacks the relevant knowledge. |
| Augmented Prompt | The input sent to the language model that combines the user's question with retrieved document chunks as context. |
| Reranking | A second-pass scoring step that takes a set of retrieved candidates and re-orders them by how directly they answer the specific query, improving precision. |
| HyDE (Hypothetical Document Embeddings) | A technique that generates a hypothetical answer to the query and embeds that instead of the raw query, improving retrieval for short or ambiguous questions. |
| RAGAS | A framework for evaluating RAG systems using automated metrics for faithfulness, answer relevancy, context precision, and context recall. |
How RAG Works: The Four Stages
Think of a RAG system as a research librarian with access to a well-organized archive. When you ask a question, the librarian does not answer from memory. They go to the archive, find the most relevant documents, hand them to you along with your question, and you read the answer from the documents rather than guessing. RAG systems work in exactly this way, with one addition: the model reads the documents and synthesizes an answer for you.
A RAG system follows four stages in sequence: indexing, retrieval, augmentation, and generation.
- Indexing (building the knowledge base). Before the system can retrieve anything, every document in the knowledge base must be processed and stored. Documents are split into smaller chunks, typically paragraphs or sections of fixed token length. Each chunk is passed through an embedding model, which converts it into a vector of numbers that captures its meaning. Both the vector and the original text are stored in a vector database. This stage happens offline, often as a batch job, and must be re-run whenever the knowledge base is updated.
- Retrieval (finding relevant context). When a user submits a question, it is passed through the same embedding model used during indexing, producing a query vector. The vector database searches for stored chunks whose vectors are closest to the query vector. This is semantic search: it matches by meaning rather than keywords. A question phrased as "how do I reset my password" will match a chunk that says "account recovery steps" even though no keywords overlap. The top matching chunks are returned.
- Augmentation (building the prompt). The retrieved chunks are combined with the user's original question into an augmented prompt. The prompt instructs the model to answer using only the provided context and to say so explicitly if the context does not contain the answer. This instruction is the most important design decision for preventing hallucination: it tells the model to defer to evidence rather than guess.
- Generation (producing the answer). The language model reads the augmented prompt, which includes the retrieved documents and the question, and generates a response grounded in the provided evidence. Unlike a standalone language model responding from training memory, the model in a RAG system is responding from specific documents you chose and control.
Practical Example
A legal technology company wants to build an internal assistant that helps lawyers quickly find information across thousands of case documents, contracts, and regulatory filings. Lawyers ask questions like "What are the indemnification obligations under the vendor contracts signed in Q3?" or "Which cases cite the precedent from the 2019 appellate ruling on data privacy?"
The company indexes all of its documents by splitting them into paragraph-level chunks, generating embeddings using a legal-domain embedding model, and storing them in a vector database. The full corpus of 50,000 documents is indexed once, with incremental re-indexing triggered whenever new documents are added.
When a lawyer submits a question, the system embeds the question, retrieves the twenty most relevant chunks from the vector database, applies a reranker to narrow those down to the five most directly relevant passages, and constructs an augmented prompt that includes those five passages alongside the lawyer's question. The language model reads the prompt and generates a response with specific citations to the retrieved passages.
The result is a system that provides answers grounded in actual documents, reduces research time from hours to minutes, and allows the knowledge base to be updated simply by adding new documents and re-indexing, with no model retraining required.
Advantages
- Reduced hallucination. By grounding the language model in retrieved evidence and instructing it to refuse if the context is insufficient, RAG significantly reduces the rate at which the model fabricates information. Responses are traceable to specific source documents.
- Knowledge freshness without retraining. Updating a RAG system's knowledge requires updating the document store and re-indexing, not retraining a model. This is orders of magnitude cheaper and faster than fine-tuning.
- Domain specialization at low cost. Organizations can build knowledge-aware AI systems using their own internal documents without the data, compute, and expertise required to fine-tune a model. The language model's general capability is preserved while domain knowledge is added through retrieval.
- Source attribution. Because responses are grounded in retrieved documents, the system can cite which chunks it used, giving users a path to verify the answer and increasing trust in high-stakes applications.
- Modular and maintainable. Each component of the RAG pipeline, the embedding model, the vector database, the retrieval strategy, the language model, can be upgraded or replaced independently. This is more flexible than a monolithic fine-tuned model.
Limitations and Trade-offs
- Retrieval quality is a hard dependency. If the wrong documents are retrieved, the language model cannot compensate. It will either produce an incorrect answer based on irrelevant context or decline to answer at all. The quality of the system's answers is strictly bounded by the quality of its retrieval.
- Indexing complexity. Poor chunking strategies, low-quality document preprocessing, and weak embedding models all degrade retrieval quality before any query is ever run. Getting indexing right requires careful thought about document structure, chunk size, and overlap.
- Latency overhead. Every RAG request adds two steps before the language model can generate anything: embedding the query and searching the vector database. This overhead is typically 50 to 200 milliseconds for a well-optimized system, but it compounds with reranking and other post-processing steps.
- Knowledge base maintenance. Documents go stale. The system must detect when knowledge base documents change and re-index affected chunks. Without a robust update pipeline, retrieval gradually returns outdated information.
- Context window limits. The number of retrieved chunks that can be passed to the model is bounded by its context window. Retrieving too many chunks can cause the model to mix up sources or lose important details from earlier in the context.
- Not ideal for complex reasoning over many documents. RAG works well for factual lookup. For tasks that require synthesizing information across dozens of documents or following complex chains of reasoning, it can struggle and may benefit from complementary techniques.
Common Mistakes
- Using chunks that are too large. Large chunks contain more text but reduce retrieval precision. A chunk that covers three different topics will match queries about all three, but may not be the best answer for any of them. Smaller, more focused chunks generally retrieve better.
- Not including overlap between chunks. When documents are split into chunks without overlap, information that falls at a chunk boundary can be split across two chunks, making it harder to retrieve. An overlap of 100 to 200 tokens between adjacent chunks prevents this.
- Not cleaning documents before indexing. Embedding a document that includes navigation menus, page headers, and HTML boilerplate adds noise that degrades retrieval. Strip all non-content elements before indexing.
- Retrieving too many chunks. More context sounds better, but a prompt with twenty retrieved chunks often produces worse answers than one with five well-chosen chunks. The model gets confused by the volume of text and mixing of sources. Use reranking to cut down to the truly relevant subset.
- Not instructing the model to refuse when context is insufficient. Without an explicit instruction to say "I cannot find this in the provided context," the model will fall back on its training data and hallucinate. This instruction is essential.
- Evaluating RAG subjectively. "It seems to give good answers" is not a measurement. Use automated evaluation frameworks like RAGAS to track faithfulness, relevancy, and recall across a representative set of questions before and after every pipeline change.
Best Practices
- Choose chunk size based on content type. Use smaller chunks (150 to 300 tokens) for FAQ-style content with distinct, self-contained answers. Use larger chunks (500 to 800 tokens) for long-form technical documentation where removing surrounding context degrades meaning. Always include some overlap between chunks.
- Use a two-stage retrieval approach for production. Vector search retrieves a broad set of candidates quickly (high recall, moderate precision). A cross-encoder reranker then scores each candidate against the specific query and returns only the top results (high precision). This combination is better than either stage alone.
- Apply HyDE for short or ambiguous queries. When queries are brief, their embeddings may not be close to the detailed, dense text in your knowledge base. Generating a hypothetical answer to the query and embedding that instead bridges the vocabulary gap between how users ask questions and how documents are written.
- Measure with RAGAS on a representative question set. Define a set of 50 to 100 representative questions with known answers before building the system. Run RAGAS metrics (faithfulness, answer relevancy, context precision, context recall) against this set before and after every meaningful change to the pipeline. This catches regressions before they reach users.
- Build an automated re-indexing pipeline. Detect when source documents change and trigger re-indexing automatically. Stale knowledge bases are a silent quality problem that is easy to prevent with a simple document change detection pipeline.
- Keep retrieved context focused and small. Aim for three to five high-quality chunks per query rather than ten to twenty loosely relevant ones. Better retrieval and a good reranker achieve more than simply passing more context to a capable model.
Comparison: RAG vs Alternatives for Knowledge Customization
| Approach | How It Works | Knowledge Update | Cost | Best For |
|---|---|---|---|---|
| Prompt Engineering | Write better instructions to guide the model's behavior and output style | Immediate (edit the prompt) | Very low | Behavior guidance, output formatting, simple use cases |
| Fine-tuning | Train the model on domain-specific examples to modify its behavior or embed knowledge | Requires full retraining cycle | High. Requires data, compute, expertise | Changing model behavior or style at a deep level |
| RAG | Retrieve relevant documents at inference time and pass them as context | Update the document store and re-index (no model change) | Moderate. Vector database, embedding model, additional latency | Factual lookup, frequently changing knowledge, private documents |
| Fine-tuning + RAG | Fine-tune for behavior and style, RAG for dynamic knowledge | RAG knowledge updates are immediate; behavior changes require retraining | High | Production systems requiring both domain behavior and fresh knowledge |
| Long context (no retrieval) | Stuff all relevant documents into the context window | Immediate (change what you include) | Very high at scale. Proportional to document length | Small document sets where retrieval adds unnecessary complexity |
FAQ
Does RAG eliminate hallucination entirely?
No. RAG significantly reduces hallucination by grounding responses in retrieved documents, but it does not eliminate it. The model can still misinterpret retrieved context, draw incorrect inferences from accurate documents, or hallucinate if instructed to answer even when context is insufficient. The most important mitigation is a strong instruction to explicitly refuse when the retrieved context does not contain an answer, rather than speculating.
Which vector database should I use?
The choice depends on your scale and infrastructure. Pinecone and Weaviate are managed cloud services that minimize operational overhead. Chroma is a lightweight option for development and small-scale production. FAISS is a high-performance library for teams who want to run locally and control their own infrastructure. Pgvector extends PostgreSQL with vector search, useful if you already run Postgres and want to minimize new infrastructure. Start with what is simplest to run given your existing stack.
How do I know if my RAG system is working well?
Use the RAGAS framework to measure four metrics: faithfulness (does the answer only use retrieved context), answer relevancy (does the answer address the question), context precision (are retrieved chunks actually used in the answer), and context recall (did retrieval find everything needed to answer correctly). Run these metrics on a representative set of 50 to 100 questions with known answers. Track them over time and treat drops as regressions requiring investigation.
When should I use RAG versus fine-tuning?
Use RAG when you need to provide knowledge that changes frequently, when you have a large corpus of documents, or when you need responses traceable to specific source documents. Use fine-tuning when you need to change the model's output style, tone, format, or behavioral patterns in ways that persist across all interactions regardless of context. Many production systems use both: fine-tuning for behavior, RAG for knowledge.
Does RAG work with any language model?
Yes. RAG is a pipeline architecture that sits outside the language model itself. Any model that accepts a text prompt and generates a text response can be used as the generation component of a RAG system. The quality of the generated answer does depend on how well the model follows instructions, particularly the instruction to refuse when context is insufficient, so more capable instruction-following models generally produce better RAG outputs.
References
- Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020.
- Gao, Y., et al. (2023). Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv:2312.10997.
- Guu, K., et al. (2020). REALM: Retrieval-Augmented Language Model Pre-Training. ICML 2020.
- Es, S., et al. (2023). RAGAS: Automated Evaluation of Retrieval Augmented Generation. arXiv:2309.15217.
- Ma, X., et al. (2023). Query Rewriting for Retrieval-Augmented Large Language Models. arXiv:2305.14283.
Key Takeaways
- RAG improves factual accuracy by grounding language model responses in retrieved documents rather than relying on training data alone, dramatically reducing hallucination for factual queries.
- Retrieval quality is the most important performance factor in any RAG system. Bad chunks going in means bad answers coming out, regardless of how capable the language model is.
- The four stages of a RAG pipeline are indexing (building the vector store), retrieval (semantic search), augmentation (building the prompt), and generation (producing the grounded answer). Failures in any stage silently degrade all stages that follow.
- A two-stage retrieval approach combining vector search for high recall with a cross-encoder reranker for high precision consistently outperforms either stage used alone.
- RAG enables knowledge freshness and domain specialization without model retraining, making it practical to update knowledge bases as frequently as needed by simply updating documents and re-indexing.
- Use RAGAS to measure faithfulness, answer relevancy, context precision, and context recall on a representative question set before and after every meaningful pipeline change, turning gut feelings into evidence.
Related Articles