Retrieval-Augmented Generation (RAG) #
Retrieval-Augmented Generation (RAG) is a system design pattern that improves an LLM’s answers by:
- Retrieving relevant information from an external knowledge source, and then
- Augmenting the LLM prompt with that retrieved context before generating the final response.
RAG helps an LLM look things up first, then answer using evidence.
Why RAG is Useful #
RAG is commonly used when:
- Your knowledge is in private documents (PDFs, policies, internal wiki)
- You need up-to-date information (things not in the model’s training data)
- You want fewer hallucinations by grounding answers in retrieved sources
- You want traceability (show “where the answer came from”)
RAG does not change the model weights.
It changes what the model sees at inference time by adding retrieved context.
High-Level Flow #
flowchart LR
U[User Question] --> E[Embed Question]
E --> R[Retriever / Vector Search]
R --> C[Top-k Relevant Chunks]
C --> P[Prompt Augmentation]
P --> L[LLM]
L --> A[Answer]
Key Building Blocks #
1. Knowledge Source #
Where your information lives:
- Documents (PDFs, Word files)
- Web pages / internal wiki
- Databases
- FAQs / manuals
2. Chunking (Preparing documents) #
Documents are split into smaller pieces (“chunks”) so retrieval is accurate.
Good chunking improves:
- Relevance
- Citation quality
- Answer completeness
3. Embeddings #
Both the user query and the stored chunks are converted into vectors (embeddings).
The retriever uses similarity search to find the best matches.
4. Retriever (Search) #
The retriever returns the top-k most relevant chunks.
Common retrieval methods:
- Similarity search (vector search)
- Hybrid search (vector + keyword)
- Re-ranking (improves relevance)
5. Prompt Augmentation #
The retrieved chunks are inserted into the prompt (context window), usually as:
- “Use the following context…”
- “Cite the sources…”
- “If context is insufficient, say so…”
6. LLM Generation #
The LLM generates the final response using:
- the user query
- the retrieved context
RAG vs Prompting vs Fine-tuning #
| Method | What changes? | Best for |
|---|---|---|
| Prompting | Prompt text | Quick behaviour changes |
| RAG | Adds retrieved context | Private / up-to-date knowledge |
| Fine-tuning | Model weights | Style, format, consistent behaviour |
If your goal is “answer from my documents”, RAG is usually the first and best option.
Fine-tuning is not a substitute for retrieval.
Common Failure Modes (and fixes) #
Retriever returns irrelevant chunks
→ Improve chunking, embeddings, and use re-rankingAnswer ignores context
→ Strengthen prompt instructions and formattingNot enough context in top-k results
→ Increase k, use hybrid retrieval, improve indexingConfident hallucination when context is missing
→ Add a rule: “If context is insufficient, say you don’t know.”