Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG) #

Retrieval-Augmented Generation (RAG) is a system design pattern that improves an LLM’s answers by:

  1. Retrieving relevant information from an external knowledge source, and then
  2. Augmenting the LLM prompt with that retrieved context before generating the final response.

RAG helps an LLM look things up first, then answer using evidence.


Why RAG is Useful #

RAG is commonly used when:

  • Your knowledge is in private documents (PDFs, policies, internal wiki)
  • You need up-to-date information (things not in the model’s training data)
  • You want fewer hallucinations by grounding answers in retrieved sources
  • You want traceability (show “where the answer came from”)

RAG does not change the model weights.
It changes what the model sees at inference time by adding retrieved context.


High-Level Flow #

flowchart LR
    U[User Question] --> E[Embed Question]
    E --> R[Retriever / Vector Search]
    R --> C[Top-k Relevant Chunks]
    C --> P[Prompt Augmentation]
    P --> L[LLM]
    L --> A[Answer]

Key Building Blocks #

1. Knowledge Source #

Where your information lives:

  • Documents (PDFs, Word files)
  • Web pages / internal wiki
  • Databases
  • FAQs / manuals

2. Chunking (Preparing documents) #

Documents are split into smaller pieces (“chunks”) so retrieval is accurate.

Good chunking improves:

  • Relevance
  • Citation quality
  • Answer completeness

3. Embeddings #

Both the user query and the stored chunks are converted into vectors (embeddings).

The retriever uses similarity search to find the best matches.


The retriever returns the top-k most relevant chunks.

Common retrieval methods:

  • Similarity search (vector search)
  • Hybrid search (vector + keyword)
  • Re-ranking (improves relevance)

5. Prompt Augmentation #

The retrieved chunks are inserted into the prompt (context window), usually as:

  • “Use the following context…”
  • “Cite the sources…”
  • “If context is insufficient, say so…”

6. LLM Generation #

The LLM generates the final response using:

  • the user query
  • the retrieved context

RAG vs Prompting vs Fine-tuning #

MethodWhat changes?Best for
PromptingPrompt textQuick behaviour changes
RAGAdds retrieved contextPrivate / up-to-date knowledge
Fine-tuningModel weightsStyle, format, consistent behaviour

If your goal is “answer from my documents”, RAG is usually the first and best option.
Fine-tuning is not a substitute for retrieval.


Common Failure Modes (and fixes) #

  • Retriever returns irrelevant chunks
    → Improve chunking, embeddings, and use re-ranking

  • Answer ignores context
    → Strengthen prompt instructions and formatting

  • Not enough context in top-k results
    → Increase k, use hybrid retrieval, improve indexing

  • Confident hallucination when context is missing
    → Add a rule: “If context is insufficient, say you don’t know.”


Home | Generative AI