Retrieval-augmented generation (RAG)
Definition
Retrieval-augmented generation (RAG) augments a large language model with a retrieval step: given a query, you retrieve relevant documents (from a vector store or search index), then pass them as context to the LLM to generate an answer. This reduces hallucination and keeps answers grounded in your data.
RAG is often preferred over fine-tuning when you need to update knowledge frequently (e.g. internal docs, support articles) without retraining, when you have domain-specific or private data that shouldn't be baked into weights, or when you want to cite sources in the model's answer. Fine-tuning is better when the desired behavior or style is stable and you can afford training and hosting.
How it works
- Index: Documents are chunked and embedded; vectors are stored in a vector database.
- Query: The user query is embedded; the system retrieves the top-k most similar chunks (see embeddings and RAG architecture).
- Generate: The LLM receives the query plus retrieved text and produces the final answer.
The diagram below shows the query-time flow: query and vector DB feed into embed and retrieve; retrieved text becomes context and is passed with the query to the LLM to produce the answer. Indexing (chunking, embedding, storing) is done offline or incrementally; retrieval and generation run at query time. Quality depends on chunking, embedding choice, and how the prompt includes context.
Simple RAG pipeline (Python)
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
# Index documents (one-time or incremental)
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(documents, embeddings)
# Query
query = "What is RAG?"
docs = vectorstore.similarity_search(query, k=4)
context = "\n\n".join(d.page_content for d in docs)
prompt = ChatPromptTemplate.from_messages([
("system", "Answer using only the context below.\n\n{context}"),
("human", "{question}"),
])
llm = ChatOpenAI(model="gpt-4")
chain = prompt | llm
answer = chain.invoke({"context": context, "question": query})
Use cases
RAG fits any application where answers must be grounded in up-to-date or private documents rather than the model’s training data.
- Customer support chatbots that answer from a knowledge base
- Internal wiki and document Q&A
- Legal or contract search and summarization
- Product and FAQ search with cited answers
Pros and cons
| Pros | Cons |
|---|---|
| Reduces hallucination | Retrieval quality depends on chunks and embeddings |
| No need to retrain for new docs | Latency from retrieval + generation |
| Easy to update knowledge | Need good chunking and indexing strategy |
External documentation
- RAG paper (Lewis et al.) — Original retrieval-augmented generation
- LangChain – Question answering / RAG
- LlamaIndex – RAG
- Vertex AI – RAG and grounding — RAG on Google Cloud