Generación aumentada por recuperación (RAG)
Definición
La generación aumentada por recuperación (RAG) amplía un gran modelo de lenguaje con un paso de recuperación: given a query, you retrieve relevant documents (from a vector store or search index), then pass them as context to the LLM to generate an answer. Esto reduce las alucinaciones y mantiene las respuestas fundamentadas en tus datos.
RAG a menudo se prefiere al fine-tuning cuando necesitas actualizar el conocimiento frecuentemente (por ej. documentos internos, artículos de soporte) sin reentrenar, when you have domain-specific or private data that shouldn't be baked into weights, or when you want to cite sources in the model's answer. Fine-tuning is better when the desired behavior or style is stable and you can afford training and hosting.
Cómo funciona
- Index: Documents are chunked and embedded; vectors are stored in a vector database.
- Consulta: La consulta del usuario se incorpora; el sistema recupera los top-k fragmentos más similares (ver embeddings and RAG architecture).
- Generate: The LLM receives the query plus retrieved text and produce the final answer.
El diagrama a continuación muestra the query-time flow: query and vector DB alimentan embed and retrieve; retrieved text becomes context and is passed with the query to the LLM to produce the answer. Indexing (chunking, embedding, storing) is done offline or incrementally; recuperación and generation run at query time. Quality depends on chunking, embedding choice, and how the prompt includes context.
Simple RAG pipeline (Python)
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
# Index documents (one-time or incremental)
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(documents, embeddings)
# Query
query = "What is RAG?"
docs = vectorstore.similarity_search(query, k=4)
context = "\n\n".join(d.page_content for d in docs)
prompt = ChatPromptTemplate.from_messages([
("system", "Answer using only the context below.\n\n{context}"),
("human", "{question}"),
])
llm = ChatOpenAI(model="gpt-4")
chain = prompt | llm
answer = chain.invoke({"context": context, "question": query})
Casos de uso
RAG fits any application where answers must be grounded in up-to-date or private documents rather than the model’s training data.
- Customer support chatbots that answer from a knowledge base
- Internal wiki and document Q&A
- Legal or contract search and summarization
- Product and FAQ search with cited answers
Ventajas y desventajas
| Pros | Cons |
|---|---|
| Reduces hallucination | Retrieval quality depends on chunks and embeddings |
| No need to retrain for new docs | Latency from recuperación + generation |
| Easy to update knowledge | Need good chunking and indexing strategy |
Documentación externa
- RAG paper (Lewis et al.) — Original recuperación-augmented generation
- LangChain – Question answering / RAG
- LlamaIndex – RAG
- Vertex AI – RAG and grounding — RAG on Google Cloud