Today, a friend spoke about an idea to download books into an LLM to read and then ask it any questions about the book. However, loading the entire book into the LLM context is way too bulky and the information you need is buried deep in a haysack, and most importantly it's going to cost a lot of tokens. I spoke about the RAG workflow for a task that require a large knowledge base (in the case an entire book).
What is RAG?
Retrieval-Augmented Generation (RAG) is a technique that enhances LLM responses by first retrieving relevant information from an external knowledge base, then passing it as context alongside the user's query [1]. It was originally introduced by Meta AI researchers to handle knowledge-intensive tasks without retraining the model.
The name breaks down directly into the workflow:
- Retrieval — find relevant chunks from your knowledge base
- Augmented — enrich the prompt with that retrieved context
- Generation — the LLM generates a response grounded in that context
If you use embeddings to find semantically similar text from a large corpus and pass those chunks as LLM context — that is a textbook RAG implementation.
The Standard RAG Pipeline
There are two phases: indexing (offline) and retrieval (runtime) [2].
Indexing:
- Split large documents into smaller chunks (typically a few hundred tokens each)
- Convert each chunk into a vector embedding using an embedding model
- Store embeddings in a vector database (e.g., FAISS, Pinecone, Milvus)
Retrieval:
- Embed the user's query using the same embedding model
- Run a semantic similarity search (e.g., cosine similarity) to find the top-K most relevant chunks
- Inject those chunks into the LLM prompt as context
- The LLM generates a response grounded in the retrieved content [2]

RAG Variants
Not all RAG implementations are equal. Retrieval quality is what separates a basic RAG from a production-grade one [4].
| Approach | Method | Strength |
|---|---|---|
| Basic RAG | Embedding similarity only | Semantic understanding |
| Hybrid RAG | Embeddings + BM25 keyword search | Semantic + exact match |
| Contextual RAG | Chunk-level context prepended before embedding | Better chunk disambiguation |
Contextual Retrieval (introduced by Anthropic) prepends a short summary of each chunk's surrounding document context before embedding it — dramatically improving retrieval accuracy for large, complex corpora [3].
Why RAG Over Full-Context Stuffing?
LLMs have finite context windows, and sending an entire large document is both expensive and noisy [4]. RAG solves this by being selective — only the relevant chunks are retrieved, keeping the prompt focused and reducing hallucination risk [1].
The trade-off is retrieval quality: if your retrieval step misses a relevant chunk, the LLM simply won't have that information. This is why hybrid retrieval (semantic + keyword) and good chunking strategies matter in production systems [2].
References
[1] IBM. What is RAG (Retrieval Augmented Generation)? https://www.ibm.com/think/topics/retrieval-augmented-generation
[2] Microsoft Azure. Develop a RAG Solution - Generate Embeddings Phase. https://learn.microsoft.com/en-us/azure/architecture/ai-ml/guide/rag/rag-generate-embeddings
[3] Anthropic. Contextual Retrieval. https://www.anthropic.com/news/contextual-retrieval
[4] SuperAnnotate. RAG vs. Long-context LLMs. https://www.superannotate.com/blog/rag-vs-long-context-llms