RPA Scale | How to use RAG to improve LLM Output

Today, a friend spoke about an idea to download books into an LLM to read and then ask it any questions about the book. However, loading the entire book into the LLM context is way too bulky and the information you need is buried deep in a haysack, and most importantly it's going to cost a lot of tokens. I spoke about the RAG workflow for a task that require a large knowledge base (in the case an entire book).

What is RAG?

Retrieval-Augmented Generation (RAG) is a technique that enhances LLM responses by first retrieving relevant information from an external knowledge base, then passing it as context alongside the user's query [1]. It was originally introduced by Meta AI researchers to handle knowledge-intensive tasks without retraining the model.

The name breaks down directly into the workflow:

Retrieval — find relevant chunks from your knowledge base
Augmented — enrich the prompt with that retrieved context
Generation — the LLM generates a response grounded in that context

If you use embeddings to find semantically similar text from a large corpus and pass those chunks as LLM context — that is a textbook RAG implementation.

The Standard RAG Pipeline

There are two phases: indexing (offline) and retrieval (runtime) [2].

Indexing:

Split large documents into smaller chunks (typically a few hundred tokens each)
Convert each chunk into a vector embedding using an embedding model
Store embeddings in a vector database (e.g., FAISS, Pinecone, Milvus)

Retrieval:

Embed the user's query using the same embedding model
Run a semantic similarity search (e.g., cosine similarity) to find the top-K most relevant chunks
Inject those chunks into the LLM prompt as context
The LLM generates a response grounded in the retrieved content [2]

RAG pipeline: from chunking to LLM generation

RAG Variants

Not all RAG implementations are equal. Retrieval quality is what separates a basic RAG from a production-grade one [4].

Approach	Method	Strength
Basic RAG	Embedding similarity only	Semantic understanding
Hybrid RAG	Embeddings + BM25 keyword search	Semantic + exact match
Contextual RAG	Chunk-level context prepended before embedding	Better chunk disambiguation

Contextual Retrieval (introduced by Anthropic) prepends a short summary of each chunk's surrounding document context before embedding it — dramatically improving retrieval accuracy for large, complex corpora [3].

Why RAG Over Full-Context Stuffing?

LLMs have finite context windows, and sending an entire large document is both expensive and noisy [4]. RAG solves this by being selective — only the relevant chunks are retrieved, keeping the prompt focused and reducing hallucination risk [1].

The trade-off is retrieval quality: if your retrieval step misses a relevant chunk, the LLM simply won't have that information. This is why hybrid retrieval (semantic + keyword) and good chunking strategies matter in production systems [2].

References

[1] IBM. What is RAG (Retrieval Augmented Generation)? https://www.ibm.com/think/topics/retrieval-augmented-generation

[2] Microsoft Azure. Develop a RAG Solution - Generate Embeddings Phase. https://learn.microsoft.com/en-us/azure/architecture/ai-ml/guide/rag/rag-generate-embeddings

[3] Anthropic. Contextual Retrieval. https://www.anthropic.com/news/contextual-retrieval

[4] SuperAnnotate. RAG vs. Long-context LLMs. https://www.superannotate.com/blog/rag-vs-long-context-llms