Retrieval-Augmented Generation, or RAG, is one of the most useful AI patterns of the last few years. It is also one of the easiest to do badly.
This post explains RAG in simple language and walks through what tends to matter most in real projects.
What RAG actually is
A RAG system has three steps:
- Retrieve. Given a user's question, search your own documents for the most relevant snippets.
- Augment. Pass those snippets to a language model as context.
- Generate. Ask the model to answer the question using the retrieved snippets, ideally with citations.
The point is to ground a model in your data so it can answer questions about it without being fine-tuned on it.
When RAG is a good fit
- You have a large, mostly-text corpus (product docs, policies, customer history, a knowledge base).
- The right answer changes over time, and you do not want to re-train every week.
- You want the model to cite its sources.
- The questions tend to be specific ("when does our refund window close for international orders?") rather than open-ended.
When RAG is the wrong tool
- The answer requires reasoning over the whole document set at once (e.g., "summarize all our policies into one"). RAG retrieves chunks; it does not see everything.
- The data is mostly numeric or structured. SQL or a search engine is usually better.
- You need exact, deterministic answers (legal, medical, regulatory) without any model rephrasing.
- The corpus is small (under a few hundred snippets). A well-prompted model with all the docs in context may beat a RAG pipeline you have to maintain.
The five things that actually matter
Most RAG projects succeed or fail on these five points, not on which model you pick.
1. Data quality
If your documents are noisy, the retrieved snippets will be noisy. The model will use them. The output will be wrong.
Spend time before you write any code:
- Deduplicate near-identical pages.
- Strip nav, headers, footers, and boilerplate.
- Decide what to do with stale or contradictory documents.
2. Chunking
How you split documents into snippets matters more than people think.
- Too small (≤100 tokens) and the snippets lose context.
- Too large (≥2,000 tokens) and the model wastes attention on irrelevant text.
- Splitting in the middle of a section can hide the right answer behind a missed boundary.
A common starting point is 500–800 tokens with 50–100 tokens of overlap. Then measure.
3. Retrieval
Vector search alone (cosine similarity on embeddings) is rarely enough.
Better defaults:
- Combine vector search with a keyword search (BM25) and re-rank.
- Retrieve more candidates than you plan to use, then re-rank the top k.
- Add metadata filters (date range, product, region) when you have them.
4. Prompting
Tell the model exactly what you want:
- "Answer using only the snippets below. If the answer is not in the snippets, say you do not know."
- "Cite each fact with the source title."
- "Do not invent product names."
Without this kind of instruction, the model will fall back to its general knowledge — sometimes correctly, sometimes not.
5. Evaluation
Build the evaluation harness before you start tuning. A common pattern:
- Build a fixed test set of 30–100 real questions.
- Write a reference answer for each.
- Score each new system version automatically (faithfulness, relevance) and with at least one human reviewer per release.
- Track the scores in a small dashboard or spreadsheet.
Without this, "the new version feels better" is the only argument you can make. That is not enough for production.
A minimum viable RAG project
If you have not built one before, here is a small scope that fits in two weeks:
- 1 source corpus, cleaned
- Chunked at 500/100 overlap
- pgvector or a managed vector DB
- Top-20 retrieval, re-rank to top-5
- A prompt that requires citations
- A 50-question eval set with reference answers
- A small Next.js or Streamlit interface
That is enough to know whether RAG is the right approach for your use case.
Common mistakes
- Picking the embedding model based on a benchmark, not on your data.
- Forgetting that the user can see the citations (so they need to make sense).
- No way to update the index when documents change.
- Skipping evaluation, then arguing about quality forever.
- Using a top-tier general model when a smaller one would be cheaper and just as good with grounded context.
Related reading
- How to Write a Clear AI Project Brief — useful before posting a RAG job.
- What to Ask Before Hiring an AI Engineer