RAG quality is not only about model choice.
In practice, your results depend heavily on how you tune parameters across both retrieval and generation. The wrong knobs can cause hallucinations, missed context, slow responses, and higher cost. The right knobs make your assistant grounded, fast, and reliable.
This guide is provider-neutral and focuses on the most practical parameters to tune first.
A) Retrieval Parameters (What Context You Fetch)
1) topK
How many chunks/documents you retrieve before generation.
- Too low: you may miss relevant evidence.
- Too high: you flood context with noise and increase latency/cost.
Practical default: topK between 4 and 8 for most QA cases.
2) score_threshold (or min_similarity)
Minimum relevance score required to keep a retrieved chunk.
- Higher threshold: cleaner but fewer chunks.
- Lower threshold: more coverage but more irrelevant context.
Use this to improve precision when retrieval returns weak matches.
3) rerank_top_n
Number of retrieved items sent to a reranker model to improve ordering.
- Useful when semantic search returns mixed quality in top results.
- Often better than blindly increasing
topK.
Example: retrieve 20, rerank, keep top 5.
4) chunk_size
How large each chunk is when indexing documents.
- Small chunks: precise retrieval but risk losing broader context.
- Large chunks: richer context but lower precision and more token usage.
Typical starting range: 300-800 tokens.
5) chunk_overlap
How much neighboring chunk text repeats.
- Helps preserve meaning across chunk boundaries.
- Too much overlap increases index size and near-duplicate retrieval.
Typical starting range: 50-120 tokens.
6) Context Budget
How many retrieved tokens you actually include in the final prompt.
- Even good retrieval can fail if you overfill context.
- Reserve space for system instructions and output tokens.
Treat context as a budget, not an unlimited buffer.
B) Generation Parameters (How The Model Responds)
1) temperature
Controls randomness.
- Lower (
0.0-0.3): more deterministic, good for factual RAG. - Higher (
0.6+): more creative, but higher hallucination risk.
2) top_p (nucleus sampling)
Chooses from the smallest set of tokens whose cumulative probability reaches p.
- Lower
top_p: more conservative output. - Higher
top_p: broader token exploration.
Common practical range: 0.8-1.0.
3) top_k (if supported)
Restricts next-token sampling to the top k candidates.
Important: this is a generation parameter, different from retrieval topK.
4) max_output_tokens (or max_tokens)
Caps output length.
- Too low: answers get cut off.
- Too high: unnecessary cost and slower responses.
Set this based on expected answer style (short QA vs long explanation).
5) stop_sequences
Defines where generation should stop.
Useful for:
- forcing concise answers,
- separating answer/citation sections,
- reducing verbose drift.
C) Starter Presets (Provider-Neutral)
These are practical starting points, not strict rules.
1) Factual RAG QA
{
"retrieval": {
"topK": 6,
"score_threshold": 0.35,
"chunk_size": 500,
"chunk_overlap": 80
},
"generation": {
"temperature": 0.2,
"top_p": 0.9,
"max_output_tokens": 500
}
}
2) Summarization Across Multiple Docs
{
"retrieval": {
"topK": 10,
"score_threshold": 0.25,
"chunk_size": 700,
"chunk_overlap": 100,
"rerank_top_n": 15
},
"generation": {
"temperature": 0.3,
"top_p": 0.95,
"max_output_tokens": 900
}
}
3) Drafting With Retrieved Context
{
"retrieval": {
"topK": 8,
"score_threshold": 0.2,
"chunk_size": 600,
"chunk_overlap": 90
},
"generation": {
"temperature": 0.6,
"top_p": 0.95,
"max_output_tokens": 1200
}
}
D) Debug Checklist (Symptom -> Likely Cause -> First Fix)
- Hallucinations despite RAG -> noisy retrieval -> increase
score_thresholdand reducetopK. - Correct answer exists but is missed -> low recall -> raise
topKor improve chunking. - Answer is too generic -> weak grounding -> tighten prompt to require citations and reduce temperature.
- Response is slow -> too much context -> lower
topK, reduce chunk size, apply reranking. - Answer is cut mid-sentence -> output cap too low -> increase
max_output_tokens. - Repeated near-duplicate evidence -> overlap too high -> reduce
chunk_overlap. - Inconsistent answers run-to-run -> high randomness -> lower
temperatureand keeptop_pstable.
E) Minimal Provider-Neutral Config Pattern
{
"retrieval": {
"topK": 6,
"score_threshold": 0.3,
"chunk_size": 500,
"chunk_overlap": 80,
"rerank_top_n": 12,
"context_budget_tokens": 3500
},
"generation": {
"temperature": 0.2,
"top_p": 0.9,
"top_k": 40,
"max_output_tokens": 700,
"stop_sequences": ["\n\n###", "<END>"]
}
}
Recommended tuning order:
- Fix chunking (
chunk_size,chunk_overlap). - Tune retrieval (
topK, threshold, rerank). - Tune generation (
temperature,top_p, output cap). - Re-check with a stable eval set.
10 Questions To Check Your Understanding
- What is the difference between retrieval
topKand generationtop_k? - When should you increase
score_thresholdinstead of reducingtopK? - How can very large
chunk_sizehurt answer quality? - Why is increasing
chunk_overlapnot always better? - What happens if
temperatureis high in a factual RAG assistant? - Why should you usually tune
temperatureandtop_pone at a time? - How does
max_output_tokensaffect cost and truncation risk? - When is reranking more useful than simply increasing retrieval
topK? - What symptoms indicate your context budget is too small?
- In what order should you tune parameters when starting from defaults?
Further Reading / Documentation
- OpenAI Text Generation docs: https://platform.openai.com/docs/guides/text-generation
- OpenAI Responses API docs: https://platform.openai.com/docs/api-reference/responses
- Anthropic Messages API docs: https://docs.anthropic.com/en/api/messages
- Google Gemini API docs: https://ai.google.dev/gemini-api/docs
- Azure AI Search RAG overview: https://learn.microsoft.com/azure/search/retrieval-augmented-generation-overview
- LangChain Retrieval conceptual guide: https://python.langchain.com/docs/concepts/retrieval/
- LlamaIndex RAG guide: https://docs.llamaindex.ai/en/stable/understanding/rag/