Back to blog

Common LLM Parameters To Know For RAG

A practical, provider-neutral guide to the key retrieval and generation parameters that control RAG quality, latency, and cost.

llmragairetrievalprompt-engineeringsoftware-development

RAG quality is not only about model choice.

In practice, your results depend heavily on how you tune parameters across both retrieval and generation. The wrong knobs can cause hallucinations, missed context, slow responses, and higher cost. The right knobs make your assistant grounded, fast, and reliable.

This guide is provider-neutral and focuses on the most practical parameters to tune first.


A) Retrieval Parameters (What Context You Fetch)

1) topK

How many chunks/documents you retrieve before generation.

  • Too low: you may miss relevant evidence.
  • Too high: you flood context with noise and increase latency/cost.

Practical default: topK between 4 and 8 for most QA cases.

2) score_threshold (or min_similarity)

Minimum relevance score required to keep a retrieved chunk.

  • Higher threshold: cleaner but fewer chunks.
  • Lower threshold: more coverage but more irrelevant context.

Use this to improve precision when retrieval returns weak matches.

3) rerank_top_n

Number of retrieved items sent to a reranker model to improve ordering.

  • Useful when semantic search returns mixed quality in top results.
  • Often better than blindly increasing topK.

Example: retrieve 20, rerank, keep top 5.

4) chunk_size

How large each chunk is when indexing documents.

  • Small chunks: precise retrieval but risk losing broader context.
  • Large chunks: richer context but lower precision and more token usage.

Typical starting range: 300-800 tokens.

5) chunk_overlap

How much neighboring chunk text repeats.

  • Helps preserve meaning across chunk boundaries.
  • Too much overlap increases index size and near-duplicate retrieval.

Typical starting range: 50-120 tokens.

6) Context Budget

How many retrieved tokens you actually include in the final prompt.

  • Even good retrieval can fail if you overfill context.
  • Reserve space for system instructions and output tokens.

Treat context as a budget, not an unlimited buffer.


B) Generation Parameters (How The Model Responds)

1) temperature

Controls randomness.

  • Lower (0.0-0.3): more deterministic, good for factual RAG.
  • Higher (0.6+): more creative, but higher hallucination risk.

2) top_p (nucleus sampling)

Chooses from the smallest set of tokens whose cumulative probability reaches p.

  • Lower top_p: more conservative output.
  • Higher top_p: broader token exploration.

Common practical range: 0.8-1.0.

3) top_k (if supported)

Restricts next-token sampling to the top k candidates.

Important: this is a generation parameter, different from retrieval topK.

4) max_output_tokens (or max_tokens)

Caps output length.

  • Too low: answers get cut off.
  • Too high: unnecessary cost and slower responses.

Set this based on expected answer style (short QA vs long explanation).

5) stop_sequences

Defines where generation should stop.

Useful for:

  • forcing concise answers,
  • separating answer/citation sections,
  • reducing verbose drift.

C) Starter Presets (Provider-Neutral)

These are practical starting points, not strict rules.

1) Factual RAG QA

JSON
{
  "retrieval": {
    "topK": 6,
    "score_threshold": 0.35,
    "chunk_size": 500,
    "chunk_overlap": 80
  },
  "generation": {
    "temperature": 0.2,
    "top_p": 0.9,
    "max_output_tokens": 500
  }
}

2) Summarization Across Multiple Docs

JSON
{
  "retrieval": {
    "topK": 10,
    "score_threshold": 0.25,
    "chunk_size": 700,
    "chunk_overlap": 100,
    "rerank_top_n": 15
  },
  "generation": {
    "temperature": 0.3,
    "top_p": 0.95,
    "max_output_tokens": 900
  }
}

3) Drafting With Retrieved Context

JSON
{
  "retrieval": {
    "topK": 8,
    "score_threshold": 0.2,
    "chunk_size": 600,
    "chunk_overlap": 90
  },
  "generation": {
    "temperature": 0.6,
    "top_p": 0.95,
    "max_output_tokens": 1200
  }
}

D) Debug Checklist (Symptom -> Likely Cause -> First Fix)

  1. Hallucinations despite RAG -> noisy retrieval -> increase score_threshold and reduce topK.
  2. Correct answer exists but is missed -> low recall -> raise topK or improve chunking.
  3. Answer is too generic -> weak grounding -> tighten prompt to require citations and reduce temperature.
  4. Response is slow -> too much context -> lower topK, reduce chunk size, apply reranking.
  5. Answer is cut mid-sentence -> output cap too low -> increase max_output_tokens.
  6. Repeated near-duplicate evidence -> overlap too high -> reduce chunk_overlap.
  7. Inconsistent answers run-to-run -> high randomness -> lower temperature and keep top_p stable.

E) Minimal Provider-Neutral Config Pattern

JSON
{
  "retrieval": {
    "topK": 6,
    "score_threshold": 0.3,
    "chunk_size": 500,
    "chunk_overlap": 80,
    "rerank_top_n": 12,
    "context_budget_tokens": 3500
  },
  "generation": {
    "temperature": 0.2,
    "top_p": 0.9,
    "top_k": 40,
    "max_output_tokens": 700,
    "stop_sequences": ["\n\n###", "<END>"]
  }
}

Recommended tuning order:

  1. Fix chunking (chunk_size, chunk_overlap).
  2. Tune retrieval (topK, threshold, rerank).
  3. Tune generation (temperature, top_p, output cap).
  4. Re-check with a stable eval set.

10 Questions To Check Your Understanding

  1. What is the difference between retrieval topK and generation top_k?
  2. When should you increase score_threshold instead of reducing topK?
  3. How can very large chunk_size hurt answer quality?
  4. Why is increasing chunk_overlap not always better?
  5. What happens if temperature is high in a factual RAG assistant?
  6. Why should you usually tune temperature and top_p one at a time?
  7. How does max_output_tokens affect cost and truncation risk?
  8. When is reranking more useful than simply increasing retrieval topK?
  9. What symptoms indicate your context budget is too small?
  10. In what order should you tune parameters when starting from defaults?

Further Reading / Documentation