RAG Evaluation Part 1: Retriever Evaluation
A practical guide to measuring retrieval quality in RAG systems, with step-by-step examples of Context Precision and Recall calculations.

Retrieval-Augmented Generation (RAG) pipelines rely heavily on the retriever to surface useful context that enables the generator (LLM) to produce grounded, factually accurate responses. Evaluating the retriever independently allows teams to:
- Diagnose bottlenecks in the retrieval vs. generation pipeline.
- Measure coverage: Are we surfacing enough context to answer the question?
- Prioritize recall vs. precision: Depending on task needs, different tuning is possible.
In this blog, we explore advanced evaluation techniques for RAG retrievers, focusing on two key metrics—Context Precision and Context Recall—and demonstrate both traditional and LLM-powered methods to accurately quantify retrieval quality.
Context Precision
Precision@K
This is a classic metric used in information retrieval:
This tells us how many of the top-K retrieved documents are relevant. However, it doesn’t differentiate whether relevant documents appear early or late within the top K. For example, a relevant document at position 1 is treated the same as one at position 5.
Context Precision@K
RAG systems benefit from placing relevant context early in the ranked list. Context Precision@K captures this intuition:
Where:
- indicates if the k-th retrieved chunk is relevant
- The denominator sums the number of relevant chunks in top-K
This formula weights early relevance more heavily, rewarding systems that not only retrieve the right context but prioritize it correctly.
Example 1: High-Quality Retrieval
Rank | Context | Relevant |
---|---|---|
1 | "Einstein was..." | 1 |
2 | "Einstein born..." | 1 |
3 | "Physics history..." | 0 |
4 | "Relativity theory..." | 1 |
5 | "Tesla was..." | 0 |
Example 2: Poor Retrieval
Rank | Context | Relevant |
---|---|---|
1 | "Tesla..." | 0 |
2 | "Physics..." | 0 |
3 | "Einstein was..." | 1 |
4 | "Einstein born..." | 1 |
5 | "Relativity..." | 1 |
Good. Now we know how to calculate Context Precision@K, but how to determine relevance — that’s still an interesting question to think about.
There are two main approaches to do this:
- Non-LLM based relevance calculation
- LLM-based relevance calculation
1. Non-LLM based relevance calculation
In practical deployments, LLM-based relevance judgments can be expensive or slow. A non-LLM alternative is to use string similarity functions to simulate human-labeled relevance.
Inputs
- R = [r₁, r₂, ..., rₖ]: Retrieved chunks (ranked)
- C = [c₁, c₂, ..., cₘ]: Reference gold chunks
- sim(r, c): Similarity function (e.g., Levenshtein, cosine, Jaccard)
- τ (tau): Relevance threshold
Algorithm
- For each retrieved chunk , compute:
- If , then label as relevant: Else: This generates a binary vector , which can then be plugged into the Context Precision@K formula.
2. LLM-based relevance calculation
While traditional methods use string similarity to assess relevance, they often miss deeper semantic connections. To overcome this, we can use a Large Language Model (LLM) to evaluate whether a retrieved passage is truly helpful in answering the question—based not just on word overlap, but on meaning.
This method involves passing the retrieved chunk and the reference gold chunk into a structured prompt and letting the LLM act as a semantic judge.
Here’s a sample prompt :
retrival_relevance_prompt = """
You are a semantic relevance evaluator. Your task is to assess whether retrieved passage (context) is helpful in answering a user's question.
Respond with 1 if the passage includes important facts, concepts, or clues that
directly or indirectly support a correct answer.
Respond with 0 if the passage is unrelated, off-topic, or does not meaningfully
contribute to answering the question.
Return only the digit 1 or 0. Do not explain your answer.
Question:
{{question}}
Context:
{{retrieved_chunk}}
Answer:
{{response_or_reference}}
Is the passage relevant to the question to derive given answer? (1 = Yes, 0 = No)
"""
This approach offers more flexibility and depth than rigid rule-based checks, and works particularly well in cases where language is paraphrased, or when retrieved content is conceptually relevant but lexically different.
If you look at the ragas implementation, the LLM-based relevance check is further subdivided into two approaches depending on the availability of the ground-truth answer:
- LLMContextPrecisionWithReference
This method assumes access to a gold answer (also called a reference
). It uses the reference answer as the target anchor while evaluating each retrieved chunk for relevance.
- LLMContextPrecisionWithoutReference
This variant is designed for real-world applications where a reference answer may not be available. Instead, it uses the model’s own generated response (response
) as the evaluation anchor.
Metric | Reference Source | LLM Prompt | When to Use |
---|---|---|---|
LLMContextPrecisionWithReference | reference (gold answer) | “Was this chunk helpful for arriving at the correct answer?” | Offline test-sets where you already know the right answer for every question. |
LLMContextPrecisionWithoutReference | response (model answer) | “Was this chunk helpful for arriving at the answer the model actually produced?” | Production / live traffic where you don’t have a curated answer but still want a quick signal about retrieval quality. |
Context Recall
RAG systems deal not just with retrieval but with retrieving chunks that support the reference answer. This gives rise to a more nuanced metric: Context Recall, which works at the level of answer claims or reference contexts.
There are two primary variants of Context Recall:
- LLM-based Context Recall
- Non-LLM-based Context Recall
Each variant addresses specific evaluation scenarios based on available resources, such as labeled data or reliance on semantic understanding.
1. LLM-based Context Recall
The LLM-based approach removes the need for manual annotation of reference chunks. Instead, it leverages the semantic reasoning capabilities of Large Language Models (LLMs) to determine whether the reference (or gold-standard) answer is adequately supported by the retrieved contexts.
How it works
- Begin by breaking the reference answer into atomic factual claims—these are the smallest units of verifiable information within an answer.
- For each extracted claim, we verify:
- Is there at least one retrieved chunk of context that directly or indirectly supports this claim?
- If at least one relevant chunk exists, the claim is considered covered or supported. This metric captures the idea of semantic completeness, evaluating whether the retrieved contexts collectively cover all essential claims necessary to reconstruct a complete and accurate answer.
Example
Consider the following reference answer:
"Einstein developed the theory of relativity in 1905 while working at the Swiss patent office in Bern."
This answer can be broken into four distinct factual claims:
Claim ID | Claim |
---|---|
C1 | Einstein developed the theory of relativity |
C2 | He developed it in 1905 |
C3 | He was working at the Swiss patent office |
C4 | The office was in Bern |
Assume the retriever returned the following contexts:
Rank | Retrieved Context |
---|---|
1 | “Einstein published the theory of relativity in a groundbreaking 1905 paper.” |
2 | “During the early 1900s, Einstein worked at a government office evaluating patents.” |
3 | “Einstein later moved to Berlin to join the Prussian Academy of Sciences.” |
4 | “He studied physics in Zurich before becoming one of the most influential scientists.” |
Now we match each claim against the retrieved contexts:
Claim | Supporting Chunk(s) |
---|---|
C1: Einstein developed the theory of relativity | 1 |
C2: He developed it in 1905 | 1 |
C3: He was working at the Swiss patent office | 2 |
C4: The office was in Bern | — |
In this scenario, 3 out of 4 claims are supported by retrieved contexts, yielding:
This result indicates substantial semantic completeness, though not perfect coverage. Notably, the claim regarding the specific location "Bern" was not supported by any retrieved chunk, highlighting a retrieval gap that can be addressed for improved accuracy.
Now the question is how to extract claim and how to decide whether the claim is supported by extracted context.
Prompt for extracting claims from the reference answer:
claim_extraction_prompt = """
You are an information analyst. Your task is to extract the smallest possible factual claims from a given answer.
Each claim must:
- Be a self-contained statement
- Represent a fact that can be verified or falsified independently
- Avoid combining multiple ideas into a single sentence
Answer:
{{ reference }}
Return the claims as a numbered list, one claim per line.
Do **not** include any explanations, formatting, or extra text.
"""
This step converts complex answers into verifiable units, setting a clear basis for semantic evaluation.
Prompt for attributing claims to retrieved contexts:
claim_context_attribution_prompt = """
You are a factual evaluator. Given a claim and a list of retrieved contexts,
decide whether the claim is directly or indirectly supported by any of the contexts.
Claim:
{{claim}}
Retrieved Contexts:
{{retrieved_chunks}}
Does any context support the claim? (1 = Yes, 0 = No)
"""
With these prompts, the LLM efficiently and consistently evaluates the semantic alignment between claims and contexts, automating and scaling the process.
2. Non-LLM based Context Recall
In scenarios where manually annotated reference chunks (gold contexts) are available, a simpler yet stricter approach can be employed. This variant, Non-LLM-based Context Recall, directly matches retrieved contexts against known relevant reference contexts using similarity measures.
Inputs
- C = [c₁, c₂, …, cₘ] Reference (gold) chunks ( chunks)
- R = [r₁, r₂, …, rₖ] Retrieved chunks (ranked)
- sim(c, r) Similarity function (e.g. Levenshtein ratio, ROUGE-L, Jaccard)
- τ (tau) Match threshold in [0, 1]
Algorithm
-
Match each reference chunk
For every reference chunk compute its best match among the retrieved chunks:
-
Decide “recalled” or “missed”
The vector marks which gold chunks were recovered.
-
Aggregate
This approach is harder and more precise but less flexible, as it relies on exact or near-exact matches rather than semantic interpretations.
This metric is particularly useful for benchmarking systems against standard datasets, where reference contexts have been rigorously annotated.
Conclusion
Evaluating the retrieval component independently in RAG systems offers invaluable insights, enabling developers and researchers to pinpoint specific strengths and weaknesses. Understanding what's working (and what's not) empowers you to tune your retrieval strategies effectively. The better your retriever, the smarter and more reliable your AI answers become. By employing metrics such as Context Precision@K and Context Recall, people can gain a nuanced understanding of retrieval quality from both classical information retrieval and semantic perspectives. Keep measuring, keep tweaking, and your RAG system will deliver consistently solid results.