RAG Evaluation Part 1: Retriever Evaluation

A practical guide to measuring retrieval quality in RAG systems, with step-by-step examples of Context Precision and Recall calculations.

RAG Evaluation

Retrieval-Augmented Generation (RAG) pipelines rely heavily on the retriever to surface useful context that enables the generator (LLM) to produce grounded, factually accurate responses. Evaluating the retriever independently allows teams to:

In this blog, we explore advanced evaluation techniques for RAG retrievers, focusing on two key metrics—Context Precision and Context Recall—and demonstrate both traditional and LLM-powered methods to accurately quantify retrieval quality.

Context Precision

Precision@K

This is a classic metric used in information retrieval:

Precision@K=Number of relevant chunks in top-KK\text{Precision@K} = \frac{\text{Number of relevant chunks in top-}K}{K}

This tells us how many of the top-K retrieved documents are relevant. However, it doesn’t differentiate whether relevant documents appear early or late within the top K. For example, a relevant document at position 1 is treated the same as one at position 5.

Context Precision@K

RAG systems benefit from placing relevant context early in the ranked list. Context Precision@K captures this intuition:

Context Precision@K=k=1K(Precision@kvk)k=1Kvk\text{Context Precision@K} = \frac{\sum_{k=1}^{K} \left( \text{Precision@k} \cdot v_k \right)}{\sum_{k=1}^{K} v_k}

Where:

This formula weights early relevance more heavily, rewarding systems that not only retrieve the right context but prioritize it correctly.

Example 1: High-Quality Retrieval

RankContextRelevant vkv_k
1"Einstein was..."1
2"Einstein born..."1
3"Physics history..."0
4"Relativity theory..."1
5"Tesla was..."0
v=[1,1,0,1,0]v = [1, 1, 0, 1, 0] Precision@1=1.0Precision@2=1.0Precision@3=230.67Precision@4=34=0.75Precision@5=35=0.6\begin{align*} \text{Precision@1} &= 1.0 \\ \text{Precision@2} &= 1.0 \\ \text{Precision@3} &= \frac{2}{3} \approx 0.67 \\ \text{Precision@4} &= \frac{3}{4} = 0.75 \\ \text{Precision@5} &= \frac{3}{5} = 0.6 \\ \end{align*} Numerator=1.01+1.01+0.670+0.751+0.60=2.75\text{Numerator} = 1.0 \cdot 1 + 1.0 \cdot 1 + 0.67 \cdot 0 + 0.75 \cdot 1 + 0.6 \cdot 0 = 2.75 Denominator=vk=3\text{Denominator} = \sum v_k = 3 Context Precision@5=2.7530.917\Rightarrow \text{Context Precision@5} = \frac{2.75}{3} \approx 0.917

Example 2: Poor Retrieval

RankContextRelevant vkv_k
1"Tesla..."0
2"Physics..."0
3"Einstein was..."1
4"Einstein born..."1
5"Relativity..."1
v=[0,0,1,1,1]v = [0, 0, 1, 1, 1] Precision@3=130.33Precision@4=24=0.5Precision@5=35=0.6\begin{align*} \text{Precision@3} &= \frac{1}{3} \approx 0.33 \\ \text{Precision@4} &= \frac{2}{4} = 0.5 \\ \text{Precision@5} &= \frac{3}{5} = 0.6 \\ \end{align*} Numerator=0+0+0.331+0.51+0.61=1.43\text{Numerator} = 0 + 0 + 0.33 \cdot 1 + 0.5 \cdot 1 + 0.6 \cdot 1 = 1.43 Denominator=vk=3\text{Denominator} = \sum v_k = 3 Context Precision@5=1.4330.477\Rightarrow \text{Context Precision@5} = \frac{1.43}{3} \approx 0.477

Good. Now we know how to calculate Context Precision@K, but how to determine relevance — that’s still an interesting question to think about.

There are two main approaches to do this:

  1. Non-LLM based relevance calculation
  2. LLM-based relevance calculation

1. Non-LLM based relevance calculation

In practical deployments, LLM-based relevance judgments can be expensive or slow. A non-LLM alternative is to use string similarity functions to simulate human-labeled relevance.

Inputs

Algorithm

  1. For each retrieved chunk riRr_i \in R, compute: score(ri)=maxcCsim(ri,c)\text{score}(r_i) = \max_{c \in C} \text{sim}(r_i, c)
  2. If score(ri)τ\text{score}(r_i) \geq \tau , then label as relevant: vi=1v_i = 1 Else: vi=0v_i = 0 This generates a binary vector v=[v1,v2,...,vk]v = [v_1, v_2, ..., v_k], which can then be plugged into the Context Precision@K formula.

2. LLM-based relevance calculation

While traditional methods use string similarity to assess relevance, they often miss deeper semantic connections. To overcome this, we can use a Large Language Model (LLM) to evaluate whether a retrieved passage is truly helpful in answering the question—based not just on word overlap, but on meaning.

This method involves passing the retrieved chunk and the reference gold chunk into a structured prompt and letting the LLM act as a semantic judge.

Here’s a sample prompt :

retrival_relevance_prompt = """
You are a semantic relevance evaluator. Your task is to assess whether retrieved passage (context) is helpful in answering a user's question.

Respond with 1 if the passage includes important facts, concepts, or clues that
directly or indirectly support a correct answer.
Respond with 0 if the passage is unrelated, off-topic, or does not meaningfully
contribute to answering the question.

Return only the digit 1 or 0. Do not explain your answer.

Question:
{{question}}

Context:
{{retrieved_chunk}}

Answer:
{{response_or_reference}}

Is the passage relevant to the question to derive given answer? (1 = Yes, 0 = No)
"""

This approach offers more flexibility and depth than rigid rule-based checks, and works particularly well in cases where language is paraphrased, or when retrieved content is conceptually relevant but lexically different.

If you look at the ragas implementation, the LLM-based relevance check is further subdivided into two approaches depending on the availability of the ground-truth answer:

  1. LLMContextPrecisionWithReference

This method assumes access to a gold answer (also called a reference). It uses the reference answer as the target anchor while evaluating each retrieved chunk for relevance.

  1. LLMContextPrecisionWithoutReference

This variant is designed for real-world applications where a reference answer may not be available. Instead, it uses the model’s own generated response (response) as the evaluation anchor.

MetricReference SourceLLM PromptWhen to Use
LLMContextPrecisionWithReferencereference (gold answer)

“Was this chunk helpful for arriving at the correct answer?”

Offline test-sets where you already know the right answer for every question.

LLMContextPrecisionWithoutReferenceresponse (model answer)

“Was this chunk helpful for arriving at the answer the model actually produced?”

Production / live traffic where you don’t have a curated answer but still want a quick signal about retrieval quality.

Context Recall

RAG systems deal not just with retrieval but with retrieving chunks that support the reference answer. This gives rise to a more nuanced metric: Context Recall, which works at the level of answer claims or reference contexts.

There are two primary variants of Context Recall:

  1. LLM-based Context Recall
  2. Non-LLM-based Context Recall

Each variant addresses specific evaluation scenarios based on available resources, such as labeled data or reliance on semantic understanding.

1. LLM-based Context Recall

The LLM-based approach removes the need for manual annotation of reference chunks. Instead, it leverages the semantic reasoning capabilities of Large Language Models (LLMs) to determine whether the reference (or gold-standard) answer is adequately supported by the retrieved contexts.

How it works

Example

Consider the following reference answer:

"Einstein developed the theory of relativity in 1905 while working at the Swiss patent office in Bern."

This answer can be broken into four distinct factual claims:

Claim IDClaim
C1Einstein developed the theory of relativity
C2He developed it in 1905
C3He was working at the Swiss patent office
C4The office was in Bern

Assume the retriever returned the following contexts:

RankRetrieved Context
1

“Einstein published the theory of relativity in a groundbreaking 1905 paper.”

2

“During the early 1900s, Einstein worked at a government office evaluating patents.”

3

“Einstein later moved to Berlin to join the Prussian Academy of Sciences.”

4

“He studied physics in Zurich before becoming one of the most influential scientists.”

Now we match each claim against the retrieved contexts:

ClaimSupporting Chunk(s)
C1: Einstein developed the theory of relativity1
C2: He developed it in 19051
C3: He was working at the Swiss patent office2
C4: The office was in Bern

In this scenario, 3 out of 4 claims are supported by retrieved contexts, yielding:

Context Recall=34=0.75\text{Context Recall} = \frac{3}{4} = 0.75

This result indicates substantial semantic completeness, though not perfect coverage. Notably, the claim regarding the specific location "Bern" was not supported by any retrieved chunk, highlighting a retrieval gap that can be addressed for improved accuracy.

Now the question is how to extract claim and how to decide whether the claim is supported by extracted context.

Prompt for extracting claims from the reference answer:

claim_extraction_prompt = """
You are an information analyst. Your task is to extract the smallest possible factual claims from a given answer.

Each claim must:
- Be a self-contained statement
- Represent a fact that can be verified or falsified independently
- Avoid combining multiple ideas into a single sentence

Answer:
{{ reference }}

Return the claims as a numbered list, one claim per line.
Do **not** include any explanations, formatting, or extra text.
"""

This step converts complex answers into verifiable units, setting a clear basis for semantic evaluation.

Prompt for attributing claims to retrieved contexts:

claim_context_attribution_prompt = """
You are a factual evaluator. Given a claim and a list of retrieved contexts,
decide whether the claim is directly or indirectly supported by any of the contexts.

Claim:
{{claim}}

Retrieved Contexts:
{{retrieved_chunks}}

Does any context support the claim? (1 = Yes, 0 = No)
"""

With these prompts, the LLM efficiently and consistently evaluates the semantic alignment between claims and contexts, automating and scaling the process.

2. Non-LLM based Context Recall

In scenarios where manually annotated reference chunks (gold contexts) are available, a simpler yet stricter approach can be employed. This variant, Non-LLM-based Context Recall, directly matches retrieved contexts against known relevant reference contexts using similarity measures.

Inputs

Algorithm

  1. Match each reference chunk

    For every reference chunk cjCc_j \in C compute its best match among the retrieved chunks:

    max_sim(cj)=maxrRsim(cj,r)\text{max\_sim}(c_j)=\max_{r \in R}\,\text{sim}(c_j,\,r)
  2. Decide “recalled” or “missed”

    uj  =  {1if max_sim(cj)    τ(hit)0otherwise        (miss)u_j \;=\; \begin{cases} 1 & \text{if } \text{max\_sim}(c_j) \;\ge\; \tau \quad (\text{hit})\\[4pt] 0 & \text{otherwise} \quad\quad\quad\;\;\;\;(\text{miss}) \end{cases}

    The vector u=[u1,u2,,um]u = [u_1, u_2, …, u_m] marks which gold chunks were recovered.

  3. Aggregate

    Recall  =  TPTP+FN  =  j=1mujm\text{Recall} \;=\; \frac{\text{TP}}{\text{TP}+ \text{FN}} \;=\; \frac{\displaystyle\sum_{j=1}^{m} u_j}{m}

    This approach is harder and more precise but less flexible, as it relies on exact or near-exact matches rather than semantic interpretations.

This metric is particularly useful for benchmarking systems against standard datasets, where reference contexts have been rigorously annotated.

Conclusion

Evaluating the retrieval component independently in RAG systems offers invaluable insights, enabling developers and researchers to pinpoint specific strengths and weaknesses. Understanding what's working (and what's not) empowers you to tune your retrieval strategies effectively. The better your retriever, the smarter and more reliable your AI answers become. By employing metrics such as Context Precision@K and Context Recall, people can gain a nuanced understanding of retrieval quality from both classical information retrieval and semantic perspectives. Keep measuring, keep tweaking, and your RAG system will deliver consistently solid results.

Next Steps

RAG Evaluation Part 2: Generator Evaluation