RAG Evaluation Part 2: Generator Evaluation

Learn to measure the generator's quality in RAG workflows using faithfulness, relevancy, and noise metrics.

Retrieval-Augmented Generation (RAG) combines information retrieval with generative models to produce accurate, factually-grounded answers. While retrieval quality ensures relevant context is surfaced, it is equally critical to evaluate the generator component—how effectively the model uses retrieved context to produce accurate, faithful, and relevant answers.

Why is Generator Evaluation Needed?

Evaluating generator in RAG is crucial because it directly impacts user trust and practical utility:

Faithfulness: Does the generated answer correctly reflect information present in the retrieved context?
Answer Relevance: Does the response adequately address the user's query?
Noise Sensitivity: How robust is the generation process to irrelevant or misleading retrieved contexts?

Poor evaluation in these dimensions leads to inaccurate, misleading responses, causing mistrust and poor user experience.

Key Components of Generator Evaluation

Faithfulness

Faithfulness measures how accurately the generated response reflects information available in the retrieved contexts.It verifies alignment between the generated text and retrieved information, avoiding hallucinations.

A response is considered faithful if all its claims can be supported by the retrieved context.

\text{Faithfulness} = \frac{\text{Number of supported claims}}{\text{Total number of claims in response}}

Example

Consider the following response by generator:

Einstein developed the theory of relativity in 1905 while working at the Swiss patent office in Bern.

This answer can be broken into four distinct factual claims:

Claim ID	Claim
C1	Einstein developed the theory of relativity
C2	He developed it in 1905
C3	He was working at the Swiss patent office
C4	The office was in Bern

Assume the retriever returned the following contexts:

Rank	Retrieved Context
1	"Einstein published the theory of relativity in a groundbreaking 1905 paper."
2	"During the early 1900s, Einstein worked at a government office evaluating patents."
3	"Einstein later moved to Berlin to join the Prussian Academy of Sciences."
4	"He studied physics in Zurich before becoming one of the most influential scientists."

Now we match each claim against the retrieved contexts:

Claim	Supporting Chunk(s)
C1: Einstein developed the theory of relativity	1
C2: He developed it in 1905	1
C3: He was working at the Swiss patent office	2
C4: The office was in Bern	—

In this scenario, 3 out of 4 claims are supported by retrieved contexts, yielding:

\text{Faithfulness} = \frac{3}{4} = 0.75

Then what is the difference between Faithfulness and Context Recall?

In short, the only difference is that Faithfulness is calculated for the generated response, while in Context Recall, a gold reference answer is available and we are checking the quality of the retrieved context.

Context Recall checks for coverage: Are all factual claims required to answer the question present somewhere in the retrieved chunks? On the other hand, Faithfulness checks for groundedness: Are all factual claims in the generated answer supported—or at least not contradicted—by the retrieved chunks?

You can use the same prompt that we used in the Context Recall section (part 1 of this blog series) for both claim extraction and claim-to-context attribution.

Response Relevancy

Response relevancy measures whether the generated response actually addresses the user's query, independent of factual correctness. It evaluates semantic alignment between the question intent and the generated response.

Algorithm

Generate N questions from the answer

The first step is to reverse-engineer the answer to create a set of questions that could be answered by the response itself. These artificial questions are meant to reflect the topics and details covered in the generated answer. Typically, an LLM is used to produce these questions.

Sample prompt to generate artificial questions:

You are a question generation expert. Given a response, generate
{{ N }} artificial questions that can be answered directly by this response.
Response: {{ response }}
Return the questions as a list of string.

Embed
- $E_q$ – embedding of the original user question.
- $E_{g_i}$ – embedding of each generated question $g_i$ .
Cosine similarity
$s_i = \cos\!\bigl(E_q,\;E_{g_i}\bigr)$
Average
$\text{Response Relevancy} = \frac{1}{N}\sum_{i=1}^{N} s_i$
The score is usually in $[0,1]$ but can range $[-1,1]$ .

Example

Component	Value
User question	“When did Apollo 11 land on the Moon?”
Answer	“Apollo 11 landed on 20 July 1969.”
Generated questions	“On what date did Apollo 11 land?” “When did the Apollo 11 lunar landing occur?” “What was the landing date of Apollo 11?”
Cosine sims	0.93, 0.91, 0.88
Score	$(0.93+0.91+0.88)/3 = 0.906$

Because the answer is perfectly on-topic, the metric returns a high relevancy score (≈ 0.91).

But consider a scenario where model’s response is a generic “I don’t know,” it might still appear topic-aligned because the model’s question-generation step often simply echoes the user’s question back—almost word-for-word because there’s no additional information for the LLM to work with, the only real hint of context is the user’s original question, the model essentially repeats or rephrases it.

In this situation, the cosine similarity between the user’s original question and the newly generated questions can be very high (close to 1).

This artificially boosts the relevancy score, making it seem like the answer is excellent—even though, in reality, the response doesn’t provide any useful information. So, while the metric’s algorithm sees perfect alignment in the wording, it’s actually being tricked by the repetition rather than reflecting true, helpful content.

Example

Component	Value
User question	“Who wrote War and Peace?”
Answer	“I don’t know.”
Generated questions	“Who is the author of War and Peace?” “Which writer wrote War and Peace?” “Who wrote the novel War and Peace?”
Cosines	0.96, 0.95, 0.97
score	$0.96$ (near-perfect!)

Take-away: Matching topics alone isn’t enough to tell whether the response truly answers the question or if it’s simply a polite way of avoiding it.

Enhanced Algorithm

Generate N questions from the answer

we reverse-engineer the answer to create a set of questions that could be answered by the response itself. These artificial questions should reflect the topics and details present in the response.

Additionally, we introduce a non-committal flag:

If the generated response is a generic fallback (e.g., “I don’t know,” “No idea,” “I’m not sure”), we skip or modify the scoring step to avoid falsely boosting the relevance metric.
This ensures that when the response contains no useful information, it doesn’t artificially inflate the relevancy score.

Enhanced sample prompt to generate artificial questions:

You are a question generation expert. Given an **answer**, create a
question that the answer would directly respond to.Also, identify if the
answer is vague or evasive.

**Task**
1. Read the provided **Answer**.
2. Generate up to {{ N }} clear questions that the answer would directly
address.
3. For each question, decide if the answer is non-committal.

**Non-committal definition**
Label as 1 if the answer is vague, hedging, or expresses uncertainty
(e.g. “I’m not sure…”, “I don’t know”, “It depends”).
Otherwise label as 0.

**Output**
Return a JSON list of objects. Each object must have:
- `question`: The generated question
- `noncommittal`: 1 or 0

Example output for 2 questions:
```json
[
  {"question": "Where was Albert Einstein born?", "noncommittal": 0},
  {"question": "What was his major scientific contribution?", "noncommittal": 0}
]

Embed
- $E_q$ – embedding of the original user question.
- $E_{g_i}$ – embedding of each generated question $g_i$ .
Cosine similarity
$s_i = \cos\!\bigl(E_q,\;E_{g_i}\bigr)$
If any generated item has noncommittal = 1, set
$m = 0;\quad\text{else } m = 1$
Final score
$\text{Response Relevancy} = m \times \frac{1}{N}\sum_{i=1}^{N}s_i$

Examples

Component	Value
User question	“What is the length of the Nile River?”
Answer	“The Nile is about 6,650 km long, though I’d need to verify the exact figure.”
Generated questions & flags	“Approximately how many kilometres long is the Nile?” – 0 “What is the Nile’s total length?” – 0 “Exactly how long is the Nile River?” – 1
Cosines	0.92, 0.90, 0.89 → mean = 0.903
Gate	at least one flag = 1 → $m = 0$
Final score	$0.903 \times 0 = 0$

Even though the average cosine is high, the single hedge zeros the score—accurately signalling an answer that still leaves the user uncertain.

Component	Value
User question	“Who discovered penicillin?”
Answer	“Alexander Fleming discovered penicillin in 1928.”
Generated questions & flags	“Which scientist discovered penicillin?” – 0 “Who first discovered penicillin in 1928?” – 0 “Who is credited with the discovery of penicillin?” – 0
Cosines	0.93, 0.91, 0.92 → mean = 0.92
Gate	no flags = 1 → $m = 1$
Final score	$0.92 \times 1 = 0.92$

Because every generated question judges the answer as fully committed (no hedging), the gate lets the average cosine flow through unchanged, yielding a high relevancy score.

Noise Sensitivity

Noise sensitivity evaluates how robust the generated response is when irrelevant or misleading contexts are retrieved. Ideally, even with some noisy contexts, a robust generative model selectively focuses on relevant information and maintains accuracy and faithfulness.

This metric ranges from 0 (perfect accuracy, no hallucinations) to 1 (every claim is an error), revealing how prone a system is to introducing errors, or “noise,” in real-world retrieval-augmented generation (RAG) workflows.

A lower noise sensitivity indicates that the system rarely adds spurious details. Conversely, a higher score suggests that the system often includes hallucinated claims not grounded in the actual answer, even if they seem plausible.

\text{Noise Sensitivity} = \frac{\text{Number of incorrect claims}}{\text{Total number of claims in response}}

Example

User question

When was the Eiffel Tower built, and what is it known for?

Ground truth

The Eiffel Tower was built between 1887 and 1889 for the 1889 Exposition Universelle in Paris. It is known for its architectural innovation and as a symbol of France.

Relevant retrieved contexts

Rank	Retrieved Context
1	"Construction of the Eiffel Tower began in 1887 and was completed in 1889 for the Exposition Universelle in Paris."
2	"The Eiffel Tower is an iconic symbol of France and a marvel of architectural innovation."
3	"The Eiffel Tower stands 330 meters tall and attracts millions of tourists annually."

Irrelevant retrieved context

Rank	Retrieved Context
4	"The Louvre Museum in Paris is home to the famous painting, the Mona Lisa, and numerous other art treasures."

Generated answer

The Eiffel Tower was constructed between 1887 and 1889 for the 1889 Exposition Universelle in Paris. It is renowned for its architectural innovation and is one of the most visited monuments in the world. Interestingly, it was designed by Gustave Eiffel.

Extract claims from the generated response

The generated answer contains four factual claims:

Claim ID	Claim
C1	The Eiffel Tower was constructed between 1887 and 1889 for the 1889 Exposition Universelle in Paris.
C2	It is renowned for its architectural innovation.
C3	It is one of the most visited monuments in the world.
C4	It was designed by Gustave Eiffel.

You can use the same prompt that we used in the Context Recall section (part 1 of this blog series) for claim extraction.

Determine if each claim is supported by the ground truth

Claim	Is Supported by Ground Truth?
C1	Yes
C2	Yes
C3	No (not mentioned in ground truth)
C4	No (not mentioned in ground truth)

Total incorrect claims: 2

Total claims: 4

sample prompt to check weather the given claim is supported by the ground truth or not:

"""
 You are an unbiased fact-checker assistant.

Inputs:
- question: {{ question }}
- reference: {{ reference }}
- claims: {{ claims }}

Your Task:
You are provided with reference gold truth answer and claim. You need to check
if given claim is supported by referenc answer or not.For each claim, produce a
JSON object on a single line:
{
  "claim": "<claim text>",
  "is_correct": <true|false>     # Is it supported by the reference?
}

Rules:
- Only use the reference (ground truth) to check claims.
- Do NOT add any external knowledge.
- Output must be strict JSON lines (one per claim), with no extra text.
"""

Compute Noise Sensitivity

\text{Noise Sensitivity} = \frac{2}{4} = 0.5

This result means that half of the generated claims may be hallucinated—they do not appear in the reference answer.

Conclusion

Evaluating the generator in Retrieval-Augmented Generation (RAG) is as crucial as assessing retrieval quality itself. Metrics like Faithfulness, Response Relevancy, and Noise Sensitivity help ensure that generated answers are not only accurate and grounded but also truly helpful to the user. By systematically verifying if responses are supported by retrieved contexts and if they directly address the question while minimizing hallucinations, we can build more trustworthy and reliable RAG systems.