Skip to main content

Evaluating LLM Responses Using DeepEval

In the last two years, Large Language Model (LLM) agents have been implemented in a variety of applications that we use in our daily lives. In my opinion, one of the most important applications of LLMs is their use as knowledge retrieval tools from databases or large documents, which would otherwise take a significant amount of time to comb through.

For this purpose, Retrieval-Augmented Generation (RAG) has emerged as one of the most powerful techniques, leveraging the capabilities of LLMs while reducing their hallucinations. However, evaluating the correctness of the responses generated by LLMs poses its own challenges. In this article, I explore various methods and metrics for assessing LLM responses, which I encountered while implementing DeepEval into the chat agent provided by HanaLoop.

Understanding Retrieval-Augmented Generation (RAG)

RAG systems leverage external knowledge sources to inform and enhance the responses generated by LLMs. By retrieving relevant documents from a corpus and using them as context, RAG can produce more accurate, informative, and contextually aware outputs. However, this added complexity necessitates robust evaluation methods to ensure the effectiveness and reliability of the responses.

The embeddings

Although we aim to use external knowledge sources to enhance our responses, LLMs (Large Language Models) fundamentally operate with numerical data rather than raw text. So, what happens under the hood? When we input text (or receive a response), the LLM converts each word in the text into a numerical vector, known as an embedding. These embeddings are the form in which LLMs process and understand text.

To determine which information stored in our database is relevant to a given question, we must first convert that information into embeddings. This is achieved by using OpenAI's embeddings endpoint, where we submit any text and receive its corresponding embedding, which is then stored in our database.

When a user makes an inquiry, we convert this inquiry into an embedding and then calculate the distance between this embedding and those stored in our database to identify the most relevant information.

The retrieved information is then converted back into text and provided as context to our LLM, along with the user's inquiry. Using OpenAI's completion endpoint, the LLM generates a response that incorporates the relevant information we retrieved earlier.

Why choose DeepEval

There are several interesting frameworks that allow us to evaluate LLM responses. Among those considered for this project were Chainforge, Ragas, Autoevals, and DeepEval. Furthermore, we could also implement our own framework to evaluate LLM responses.

For this project, we considered ease of implementation, the options provided in terms of metrics, the size of the community using the framework, and last (but perhaps most importantly) whether its documentation was easily accessible and comprehensive.

Although Ragas seems to be one of the most popular frameworks, DeepEval not only allows us to evaluate using the native Ragas metrics but also provides the additional capability of adding reasoning to the returned score, which is not available in Ragas.

For me, the deciding factor was the comprehensiveness of DeepEval's documentation, which provided step-by-step explanations of how to use it, as well as the methods' arguments.

Key Metrics for Evaluating LLM Responses

1. G-Eval

G-Eval uses chain-of-thought (CoT) reasoning to evaluate LLM outputs based on any custom criteria. Essentially, we can provide the input and the actual output that the LLM created, and evaluate it using multiple steps that we can define.

from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

correctness_metric = GEval(
name="Correctness",
criteria="Determine whether the actual output is factually correct based on the expected output.",
# NOTE: you can only provide either criteria or evaluation_steps, and not both
evaluation_steps=[
"Check whether the facts in 'actual output' contradict any facts in 'expected output'",
"You should also heavily penalize omission of detail",
"Vague language, or contradicting OPINIONS, are OK"
],
evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
)

How It Works

G-Eval generates the evaluation steps using chain-of-thought (CoT) reasoning based on the provided criteria, and then uses these generated steps to determine the final score of the parameters, which will range from 0 to 1.

2. Answer Relevancy

This metric measures the relevancy of the output that the LLM generated against the input. In principle, how relevant is the generated output to what the user requested?

Example

from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

# Replace this with the actual output from your LLM application
actual_output = "We offer a 30-day full refund at no extra cost."

metric = AnswerRelevancyMetric(
threshold=0.7,
model="gpt-4",
include_reason=True
)
test_case = LLMTestCase(
input="What if these shoes don't fit?",
actual_output=actual_output
)

metric.measure(test_case)
print(metric.score)
print(metric.reason)

# Or evaluate test cases in bulk
evaluate([test_case], [metric])

Why This Is Useful

For a RAG pipeline, although this metric does not directly make use of the context itself and thus does not evaluate the processes related to retrieving the relevant vectors or embedding the document, it can be useful in order to recognize which responses or which answers can be problematic at a higher level. The benefit is that it is transparent and constitutes the same kind of metric a human reviewer would use: Does this seem relevant to what I asked?

How It Works

The AnswerRelevancyMetric score is calculated according to the following equation:

Answer Relevancy = Number of Relevant Statements / Total Number of Statements

The metric uses the LLM to extract all statements made in the provided output and then classifies whether each of the statements is relevant or not.

Example:

  • Question: What is Paris?
  • Answer: Paris is a city in France. Paris is the capital of France and has an area of 105.4 km².
  • Statements:
    • Paris is a city in France.
    • Paris is the capital of France.
    • Paris has an area of 105.4 km².

3. Faithfulness

This metric is similar to the one above but instead calculates the faithfulness of the LLM response to the provided context. This aims to answer the question: How many of the claims our LLM makes in its response are actually faithful to the context we provided?

Example

from deepeval import evaluate
from deepeval.metrics import FaithfulnessMetric
from deepeval.test_case import LLMTestCase

# Replace this with the actual output from your LLM application
actual_output = "We offer a 30-day full refund at no extra cost."

# Replace this with the actual retrieved context from your RAG pipeline
retrieval_context = ["All customers are eligible for a 30-day full refund at no extra cost."]

metric = FaithfulnessMetric(
threshold=0.7,
model="gpt-4",
include_reason=True
)
test_case = LLMTestCase(
input="What if these shoes don't fit?",
actual_output=actual_output,
retrieval_context=retrieval_context
)

metric.measure(test_case)
print(metric.score)
print(metric.reason)

# Or evaluate test cases in bulk
evaluate([test_case], [metric])

Why This Is Useful

Using this method, we can check how good of a context we are providing our LLM and how well it can use that context. Problems arising here can have to do with the limited context window of the model we are using, issues with our document extraction and embedding, and other processes in the pipeline.

How It Is Calculated

Faithfulness = Number of Truthful Statements / Total Number of Statements

Example:

Using our previous example, we can add a context that provides the LLM with the relevant information and see how the process changes:

  • Question: What is Paris?
  • Context: Paris is the capital and largest city of France. With an official estimated population of 2,102,650 residents in January 2023 in an area of more than 105 km² (41 sq mi).
  • Answer: Paris is a city in France. Paris is the capital of France and has an area of 105.4 km².
  • Statements:
    1. Paris is a city in France.
      • Evaluation: Truthful to the context, but lacks part of the complete statement, which would also include the city being the capital.
    2. Paris is the capital of France.
      • Evaluation: Truthful.
    3. Paris has an area of 105.4 km².
      • Evaluation: Partially truthful because the actual area is 105 km²; slight discrepancy leads to a penalty.

4. Contextual Relevancy

This metric measures whether the information passed in the context is relevant to the input provided. It aims to see whether we are passing superficial and unnecessary information to our LLM, potentially limiting its ability to use context properly.

This metric requires input, actual_output, and the retrieval_context as parameters.

Example

from deepeval import evaluate
from deepeval.metrics import ContextualRelevancyMetric
from deepeval.test_case import LLMTestCase

# Replace this with the actual output from your LLM application
actual_output = "We offer a 30-day full refund at no extra cost."

# Replace this with the actual retrieved context from your RAG pipeline
retrieval_context = ["All customers are eligible for a 30-day full refund at no extra cost."]

metric = ContextualRelevancyMetric(
threshold=0.7,
model="gpt-4",
include_reason=True
)
test_case = LLMTestCase(
input="What if these shoes don't fit?",
actual_output=actual_output,
retrieval_context=retrieval_context
)

metric.measure(test_case)
print(metric.score)
print(metric.reason)

# Or evaluate test cases in bulk
evaluate([test_case], [metric])

Why This Is Useful

This metric allows us to detect issues specifically in our context. This might be the way the context is provided to the LLM, might be related to the prompt we are using, or simply mean that the context we have in the database is not sufficient for this question. Lastly, the context retrieval might be flawed and not fetch the most relevant piece of information.

How It Is Calculated

Similar to the above metrics, the LLM breaks the context into statements and compares them to their relevancy to the input.

Contextual Relevancy = Number of Relevant Statements / Total Number of Statements

Other Methods

Although these methods seem to be the most appropriate for evaluating RAG pipelines, I will briefly touch upon some other methods that were not used in the implementation of our AI assistant. The reason for this is that we did not have the expected response, which all of these methods require as input. However, this can potentially be achieved by having a human evaluator create these expected responses and then pass them to the model. This begs the question, however: Why isn't the human then directly evaluating the response themselves?

1. Contextual Recall

This method measures the quality of a RAG pipeline by evaluating the extent to which the context aligns with the expected output. This metric essentially tests your retriever (workflow that retrieves the context) to understand to what degree you are fetching the relevant information for any given question asked.

Example

from deepeval import evaluate
from deepeval.metrics import ContextualRecallMetric
from deepeval.test_case import LLMTestCase

# Replace this with the actual output from your LLM application
actual_output = "We offer a 30-day full refund at no extra cost."

# Replace this with the expected output from your RAG generator
expected_output = "You are eligible for a 30-day full refund at no extra cost."

# Replace this with the actual retrieved context from your RAG pipeline
retrieval_context = ["All customers are eligible for a 30-day full refund at no extra cost."]

metric = ContextualRecallMetric(
threshold=0.7,
model="gpt-4",
include_reason=True
)
test_case = LLMTestCase(
input="What if these shoes don't fit?",
actual_output=actual_output,
expected_output=expected_output,
retrieval_context=retrieval_context
)

metric.measure(test_case)
print(metric.score)
print(metric.reason)

# Or evaluate test cases in bulk
evaluate([test_case], [metric])

How It Is Calculated

Contextual Recall = Number of Attributable Statements / Total Number of Statements

The LLM extracts all statements in the expected_output and then classifies whether each statement is attributable to the retrieval_context. Essentially, what percent of the expected response (the response that one would consider optimal) is covered by the information you have retrieved from the database? If that metric scores low, it means that your answer may not be good since the retrieval context does not cover the necessary information.

2. Contextual Precision

The Contextual Precision metric measures the RAG pipeline's retriever by evaluating whether the retrieved nodes in your retrieval context are ranked correctly, from most relevant to least. Essentially, when fetching a retrieval context, this should optimally include an 'n' number of most relevant contexts for any given question. The ranking of these contexts determines, again, the accuracy of your retrieval pipeline.

Example

from deepeval import evaluate
from deepeval.metrics import ContextualPrecisionMetric
from deepeval.test_case import LLMTestCase

# Replace this with the actual output from your LLM application
actual_output = "We offer a 30-day full refund at no extra cost."

# Replace this with the expected output from your RAG generator
expected_output = "You are eligible for a 30-day full refund at no extra cost."

# Replace this with the actual retrieved context from your RAG pipeline
retrieval_context = ["All customers are eligible for a 30-day full refund at no extra cost."]

metric = ContextualPrecisionMetric(
threshold=0.7,
model="gpt-4",
include_reason=True
)
test_case = LLMTestCase(
input="What if these shoes don't fit?",
actual_output=actual_output,
expected_output=expected_output,
retrieval_context=retrieval_context
)

metric.measure(test_case)
print(metric.score)
print(metric.reason)

# Or evaluate test cases in bulk
evaluate([test_case], [metric])

How It Is Calculated

\text{Contextual Precision} = \frac{\text{Number of Relevant Nodes}_1}{\sum_{k=1}^{n} \left( \frac{k}{\text{Number of Relevant Nodes Up to Position } k} \right) \times r_k}

Contextual Precision is calculated using a weighted cumulative precision as a score of the contextual precision. The function counts the relevant nodes up to each position 'k', divides them by 'k' (if we are at position 5 but have so far only 3 relevant nodes, that division is 3/5), and finally multiplies it with 1 if the node is relevant or 0 if it is not. The result is a higher number the more precise the context is.

The functionality of this method is that since an LLM's ability to remember context deteriorates as the context gets larger, the most important context information should be provided first, meaning their ranking is important. If more important context is ranked lower, it can lead to hallucinations and incorrect responses.

Conclusion

These are the evaluation methods that we chose to use for our implementation of LLM response evaluation using DeepEval, but this list is not exhaustive. In most cases, however, it seems that the methods being used by other libraries are similar, while DeepEval offers good documentation.

sources

https://aws.amazon.com/what-is/retrieval-augmented-generation/ https://docs.confident-ai.com/ https://platform.openai.com/docs/guides/embeddings

서울특별시 강남구 선릉로93길 40, 나라키움 역삼A빌딩 4층
email: info@hanaloop.com | phone: +82 0507-1337-9251