Evaluating LLM Responses Using DeepEval
In the last two years, Large Language Model (LLM) agents have been implemented in a variety of applications that we use in our daily lives. In my opinion, one of the most important applications of LLMs is their use as knowledge retrieval tools from databases or large documents, which would otherwise take a significant amount of time to comb through.
For this purpose, Retrieval-Augmented Generation (RAG) has emerged as one of the most powerful techniques, leveraging the capabilities of LLMs while reducing their hallucinations. However, evaluating the correctness of the responses generated by LLMs poses its own challenges. In this article, I explore various methods and metrics for assessing LLM responses, which I encountered while implementing DeepEval into the chat agent provided by HanaLoop.
Understanding Retrieval-Augmented Generation (RAG)
RAG systems leverage external knowledge sources to inform and enhance the responses generated by LLMs. By retrieving relevant documents from a corpus and using them as context, RAG can produce more accurate, informative, and contextually aware outputs. However, this added complexity necessitates robust evaluation methods to ensure the effectiveness and reliability of the responses.
The embeddings
Although we aim to use external knowledge sources to enhance our responses, LLMs (Large Language Models) fundamentally operate with numerical data rather than raw text. So, what happens under the hood? When we input text (or receive a response), the LLM converts each word in the text into a numerical vector, known as an embedding. These embeddings are the form in which LLMs process and understand text.
To determine which information stored in our database is relevant to a given question, we must first convert that information into embeddings. This is achieved by using OpenAI's embeddings endpoint, where we submit any text and receive its corresponding embedding, which is then stored in our database.
When a user makes an inquiry, we convert this inquiry into an embedding and then calculate the distance between this embedding and those stored in our database to identify the most relevant information.
The retrieved information is then converted back into text and provided as context to our LLM, along with the user's inquiry. Using OpenAI's completion endpoint, the LLM generates a response that incorporates the relevant information we retrieved earlier.
Why choose DeepEval
There are several interesting frameworks that allow us to evaluate LLM responses. Among those considered for this project were Chainforge, Ragas, Autoevals, and DeepEval. Furthermore, we could also implement our own framework to evaluate LLM responses.
For this project, we considered ease of implementation, the options provided in terms of metrics, the size of the community using the framework, and last (but perhaps most importantly) whether its documentation was easily accessible and comprehensive.
Although Ragas seems to be one of the most popular frameworks, DeepEval not only allows us to evaluate using the native Ragas metrics but also provides the additional capability of adding reasoning to the returned score, which is not available in Ragas.
For me, the deciding factor was the comprehensiveness of DeepEval's documentation, which provided step-by-step explanations of how to use it, as well as the methods' arguments.
Key Metrics for Evaluating LLM Responses
1. G-Eval
G-Eval uses chain-of-thought (CoT) reasoning to evaluate LLM outputs based on any custom criteria. Essentially, we can provide the input and the actual output that the LLM created, and evaluate it using multiple steps that we can define.
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams
correctness_metric = GEval(
name="Correctness",
criteria="Determine whether the actual output is factually correct based on the expected output.",
# NOTE: you can only provide either criteria or evaluation_steps, and not both
evaluation_steps=[
"Check whether the facts in 'actual output' contradict any facts in 'expected output'",
"You should also heavily penalize omission of detail",
"Vague language, or contradicting OPINIONS, are OK"
],
evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
)
How It Works
G-Eval generates the evaluation steps using chain-of-thought (CoT) reasoning based on the provided criteria, and then uses these generated steps to determine the final score of the parameters, which will range from 0 to 1.
2. Answer Relevancy
This metric measures the relevancy of the output that the LLM generated against the input. In principle, how relevant is the generated output to what the user requested?
Example
from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
# Replace this with the actual output from your LLM application
actual_output = "We offer a 30-day full refund at no extra cost."
metric = AnswerRelevancyMetric(
threshold=0.7,
model="gpt-4",
include_reason=True
)
test_case = LLMTestCase(
input="What if these shoes don't fit?",
actual_output=actual_output
)
metric.measure(test_case)
print(metric.score)
print(metric.reason)
# Or evaluate test cases in bulk
evaluate([test_case], [metric])
Why This Is Useful
For a RAG pipeline, although this metric does not directly make use of the context itself and thus does not evaluate the processes related to retrieving the relevant vectors or embedding the document, it can be useful in order to recognize which responses or which answers can be problematic at a higher level. The benefit is that it is transparent and constitutes the same kind of metric a human reviewer would use: Does this seem relevant to what I asked?
How It Works
The AnswerRelevancyMetric
score is calculated according to the following equation:
Answer Relevancy = Number of Relevant Statements / Total Number of Statements
The metric uses the LLM to extract all statements made in the provided output and then classifies whether each of the statements is relevant or not.
Example:
- Question: What is Paris?
- Answer: Paris is a city in France. Paris is the capital of France and has an area of 105.4 km².
- Statements:
- Paris is a city in France.
- Paris is the capital of France.
- Paris has an area of 105.4 km².
3. Faithfulness
This metric is similar to the one above but instead calculates the faithfulness of the LLM response to the provided context. This aims to answer the question: How many of the claims our LLM makes in its response are actually faithful to the context we provided?
Example
from deepeval import evaluate
from deepeval.metrics import FaithfulnessMetric
from deepeval.test_case import LLMTestCase
# Replace this with the actual output from your LLM application
actual_output = "We offer a 30-day full refund at no extra cost."
# Replace this with the actual retrieved context from your RAG pipeline
retrieval_context = ["All customers are eligible for a 30-day full refund at no extra cost."]
metric = FaithfulnessMetric(
threshold=0.7,
model="gpt-4",
include_reason=True
)
test_case = LLMTestCase(
input="What if these shoes don't fit?",
actual_output=actual_output,
retrieval_context=retrieval_context
)
metric.measure(test_case)
print(metric.score)
print(metric.reason)
# Or evaluate test cases in bulk
evaluate([test_case], [metric])