The Essential Guide to LLM Apps Evaluation

Going Beyond "But It Works on These Examples…"

Louis-François Bouchard

Omar Solano

, and

Francois Huppe-Marcoux

Nov 21, 2024

Good morning everyone!

Today, we're discussing an important yet often overlooked aspect of building AI products: evaluation systems. This newsletter is inspired by Hamel Husain's excellent blog post "Your AI Product Needs Evals," which provides good knowledge into constructing domain-specific LLM evaluation systems. We'll break down the key concepts and add our own perspectives on implementing these ideas effectively.

In our practice, we've observed a concerning pattern: teams test their LLM features on a handful of "happy path" examples, or worse, rely on generic metrics like "faithfulness" or "relevancy" that don't translate to real user satisfaction. This approach is like testing a new car by only driving it around an empty parking lot—you're missing all the real-world scenarios that matter!

Why You Need Proper Evals

Think of LLM evaluation systems as your product's quality control department. Without proper evaluation, you're essentially driving blind. You won't know if your AI is helping most of your users, you can't identify where and why it fails, and you have no systematic way to improve. The key to successful AI products isn't just about having the latest models or clever prompts—it's about being able to iterate quickly and confidently against a clear metric. This requires a robust evaluation system. For each modification you perform on the system, you need to know if it is better or worse for most users!

The Three Levels of LLM Evaluation

Level 1: Unit Tests (The Basics)

These are your basic assertions that run quickly and cheaply. Think of them as your first line of defense. Unit tests might verify that responses don't contain sensitive information, follow a specific format, stay within length limits, or include the necessary information. They can run on every inference call, provide immediate feedback, and trigger a retry to the LLM. Here you can either do a check with RegEx or run a small language model with a straightforward instruction.

In Practice: For a document summarization AI feature, Level 1 unit tests would verify summary length limits, check for copy-pasted sections, and ensure required sections are present.

# Check for sensitive information inside the summary (SSN, credit cards)
def test_no_pii(response):
    patterns = {
        'ssn': r'\\b\\d{3}-\\d{2}-\\d{4}\\b',
        'card': r'\\b\\d{16}\\b'
    }
    for name, pattern in patterns.items():
        assert not re.search(pattern, response), f"Found {name} in response"

# Check if the summary meets length requirements
def test_summary_length(response, max_length):
    assert len(response) <= max_length, f"Summary exceeds maximum length of {max_length} characters"

Level 2: Human & Model Evaluation (The Quality Check)

This level involves two complementary approaches to evaluate the actual quality of responses. Let's look at both:

Human Evaluation: Domain experts review outputs to assess quality and catch nuanced issues. While this provides the most accurate assessment, it's expensive and doesn't scale well. However, human evaluation is valuable for establishing ground truth and calibrating automated approaches.
LLM-as-Judge: We use another LLM to evaluate outputs at scale. Here, we tune a prompt for an AI assistant to replicate your expert's judgment. If the agreement between the model and the expert is close enough, this approach offers scalability and consistency.
Refer a friend

In the following sections, we'll explore the setup of LLM-as-Judge, which allows you to automate evaluation while maintaining quality standards.

In Practice: For the same document summarization AI example, a Level 2 evaluation would assess factors like accuracy of key points, logical flow, and appropriate detail level. Domain experts would first evaluate a sample of summaries, providing detailed critiques. Then, an LLM judge would be prompted and calibrated to extend this evaluation to a much larger set of outputs.

Level 3: User Ratings (The Reality Check)

The final level (and most important!) involves testing with real users to measure the actual impact. Here, you compare different versions of your features, measure and collect user engagement, user rating feedback, and churn rate, and validate that improvements in other metrics translate to better user outcomes. This is where you truly understand if your AI feature is useful.

In Practice: At Level 3 for the document summarization AI feature, testing would measure real-world impact through user time saved, user preference ratings, comprehension rates, and task completion rates.

Setting Up Your LLM Judge

Let's start with a real-world example that illustrates a key principle in evaluation design.

A few years ago, we trained a generative model to create color palettes. We gathered a few thousand examples and hired an expert artist to evaluate them. Initially, we asked the artist to rate each palette on a scale of 1 to 5. This seemingly straightforward approach quickly revealed its flaws: What exactly distinguished a palette rated 2 from one rated 3? How could we define the difference between a 4 and a 5 regarding aesthetic appeal? The artist struggled with these arbitrary distinctions.

We then shifted to a simpler approach: binary pass/fail judgments with written critiques. Informed by research from “Aligning with Human Judgement,” which has revealed some important nuances in implementing this approach effectively**,** the artist could now definitively say "This works" or "This doesn't work" and—importantly—explain why. The critiques revealed nuanced reasoning like "The transitional colors create harmony" or "The contrast is too jarring for this palette's intended use." These insights were far more valuable than numeric ratings alone.

This experience taught us a fundamental lesson about setting up model evaluators: Binary decisions paired with detailed critiques provide clearer, more actionable evaluation data than complex rating scales.

Get 25% off forever

Here's how to apply this lesson in practice:

Find Your Domain Expert: Identify the person whose judgment matters most—whether a subject matter expert, lead user, or product owner. Their judgment will serve as your ground truth.

Generate Test Data: Create a diverse dataset of input-output pairs from your current system.

💡Pro tip: Use an LLM to generate challenging inputs that test edge cases—scenarios you might not expect in typical usage but that could reveal important system behaviors.

Collect Expert Judgments: Have your expert evaluate a small set of outputs (around 20-30) with:
- A binary pass/fail decision
- A detailed critique explaining their reasoning
Calibrate Your LLM Judge: Use these expert judgments to create and refine your evaluator prompt.
Scale Up: Run your judge on a larger set of outputs.
Verify & Iterate: Have your expert review a sample of the judge's decisions and refine the prompt until you achieve satisfactory alignment.

The focus should be on identifying clear failures first—it's often more valuable to have high precision in catching bad outputs than to achieve perfect recall.

Embrace Criteria Drift: Recent research on "Who Validates the Validators?" revealed that experts often refine their evaluation criteria as they review more outputs. This "criteria drift" is natural and valuable—each critique helps articulate and evolve the standards for good outputs.

💡Example of criteria drift: In our color palette example, at first our expert was strictly judging color harmony. However, after seeing several palettes, they began noticing that darker, grey-heavy palettes looked poor in our specific application context. This led to a new, unexpected criterion: "avoid predominantly grey palettes." Such discoveries through evaluation are valuable - instead of trying to perfectly define criteria upfront, we let the expert's understanding evolve naturally as they grade more examples.

With these expert judgments in hand, you can now create an LLM judge prompt that incorporates both the binary decisions and the reasoning behind them. Here's a template:

You are evaluating [specific type of output].

Instructions for evaluation:
1. Carefully read and understand the response in context
2. Consider these key aspects:
   [List 3-4 critical factors from expert critiques]
3. Explain your reasoning step-by-step
4. Conclude with a binary pass/fail decision

Examples of PASS decisions with explanations:
[Insert 2-3 expert-approved examples with their critiques]

Examples of FAIL decisions with explanations:
[Insert 2-3 expert-rejected examples with their critiques]

Response to evaluate:
[Insert response]

Remember: Like training a new team member, start small and build trust gradually. Begin with basic pass/fail judgments and clear critiques on a manageable set of examples. Once your LLM judge demonstrates strong alignment with your expert (through agreement scores), you can scale to evaluate larger sets of outputs. This gives you the best of both worlds - the expert's judgment to establish quality standards and the LLM's ability to scale evaluation. Just remember to periodically check alignment with your expert to ensure standards are maintained.

Make sure to explore tools like AlignEval that can automatically optimize your LLM evaluator prompt. With this LLM evaluator, you can then use tools such as Adalflow to automatically optimize other parts of your AI prompts.

💡Note: Adalflow provides model-agnostic building blocks for building LLM task pipelines. Learn more in the repo.

In short, LLM-as-judge is particularly powerful because it combines the best of both worlds: the scalability of automated testing with the nuanced understanding needed for quality evaluation.

High Learning Rate

Discussion about this post