📚️ LLM - Evals
Thorough and methodical evaluation is the cornerstone of large language models (LLMs). In the LLM community, these evaluations are referred to as "evals."
Evals...
- Validate Performance - confirm that your system behaves as intended
- Enable Rapid Iteration - quick feedback loops tell when it's not working
You can talk about evals in any ML model context; here we focus on LLMs.
Why Evals Matter
Evals function similarly to unit and integration tests in traditional software. Without systematic evals, it becomes difficult to safely evolve AI applications.
Modern LLM solutions are often more than just simple prompt-response systems. They usually consist of complex workflows with multiple stages. Evals ensure that each part of this workflow functions correctly.
Agentic Workflows
When an LLM triggers external actions (e.g., calling APIs or controlling hardware), we call it an agentic workflow. These agentic systems often rely on "tool use" to interact with external resources.
Because model responses can vary significantly with even minor changes, automated checks are essential to ensure that you don't break functionality.
Unlike deterministic software, AI applications are probabilistic and behave like black boxes. Evals help manage that uncertainty.
The end goal is to implement automated, quantifiable scoring metrics rather than relying on manual QA or subjective "vibes." Human feedback remains valuable, but automating as much as possible allows for rapid, reliable iteration.
Eval Categories
block-beta columns 1 block columns 3 a["1. Unit Tests"] b["2. Integration Tests"] c["3. Scoped Tests"] end block columns 2 d["4. Human Evaluation"] e["5. LLM-as-a-Judge Evaluation"] end block columns 1 f["6. Benchmarks"] end
1. Unit Tests
Unit tests are deterministic evals are straightforward pass–fail assertions against measurable criteria.
- "Does the output contain these expected keywords?"
- "Does the output contain this forbidden content like UUIDs?"
- "Is the generated article between 300 and 3,000 words?"
These tests are fully automated and very precise.
2. Integration Tests
Integration tests measure interaction between components.
For instance, verifying an LLM-based Retrieval-Augmented Generation (RAG) system’s ability to effectively look up the correct documents.
3. Scoped Tests
Scoped tests evaluate domain-specific tasks or datasets relevant to your project. You often split the LLM's functionality into multiple features or scenarios and assess each one individually.
It’s possible to use an LLM to generate test sets, but it must be reviewed and corrected by humans to ensure quality.
You’ll typically employ automated metrics (e.g., similarity scores, BLEU, ROUGE, perplexity) for quick, repeatable comparisons between versions. Because AI outputs vary probabilistically, you may not achieve 100% pass rates. After a certain threshold, the pass rate becomes a product decision.
Additional Considerations:
- Consistency: Similar inputs should yield similar responses.
- Bias: Evaluate outputs for different demographic groups to detect bias.
- Toxicity: Check for harmful or offensive content.
4. Human Evaluation
Human evals involve real people assessing outputs for correctness and quality. While they are the gold standard, they’re also the slowest and most expensive. Automated pipelines should be introduced as soon as possible, but human evaluation remains critical.
Typical human eval questions:
- "Is the response acceptable? (Yes/No)"
- "Critique the output: misinformation, irrelevance, fluency, coherence, etc."
5. LLM-as-a-Judge Evaluation
LLM-as-a-judge uses a specialized or fine-tuned model to evaluate more qualitative aspects such as factual accuracy or writing style. This approach leverages earlier human evaluations to train or prompt-engineer an LLM capable of scoring other LLM outputs.
Track agreement rates between the LLM judge and human judges.
Tools like G-Eval and SelfCheckGPT automate hallucination detection.
Although powerful, an LLM-based judge must also be monitored to ensure it aligns with human standards.
6. Benchmarks
Benchmark tests are standardized datasets used to measure specific capabilities (language understanding, reasoning, etc.) and to compare models objectively.
Some widely used benchmarks:
- GLUE: A collection of nine language understanding tasks.
- SuperGLUE: A more challenging extension of GLUE.
- HellaSwag: Tests common sense inference and situational reasoning.
Additional Evals
- Adversarial Testing: Design prompts or scenarios that intentionally break your model to see how it handles edge cases or misleading inputs.
- Latency & Resource Monitoring: Especially relevant if your use-case demands real-time responses, or you have usage cost constraints.
- In-Production Monitoring: Keep track of usage metrics such as engagement (how often users rely on the AI) and retention (how often they come back).
- Token Usage: For LLM APIs, track token consumption as a metric for cost and performance.