Back to Wiki
Development
Last updated: 2024-12-247 min read

Testing & Evaluation

How to test and evaluate AI agents

Testing & Evaluation

Testing non-deterministic AI agents is fundamentally different from testing traditional software. Since the output can vary for the exact same input, boolean assert statements are often insufficient.

The Evaluation Stack

1. Assertions (Deterministic)

For some parts of the system, traditional tests still work.

  • Output Format: Is the output valid JSON?
  • Tool Usage: Did the agent call the search tool with the correct arguments?
  • Latency: Did the response take less than 2 seconds?

2. Reference-Based Grading (Rouge/BLEU)

Comparing the agent's output to a "gold standard" human-written answer.

  • Exact Match: Rarely used for generative tasks.
  • Fuzzy Match: Checking for semantic similarity.
  • Limitations: An agent might be correct but phrase it completely differently, leading to a low score.

3. LLM-as-a-Judge

Using a stronger LLM (e.g., GPT-4) to evaluate the output of the agent locally.

  • Rubrics: Giving the Judge LLM a specific rubric (e.g., "Is the answer helpful? Is it polite?").
  • Pros: Scalable, fast, and captures semantic nuance better than code.
  • Cons: Can be expensive and bias towards its own "style."

4. Human Evaluation

The gold standard, but the most expensive.

  • Thumb up/down: Simple binary feedback from users.
  • A/B Testing: Showing two versions to users and seeing which performs better.

What to Test?

Correctness

Does the agent answer the question factually? This is crucial for RAG systems to avoid hallucinations.

Robustness

How does the agent handle:

  • Jailbreaks: Attempts to make it do something bad.
  • Garbage Input: Random characters or nonsense.
  • Language Mixing: Switching languages mid-sentence.

Tool Use Accuracy

  • Selection: Did it pick the right tool?
  • Arguments: Did it provide the correct parameters?
  • Error Handling: What happens if the tool API fails?

Frameworks for Eval

  • Ragas: Specifically for RAG pipeline evaluation.
  • DeepEval: Unit testing for LLMs.
  • Promptfoo: CLI tool for testing prompts matrix-style.