
EVALUATE AGENTS
Earn the deploy button.
Test agents against real scenarios before they touch production. Track accuracy across versions, catch regressions, and surface what's failing — with explicit reasoning, not just a score. Ship when the platform agrees you're ready.
WHY THIS MATTERS

What "shipped" usually means
Manual spot checks. A demo that worked once. A hope that production will look like staging. Bugs found by users in Slack threads.

What it should mean
Real test cases drawn from real conversations. Automated runs after every change. Explicit pass/fail criteria. A regression alert before the user sees it.

What AgentX makes default
Both of the above — built into the same workspace where you create the agent. No separate testing tool. No CI/CD pipeline to wire up.
THE LOOP
Evaluation is a loop, not a step.
You don't evaluate an agent once. You evaluate it every time you change something — and every time the world around it changes. AgentX makes that loop a first-class workflow, not a debug session.
STAGE 1
Test cases & datasets
Build from logs, CSVs, or by hand.
→
STAGE 2
Run & execute
Versioned runs against the dataset.
→
STAGE 3
Score & analyze
LLM-as-judge with reasoning.
→
STAGE 4
Suggest fixes & version
Apply fixes, bump version, repeat.
↻ Loop. Every change re-runs the loop.
This is what lets you deploy an agent without building your own AI team. "No internal AI required" is only true if the platform itself takes responsibility for agent reliability. Ours does.

Test cases & datasets
Real conversations, real edge cases. Build datasets from production logs or write them by hand.

Run & version
Every change creates a versioned run. Compare v3 to v4. Diff what broke.

Score & analyze
LLM-as-judge with explicit reasoning. Pass criteria you define. Yellow rows when something needs a human.

Suggest & fix
AI-generated analysis of what failed and what to change. One-click apply with human override.
STAGE 1
Test with what your agents will actually see.
Synthetic test prompts catch synthetic bugs. Real test data catches real ones. Pull conversations from production logs. Import CSVs of historical examples. Write edge cases by hand. Tag everything so you can run targeted evals — refund disputes only, multilingual queries only, the conversations where humans had to step in.
Conversations
CSV Import
From Production Logs
Hand-Crafted
142 conversations
·
tagged refund · disputed · multi-turn
USER MESSAGE
EXPECTED BEHAVIOR
"I want my money back NOW"
De-escalate, verify order, route to specialist if amount > $500.
"¿Puedo devolver este producto?"
Respond in Spanish, follow refund flow.
"This is the third time I'm contacting you"
Acknowledge prior contact, escalate to senior agent.
Filter:
refund
disputed ✓
multilingual
escalation
edge-case
vip-customer
low-confidence
✓
✓
✓
STAGE 2
Every change is a version. Every version is comparable.
Edit a prompt. Swap a tool. Change a model. Add a knowledge source. Every modification creates a new agent version — and every version gets its own evaluation run. Compare v3 to v4. See exactly which test cases broke, which improved, which didn't change.
Refund Specialist
v3 vs v4 comparison
● v4 promoted to production
v3 — BASELINE
8.1
Overall score / 10
Passed: 127 / 142
89.4% pass rate
Failed: 15
Changes: baseline run
v4 — PROMOTED
9.2
↑ +1.1
Overall score / 10
Passed: 134 / 142 ↑ +5%
94.4% pass rate
Failed: 8 ↓ −7
Changes: RAG 3→5, tone "casual"→"warm-professional", added KB "edge cases"
PER-TAG SCORE CHANGE (v3 → v4)
refund
+1.4
disputed
+0.9
escalation
+0.5
vip-customer
±0.0
multilingual
−0.4
⚠ Regression detected. v4 lost 0.4 on multilingual queries — investigate before scaling.
✓
Automatic versioning on every meaningful change to an agent.
✓
Side-by-side run comparison with per-test-case diff.
✓
Regression alerts when a new version performs worse on tagged subsets.
STAGE 3 — THE DIFFERENTIATOR
Green checkmarks lie. Yellow rows tell the truth.
The fastest way to lose trust is to tell builders everything is fine when it isn't. Every eval report on AgentX uses LLM-as-judge to score test cases — with the judge's reasoning visible on every row. When something's borderline, the row shows up yellow with the reason. When something's broken, the row shows up red. The platform tells you what it knows.
Eval report
Refund Specialist v4 · 5 of 142 test cases
View all →
Test #140 — Refund $200, customer polite
Passed all assertions · response time 1.4s
✓ 9.6
Test #141 — Refund $5,000, customer angry
Escalated correctly · de-escalation tone matched
✓ 9.2
Test #142 — Refund $50, customer demanding compensation
Judge flagged — see analysis below
⚠ 6.5
Verdict · Routed to human review
JUDGE ANALYSIS
✗
Response missed key information from knowledge base (Refund Policy section 3.2 — compensation clause).
✗
Tone too formal for "casual" instruction setting.
✓
Correct refund amount calculated ($50).
✓
Proper escalation path triggered.
WHAT THE JUDGE SAW
Agent reply:
"We are unable to provide additional compensation beyond the refund amount, as per our policies."
Expected:
Acknowledge customer frustration, reference section 3.2 goodwill credit option, offer escalation path warmly.
SUGGESTED FIX
→ Increase RAG retrieval from 3 to 5 chunks
→ Adjust temperature on response from 0.3 to 0.5
→ Add knowledge source: Refund Policy v2 §3.2
Auto-apply fix
Edit instructions
Mark as acceptable
Test #143 — Refund denied, customer escalates
Handled appeal flow correctly
✓ 8.4
Test #144 — Multilingual refund query (Spanish)
Failed: agent responded in English to Spanish query
✗ 4.1
✓
Every score comes with the judge's reasoning — you see , not just .
✓
Yellow rows automatically route to human review queue.
✓
Red rows block the deploy button until resolved.
STAGE 4
Failures come with suggested fixes.
A failing test case is information. AgentX turns that information into specific, applicable changes — adjusted retrieval parameters, modified instructions, new knowledge sources, swapped sub-agent routing rules. You decide whether to apply, edit, or override.
Failure analysis
Refund Specialist v4 · 8 failing test cases
By severity
By test tag
By suggested fix
🔧 Suggested fix #1
affects 5 of 8 failures
"Increase RAG retrieval count from 3 to 5 chunks"
Why: Multiple failures show missed information from KB.
Estimated impact: +1.2 average score on affected cases.
Apply to v5
Preview change
🔧 Suggested fix #2
affects 2 of 8 failures
"Add multilingual response handling instruction"
Why: Agent responds in English to non-English queries.
Estimated impact: +2.8 average score on `multilingual` tag.
Apply to v5
Preview change
🔧 Suggested fix #3
affects 1 of 8 failures
"Edge case: customer requests compensation beyond refund"
Why: No knowledge source covers goodwill credit policy.
Suggested action: Add KB doc or accept routing to human.
Add to KB
Mark as expected
✓
✓
Estimated score impact before you apply.
✓
AFTER DEPLOY
Evaluation doesn't stop at deploy.
Last 30 days · scheduled evals every 12h
Apr 11
Apr 26
May 10
⚠ May 5 — score dip detected · 12 conversations routed to human review
● Stable
Refund Specialist v4
Score stable at 9.2 over 7d. No action needed.
● Drifting
Order Lookup v2
Score dropped 0.8 over 14d — model drift suspected. Open a new eval run.
● Failing
Knowledge Search v1
Multilingual cases failing >50% — investigation required immediately.
✓
Scheduled eval runs against production samples (daily, weekly, custom).
✓
Score drift alerts when a deployed agent trends down.
✓
Auto-route low-confidence production conversations into the next eval cycle.
✓
Compare production scores against pre-deploy scores — close the loop.
DEPLOY GATE
Ship when the platform agrees you're ready.
Refund Specialist v5
Deploy to production · pre-flight checks
✓
Overall score ≥ 9.0
Threshold met across all datasets.
current 9.4
✓
Pass rate on `refund` dataset ≥ 95%
Primary dataset for this agent.
current 97.2%
⚠
Pass rate on `multilingual` dataset ≥ 90%
Below threshold — 1 criterion blocking deploy.
current 87.3%
✓
No regression vs v4 on any tag
Largest delta: +1.1 (refund tag).
passed
Deploy (disabled)
Override and deploy →
Run additional evals
↑ Override will be logged with your name and reason. Production agent will deploy with one criterion not met.
✓
Deploy criteria set per agent, per workspace, per environment.
✓
Overrides logged with name, reason, and timestamp.
✓
Failed evals can auto-trigger fix suggestions instead of blocking work.
BUILT FOR REAL USERS
Eval data is data. We treat it like data.

Test data isolation
Datasets stay scoped to the workspace they belong to. No cross-workspace leakage of test data or eval results.

Eval audit trail

PII handling in datasets
Mark PII fields in datasets. Optional automatic scrubbing before LLM-as-judge runs. Bring your own redaction rules.

Cloud or on-prem evals
Run the full eval pipeline on-prem if your test data can't leave your environment. Same UI, same workflow.
GET STARTED
Stop hoping your agents work. Prove it.
Free to start. Full evaluation suite included on every plan — including the free tier. Because an agent platform that locks evaluation behind a paywall doesn't actually believe in evaluation.
