EVALUATE AGENTS

Earn the deploy button.

Test agents against real scenarios before they touch production. Track accuracy across versions, catch regressions, and surface what's failing - with explicit reasoning, not just a score. Ship when the platform agrees you're ready.

Test agents against real scenarios before they touch production. Track accuracy across versions, catch regressions, and surface what's failing - with explicit reasoning, not just a score. Ship when the platform agrees you're ready.

Start free

See an eval report

Evaluation Runs

View runs across your agents

+ New

Refund Specialist v4

Refund Policy Q4 · Sarah Kim

9.8

Order Lookup v2

Order History KB · Sarah Kim

9.2

Customer Success Team

Real Conversations Mar · Marek N.

8.4

Refund Specialist v4

Edge Cases — Disputed · Sarah Kim

7.3

Refund Specialist v3

Edge Cases — Disputed · Sarah Kim

6.5

Knowledge Search v1

Multilingual Queries · Marek N.

4.1

Email Drafter v1

Tone Calibration · Sarah Kim

9.6

↑ The yellow rows are what platforms that "just deploy" won't show you.

WHY THIS MATTERS

A demo isn't a deployment.

An agent that handles 5 test prompts well doesn't handle 500 real ones. Edge cases break agents. Knowledge gets stale. Models drift. A working prototype is the beginning of the work - not the end. Most platforms stop here and let you find out in production. We don't.

What "shipped" usually means

Manual spot checks. A demo that worked once. A hope that production will look like staging. Bugs found by users in Slack threads.

What it should mean

Real test cases drawn from real conversations. Automated runs after every change. Explicit pass/fail criteria. A regression alert before the user sees it.

What AgentX makes default

Both of the above - built into the same workspace where you create the agent. No separate testing tool. No CI/CD pipeline to wire up.

THE LOOP

Evaluation is a loop, not a step.

You don't evaluate an agent once. You evaluate it every time you change something - and every time the world around it changes. AgentX makes that loop a first-class workflow, not a debug session.

STAGE 1

Test cases & datasets

Build from logs, CSVs, or by hand.

→

STAGE 2

Run & execute

Versioned runs against the dataset.

→

STAGE 3

Score & analyze

LLM-as-judge with reasoning.

→

STAGE 4

Suggest fixes & version

Apply fixes, bump version, repeat.

↻ Loop. Every change re-runs the loop.

This is what lets you deploy an agent without building your own AI team. "No internal AI required" is only true if the platform itself takes responsibility for agent reliability. Ours does.

Test cases & datasets

Real conversations, real edge cases. Build datasets from production logs or write them by hand.

Run & version

Every change creates a versioned run. Compare v3 to v4. Diff what broke.

Score & analyze

LLM-as-judge with explicit reasoning. Pass criteria you define. Yellow rows when something needs a human.

Suggest & fix

AI-generated analysis of what failed and what to change. One-click apply with human override.

STAGE 1

Test with what your agents will actually see.

Synthetic test prompts catch synthetic bugs. Real test data catches real ones. Pull conversations from production logs. Import CSVs of historical examples. Write edge cases by hand. Tag everything so you can run targeted evals — refund disputes only, multilingual queries only, the conversations where humans had to step in.

Conversations

CSV Import

From Production Logs

Hand-Crafted

142 conversations

tagged refund · disputed · multi-turn

USER MESSAGE

EXPECTED BEHAVIOR

"I want my money back NOW"

De-escalate, verify order, route to specialist if amount > $500.

"¿Puedo devolver este producto?"

Respond in Spanish, follow refund flow.

"This is the third time I'm contacting you"

Acknowledge prior contact, escalate to senior agent.

Filter:

refund

disputed ✓

multilingual

escalation

edge-case

vip-customer

low-confidence

✓

Import datasets from CSVs, JSON, or directly from production logs.

✓

Tag and filter - run targeted evals on the cases you care about.

✓

Build expected behaviors as freeform criteria or strict rubrics.

STAGE 2

Every change is a version. Every version is comparable.

Edit a prompt. Swap a tool. Change a model. Add a knowledge source. Every modification creates a new agent version - and every version gets its own evaluation run. Compare v3 to v4. See exactly which test cases broke, which improved, which didn't change.

Refund Specialist

v3 vs v4 comparison

● v4 promoted to production

v3 - BASELINE

8.1

Overall score / 10

Passed: 127 / 142

89.4% pass rate

Failed: 15

Changes: baseline run

v4 - PROMOTED

9.2

↑ +1.1

Overall score / 10

Passed: 134 / 142 ↑ +5%

94.4% pass rate

Failed: 8 ↓ −7

Changes: RAG 3→5, tone "casual"→"warm-professional", added KB "edge cases"

PER-TAG SCORE CHANGE (v3 → v4)

refund

+1.4

disputed

+0.9

escalation

+0.5

vip-customer

±0.0

multilingual

−0.4

⚠ Regression detected. v4 lost 0.4 on multilingual queries — investigate before scaling.

✓

Automatic versioning on every meaningful change to an agent.

✓

Side-by-side run comparison with per-test-case diff.

✓

Regression alerts when a new version performs worse on tagged subsets.

STAGE 3 — THE DIFFERENTIATOR

Green checkmarks lie. Yellow rows tell the truth.

The fastest way to lose trust is to tell builders everything is fine when it isn't. Every eval report on AgentX uses LLM-as-judge to score test cases - with the judge's reasoning visible on every row. When something's borderline, the row shows up yellow with the reason. When something's broken, the row shows up red. The platform tells you what it knows.

Eval report

Refund Specialist v4 · 5 of 142 test cases

View all →

Test #140 - Refund $200, customer polite

Passed all assertions · response time 1.4s

✓ 9.6

Test #141 - Refund $5,000, customer angry

Escalated correctly · de-escalation tone matched

✓ 9.2

Test #142 - Refund $50, customer demanding compensation

Judge flagged - see analysis below

⚠ 6.5

Verdict · Routed to human review

JUDGE ANALYSIS

✗

Response missed key information from knowledge base (Refund Policy section 3.2 - compensation clause).

✗

Tone too formal for "casual" instruction setting.

✓

Correct refund amount calculated ($50).

✓

Proper escalation path triggered.

WHAT THE JUDGE SAW

Agent reply:

"We are unable to provide additional compensation beyond the refund amount, as per our policies."

Expected:

Acknowledge customer frustration, reference section 3.2 goodwill credit option, offer escalation path warmly.

SUGGESTED FIX

→ Increase RAG retrieval from 3 to 5 chunks

→ Adjust temperature on response from 0.3 to 0.5

→ Add knowledge source: Refund Policy v2 §3.2

Auto-apply fix

Edit instructions

Mark as acceptable

Test #143 - Refund denied, customer escalates

Handled appeal flow correctly

✓ 8.4

Test #144 - Multilingual refund query (Spanish)

Failed: agent responded in English to Spanish query

✗ 4.1

✓

Every score comes with the judge's reasoning - you see , not just .

✓

Yellow rows automatically route to human review queue.

✓

Red rows block the deploy button until resolved.

STAGE 4

Failures come with suggested fixes.

A failing test case is information. AgentX turns that information into specific, applicable changes - adjusted retrieval parameters, modified instructions, new knowledge sources, swapped sub-agent routing rules. You decide whether to apply, edit, or override.

Failure analysis

Refund Specialist v4 · 8 failing test cases

By severity

By test tag

By suggested fix

🔧 Suggested fix #1

affects 5 of 8 failures

"Increase RAG retrieval count from 3 to 5 chunks"

Why: Multiple failures show missed information from KB.

Estimated impact: +1.2 average score on affected cases.

Apply to v5

Preview change

🔧 Suggested fix #2

affects 2 of 8 failures

"Add multilingual response handling instruction"

Why: Agent responds in English to non-English queries.

Estimated impact: +2.8 average score on `multilingual` tag.

Apply to v5

Preview change

🔧 Suggested fix #3

affects 1 of 8 failures

"Edge case: customer requests compensation beyond refund"

Why: No knowledge source covers goodwill credit policy.

Suggested action: Add KB doc or accept routing to human.

Add to KB

Mark as expected

✓

Suggested fixes grouped by impact - fix one thing, resolve five failures.

✓

Estimated score impact before you apply.

✓

Every fix is editable - never auto-applied without your call.

AFTER DEPLOY

Evaluation doesn't stop at deploy.

A model update from your LLM provider can quietly change agent behavior. A new product line can drift your knowledge base out of date. Real users send conversations no test set covered. Production evaluation runs scheduled evals against fresh production samples - and alerts you when scores trend the wrong way.

Refund Specialist v4 - production score

Last 30 days · scheduled evals every 12h

Apr 11

Apr 26

May 10

⚠ May 5 - score dip detected · 12 conversations routed to human review

● Stable

Refund Specialist v4

Score stable at 9.2 over 7d. No action needed.

● Drifting

Order Lookup v2

Score dropped 0.8 over 14d - model drift suspected. Open a new eval run.

● Failing

Knowledge Search v1

Multilingual cases failing >50% - investigation required immediately.

✓

Scheduled eval runs against production samples (daily, weekly, custom).

✓

Score drift alerts when a deployed agent trends down.

✓

Auto-route low-confidence production conversations into the next eval cycle.

✓

Compare production scores against pre-deploy scores - close the loop.

DEPLOY GATE

Ship when the platform agrees you're ready.

You set the bar. Pass thresholds per dataset, score floors per agent, acceptance criteria per tag. The deploy button stays disabled until criteria pass. You can override - but you'll see exactly what you're overriding, and the override gets logged.

Refund Specialist v5

Deploy to production · pre-flight checks

✓

Overall score ≥ 9.0

Threshold met across all datasets.

current 9.4

✓

Pass rate on `refund` dataset ≥ 95%

Primary dataset for this agent.

current 97.2%

⚠

Pass rate on `multilingual` dataset ≥ 90%

Below threshold - 1 criterion blocking deploy.

current 87.3%

✓

No regression vs v4 on any tag

Largest delta: +1.1 (refund tag).

passed

Deploy (disabled)

Override and deploy →

Run additional evals

↑ Override will be logged with your name and reason. Production agent will deploy with one criterion not met.

✓

Deploy criteria set per agent, per workspace, per environment.

✓

Overrides logged with name, reason, and timestamp.

✓

Failed evals can auto-trigger fix suggestions instead of blocking work.

BUILT FOR REAL USERS

Eval data is data. We treat it like data.

Test sets often contain customer information, internal documents, and sensitive examples. The eval pipeline isn't a separate system with its own rules - it's the same isolated, audited, controlled environment as the rest of your workspace.

Test data isolation

Datasets stay scoped to the workspace they belong to. No cross-workspace leakage of test data or eval results.

Eval audit trail

Every eval run logged - who ran it, against which dataset, with which agent version, and what the override was. Exportable.

PII handling in datasets

Mark PII fields in datasets. Optional automatic scrubbing before LLM-as-judge runs. Bring your own redaction rules.

Cloud or on-prem evals

Run the full eval pipeline on-prem if your test data can't leave your environment. Same UI, same workflow.

Enterprise features →

Architecture deep-dive →

Build it. Evaluate it. Then ship it.

Evaluation closes the loop between an agent that works in your editor and an agent that works in your customer's hands. It's the second of three pillars. Here's the first and the third.

Build Agents

→

Visual builder, multi-agent orchestration, skills, tools, knowledge, memory. The first step of the loop.

See builder →

Deploy Agents

→

One-click to API, Slack, Teams, WhatsApp, web, email, voice. Versioned. Rollback in a click. Eval-gated deployment built in.

See deployment →

Build it. Evaluate it. Then ship it.

Evaluation closes the loop between an agent that works in your editor and an agent that works in your customer's hands. It's the second of three pillars. Here's the first and the third.

Build Agents

→

Visual builder, multi-agent orchestration, skills, tools, knowledge, memory. The first step of the loop.

See builder →

Deploy Agents

→

One-click to API, Slack, Teams, WhatsApp, web, email, voice. Versioned. Rollback in a click. Eval-gated deployment built in.

See deployment →

GET STARTED

Stop hoping your agents work. Prove it.

Free to start. Full evaluation suite included on every plan - including the free tier. Because an agent platform that locks evaluation behind a paywall doesn't actually believe in evaluation.

Free

$0

/ forever

Build and test your first agent.

Try it

Solo Builder

$49

/ month

Solo builders shipping production agents.

Get started

Professional/Business

$199 - $299

/ month

Agencies and service teams with white-label deployment.

See plans

Enterprise

Custom

scoped per process

On-prem, SSO, dedicated infrastructure.

Talk to us

Free

$0

/ forever

Build and test your first agent.

Try it

Solo Builder

$49

/ month

Solo builders shipping production agents.

Get started

Professional/Business

$199 - $299

/ month

Agencies and service teams with white-label deployment.

See plans

Enterprise

Custom

scoped per process

On-prem, SSO, dedicated infrastructure.

Talk to us

Free

$0

/ forever

Build and test your first agent.

Try it

Solo Builder

$49

/ month

Solo builders shipping production agents.

Get started

Professional/Business

$199 - $299

/ month

Agencies and service teams with white-label deployment.

See plans

Enterprise

Custom

scoped per process

On-prem, SSO, dedicated infrastructure.

Talk to us

Start Your AI Automation Journey Today

Get Started - Free

View Pricing