EVALUATION FRAMEWORK

Evaluation is the gate. Not a checkbox.

We don't deploy because the demo looked good. We deploy because evaluation passed - against your real historical data, against thresholds your team set, with edge cases explicitly surfaced, and with a report your model validator signed off on. Before deploy, evaluation is the gate. After deploy, the same engine runs continuously as a governance control.

Talk to us

See a real evaluation example →

CONTINUOUS EVALUATION LOOP

Runs before deploy and continuously after

Build test set

→

Run evaluation

→

Score & surface failures

→

Threshold decision

→

Iterate or deploy

→

Monitor drift

←

On threshold breach → loop back to Run evaluation

↺

The same engine that gates initial deployment runs continuously in production.

TWO MODES

Pre-deploy validation. Post-deploy monitoring.

Evaluation operates in two modes. Before deploy: a structured validation cycle against historical data, with a written report and a threshold decision. After deploy: continuous monitoring of production traffic, with drift detection and automatic alerts. Both modes share the same underlying engine - same test datasets, same scoring, same audit trail, same threshold definitions.

PRE-DEPLOYMENT VALIDATION

The Stage 2 gate in our delivery model.

We build the agent workflow, connect to your real historical data, and run it against the test dataset you helped construct. The output is a written evaluation report - per-field accuracy, edge case coverage, failure modes, tone scoring (where applicable), and recommended actions. Your team sets the threshold during Stage 1 scope; we don't deploy until we hit it.

This is the Stage 2 gate in How We Work →

How We Work →

POST-DEPLOYMENT MONITORING

After go-live, the evaluation engine runs continuously.

Production traffic is sampled, scored against the same validated criteria, and compared to baseline. Drift detection triggers when rolling accuracy crosses defined thresholds. Incident response is automated: workflow holds, eval re-runs against expanded test set, validation evidence is generated for the risk team.

Drift detection feeds into the governance loop →

Drift into the governance loop →

TEST DATA

Synthetic data finds synthetic problems. We use yours.

Evaluation only works if the test data reflects what the agent will actually encounter in production. We don't use synthetic test sets generated by an LLM. We work with your team to construct datasets from real historical cases - invoices you actually processed, conversations you actually had, documents you actually filed. Edge cases are pulled from production logs, not invented.

TEST DATASET - INVOICE PROCESSING

Constructed for: Acme Corp · Invoice Processing v3.1

Constructed by: M. Torres (Acme Risk) + AgentX delivery

Approved by: S. Kim (CFO) · Apr 8, 2026

DATASET COMPOSITION

Total test cases

142

Source

Historical invoices Jan–Mar 2026

Selection method

Stratified random + edge case picks

Stratification

By vendor type, amount tier, doc format

BREAKDOWN BY SCENARIO

PO-matched invoices

Non-PO invoices

Multilingual invoices (Spanish, Polish, German)

EDGE CASES INCLUDED

Multi-page invoices with attachments

Foreign currency with FX conversion required

Credit memos vs standard invoices

Vendor name variations (DBA, Inc, LLC differences)

Amounts requiring manager approval (>$10K)

PII HANDLING

✓ All PII fields redacted before use in evaluation

✓ Test set stored within Acme workspace boundary

✓ No customer data leaves Acme environment during eval

✓ LLM judge runs against PII-redacted text only

Approve dataset

Request additions

Export PII handling doc

✓ Test cases drawn from real historical data - not synthetic

✓ Edge cases sourced from production logs and risk team review

✓ Stratified sampling ensures coverage across scenarios, not just easy cases

✓ PII handling documented and enforced before any data leaves your boundary

Data handling and isolation → Security

MEASUREMENT

Per-field accuracy. Per-document-type. Per-scenario.

Aggregate scores hide failures. A 95% overall accuracy can include 99% on the easy cases and 60% on the cases that matter. We report accuracy at the level of detail your risk team needs - per field, per document type, per workflow path, per protected attribute when relevant for fairness assessment.

EVALUATION METHODOLOGY - INVOICE PROCESSING v3.1

ACCURACY MEASUREMENT - PER FIELD

✓ Vendor name

≥ 95% required

✓ Invoice number

≥ 99% required

✓ Invoice date

≥ 98% required

✓ Total amount

≥ 99% required

✓ Line item descriptions

≥ 90% required

Aggregate (weighted by field criticality)

≥ 92% required

WORKFLOW PATH ACCURACY

✓ Auto-approve decision correct

≥ 95% required

✓ Escalation triggered correctly

100% required (HITL)

✓ Rejection routing correct

≥ 95% required

FAIRNESS METRICS (where applicable)

✓ Disparate impact ratio measured per protected attribute

✓ Statistical parity difference reported

✓ Threshold of acceptable disparity defined by your team

QUALITATIVE SCORING (customer-facing agents only)

Tone appropriateness

LLM-judge scored

Response completeness

LLM-judge scored

Escalation handling

LLM-judge scored

REPORTING DEPTH

Every failed test case includes: input shown to agent, agent's actual output, expected output, diagnostic reasoning from LLM-judge, recommended fix.

Per-field accuracy

Aggregate scores hide what matters. We score every field separately and weight by your definition of criticality.

Per-workflow-path accuracy

Was the right decision made (auto-approve, escalate, route, reject) - independent of whether the data extraction was perfect.

Fairness metrics where applicable

For decisions affecting protected groups, disparate impact and statistical parity measured per attribute.

Qualitative scoring for customer-facing agents

Tone, completeness, escalation handling - scored by LLM-judge with explicit reasoning.

PRE-DEPLOY GATE

Your team sets the threshold. We don't deploy until we hit it.

The threshold isn't our decision. During Stage 1 scope, your team - process owner, risk team, compliance officer if relevant - defines what “ready for production” looks like. Per-field accuracy targets. Edge case coverage requirements. Acceptable HITL routing rates. Fairness thresholds. We then run evaluation and either meet those targets or report exactly what we can and cannot deliver.

PRE-DEPLOY EVALUATION REPORT - INVOICE PROCESSING v3.1

Evaluated: Apr 5–7, 2026 | Workflow target go-live: Apr 9, 2026

OVERALL OUTCOME

Aggregate accuracy

94.2% vs target 92.0% ✓

Edge case coverage

19/24 ≥ 85%; 5/24 routed to HITL ✓

Workflow path accuracy

97.8% vs target 95.0% ✓

Fairness (vendor type)

No statistically significant disparity

DETAIL BY FIELD

Vendor name

96.8% ✓

Invoice number

99.4% ✓

Total amount

99.2% ✓

Multilingual invoices (below 85%)

79.1% ⚠ routes to human

EDGE CASES SURFACED

✓ 19 edge cases handled at ≥ 85% accuracy

⚠ 5 edge cases below threshold: handwritten annotations (79%), vendor DBA variations (78%), multilingual (79%), currency edge cases (82%), damaged scans (74%) - all route to HITL

ROUTING TO HUMAN REVIEW

Estimated production HITL routing rate

8.4% (acceptable ≤ 12%) ✓

GATE DECISION - REVIEWER SIGNATURES REQUIRED

☐ Process Owner: S. Kim (CFO)

_____________________

☐ Risk Team: M. Torres

_____________________

☐ Compliance: R. Chen (where applicable)

_____________________

☐ AgentX: delivery lead

_____________________

What this document is

A formal validation artifact. Reviewed by your team. Signed before deployment. Retained as audit evidence for the lifecycle of the agent.

What this document isn't

A sales document. A demo result. A summary of what we hope the agent does. Every number is traceable to the underlying test case.

What happens if we don't pass

We iterate - modify the workflow, expand training inputs, narrow the scope, or recommend that the workflow not deploy. Your call. The report stays honest about what works and what doesn't.

The pre-deploy gate is Stage 2 in our delivery process →

POST-DEPLOY

The eval engine doesn't stop at go-live.

Pre-deploy validation answers “does this work today on representative data.” It does not answer “will this still work in 90 days.” The platform answers the second question by running evaluation continuously in production - sampling traffic, scoring outputs, tracking accuracy trends, alerting when drift crosses defined thresholds.

PRODUCTION VALIDATION - INVOICE PROCESSING v3.1

Workspace: Acme Corp | Tracking window: Last 90 days

VALIDATION ACTIVITY

Continuous sample evaluation

500 cases / day

Scheduled full re-eval

Weekly (Mondays 02:00 UTC)

Triggered re-eval (this month)

2 events

INCIDENT - APR 18 (INC-2026-04-18-001)

Trigger

Rolling accuracy dropped to 91% (threshold 92%)

Detection

14 minutes after threshold breach

Root cause

LLM provider model update changed behavior on edge cases

Action

Workflow held. Re-eval in 47 min on extended test set. v3.2 deployed Apr 19 (95.4%)

Customer impact

None (HITL caught all flagged cases during hold)

Validator

M. Torres (Acme Risk) · Closure: Apr 19, 2026

VERSION HISTORY

v3.0 - Initial deployment - Apr 9, 2026

Validated by S. Kim

v3.1 - Tax handling update - Apr 14, 2026

Validated by M. Torres

v3.2 - Drift correction - Apr 19, 2026

Validated by M. Torres

✓ Continuous validation against production traffic (configurable sample rate)

✓ Scheduled full re-evaluation against test set (default: weekly)

✓ Triggered re-eval on threshold breach, model version change, or workflow modification

✓ Every incident logged with root cause, action, validation evidence, validator identity

Drift monitoring feeds into governance reporting →

VALIDATION CYCLES

Quarterly by default. Triggered on change.

Continuous monitoring catches accuracy drift in real time. Validation cycles are different - periodic structured re-validation that produces written evidence for your risk team and your auditors. Quarterly by default. Triggered on any meaningful change.

SCHEDULED VALIDATION

Predictable. Calendar-driven.

• Quarterly full re-validation against current test set

• Annual test set review - does the test set still reflect production reality?

• Annual fairness re-assessment for decisions affecting protected groups

• Annual independent validation support - we provide artifacts; your model risk team conducts independent review

TRIGGERED VALIDATION

Reactive. Change-driven.

• LLM provider model version change

• Any workflow modification (rules, integrations, sub-agents)

• Threshold breach in continuous monitoring

• New document type or input pattern detected

• Vendor / system / regulatory change requested by your team

• Significant volume change that may stress the workflow

WHAT ANNUAL REVIEW PRODUCES

A written annual validation summary for your model risk inventory. Covers: accuracy trends over the year, incidents and resolutions, version history, test set evolution, fairness metric trends, ongoing monitoring effectiveness. Designed to slot into your existing model risk reporting cycle without your team rewriting it.

AUDIT EVIDENCE

Reports your auditor can read. And your regulator.

Evaluation reports are not internal artifacts. They are external-facing documents - delivered to your team, signed by the people who approved deployment, retained per your retention policy, exportable in formats your auditor and regulator expect. We don't keep evaluation results internal. They're yours.

REPORT TYPES WE PRODUCE

• Pre-deploy evaluation report - Stage 2 gate document, signed before go-live

• Continuous monitoring summary - monthly dashboard export, automatic

• Incident reports - per drift detection event, with root cause and resolution

• Quarterly validation report - structured re-validation evidence

• Annual validation summary - for model risk inventory

• Fairness assessment report - for decisions affecting protected groups (where applicable)

• Test set evolution log - what cases were added, removed, modified, when, why

FORMATS & DESTINATIONS

• PDF - for audit binders and regulatory submissions

• Excel - for model risk inventory integration

• JSON - for GRC tool integration (LogicGate, ServiceNow GRC, Workiva)

• CSV - for ingestion into your data warehouse

• API - for real-time access by your validation tooling

• SFTP / S3 export - for automated archival to your storage

ROLES

Every evaluation has a validator. Every validator has a name.

Evaluation is a process with named roles. We define them up front during scope. Without named accountable roles, evaluation reports become artifacts no one signed and no one defends. With named roles, every decision in the validation cycle has an owner.

ROLE × RESPONSIBILITY MATRIX

YOUR TEAM

Process owner (e.g. CFO, COO, Head of Ops)

Sets thresholds. Signs pre-deploy report. Named in all eval artifacts.

Risk team

Contributes to test set construction. Reviews evaluation reports. Signs incident closures.

Compliance officer (where applicable)

Involved in threshold definition for regulated workflows. Signs pre-deploy report for regulated deployments.

Model validator (independent, where required)

Conducts independent review using artifacts we provide. Signs annual validation summary.

AGENTX

Delivery lead

Runs evaluation cycles. Produces reports. Signs pre-deploy report. Named on all AgentX-side artifacts.

Evaluation platform

Automated scoring, monitoring, drift detection. LLM-judge scoring with logged reasoning.

Incident response

Holds workflow on breach. Runs re-eval. Deploys fix. Produces incident report. Notifies your team.

Named individuals, not roles

Every validation artifact identifies the specific person who reviewed and signed. Not “the risk team.” M. Torres.

Documented sign-off

Every threshold decision, every deployment approval, every incident closure - has a named signatory. Not an auto-generated approval flag.

Continuity of accountability

When a validator changes, the previous validator's sign-offs remain on record. New validators sign new artifacts. The audit trail doesn’t reset.

The other enterprise deep dives.

Evaluation is one of four enterprise pillars. It depends on security controls, feeds into governance, and is the Stage 2 gate of our delivery model.

HOW WE WORK

Our delivery model, stage by stage.

Four-stage delivery model. Evaluation is the Stage 2 gate that decides whether the workflow deploys.

AI GOVERNANCE

Evaluation evidence feeds the governance loop.

Continuous evaluation feeds drift monitoring, decision audit, and validation reporting. The governance loop runs on top of the eval engine.

SECURITY

How your data is handled during eval.

Test data isolation, audit trail on eval activity, PII handling - the security controls that evaluation operates under.

READY FOR THE VALIDATION CONVERSATION?

Bring your model risk team. We'll show you the methodology.

A 45-minute call. We'll walk through how we'd construct a test dataset from your historical cases, what accuracy thresholds your team would set, what evaluation reports look like in your hands, and how continuous validation would fit your model risk cycle. We bring sample reports. You bring your validation framework. We compare.

Talk to us

See a real evaluation example →

Start Your AI Automation Journey Today

Get Started - Free

View Pricing