
EVALUATION FRAMEWORK
Evaluation is the gate. Not a checkbox.
We don't deploy because the demo looked good. We deploy because evaluation passed - against your real historical data, against thresholds your team set, with edge cases explicitly surfaced, and with a report your model validator signed off on. Before deploy, evaluation is the gate. After deploy, the same engine runs continuously as a governance control.
The same engine that gates initial deployment runs continuously in production.
TWO MODES
Pre-deploy validation. Post-deploy monitoring.
Evaluation operates in two modes. Before deploy: a structured validation cycle against historical data, with a written report and a threshold decision. After deploy: continuous monitoring of production traffic, with drift detection and automatic alerts. Both modes share the same underlying engine - same test datasets, same scoring, same audit trail, same threshold definitions.
PRE-DEPLOYMENT VALIDATION
The Stage 2 gate in our delivery model.
We build the agent workflow, connect to your real historical data, and run it against the test dataset you helped construct. The output is a written evaluation report - per-field accuracy, edge case coverage, failure modes, tone scoring (where applicable), and recommended actions. Your team sets the threshold during Stage 1 scope; we don't deploy until we hit it.
POST-DEPLOYMENT MONITORING
After go-live, the evaluation engine runs continuously.
Production traffic is sampled, scored against the same validated criteria, and compared to baseline. Drift detection triggers when rolling accuracy crosses defined thresholds. Incident response is automated: workflow holds, eval re-runs against expanded test set, validation evidence is generated for the risk team.
Drift detection feeds into the governance loop →
TEST DATA
Synthetic data finds synthetic problems. We use yours.
Evaluation only works if the test data reflects what the agent will actually encounter in production. We don't use synthetic test sets generated by an LLM. We work with your team to construct datasets from real historical cases - invoices you actually processed, conversations you actually had, documents you actually filed. Edge cases are pulled from production logs, not invented.
TEST DATASET - INVOICE PROCESSING
Constructed for: Acme Corp · Invoice Processing v3.1
Constructed by: M. Torres (Acme Risk) + AgentX delivery
Approved by: S. Kim (CFO) · Apr 8, 2026
DATASET COMPOSITION
Total test cases
142
Source
Historical invoices Jan–Mar 2026
Selection method
Stratified random + edge case picks
Stratification
By vendor type, amount tier, doc format
BREAKDOWN BY SCENARIO
PO-matched invoices
98
Non-PO invoices
31
Multilingual invoices (Spanish, Polish, German)
13
EDGE CASES INCLUDED
Multi-page invoices with attachments
14
Foreign currency with FX conversion required
8
Credit memos vs standard invoices
7
Vendor name variations (DBA, Inc, LLC differences)
12
Amounts requiring manager approval (>$10K)
18
PII HANDLING
✓ All PII fields redacted before use in evaluation
✓ Test set stored within Acme workspace boundary
✓ No customer data leaves Acme environment during eval
✓ LLM judge runs against PII-redacted text only
Approve dataset
Request additions
Export PII handling doc
✓ Test cases drawn from real historical data - not synthetic
✓ Edge cases sourced from production logs and risk team review
✓ Stratified sampling ensures coverage across scenarios, not just easy cases
✓ PII handling documented and enforced before any data leaves your boundary
MEASUREMENT
Per-field accuracy. Per-document-type. Per-scenario.
Aggregate scores hide failures. A 95% overall accuracy can include 99% on the easy cases and 60% on the cases that matter. We report accuracy at the level of detail your risk team needs - per field, per document type, per workflow path, per protected attribute when relevant for fairness assessment.
EVALUATION METHODOLOGY - INVOICE PROCESSING v3.1
ACCURACY MEASUREMENT - PER FIELD
✓ Vendor name
≥ 95% required
✓ Invoice number
≥ 99% required
✓ Invoice date
≥ 98% required
✓ Total amount
≥ 99% required
✓ Line item descriptions
≥ 90% required
Aggregate (weighted by field criticality)
≥ 92% required
WORKFLOW PATH ACCURACY
✓ Auto-approve decision correct
≥ 95% required
✓ Escalation triggered correctly
100% required (HITL)
✓ Rejection routing correct
≥ 95% required
FAIRNESS METRICS (where applicable)
✓ Disparate impact ratio measured per protected attribute
✓ Statistical parity difference reported
✓ Threshold of acceptable disparity defined by your team
QUALITATIVE SCORING (customer-facing agents only)
Tone appropriateness
LLM-judge scored
Response completeness
LLM-judge scored
Escalation handling
LLM-judge scored
REPORTING DEPTH
Every failed test case includes: input shown to agent, agent's actual output, expected output, diagnostic reasoning from LLM-judge, recommended fix.

Per-field accuracy
Aggregate scores hide what matters. We score every field separately and weight by your definition of criticality.

Per-workflow-path accuracy
Was the right decision made (auto-approve, escalate, route, reject) - independent of whether the data extraction was perfect.

Fairness metrics where applicable
For decisions affecting protected groups, disparate impact and statistical parity measured per attribute.

Qualitative scoring for customer-facing agents
Tone, completeness, escalation handling - scored by LLM-judge with explicit reasoning.
PRE-DEPLOY GATE
Your team sets the threshold. We don't deploy until we hit it.
PRE-DEPLOY EVALUATION REPORT - INVOICE PROCESSING v3.1
Evaluated: Apr 5–7, 2026 | Workflow target go-live: Apr 9, 2026
OVERALL OUTCOME
Aggregate accuracy
94.2% vs target 92.0% ✓
Edge case coverage
19/24 ≥ 85%; 5/24 routed to HITL ✓
Workflow path accuracy
97.8% vs target 95.0% ✓
Fairness (vendor type)
No statistically significant disparity
DETAIL BY FIELD
Vendor name
96.8% ✓
Invoice number
99.4% ✓
Total amount
99.2% ✓
Multilingual invoices (below 85%)
79.1% ⚠ routes to human
EDGE CASES SURFACED
✓ 19 edge cases handled at ≥ 85% accuracy
⚠ 5 edge cases below threshold: handwritten annotations (79%), vendor DBA variations (78%), multilingual (79%), currency edge cases (82%), damaged scans (74%) - all route to HITL
ROUTING TO HUMAN REVIEW
Estimated production HITL routing rate
8.4% (acceptable ≤ 12%) ✓
GATE DECISION - REVIEWER SIGNATURES REQUIRED
☐ Process Owner: S. Kim (CFO)
_____________________
☐ Risk Team: M. Torres
_____________________
☐ Compliance: R. Chen (where applicable)
_____________________
☐ AgentX: delivery lead
_____________________
What this document is
A formal validation artifact. Reviewed by your team. Signed before deployment. Retained as audit evidence for the lifecycle of the agent.
What this document isn't
A sales document. A demo result. A summary of what we hope the agent does. Every number is traceable to the underlying test case.
What happens if we don't pass
We iterate - modify the workflow, expand training inputs, narrow the scope, or recommend that the workflow not deploy. Your call. The report stays honest about what works and what doesn't.
POST-DEPLOY
The eval engine doesn't stop at go-live.
Pre-deploy validation answers “does this work today on representative data.” It does not answer “will this still work in 90 days.” The platform answers the second question by running evaluation continuously in production - sampling traffic, scoring outputs, tracking accuracy trends, alerting when drift crosses defined thresholds.
PRODUCTION VALIDATION - INVOICE PROCESSING v3.1
Workspace: Acme Corp | Tracking window: Last 90 days
VALIDATION ACTIVITY
Continuous sample evaluation
500 cases / day
Scheduled full re-eval
Weekly (Mondays 02:00 UTC)
Triggered re-eval (this month)
2 events
INCIDENT - APR 18 (INC-2026-04-18-001)
Trigger
Rolling accuracy dropped to 91% (threshold 92%)
Detection
14 minutes after threshold breach
Root cause
LLM provider model update changed behavior on edge cases
Action
Workflow held. Re-eval in 47 min on extended test set. v3.2 deployed Apr 19 (95.4%)
Customer impact
None (HITL caught all flagged cases during hold)
Validator
M. Torres (Acme Risk) · Closure: Apr 19, 2026
VERSION HISTORY
v3.0 - Initial deployment - Apr 9, 2026
Validated by S. Kim
v3.1 - Tax handling update - Apr 14, 2026
Validated by M. Torres
v3.2 - Drift correction - Apr 19, 2026
Validated by M. Torres
✓ Continuous validation against production traffic (configurable sample rate)
✓ Scheduled full re-evaluation against test set (default: weekly)
✓ Triggered re-eval on threshold breach, model version change, or workflow modification
✓ Every incident logged with root cause, action, validation evidence, validator identity
Drift monitoring feeds into governance reporting →
VALIDATION CYCLES
Quarterly by default. Triggered on change.
Continuous monitoring catches accuracy drift in real time. Validation cycles are different - periodic structured re-validation that produces written evidence for your risk team and your auditors. Quarterly by default. Triggered on any meaningful change.
SCHEDULED VALIDATION
Predictable. Calendar-driven.
• Quarterly full re-validation against current test set
• Annual test set review - does the test set still reflect production reality?
• Annual fairness re-assessment for decisions affecting protected groups
• Annual independent validation support - we provide artifacts; your model risk team conducts independent review
TRIGGERED VALIDATION
Reactive. Change-driven.
• LLM provider model version change
• Any workflow modification (rules, integrations, sub-agents)
• Threshold breach in continuous monitoring
• New document type or input pattern detected
• Vendor / system / regulatory change requested by your team
• Significant volume change that may stress the workflow
WHAT ANNUAL REVIEW PRODUCES
A written annual validation summary for your model risk inventory. Covers: accuracy trends over the year, incidents and resolutions, version history, test set evolution, fairness metric trends, ongoing monitoring effectiveness. Designed to slot into your existing model risk reporting cycle without your team rewriting it.
AUDIT EVIDENCE
Reports your auditor can read. And your regulator.
Evaluation reports are not internal artifacts. They are external-facing documents - delivered to your team, signed by the people who approved deployment, retained per your retention policy, exportable in formats your auditor and regulator expect. We don't keep evaluation results internal. They're yours.
REPORT TYPES WE PRODUCE
• Pre-deploy evaluation report - Stage 2 gate document, signed before go-live
• Continuous monitoring summary - monthly dashboard export, automatic
• Incident reports - per drift detection event, with root cause and resolution
• Quarterly validation report - structured re-validation evidence
• Annual validation summary - for model risk inventory
• Fairness assessment report - for decisions affecting protected groups (where applicable)
• Test set evolution log - what cases were added, removed, modified, when, why
FORMATS & DESTINATIONS
• PDF - for audit binders and regulatory submissions
• Excel - for model risk inventory integration
• JSON - for GRC tool integration (LogicGate, ServiceNow GRC, Workiva)
• CSV - for ingestion into your data warehouse
• API - for real-time access by your validation tooling
• SFTP / S3 export - for automated archival to your storage
ROLES
Every evaluation has a validator. Every validator has a name.
Evaluation is a process with named roles. We define them up front during scope. Without named accountable roles, evaluation reports become artifacts no one signed and no one defends. With named roles, every decision in the validation cycle has an owner.
Named individuals, not roles
Every validation artifact identifies the specific person who reviewed and signed. Not “the risk team.” M. Torres.
Documented sign-off
Every threshold decision, every deployment approval, every incident closure - has a named signatory. Not an auto-generated approval flag.
Continuity of accountability
When a validator changes, the previous validator's sign-offs remain on record. New validators sign new artifacts. The audit trail doesn’t reset.
The other enterprise deep dives.
Evaluation is one of four enterprise pillars. It depends on security controls, feeds into governance, and is the Stage 2 gate of our delivery model.
HOW WE WORK
Our delivery model, stage by stage.
Four-stage delivery model. Evaluation is the Stage 2 gate that decides whether the workflow deploys.
Read more →
AI GOVERNANCE
Evaluation evidence feeds the governance loop.
Continuous evaluation feeds drift monitoring, decision audit, and validation reporting. The governance loop runs on top of the eval engine.
Read more →
SECURITY
How your data is handled during eval.
Test data isolation, audit trail on eval activity, PII handling - the security controls that evaluation operates under.
Read more →
READY FOR THE VALIDATION CONVERSATION?
Bring your model risk team. We'll show you the methodology.
A 45-minute call. We'll walk through how we'd construct a test dataset from your historical cases, what accuracy thresholds your team would set, what evaluation reports look like in your hands, and how continuous validation would fit your model risk cycle. We bring sample reports. You bring your validation framework. We compare.
