Evaluation-Driven Development: The Missing Operations Layer for AI Agents

TV
Thiago Victorino
9 min read
Evaluation-Driven Development: The Missing Operations Layer for AI Agents
Listen to this article

The data scientist who trained models is dead. Koren Gast, a data scientist at monday.com, published the obituary last week. After building Monday Magic and Monday Vibe, two agent-powered products, she concluded that model.fit() is no longer the job. The job is building evaluation infrastructure.

She is right. But the implication is larger than she frames it.

Gast describes a new role: methodologist for evaluation frameworks, error taxonomies, and context engineering. Her team discovered that 40% of failures in their text-to-app agent were tool parameter generation errors. Not hallucinations. Not reasoning failures. The agent selected the right tool and then passed it the wrong parameters. A prompt change improved tool selection accuracy by 12% while causing a 4% regression in schema adherence.

That tradeoff is the story. You cannot optimize one dimension of agent behavior without degrading another. And you cannot detect these regressions without evaluation infrastructure that measures multiple dimensions simultaneously.

The Consistency Problem

Two days after Gast’s post, ServiceNow published EVA, the first end-to-end evaluation framework for voice agents. Eight authors. Twenty systems benchmarked. Fifty airline scenarios (rebooking, cancellation, vouchers, standby). Fifteen tools in the evaluation toolkit. Three trials per scenario.

The framework measures six dimensions: Task Completion (deterministic scoring), Faithfulness (LLM-as-Judge), Speech Fidelity (LALM-as-Judge), Conciseness, Conversation Progression, and Turn-Taking. Two headline metrics: EVA-A for accuracy, EVA-X for experience.

Three findings deserve attention.

First, agents that complete tasks well deliver worse user experiences. EVA found a measurable tradeoff between accuracy and experience scores. An agent that doggedly pursues task completion becomes verbose, repetitive, and unpleasant. This mirrors Gast’s observation about the 12%/4% tradeoff. Improvement in one dimension causes regression in another.

Second, the gap between “can do” and “does do” is enormous. EVA uses pass@k (succeeded at least once in k trials) and pass^k (succeeded every time in k trials). The delta between pass@3 and pass^3 was substantial across all systems tested. Agents that can complete a task often fail to complete it consistently. Capability and reliability are different metrics requiring different evaluation infrastructure.

Third, named entity transcription is the dominant failure mode for voice agents. A single misheard character in a booking reference cascades through the entire interaction. The error is not in reasoning. It is in input fidelity. No amount of prompt engineering fixes a transcription error.

Evaluation as Operations Infrastructure

As we covered in The Governance Loop Hidden in Your Agent Monitoring, Chase’s production improvement loop is a governance framework in disguise. Annotation queues are compliance review. Evaluation rubrics are control policies. Sampling strategies are risk-based oversight.

Gast and the EVA team extend this thesis. Evaluation is not a testing phase that happens before deployment. It is a continuous operations discipline that runs alongside production.

Gast describes the shift explicitly: data scientists moved from Python and Jupyter notebooks to TypeScript codebases and CI/CD integration. Evaluation pipelines live in the same infrastructure as the agents they evaluate. Golden datasets, calibrated evaluators (measured by Cohen’s Kappa for inter-annotator agreement), and automated guardrails are not research tools. They are production infrastructure.

Hamel Husain, whom Gast quotes, captures it cleanly: “Teams that succeed barely talk about tools. They obsess over measurement and iteration.” And more directly: “Evals are just data science applied to AI.”

The Multi-Dimensional Trap

Here is where most organizations fail. They build one-dimensional evaluation: task completion rate, or user satisfaction score, or latency. Then they optimize against that single metric and wonder why production quality degrades.

Monday.com’s 40% tool parameter failure rate was invisible until they built error taxonomies that classified failures by type. A single “agent accuracy” metric would have masked the pattern entirely. You need to know that the agent selected the right tool but passed wrong parameters, because the fix for tool selection errors and parameter generation errors are completely different interventions.

EVA’s six-dimensional framework makes this explicit by design. You cannot evaluate a voice agent on task completion alone because an agent that completes every task but takes fifteen turns to do it is unusable. You cannot evaluate on conciseness alone because a terse agent that fails half its tasks is worthless. The dimensions interact. The evaluation must capture the interactions.

This is not optional complexity. It is the minimum viable evaluation for production agent systems. As we explored in Your Agent Already Knows What’s Wrong, Factory’s monitoring system processes 1,946 sessions daily and auto-resolves 73% of issues. That resolution rate depends on granular failure classification. A system that only knows “session failed” cannot self-heal.

What Evaluation-Driven Development Actually Requires

Gast’s team arrived at a specific operational stack. It is worth examining because it represents what a mature evaluation practice looks like.

Error taxonomies. Not error logs. Taxonomies. Structured classification of failure modes that enable pattern detection. Monday.com’s taxonomy revealed the 40% tool parameter problem. Without it, those failures were noise in an aggregate error rate.

Multi-dimensional metrics. EVA measures six dimensions. Monday.com tracks tool selection accuracy, schema adherence, and field completeness independently. The minimum is three orthogonal metrics for any production agent. Task completion, output quality, and operational efficiency.

Regression detection across dimensions. The 12% improvement / 4% regression pattern is the norm, not the exception. Every change to an agent system shifts multiple metrics. You need automated detection of cross-dimensional regressions, or you will ship improvements that are net negative.

Calibrated evaluators. LLM-as-judge evaluation works, as we covered in the governance loop analysis, achieving roughly 85% alignment with human judgment. But the judge needs calibration. Cohen’s Kappa measures whether your evaluators (human or automated) agree with each other. Without calibration, your evaluation infrastructure produces confident but unreliable scores.

Golden datasets. Curated sets of inputs with known-good outputs. Not synthetic data. Real production cases, annotated by domain experts, versioned alongside the agent code. When a prompt change improves tool selection by 12%, you need a golden dataset to detect the 4% schema adherence regression before it reaches production.

The Organizational Shift

Here is the uncomfortable part. Evaluation-Driven Development is not a technical practice. It is an organizational one.

Gast describes data scientists moving into TypeScript codebases. That is not a language preference. It is an organizational boundary dissolving. When evaluation lives in Jupyter notebooks owned by a research team, it is a checkpoint. When evaluation lives in CI/CD pipelines owned by the engineering organization, it is infrastructure.

The same shift applies to the metrics. Monday.com’s 3% failure rate in required field omission from structured outputs is a product quality metric, not a model quality metric. It belongs in the product team’s dashboard, not the ML team’s research log.

ServiceNow’s EVA framework makes this even more explicit. Evaluating voice agents requires linguistics expertise (Speech Fidelity), UX expertise (Conversation Progression, Turn-Taking), domain expertise (Task Completion, Faithfulness), and systems expertise (consistency measurement via pass@k vs pass^k). No single team owns all of those disciplines.

Evaluation-Driven Development requires cross-functional ownership of evaluation infrastructure. Engineering builds the pipelines. Product defines the metrics. Domain experts curate the golden datasets. Operations monitors the regressions. This is the same cross-functional structure we identified in Chase’s governance loop, applied specifically to evaluation.

The Cost of Not Building This

Organizations that skip evaluation infrastructure do not save money. They defer the cost to production incidents.

Monday.com’s 40% tool parameter error rate existed before they had the taxonomy to see it. Those errors were reaching users. The evaluation infrastructure did not create the problem. It revealed the problem that was already costing them.

EVA’s consistency gap (pass@3 vs pass^3) means that demo-quality agents, the ones that work when you test them, fail unpredictably in production. Without consistency measurement, you ship agents that succeed in testing and fail in deployment. The customer discovers the reliability problem. You discover it in the support queue.

The 4% schema adherence regression from monday.com’s prompt change would have shipped to production without multi-dimensional evaluation catching it. Four percent sounds small. Across millions of interactions, it is thousands of broken outputs per day.

Where This Converges

Monitoring tells you what happened. Governance tells you what should happen. Evaluation tells you whether what happened matches what should have happened.

These three disciplines are converging into a single operations layer. Chase’s production improvement loop is the governance skeleton. Factory’s self-healing monitoring is the automation layer. Gast’s evaluation methodology and ServiceNow’s EVA framework are the measurement discipline.

The organizations that build all three as integrated infrastructure will run agents that improve over time. The organizations that treat evaluation as a pre-deployment testing phase will run agents that degrade over time, because production conditions drift and evaluation that does not run continuously cannot detect the drift.

The tools exist. The frameworks exist. The question is whether evaluation gets the same organizational investment as the agents it measures.


This analysis synthesizes monday.com’s The Death of Model Fit (March 2026) and ServiceNow AI’s EVA: A Framework for Evaluating Voice Agents (March 2026).

Victorino Group builds evaluation infrastructure for enterprise AI agent operations. Let’s talk.

All articles on The Thinking Wire are written with the assistance of Anthropic's Opus LLM. Each piece goes through multi-agent research to verify facts and surface contradictions, followed by human review and approval before publication. If you find any inaccurate information or wish to contact our editorial team, please reach out at editorial@victorinollc.com . About The Thinking Wire →

If this resonates, let's talk

We help companies implement AI without losing control.

Schedule a Conversation