Your AI Agent's Judgment Is Unmeasured. A PM Rubric Shows What to Fix.

Jeff Gothelf stood at ScanAgile 2026 in Helsinki and got asked a question he couldn’t answer: how do you quantify judgment?

He’s the co-author of Lean UX (O’Reilly) and Sense & Respond (HBR Press). He has spent a decade teaching product managers to think clearly about what to build and why. But nobody had asked him to make judgment visible before. Not as a skill. As a score.

He developed the answer on the flight home. A four-dimension rubric, each scored 1 to 3, maximum 12 points. It was designed for product managers. It is actually a governance instrument for any decision-maker. Including ones that run on GPUs.

The Rubric

Gothelf’s framework measures four things. Each maps to a question that every decision should be able to answer.

Customer Evidence (1-3): What grounds this decision?

A score of 1 means the decision rests on assumption or internal opinion. A 2 means secondhand data or analytics. A 3 means direct customer signals: interviews, research, observation. The scoring is simple. The standard is high.

Outcome Clarity (1-3): How specific is the intended change?

“Improve UX” scores a 1. It says nothing measurable. “Reduce checkout friction” scores a 2. Directional, not precise. “Customer completes checkout without contacting support” scores a 3. That is a behavior you can observe and count.

Trade-off Reasoning (1-3): How justified is this choice over alternatives?

“We should do this” is a 1. No reasoning visible. “Instead of X” is a 2. Comparison exists but lacks justification. “We chose this over X because of Y customer insight” is a 3. The decision trail is explicit.

Impact Estimation (1-3): Does cost-and-return thinking exist?

No cost or ROI thinking is a 1. A rough estimate without ROI is a 2. Clear cost, expected impact, and ROI compared against alternatives is a 3.

Twelve points total. Gothelf’s argument is that any PM scoring below 9 has a judgment problem they can now name and fix. The rubric turns a vague competency into a visible, teachable behavior.

Why This Matters Beyond Product Management

Gothelf built this for PMs. His framing is explicit: AI compressed execution costs, so judgment is the remaining differentiator. “You can out-execute a competitor less and less,” he writes. “You can still out-think them.”

True for humans. Also true for agents.

Consider a marketing agent deciding campaign targeting. A design agent selecting a layout pattern. A legal agent recommending contract language. Each of these agents makes judgment calls. And right now, nobody is scoring them.

We have been documenting the measurement problem for months. 600+ organizations show a perception mismatch between what executives believe AI delivers and what controlled trials measure. 40% of workers use AI; 2% of hours are saved. McKinsey surveyed perceptions instead of outcomes. Workers spend 3.8 hours per week verifying AI output against 3.6 hours saved.

The diagnosis is thorough. What has been missing is a framework for what to measure instead.

Gothelf provides one. Not for speed. Not for volume. For the quality of the decisions that precede execution.

Applying the Rubric to Agent Decisions

Take each dimension and ask it of an AI agent instead of a PM.

Customer Evidence for Agents: When your marketing agent selects an audience segment, what data grounded that choice? Did it use direct behavioral signals (score: 3), historical analytics (score: 2), or pattern-match from training data with no customer-specific input (score: 1)?

Most agents today operate at a 1. They generate plausible decisions from general patterns. They do not consult your customer evidence. They cannot. Nothing in the agent’s architecture requires it to show what grounded its choice.

Outcome Clarity for Agents: When your design agent proposes a layout, what behavioral change does it expect? “Better user experience” is meaningless from a human PM. It is equally meaningless from an agent. A governed agent should specify: “This layout reduces scroll-to-CTA distance from 4 screens to 1.5, targeting a 15% increase in conversion for mobile users.”

Score the agent the same way you would score the PM. Vague outcomes earn a 1. Specific, observable behavioral changes earn a 3.

Trade-off Reasoning for Agents: Did the agent consider alternatives? Can it explain why it chose option A over options B and C? Most current agents produce a single recommendation with no visible reasoning trail. That is a 1 on Gothelf’s scale. A governed agent should produce: “I chose approach A over B because customer churn data from Q4 shows price sensitivity outweighs feature requests 3:1 in the SMB segment.”

Impact Estimation for Agents: Does the agent estimate cost and expected return? Most do not. A governed agent should: “This campaign targets 12,000 accounts at $3.20 CPM. Expected conversion at 2.1% based on Q3 benchmarks yields 252 qualified leads at $152 each. Alternative B reaches 8,000 accounts at higher intent; expected yield is 280 leads at $91 each. Recommending B.”

That level of reasoning is a 3. Most agents today produce output with no cost reasoning at all.

The Scoring Table Nobody Has Built

Here is what Gothelf’s rubric looks like when applied to AI agent governance:

Dimension	Score 1 (Ungoverned)	Score 3 (Governed)
Customer Evidence	Trained patterns, no customer data	Decision cites specific customer signals
Outcome Clarity	”Improve engagement"	"Increase 7-day retention by 8% for cohort X”
Trade-off Reasoning	Single recommendation, no alternatives	Explicit comparison with justification
Impact Estimation	No cost or return estimate	Cost, expected impact, ROI vs. alternative

An agent scoring 4/12 is making decisions in the dark. An agent scoring 12/12 is making decisions you can audit, challenge, and improve.

The question for any organization deploying agents: what score are your agents earning right now? If you do not know, the answer is probably 4.

What the Rubric Does Not Cover

Intellectual honesty requires naming the limits.

Gothelf’s rubric was conceived for human PMs. It does not address failure modes specific to AI: hallucination risk, training data bias, context window limitations, or the tendency to present fabricated reasoning with high confidence. An agent can score 3 on trade-off reasoning while citing data that does not exist.

The rubric also has no validation data yet. Gothelf developed it on a flight from Helsinki. It is a framework, not a finding. It has not been tested across organizations or correlated with outcome quality. Treat it as a starting instrument, not a proven metric.

Finally, scoring remains subjective. Two reviewers evaluating the same agent decision may disagree on whether the customer evidence is “direct signals” or “secondhand data.” Calibration across evaluators is a real problem that Gothelf acknowledges but does not solve.

These are limitations, not disqualifications. The rubric is useful precisely because it makes judgment visible and debatable. Imperfect measurement beats no measurement, which is what most organizations have today.

From Diagnosis to Instrument

Gothelf’s core insight is compact: “Judgment was always the differentiator. AI just made it obvious.”

For PMs, this means building the skills that AI cannot replicate: customer empathy, strategic reasoning, cost-benefit thinking. For AI governance, it means something different and more actionable. It means building the scoring infrastructure that makes agent judgment auditable.

A PM who scores 6/12 gets coaching. An agent that scores 6/12 gets a constraint layer: mandatory customer data input before decisions, required alternative generation, forced impact estimation before output. The rubric becomes a governance specification.

You cannot download judgment. You cannot prompt for it. But you can require the behaviors that constitute it.

This analysis synthesizes You Can Quantify Cost. Here Are Four Ways to Measure Judgment. (March 2026) by Jeff Gothelf.

Victorino Group helps enterprises measure and govern AI judgment at scale. Let’s talk.