Victorino — Thinking

Cloudflare Just Named the Role That Doesn't Survive AI: Measurers

Thiago Victorino — Tue, 26 May 2026 00:00:00 GMT

In May 2026, Cloudflare cut 1,100 employees. First mass layoff in the company’s 16-year history. The headline number is not what matters. The label Matthew Prince hung on the people being cut is.

He called them “measurers.” Middle management. Finance. Legal. Internal auditing. Revenue recognition. The functions whose work product is verification of someone else’s work product.

This is the first time a major-tech CEO has named coordination as the cuttable layer when AI absorbs execution. The framing matters more than the layoff. Every board deck for the next two quarters will borrow the word. Every CFO will ask which of their cost centers fall under the new taxonomy. The question worth fighting over is not whether measurers should be cut. It is what replaces them, because AI execution without measurer-equivalent oversight is the failure mode nobody is pricing in yet.

What Prince actually said

The author at Hackyexperiments captured the rhetoric clearly: “A founder cutting 20% writes an op-ed in the Wall Street Journal and gets called brave.” The same author argues that “a 5-to-10 person company starting today can credibly take on incumbents with thousands.” And the technical claim underneath the philosophy: variance in engineer productivity is now “directly measurable via token usage.”

Three moves nested in those sentences. Each one is a thesis we have been circling for a year.

First move: the cost of running a company has fallen far enough that the measurement layer is the most expensive layer that remains. Engineers ship more code per hour. Designers iterate faster. Customer support handles more tickets per agent. The bottleneck shifted to the people whose job was to confirm those things happened correctly.

Second move: AI made the execution layer auditable in machine-readable form. Token usage. Commit-level traceability. Automated review passes. The “measurer” function was a workaround for not having that telemetry. Once the telemetry exists, the headcount built to compensate for its absence becomes optional.

Third move: cutting that layer is now a brag, not a confession. The cultural frame shifted from “we had to lay people off” to “we restructured around AI.” Prince is the first major-tech CEO to publish the new vocabulary out loud.

Why this taxonomy will spread

CIOs and CFOs have wanted this language for two years. They could see the cost. They could not see the cohort. The org chart did not have a row labeled “people who measure other people.” Prince just drew the row. Once a row exists, it can be reorganized, reduced, or replaced. That is how language operationalizes change.

Expect the term in earnings calls within a quarter. Expect McKinsey to repackage it into a deck within two. Expect a Harvard Business Review piece with a 2x2 matrix by Q4.

The risk is not the language. The risk is the underlying assumption. The assumption is that measurement is overhead. That assumption is wrong by half.

What measurers actually did

Strip the org-chart politics and look at the functions. Internal audit catches material misstatements before regulators do. Revenue recognition keeps a company off restatement watch. Legal review keeps the agreements you sign from costing you ten times the contract value when something breaks. Finance approvals are the difference between a clean SOX 404 and a Section 302 nightmare.

These were never coordination overhead. They were liability suppression. The measurer function was the human firewall between the company and the consequences of unchecked execution. AI does not eliminate that firewall. It changes what the firewall is made of.

This is the part of the Prince framing that gets lost in translation. He did not say “we no longer need oversight.” He said “we no longer need that many humans doing this oversight.” Those are not the same sentence. The first is a strategy. The second is a workforce reallocation. Boards that read them as identical are buying the second risk in size.

What replaces the measurer headcount

Three pieces of machinery, in order. Each one corresponds to a function the laid-off layer was doing manually.

Continuous control verification. What internal audit used to sample quarterly, instrumentation now reads every transaction. Controls assert. Anomalies escalate. The audit committee gets a dashboard with a freshness timestamp, not a binder six weeks after quarter close. Tools like AuditBoard, Workiva, and the SAP GRC suite all moved this direction in 2025. The remaining work is not capability; it is implementation discipline.

Policy-as-code at the agent layer. Every AI agent acting on company systems carries the rule that authorizes the action. The rule is versioned, tested, and citable when the auditor asks. This is what Cloudflare’s internal stack rollout demonstrated at engineering scale. The same architecture has to extend into finance, legal, and ops, which is the gap we wrote about in the cross-domain governance tooling deficit.

Signal review at human pace. Whatever the machine flags has to land on a human desk with enough context to act on within hours, not weeks. This is not coordination overhead. This is the new measurer role. Fewer people, doing the deliberation the automation cannot do, on cases the automation correctly escalates. The math is roughly one human reviewer for every fifteen to twenty that the manual function required, based on the financial-services SOX automation benchmarks already in circulation.

A company that cuts the headcount without building the three pieces is not running leaner. It is running with the smoke detector unplugged.

The two-clock problem this exposes

We wrote about the two-clock CEO workforce problem earlier this year. Cloudflare’s announcement collapses the two clocks into one decision moment.

Clock one is the quarterly cost clock. Headcount is the biggest line item. Cutting measurers shows up in the next earnings cycle. The pressure to act on clock one is immediate and visible to every board.

Clock two is the liability accrual clock. Controls that erode silently produce material weaknesses that surface eighteen to thirty-six months later. The cost is enormous and the attribution is fuzzy. The pressure to act on clock two is invisible until the auditor finds the issue.

CEOs who only price clock one will get a quarter of margin and three years of restatement risk. CEOs who price both will reduce headcount and ship the control instrumentation in the same quarter. The second path is the one Prince’s framing was supposed to invite. Most companies will hear only the first half of the sentence.

What to do in the next 90 days

The Prince framing will be in every board meeting before July. Get ahead of it.

First, map your measurer surface. For each business function (finance, legal, audit, compliance, revenue ops, sales ops, marketing ops), list the work products whose purpose is verification rather than production. That list is the cohort the new vocabulary is going to target. Knowing it before the board asks is the cheap win.

Second, draw the replacement architecture function by function. Where does continuous control verification need to live. Where does policy-as-code need to extend. Where does the human reviewer queue sit and who staffs it. Do this in two weeks. Not a year. The cost of having no answer when the question lands is a forced answer that ignores clock two.

Third, choose your sequence. Cut headcount only after the instrumentation that replaces it is in production. Companies that invert the sequence will pay the audit penalty in 2027 and the restatement penalty in 2028.

Cloudflare named the role. The replacement work is what the next twelve months are about. Do not confuse the announcement with the strategy.

This analysis synthesizes The Revenge of The Measurers (Hackyexperiments, May 2026).

Victorino Group helps leadership teams redesign the oversight machinery that survives the measurer cut, replacing coordination headcount with governance instrumentation that scales. Let’s talk.

Rand Fishkin Just Killed 'Make Great Content.' Inimitable Product Is What's Left.

Thiago Victorino — Tue, 26 May 2026 00:00:00 GMT

Rand Fishkin spent twenty years telling marketers to make great content. On May 25, 2026, he published a post telling them to stop.

The piece is called “Inimitable Product is the New ‘Make Great Content’,” and it landed on sparktoro.com with a quiet prediction from its own author: fewer than 5,000 visits, fewer than 500 from search. Fishkin, who built Moz on the premise that quality content earns durable distribution, now expects the founder of modern SEO to be ignored by the channel he helped build. His framing of the platforms is unsentimental: “We are going to ruin the Internet.”

That sentence deserves to sit by itself for a moment, because it reframes a debate that marketing has been trying to soften for two years.

What Fishkin actually killed

The advice “make great content” rested on a model. You write something useful. Search engines find it. Readers click through. Some of them buy. The artifact (the post, the report, the explainer) was the moat because the traffic it earned compounded.

AI search broke the model at the artifact layer. ChatGPT, Perplexity, Gemini, and Google’s own AI Overviews now read the content, summarize it, and answer the user without sending the click. Fishkin calls this a prisoner’s dilemma: every publisher must allow indexing or lose visibility, and the act of allowing indexing teaches the systems to make the publisher redundant. The artifact gets absorbed. The traffic does not return.

His proposed replacement is what he calls inimitable product. The examples are deliberately concrete and deliberately diverse: ultrasonic chef’s knives, made-to-measure suits, curated gift boxes, pottery refined through millennia of technique, Meow Wolf’s immersive art installations, lawn care, financial services. These do not share an industry. They share a structural property. None of them survives by being summarized. You cannot answer “what is a made-to-measure suit” in a way that substitutes for the suit. You cannot summarize Meow Wolf in a way that substitutes for walking through Meow Wolf. The product resists the channel.

This is the part of Fishkin’s argument that travels.

Professional services has the same problem

Engineering leaders have been saying a version of this for eighteen months in their own vocabulary. The moat is not the code your team writes, because the assistant will eventually write code of that quality. The moat is the harness around the code: the review process, the test suite, the deployment discipline, the named conventions, the institutional memory of why a thing is built the way it is.

Fishkin’s marketing thesis is the same thesis arriving from the other end of the building. When AI can compress and re-present any artifact, the durable advantage shifts from the artifact to the system that produces and validates it. The system is inimitable because it is built from the operator’s specific history, specific data, specific decisions, and specific accountability.

For a consulting firm, this is not abstract. Three concrete shifts follow.

First, the deliverable stops being the moat. Every reasonably equipped firm can now produce a credible AI strategy deck in an afternoon. The deck is the artifact. AI can summarize it. The buyer can ask Claude or ChatGPT for a similar deck and get one that is 80% as good for the cost of a subscription. If your offer is the deck, your offer has been commoditized.

Second, named methodology becomes the moat. A methodology that has a specific name, a specific origin story, a specific set of decisions encoded in its sequence, and a specific operator track record is harder to summarize. AI can describe what the methodology says. It cannot reproduce the judgment that produced the methodology, the cases where it failed, the iterations that refined it, or the operator’s willingness to stand behind it in a specific engagement. The methodology is the made-to-measure suit. The deck is the off-the-rack copy.

Third, proprietary measurement becomes the moat. This is the part Fishkin gestures at when he lists financial services. A wealth manager’s value is not the explainer about index funds; AI will produce that explainer for free. The value is the measurement infrastructure that turns a specific client’s specific portfolio into a specific recommendation under specific market conditions, with accountability if it goes wrong. The measurement infrastructure is the inimitable product. The explainer is the bait that no longer works.

Releezy is built on this thesis, by accident

We started Releezy because engineering leaders kept asking a question that nobody could answer with confidence: is our team measurably better with AI than without it. The product is a measurement discipline that runs on a team’s own data, produces a scoreboard that compares humans and AI on the same axes, and gives the leader a defensible answer to the board.

Read Fishkin’s post and the positioning sharpens. The market does not need another explainer about AI productivity. The internet is drowning in those explainers, and AI search will summarize them all on a single result page. What the market needs is the inimitable thing: a measurement system tied to a specific team’s specific work, producing evidence that cannot be replicated by anyone who does not have access to that team’s data and that team’s standards.

In Fishkin’s vocabulary, the explainer is the artifact and the measurement is the suit. The explainer can be summarized. The suit has to be cut.

This is also why we have been resistant to building Releezy as a content marketing play. The natural instinct, given two decades of SEO conditioning, would be to publish a hundred posts on “AI productivity metrics” and hope the funnel fills. Fishkin’s prediction about his own post is the warning shot. If the founder of modern SEO expects fewer than 500 search visits to a piece this strong, the funnel math no longer holds. The audience comes from the inimitable thing existing in the market, being talked about by operators who use it, and being defensible when challenged. Content supports the inimitable thing. Content does not substitute for it.

The do-this-now

If you run a professional services firm, three concrete moves over the next sixty days.

Name your methodology. If your firm’s approach does not have a specific name, a specific sequence, and a specific set of choices that distinguish it from a generic competitor’s approach, AI will treat your firm as interchangeable with that competitor. Naming is not branding. Naming is the act of refusing to be summarized.

Identify the one measurement only you can produce. Every firm has something it sees in client data that nobody else sees. For most firms, that something is buried in spreadsheets and never productized. Productize it. The measurement, properly packaged, is the suit. Everything else you sell is the off-the-rack copy.

Audit your content for substitutability. Take your last ten posts. Paste each one into Claude or ChatGPT and ask for a competing version. If the AI version reads as 80% as good, that post is teaching the model to replace you. Replace it with something the AI cannot reproduce: a named framework with your fingerprints on it, a measurement nobody else has, a case where you stand behind a specific decision with a specific client. Inimitable, in Fishkin’s sense, means the thing that does not survive the summary.

The moat moved. Fishkin saw it from inside marketing. Engineering leaders saw it from inside the IDE. They are looking at the same shift.

This analysis synthesizes Inimitable Product is the New ‘Make Great Content’ by Rand Fishkin (SparkToro, May 2026).

Victorino Group helps professional services firms translate “inimitable product” into named governance methodology, proprietary measurement, and first-party data that survive AI summarization. Let’s talk.

HBR Finally Named It: 'Every 30 Minutes Someone Creates Something I Have to Look At'

Thiago Victorino — Tue, 26 May 2026 00:00:00 GMT

A manager interviewed by Harvard Business Review put words on something engineering leaders have been muttering for a year: “Every 30 minutes, someone creates something I have to look at.” Liz Fosslien and Mollie West Duffy published that sentence on May 25, 2026, inside a piece titled Managers Are Struggling to Keep Up with the AI Productivity Boom. The management trade press has finally caught up to the operational reality.

The reflex inside HBR’s framing is to coach the manager. Clearer direction. Focus attention on what matters. Faster feedback loops without micromanaging. All true. All insufficient. What the article describes is not a coaching problem. It is a governance problem with a coaching wrapper, and the difference matters because coaching scales with the manager and governance scales with the system.

What Actually Broke

The output of an AI-accelerated team is not a faster version of the previous output. It is a different artifact, produced at a different cadence, demanding a different kind of attention from the person at the top of the queue.

Before agents and copilots, a manager of eight engineers absorbed maybe ten review-worthy artifacts per day. Pull requests, design proposals, status updates, one or two escalations. The mental model behind classical management practice assumes that volume. One-on-ones, weekly syncs, quarterly reviews, ad hoc judgment calls. The whole apparatus runs on the assumption that most of what the team produces between touchpoints does not need the manager.

That assumption is gone. When each individual contributor ships three to five times what they used to, the queue at the manager’s desk does not grow three to five times. It grows worse than linearly, because each artifact arrives at a moment when the previous one has not been fully processed, and the cost of context-switching compounds. The manager who used to be the bottleneck for direction becomes the bottleneck for verification, integration, and prioritization. All three at once.

That is the operational fact behind the HBR quote. The manager is not slow. The manager is correctly recognizing that the job description quietly changed and nobody renegotiated it.

Three Things Break First

When a team accelerates and the management layer does not redesign, three specific things break before anything else. Naming them is the first step in fixing them.

Queue depth. The number of artifacts waiting on the manager’s attention at any given moment. Pre-AI, this number bounced between zero and four. Post-acceleration, it bounces between fifteen and forty. There is no theoretical maximum because the team has no mechanism to know they have already overshot the manager’s processing rate. They are not being rude. They are doing what they were told to do, faster than the system can absorb the consequences.

Feedback latency. The time between an artifact being produced and the producer getting useful signal back. Pre-AI, this hovered around a day. Post-acceleration, with the queue deep and the manager triaging, it stretches to three days, five days, sometimes a full week. The producer keeps producing in the absence of signal, building on assumptions that have not been validated. By the time feedback arrives, the producer has shipped four more things on top of the unreviewed one. Reversing course is now expensive.

Attention allocation. Which artifacts the manager chooses to read closely, which to skim, which to skip. Pre-AI, this was implicit and survivable because the queue was small enough that even random selection worked most of the time. Post-acceleration, attention is the scarcest resource in the system and it is being spent without an explicit policy. The manager defaults to recency, or to whoever pinged loudest, or to whatever is on top of the screen. None of those correlate with what actually matters to the business.

Read the HBR framing back against those three and the recommendations land differently. Clearer direction is queue depth governance: fewer artifacts compete for review because the team knows in advance what counts. Faster feedback loops without micromanaging is feedback latency engineering: the loop closes through structure, not through the manager being available more hours. Focus attention on what matters is attention allocation, made explicit instead of vibes-based.

Why This Is Governance, Not Coaching

The coaching frame says: this manager needs to get better at the new pace. Better at prioritization, faster at written feedback, more disciplined about deep work blocks.

The governance frame says something different. The system has changed. The constraints have shifted. The role description has not been rewritten to match. No amount of individual heroics will close that distance, because the next manager hired into the same structure will hit the same wall in the same week.

We saw the same shape in The Mexican Standoff Inside AI Teams: the team locks up not because individuals are failing but because nobody has the authority to declare who decides what. We saw it from a different angle in The Two-Clock CEO: scale-stage CEOs running two incompatible operating cadences on a single calendar. And we saw it inside the PM function in AI Agents for Product Managers: the PM role expands faster than the org chart admits, and individual ingenuity papers over structural debt until the structure breaks.

The HBR manager is the same pattern at a different layer. The fix is not to make the manager work harder against an unrenegotiated job description. The fix is to renegotiate it, in writing, with explicit governance for the three things that break first.

What Governance Looks Like in Practice

Queue depth governance. Publish a maximum. State out loud how many open artifacts a manager will hold at once, and what happens when the queue exceeds it. Options that actually work: a designated peer-review tier for anything below a defined criticality threshold, a hard cutoff where new artifacts route to a reviewer pool instead of the manager, a weekly queue audit where stale items are killed rather than carried. The principle: the queue is a resource with a ceiling, not an inbox with infinite capacity.

Feedback latency engineering. Set a target time-to-feedback for each artifact class, and instrument it. Twenty-four hours for code review at a certain criticality, forty-eight for design proposals, same-week for written strategy memos. When the metric slips, the response is structural: more reviewers, smaller artifacts, async feedback templates that lower the cost of a useful response. Not “the manager should respond faster.” That is coaching. Structural change is governance.

Attention allocation, made explicit. Decide in advance which categories of artifact the manager reads closely versus skims versus delegates. Write it down. Share it with the team so they know what kind of attention to expect when they produce a given artifact. The act of writing it forces the prioritization that the manager was previously doing implicitly under pressure. The act of sharing it removes the social cost of not reading everything.

None of this is exotic. It is the same kind of governance discipline we already apply to incident response, to access control, to financial approvals. The novelty is applying it to managerial attention as a resource that can be designed, budgeted, and protected.

Do This Now

If you manage managers, this week: pick one manager and one team. Sit with them for an hour. Count the queue. Measure the average feedback latency over the last two weeks. Ask the manager to list, in writing, which artifact categories they currently read closely versus skim versus skip. Bring that artifact to the next leadership meeting.

You will not need to argue for governance after that. The numbers will argue for themselves. The reason HBR could finally publish the quote is that the math has become impossible to hide. The reason it is your job, not the manager’s, to fix it is that the manager cannot govern the system they are operating inside.

The team got faster. The role description did not. The work of the next two quarters is to close that distance deliberately, with explicit levers on queue depth, feedback latency, and attention allocation. Otherwise the productive team becomes the unmanageable team, and you lose the manager along with the throughput gain.

This analysis synthesizes Managers Are Struggling to Keep Up with the AI Productivity Boom by Liz Fosslien and Mollie West Duffy (Harvard Business Review, May 2026).

Victorino Group helps leadership teams replace AI-era management ad-hoc heroics with explicit governance for queue depth, feedback latency, and attention allocation. Let’s talk.

HuggingFace Just Wrote the Vocabulary We've Been Using: Agent = Model + Harness

Thiago Victorino — Tue, 26 May 2026 00:00:00 GMT

On May 25, HuggingFace published an Agent Glossary. The headline definition is one line. “Agent = Model + Harness.” Sergio Paniego and Aritra Roy Gosthipaty wrote it because, in their own framing, the confusion at ICLR 2026 about overlapping terms had become operational debt for the field. They wanted a shared vocabulary. They wrote one.

For anyone selling, buying, or governing AI agents, this is the most important publication of the month. Not because the definitions are novel. We have been using these exact terms for over a year. But because a major model lab now owns the canonical reference. When a procurement team Googles “what is an agent harness,” HuggingFace is what they land on. The vocabulary is now neutral ground.

That changes the conversation. Specifically, it changes who has to do the translation work.

The Definition That Settles a Year of Argument

HuggingFace’s core decomposition is precise enough to put in a contract. The scaffold is the behavior-defining layer: the system prompt, the tool descriptions, the response parsing logic. The harness is the execution layer: the code that calls the model, handles tool invocations, and decides when to stop. The model sits inside. Everything else is scaffolding or harness around it.

The line that does the heavy lifting is the one Paniego and Gosthipaty land on directly: “If you are not the model, you are the harness.” That sentence is procurement-grade. It collapses every fuzzy product category into a binary. Either you ship the weights or you ship the code that orchestrates them. There is no third category.

Sub-agents get the same treatment. A sub-agent has reasoning capability, which is what distinguishes it from a tool (a function call) or a skill (packaged knowledge). Tools execute. Skills inform. Sub-agents decide. Once you accept that taxonomy, the layer at which a vendor competes becomes unambiguous. You can map any AI product to one of those four roles in under thirty seconds.

That mapping is the new procurement skill.

Why the Lab Endorsement Matters More Than the Terms

We have written about this decomposition under the harness definition and traced its cross-discipline applications. The terms are not new. Anthropic uses them in their applied research posts. Practitioners on Twitter use them. The internal documentation of every serious agent team uses them, with local variations.

What was missing was a single citation that a buyer could send to a vendor without it reading as advocacy. If we sent a prospect the Victorino post defining harness, the implicit subtext was “adopt our framework.” If we sent them the Anthropic post, the subtext was “adopt the framework of the lab whose model you might also be considering.” HuggingFace’s position in the ecosystem is closer to neutral. They host models from every lab. They are infrastructure, not a competitor. Their glossary reads as the field defining itself.

That neutrality is what makes the glossary procurement-grade. A CIO can now require, in an RFP, that vendors describe their offering as model, harness, scaffold, sub-agent, tool, or skill, and cite the HuggingFace definition as the reference. The vendor cannot argue with the source. The vendor has to translate their marketing into the glossary’s terms.

This is the moment the vocabulary becomes a procurement weapon rather than an internal tool.

The Three Buyer Questions That Now Have Clean Answers

Before May 25, three questions kept coming up in vendor evaluations and produced muddled answers every time. Each one now has a clean form.

The first is “what layer are you selling at?” Before the glossary, vendors would say “we are a full agent platform” or “we are an agent framework.” Both phrases were marketing, not architecture. With the glossary, the question becomes: do you ship the model, the harness, the scaffold, or some combination? A vendor that cannot answer that question in one sentence is selling you a category, not a product.

The second is “what happens at the seams?” If a vendor sells the harness, what model assumptions does the harness make? If they sell scaffold (a set of prompts and tool descriptions), what harness assumptions does it require? The glossary makes the seams visible. The contracts can now specify which side owns each one.

The third is “where does your governance live?” Most agent governance is implemented in the harness, because that is the layer that decides when to call a tool, when to stop, and how to log. The scaffold can encode intent, but the harness enforces it. Once a procurement team understands that, the security review changes shape. Instead of asking the vendor “is your platform safe,” the team asks “show me the harness behaviors that enforce policy and the scaffold patterns that declare it.” Two specific deliverables instead of one vague one.

Each of these three questions used to require a 30-minute explanation of vocabulary before the substantive answer could begin. The glossary removes the preamble. The conversations get shorter and sharper at the same time.

What Stops Being a Coherent Unit

The hidden move in the HuggingFace post is that “agent” stops being a unit you can buy or evaluate. An agent is a composition. The model is bought from a lab. The harness is bought from a platform vendor or built in-house. The scaffold is written by the team using the agent. The tools are integrated. The skills are curated.

When someone says “we are evaluating Vendor X’s agent,” they are now committing a category error. Vendor X sells one or two layers. The agent only exists when all five layers are assembled. The evaluation has to happen at the layer level.

This will be uncomfortable for vendors who built their pitch around “the agent.” It is liberating for buyers who needed a framework to decompose what they were buying. The leverage shifts toward the buyer who can map a vendor pitch onto the glossary in real time.

What to Do This Week

Take the HuggingFace glossary URL and put it in your next three vendor RFPs. Specifically, in the architecture section, add: “Per the HuggingFace Agent Glossary, identify which of the following layers your offering provides: model, harness, scaffold, sub-agent, tool, skill. For each layer you provide, describe the interface assumptions made about the adjacent layers.”

Then, before the next vendor call, spend ten minutes mapping the vendor’s marketing site to those layers. Note which layers are explicit, which are implied, and which are unclear. The unclear ones are your opening questions. You will find that most pitches collapse two or three layers into one fuzzy term, and that the right discovery question is simply “which layer are you talking about right now?”

If you ship agents internally, mirror the same exercise on your own architecture. Write the one-pager that maps each layer of your stack to the glossary terms. Circulate it to the security, platform, and product teams. The first version will surface three places where two teams meant different things by the same word. Fixing those is the compounding value of adopting the vocabulary.

The vocabulary war is over. HuggingFace called the terms. The teams that adopt them first will run cleaner procurement, cleaner security reviews, and cleaner internal handoffs for the next eighteen months. The teams that keep using “agent” as a unit will spend that same eighteen months explaining what they mean.

This analysis synthesizes Agent Glossary: harness, scaffold, and the AI agent terms worth getting right by Sergio Paniego and Aritra Roy Gosthipaty (HuggingFace, May 2026).

Victorino Group helps procurement and engineering leaders translate the harness vocabulary into RFP-grade evaluation criteria for AI agent vendors. Let’s talk.

Jen Can Never Leave: When the Expert Is the Single Point of Failure

Thiago Victorino — Tue, 26 May 2026 00:00:00 GMT

Jen worked at Reed Group, now Alight Absence Management. Her job sounds boring on paper. She processed payroll files with two-letter codes in columns called “Action” and “Action Reason Code.” A combination like PAY/SRT meant nothing to a new analyst. To Jen, it meant a partial leave with a specific compliance treatment, and she could spot the case in three seconds across a spreadsheet of thousands.

Nobody else could.

That last sentence is the entire problem. Jason Cole, the CTO who tells Jen’s story on the darthealth blog, frames it without melodrama. Jen could not take real vacations. Not the kind where you stop reading email. Because every time something unusual came through the payroll pipeline, the system did not flag it. Jen did. And if Jen was at the beach, the unusual case sat in a queue, or worse, was processed wrong.

This is the texture of a governance failure that does not look like one. There is no breach. No audit finding. No regulator knocking. There is just one human being who has become a load-bearing wall, and a company that has quietly accepted that the wall can never be repaired without bringing down the building.

The Diagnostic Question

When you find a Jen in your operation, the instinct is to treat it as a staffing problem. Hire two more Jens. Cross-train juniors. Pay her a retention bonus large enough to buy her loyalty through the next fiscal year.

None of that solves anything. Cross-training assumes the knowledge can be transferred in a conversation. It cannot. Jen’s pattern recognition was not in her PowerPoint. It was in her hands, built over thousands of files where she had seen what happened when PAY/SRT was misclassified two quarters later. Hiring more Jens assumes the labor market produces them. It does not. Jen is the artifact of a specific company’s specific history, processed through one specific person’s specific career.

The right diagnostic question is the one Cole asked. Not “how do we replace Jen,” but “why does the company depend on Jen in the first place?” The answer is that the payroll system never encoded what the payroll system actually did. The codes were never documented because the people who used them did not need documentation. The patterns were never written down because the patterns lived in the muscle memory of the team that read them.

That is technical debt. It just happens to be carried by a person instead of a Confluence page.

Documentation Is a Photograph

Here is the line from Cole that earns the price of admission. “Documentation is a snapshot of what someone remembered on the day they wrote it. Wisdom is knowing what to do when data with the same smell but a different look shows up next time.”

Read that twice. The implication is brutal for any organization that has spent the last decade running knowledge management initiatives. The runbook is a photograph of a moment. The wisdom is the photographer’s eye, which the runbook can never capture. Every audit trail, every standard operating procedure, every onboarding deck is necessarily incomplete in the same way. The unusual case, the one that smells like PAY/SRT but is not quite PAY/SRT, is exactly the case the documentation cannot cover.

This is why the standard response to a Jen problem fails. You write the runbook. The runbook captures the cases Jen has seen often. The next unusual case shows up. The runbook does not cover it. Jen still has to look at it. Jen still cannot go on vacation. The runbook has solved nothing for the cases that actually matter, which are the ones that needed Jen’s judgment in the first place.

The trap is treating documentation as the destination. Documentation is a way station. The destination is a system that can do what Jen does, which is recognize that something is unusual and then know what to do about it. Or, in the cases where it cannot do the second part, at least know how to escalate the first.

What Cole Actually Built

Reed Group’s answer was something Cole calls the Data Nexus. The architecture is less important than the operational behavior. The Nexus learns Jen’s pattern recognition. It looks at the payroll file and applies the same heuristics Jen built over years. When it sees a familiar pattern, it processes the case. When it sees something ambiguous, something with the same smell but a different look, it does not guess. It flags the record for human review and tells the human what made it suspicious.

The system handles the cases that have a precedent. Jen handles the cases that do not. That single change rewires the entire economics of the role.

Cole’s framing is worth quoting directly. “Jen with the Data Nexus becomes Dr. House, only consulting on the really interesting cases while the system learns.” House did not see every patient who walked into Princeton-Plainsboro. He saw the patients who had defeated the standard diagnostic pipeline. That is a specialist’s job. Jen, before the Nexus, was doing the equivalent of every diagnosis from strep throat to lupus, all day, every day. The Nexus did not replace Jen. It promoted her.

This is the governance pattern. Encode the tacit knowledge that already has precedent. Escalate the novel cases to the human who has the judgment to handle them. The system gets faster at the routine work over time. The human gets paid for the work that requires actual expertise. And the human, finally, gets to go on vacation, because the routine queue keeps moving without her.

Why This Is a 2026 Pattern, Not a 2018 One

We have been telling this story wrong for a decade. The version we kept telling was “AI will replace knowledge workers.” That story was always too simple, and the people doing the actual operational work could feel it was too simple. Replace Jen with what? A model that has never seen a Reed Group payroll file? A vendor SaaS that does not know what PAY/SRT means in your specific compliance context?

The 2026 version of the story is different. The Nexus is not a generic model. It is a system that was built around Jen, that learned from Jen, and that runs alongside Jen as her instrument. It only exists because Jen existed first. The institutional knowledge was the input, not the output to be deleted.

The transferable pattern is this. Every “we have one person who knows X” sentence in your organization is technical debt. Not a charming quirk. Not a sign of strong culture. Debt. It has a carrying cost (vacation that the person cannot take, knowledge that walks out the door when they do, queues that back up when they are sick) and a refinancing cost (the project that finally encodes the knowledge into a system). Right now, in 2026, the refinancing cost is cheaper than it has ever been, because the underlying tools to capture, encode, and escalate are commodity.

The companies that will operate AI well over the next three years are the ones that go looking for their Jens deliberately. Not to replace them. To finally pay back the debt they have been carrying on their backs.

Do This Now

Pick one operation in your business and ask a single question. If this one person were to take three weeks of vacation, no email, no Slack, what would back up in the queue and what would get processed wrong?

That answer is your candidate Jen. The next question is what fraction of her work has precedent (encode it) and what fraction is genuinely novel (escalate it). Treat the encoding as a software project, not a documentation project. Treat the escalation as a workflow design problem, not a hiring problem. And put the human in the seat that requires her actual judgment, not the seat that requires her to do the same recognition task ten thousand times in a row.

Then send her on vacation. The system you built will tell you whether you actually finished the job.

This analysis synthesizes Jen Can Never Leave by Jason Cole (Reed Group, May 2026).

Victorino Group helps operations leaders convert single-expert dependencies into governed AI-augmented workflows, turning the Jen on your team into the expert who finally takes a real vacation. Let’s talk.

Karpathy Retired Vibe Coding. The Replacement Is Product Management.

Thiago Victorino — Tue, 26 May 2026 00:00:00 GMT

Andrej Karpathy coined “vibe coding” in early 2025. In May 2026, per Jeff Gothelf’s reporting, he declared the term obsolete and replaced it with a list of activities that any honest product leader recognizes on sight.

That recognition is the story.

The List Karpathy Used to Retire His Own Term

As Gothelf reconstructs the list, Karpathy describes what agentic engineering actually requires: writing design specs, supervising plans, inspecting diffs, writing tests, building evaluation loops, managing permissions, preserving quality.

Read that list twice. Notice what is not on it. Typing code. Memorizing syntax. Picking a framework. Configuring a build tool. The activities that defined “developer” in 2015 are absent. The activities that defined “product manager” are not.

Gothelf draws the parallel directly. Every item on Karpathy’s list maps to a classic PM activity: problem definition, prioritization, outcome validation, success metrics, scope alignment, quality judgment. The vocabulary changed. The job did not.

Why This Is the Admission That Matters

For two years the conversation about AI-assisted development has been an engineering conversation. How many seats of Cursor. Which model. Which agent loop. Which IDE. The implicit assumption: this is a tools problem, solved by engineering procurement and engineering training.

Karpathy’s retirement of his own term punctures that assumption. The judgment work was never the engineering bottleneck. It was the PM bottleneck disguised as one.

When a developer accepts a bad AI suggestion and ships it, the failure was not that the model hallucinated. The failure was that nobody wrote a sharp enough spec to make the hallucination obvious. When an agent runs a dangerous command, the failure was not the agent. It was the missing permission boundary. When an evaluation loop produces nothing useful, the failure was not the loop. It was the absence of a success metric anyone agreed on.

Each of those is a PM activity. None of them are cured by buying more Claude seats.

The Quote That Should Be on Every CTO’s Wall

Gothelf’s sharpest line lands here. “The admin version of every one of those activities is now automatable, and it will be automated. The judgment version is the job.”

The admin version of writing a spec is filling in a template. The judgment version is knowing what the customer cannot articulate yet. The admin version of inspecting a diff is reading the file. The judgment version is knowing which 200 lines of refactor are net positive and which are a regression dressed in clean code. The admin version of an evaluation loop is wiring it up. The judgment version is picking the right metric to measure.

The first column is being eaten by agents on a quarterly cadence. The second column is what remains. Karpathy is, in effect, telling engineering leaders that they have been staffing the first column and ignoring the second.

What Organizations Are Doing Wrong Right Now

Three patterns we see repeatedly in 2026:

Engineering teams adding AI capacity without PM capacity. A 40-person engineering org wires up Claude Code, sees a 25% throughput lift on first measurement, and decides the bottleneck is now “more AI tooling.” Six months later, the throughput lift has flattened. Investigation reveals the missing piece is not more tooling. It is that the same five product managers are now bottlenecking twice the engineering output, and the specs they write have not adapted to a world where the executor reads them literally.

PMs treated as ticket-writers, not spec-writers. In most companies, the PM job degraded over the last decade into stakeholder management, ticket triage, and roadmap theater. The actual product thinking, the customer judgment, the trade-off articulation, got squeezed out. AI agents expose this immediately. An agent fed a ticket like “improve the onboarding flow” will produce something. Whether that something is right requires the judgment work that PMs have been increasingly absolved of doing.

“AI-assisted developer” as a job title, with no equivalent for product. Job boards in May 2026 are full of “AI-augmented engineer” and “agentic engineer” titles. The corresponding product role does not exist with the same crispness. The market is still recruiting executors for an environment where execution is increasingly automated, and underfunding the judgment roles for an environment where judgment is the binding constraint.

What to Do Monday

Audit your spec quality before you audit your model selection. Pull the last ten specs your team handed to AI agents. Ask: would a smart but literal junior, with no context about your product, build the right thing from this? If the answer is no, the bottleneck is upstream of the model. No tooling change will fix it.

Move judgment work earlier in the cycle. If a PM gets involved at the acceptance-testing stage, you have already paid for the engineering iteration. With agents producing code in minutes, that iteration cost approaches zero, which means the judgment cost dominates. PMs need to be in the room when the spec is written, not when the PR is reviewed.

Reframe AI-assisted development as a product staffing question. Stop asking “do we have enough AI-fluent engineers?” Start asking “do we have enough people who can articulate the right problem clearly enough for an agent to solve?” These are different questions, with different answers, and the second one is the one that determines outcomes.

Stop hiring PMs who cannot read a diff. The judgment version of “inspecting diffs” requires technical literacy. A PM who has never read code cannot evaluate whether the agent’s refactor is correct, only whether the demo looks right. In a world where the demo always looks right, that is no longer enough.

Build the evaluation loop as a product artifact, not an engineering one. Evaluation criteria for AI output are product decisions. What “good” looks like, what acceptance thresholds apply, what counts as regression, these are not technical questions. They are product questions wearing technical clothing. Treat them accordingly.

The reframe Karpathy is forcing is uncomfortable for organizations that spent the last two years convincing themselves the AI shift was a procurement and training problem in engineering. It was not. It was a product discipline problem all along. The tools just made it impossible to keep ignoring.

The companies that staff for that reality now will compound. The ones still hiring more executors for an environment where execution is free will spend 2027 wondering why their AI investment did not produce the returns the deck promised.

This analysis synthesizes Karpathy Said Vibe Coding Is Obsolete. What He Described Instead Is Product Management. by Jeff Gothelf (May 2026).

Victorino Group helps leadership teams reframe AI-assisted development as a PM staffing problem, building the spec-writing, outcome-validation, and judgment discipline that determines whether the agentic stack actually produces value. Let’s talk.

The Redux Maintainer Just Documented the Most Honest Agent Workflow of 2026

Thiago Victorino — Tue, 26 May 2026 00:00:00 GMT

On May 7, 2026, Mark Erikson published Part 2 of his AI workflow series. Erikson maintains Redux. His day job is at Replay. He is the person who answers when a senior frontend engineer files a state-management bug at three in the morning, and he has answered enough of them across enough years to have opinions worth reading.

The headline of the post is not the tool stack. The headline is what he refused to do, and what he openly admitted he still cannot solve.

The stack itself is interesting. OpenCode with the CodeNomad UI. Claude Opus 4.5 and 4.6 over the API. Custom MCPs named grepika, tilth, and cachebro. A custom Bun script called devplans.ts that handles session handoffs. Most of that is replaceable. Tools change every six weeks. The discipline does not.

The discipline reads like a checklist of things that a lot of practitioners pretend they have already moved past. Erikson has not moved past them. He runs a parent orchestrator session that spawns interactive child subtask sessions, and he limits himself to one concurrent workstream. His own words: “I am intentionally choosing to limit the workflow to what I can manage in my own head.” He refuses YOLO permission modes. He uses regex-based command filtering rather than agent-call-based safety. He commits to Git by hand.

If you read that and felt the urge to argue with him, the next two sections are for you.

What Erikson Refused, and Why It Lands Differently in 2026

Three refusals stand out. Each of them puts pressure on something a vendor or a thought leader has been selling for the last twelve months.

The first refusal is YOLO permission modes. Most agent runners ship with a “go” switch that turns off the prompt for individual tool calls. Erikson does not flip it. The argument for flipping it is throughput. The argument against, which Erikson makes by simply not using it, is that an unprompted agent run is a run you cannot reconstruct. You traded a slower loop for a faster loop with no record of which decisions the model made on your behalf. When something breaks, you have no idea where to start reading.

The second refusal is agent-call-based safety. Many recent safety architectures route every tool call through a guardian agent that decides whether it is allowed. The pitch is that an LLM understands intent and can block dangerous calls that a regex cannot. Erikson chose the regex. The regex has the property that it is deterministic, auditable, and cannot itself be hallucinated past. Two engineers reading the same regex see the same set of allowed commands. Two engineers reading a guardian agent’s recent log do not.

The third refusal is concurrent subtasks. The frontier of agentic workflows is many parallel sub-agents, hierarchies, swarms. Erikson runs one at a time. His reason is not that the technology cannot do more. His reason is that he cannot mentally model more than one in flight, and he refuses to operate a system whose state he cannot hold in his head. Apply that test to your own production agents. How many of them produce output that any single engineer on your team can fully reconstruct after the fact? If the answer is “none,” that is a finding, not an achievement.

None of these three positions are radical in isolation. What is striking is that a maintainer of Erikson’s caliber publishes them together and is not embarrassed to say “I limit myself.” The implicit message is that the people shipping the loudest, fastest, most parallel agent stacks may be doing so because they have not yet had to live with the consequences.

The Two Open Problems He Was Honest About

The more important contribution of the post is not the refusals. It is the two surfaces Erikson openly named as unsolved.

The first is long-term memory and context. Erikson is explicit. When he needs to reconstruct what he and the agent decided two sessions ago, he digs through prior sessions by hand. There is no working long-term memory. The session is the memory. Cross-session continuity is a manual archaeology problem, and the workaround is his devplans.ts script that hand-rolls handoffs between sessions.

The second is code review and intent verification. His exact framing: “code review and ensuring intent are still hard.” This is the part that engineering leaders are most likely to misread. He is not saying that agents cannot write code. He is saying that nobody, including him, has a reliable way to confirm that the code the agent produced reflects the intent the human had at the start. The verification surface is still human.

Both of these are surfaces that vendors are racing to fill. The race is real, and somebody will eventually ship something useful in each lane. Today, in May 2026, the most respected practitioner publishing on this topic says neither lane is closed. Your operating assumption should match his.

There is a connection between his three refusals and his two open problems that is worth naming. The refusals exist because the open problems exist. If long-term memory worked, the case for stateless YOLO runs would be much stronger because you could reconstruct what happened. If reliable AI code review existed, the case for high-throughput parallel subtasks would be much stronger because each output would be independently verifiable. The discipline he practices is not arbitrary. It is exactly the discipline a senior practitioner adopts when the two load-bearing primitives are still missing.

Why This Matters for Your Stack

We have written before about why the harness is your memory and why subtraction beats addition in harness design. Erikson’s post is the field validation for those positions, written by somebody who is not building a Victorino service.

Read his post against your own production agent setup. Three questions are worth asking.

Are your agents running with permission models that an outside reviewer could reconstruct? If your team uses YOLO modes in production, you have implicitly accepted that you will not be able to explain individual decisions after the fact. Erikson chose not to make that trade. The question is whether your team made the choice deliberately or by default.

Is your safety layer deterministic or model-based? A guardian agent is a useful complement to a deterministic filter. It is a dangerous replacement for one. The regex is boring and that is the point. Boring is auditable.

Do you have a written rule for how many concurrent subtasks any one operator manages? If not, you have one in practice and you are not measuring it. The number does not have to be one. Erikson chose one for himself. A team operating at scale will choose more. But the number should be a decision, not an emergent property of whatever the tooling defaults to.

Where the Frontier Actually Is

The post draws a sharper line than most leaders are willing to draw publicly between solved and unsolved. The solved part is hands-on engineering with an agent that obeys a deterministic filter, commits when a human says commit, and works on one thing at a time. That part works today and it works well. The unsolved part is cross-session continuity and verification of intent.

The vendor pitches around memory products and AI-driven code review are not wrong to exist. They are pointed at real surfaces. But they are pointed at surfaces that the best practicing maintainer in the industry says are not yet closed. If you are budgeting agent investment for the next two quarters, weight the budget toward the parts Erikson confirms are solved, and treat the memory and review parts as research bets, not as production primitives.

Do This Now

Open the operating doc for one production agent on your team. Find three things.

Find the permission model. Write down whether your operators are running with prompted approval, deterministic filters, or YOLO mode. If it is YOLO, schedule a review with the owning engineer this week. The question is not whether you trust the agent. The question is whether you can reconstruct its decisions if a customer asks.

Find the concurrency cap. Write down the maximum number of concurrent subtasks any single operator runs. If the number is not written down, write it down today. Pick a number you can defend. Erikson picked one. Your team may pick three. The number itself is less important than the fact that somebody owns it.

Find the memory story. Write down how an operator reconstructs what was decided three sessions ago. If the answer is “they grep through prior logs,” you are running the same workaround Erikson is, and that is acceptable. If the answer is “our memory product handles it,” verify that claim against a real case before you bet a release on it.

The bar Erikson set in this post is not a high bar. It is an honest one. Match it.

This analysis synthesizes My Thoughts on AI, Part 2: Agent Setup, Workflow, and Tools by Mark Erikson (May 2026).

Victorino Group helps engineering leaders codify the agent discipline that top-tier maintainers practice, with explicit guardrails for the gaps vendors have not yet closed. Let’s talk.

Microsoft Cancels Claude Code. The Token Economy Hits Big Tech.

Thiago Victorino — Tue, 26 May 2026 00:00:00 GMT

On May 25, 2026, TheNextWeb reported that Microsoft’s Experiences and Devices group, the division that ships Windows, Microsoft 365, Outlook, Teams, and Surface, is migrating most engineers off direct Claude Code licenses by June 30. The replacement is GitHub Copilot CLI, which can still call Claude under the hood through a managed routing layer. The framing inside Microsoft is “cost optimization.” The substantive read is different.

This is the first public admission by a hyperscaler that agentic AI unit economics do not pencil at current token prices.

Stop and consider who is making this call. Microsoft is the largest single investor in OpenAI. Microsoft owns GitHub, the channel through which most enterprise AI coding tooling is sold. Microsoft has the deepest pockets in software. If anyone could absorb the token bill, it is Microsoft. And the division pulling the plug is not a back office. It is the one shipping the products that pay for everything else. When the most resourced division at the most resourced software company in the world starts triaging AI tool licenses by ROI rather than by adoption, the message is not subtle.

The story is not Microsoft picking Copilot CLI over Claude Code. The story is that procurement governance just became the load-bearing layer of any AI tooling strategy.

The Numbers Nobody Budgeted For

TheNextWeb’s reporting included data points that should reset every 2026 AI budget conversation.

Uber engineers are spending between $500 and $2,000 per month, per person, on AI coding tokens. The CTO’s own words: “the budget I thought I would need is blown away already.” Uber is not a tentative AI adopter. Uber is the case study companies cite when they want to justify aggressive engineer-led tool adoption. That same company is now publicly saying the spend curve outran the planning model.

The OpenClaw framework, which orchestrates Claude through extended agentic loops, is reportedly consuming $1,000 to $5,000 per day for users on $200 per month subscription plans. That is not a 10x overrun. That is a 150x to 750x overrun, daily, against the sticker plan. Anthropic and others are absorbing the difference today because the strategic position is worth more than the unit margin. That subsidy has an end date that nobody has published.

Gartner’s number is the one that should make every CFO sit up. Only 28% of AI infrastructure projects fully deliver on their business case. 25% of planned 2026 AI budgets are expected to slip into 2027. The slip is not a delivery problem. The slip is a money problem. The bill arrived before the value did.

Put those numbers next to Microsoft’s retreat and the picture sharpens. The vendor with the strongest balance sheet, the deepest integration with the model provider, and the most adoption data inside its own walls just decided that direct seat licensing for the most-loved coding agent in the industry was not worth the marginal yield. That is a procurement signal, not a product signal.

What Microsoft Actually Did

The headlines will say Microsoft chose Copilot CLI over Claude Code. Read the move structurally instead.

Microsoft did not block Claude. Copilot CLI still routes to Claude where the routing layer judges it the best model for the task. What Microsoft removed was the direct seat license. Engineers no longer get an unmetered Claude Code subscription that bills outside any procurement envelope. They get access to Claude through a managed pipe that Microsoft controls, prices, and instruments.

This is the procurement pattern enterprise IT has applied to every prior wave of expensive tooling. Database licenses, observability platforms, cloud compute. The first phase is “engineers can expense it.” The second phase is “the bill ate the budget.” The third phase is a managed gateway where the vendor is still consumed but the spend is bounded, attributable, and renegotiable. Microsoft just compressed phases one through three into eighteen months.

The hyperscaler running the playbook on itself is the news. Every CTO outside Microsoft now has the same question on their desk. If Microsoft’s most strategically important engineering division could not absorb direct Claude Code seats, who in your organization can?

The Uber Trajectory

Uber’s CTO did not say “we got the math wrong.” He said the budget he thought he would need was blown away already. The verb is passive. The cost did not exceed forecast. The cost obliterated forecast. That is what happens when an org buys AI tooling through expense reports and discovers afterward that the unit cost is variable, the per-engineer ceiling is open, and the model providers have every commercial incentive to let consumption ramp until the subsidy ends.

Every enterprise that lets engineers expense AI tools without budget governance is on the Uber trajectory. Burn the year’s budget in four months. Discover in May that there is no envelope left for the second half. Triage in panic. The triage moment is what Microsoft just did publicly. Most enterprises will do it privately, in August or September, when finance pulls the AI spend ledger and reconciles it against the original plan.

The pattern we have argued for months applies here directly. When a public benchmark shipped a default cap of $100 per provider per month inside the agent procurement protocol, that was the market telling internal platform teams what disciplined defaults look like. The Microsoft retreat is the same market lesson, restated from the buy side. The cap exists because the spend curve, left alone, breaks the model.

What Stays Standing

Three things are still true after Microsoft’s announcement, and they are the foundation any AI tooling strategy has to stand on now.

First, the value is real where it is governed. Engineers using AI coding tools inside a measured pipeline still ship faster than the same engineers without them. The retreat is about unbounded seats, not about the underlying capability. The capability earns its place when the spend has a ceiling and the output has a measurement.

Second, the model providers will rationalize. Anthropic, OpenAI, and the rest cannot subsidize 150x overruns indefinitely. Prices will move. Rate limits will tighten. Subscription tiers will fragment. The companies that built their internal AI workflows around current sticker prices, with no plan for what happens when those prices move, will reprice their entire AI strategy on someone else’s calendar.

Second-and-a-half: governance is not a brake on adoption, it is the condition for sustained adoption. The organizations that ship procurement and budget discipline first get to keep using the tools when others have to pull back. We argued the same thing when governance started shipping as product. The pattern repeats here: the discipline that looked like overhead in 2025 is the survival kit in 2026.

Third, the workload-harness fit question matters more, not less. If you are spending $500 to $2,000 per engineer per month on AI tokens, the assignment of which workloads run on which harness is no longer a developer-experience decision. It is a unit-economics decision. Every workload that goes through the most expensive harness when a cheaper one would do is a line item that finance will eventually find.

Do This Now

Stop the next AI tool expense reimbursement cycle until your CFO can answer one question. What is the published per-engineer monthly ceiling for AI coding tools, by tool, by team, this quarter? If the answer is “we don’t have one,” you are on the Uber trajectory. The fix is not a meeting. The fix is a written cap, a metering layer that enforces it, and a managed gateway that routes engineers through it.

Microsoft just published the lesson at the highest possible volume. The companies that will still be running their AI tooling stack in Q4 are the ones that take the lesson seriously this week, not the ones that wait for their own finance team to ring the alarm in August.

This analysis synthesizes Microsoft retreats on Claude Code as AI costs bite (TheNextWeb, May 2026).

Victorino Group helps enterprise teams build the procurement and budget governance layer that turns AI-tool sprawl into accountable, measured spend. Let’s talk.

The Scientific Loop Has Four Roles. AI Only Gets One of Them.

Thiago Victorino — Tue, 26 May 2026 00:00:00 GMT

On May 25, Alejandro Piad Morffis published a short essay called AI is doing something weird to Science. It does what most takes on AI and science do not. It refuses the binary of “AI is replacing scientists” versus “AI is just a tool.” Instead, it decomposes what scientists actually do into four roles and asks which ones survive contact with a large language model.

The four roles are: poser, proposer, verifier, curator. Read them slowly. Most discussions of AI in knowledge work collapse all four into a single thing called “the human” or “the expert.” Piad pulls them apart, and once they are apart, the load-bearing role becomes obvious. It is not the one most people assume.

This matters far outside science. The same four roles are present in legal review, financial analysis, marketing production, and any other knowledge workflow where AI is now generating candidates. If your organization cannot point to the verifier, you do not have governance. You have decoration.

What Piad actually proposed

The four roles, in Piad’s own words and structure:

Poser. Decides what is worth solving. Names the question. Sets the frame. In Piad’s account, this remains exclusively human. Not because LLMs cannot generate questions, but because the choice of which question matters is an act of taste, judgment, and stakes that no model can hold.

Proposer. Generates candidate solutions, fast. This is where the LLM lives. Piad is precise about the title: “Not discoverer, not author, not scientist. The one that generates candidates fast enough that the verifier can find something in the haystack.” The proposer’s job is volume and variety, not correctness.

Verifier. Checks whether a candidate is actually true. In Piad’s four documented cases, the verifier is never another LLM. It is formal logic (Lean), a combinatorial proof checker, a wet-lab experiment, a crystallography measurement. The verifier cannot be fooled by plausible-sounding falsehoods, which is exactly what LLMs excel at producing.

Curator. Decides which surviving candidates are worth pursuing further. This is human again. The verifier tells you something is true; the curator tells you it is interesting, fits a research program, advances the field. Truth is necessary but not sufficient.

Piad’s punchline is direct: “The verifier is the one that matters. A loop with a weak proposer and a strong verifier still produces valid science, it is just slow.” Reverse the sentence and the implication is brutal. A loop with a strong proposer and a weak verifier produces fast nonsense at scale.

The cases are not new. The naming is.

Piad walks through four examples. Claude’s Cycles work in combinatorics, where Claude proposed candidate constructions and a formal checker verified them. Terence Tao’s Lean-assisted mathematics, where Tao directs the question and curates the result while Lean does the verification. AlphaFold, where the model proposes protein structures and crystallography verifies them. GNoME, where the model proposes candidate materials and physical synthesis verifies them.

He also reaches back to 1976. The Appel-Haken proof of the four-color theorem used the same loop structure: a human posed the question, a program generated candidate configurations, a verifier checked each one, and humans curated the surviving result into a proof. We have been running this loop for fifty years. We just never named the roles.

This is the move that makes the essay useful. Piad did not discover a new architecture. He gave a name to a pattern that was already running, and once the pattern is named, you can test for it.

The test, exported

Take the four roles to any AI deployment outside science and ask:

Legal review. A firm deploys an LLM to summarize contracts and flag risks. Who is the poser? (The partner who decides which clauses matter.) Who is the proposer? (The model.) Who is the verifier? (Here it gets uncomfortable. Often the answer is “another associate reading the summary,” which is just a slower proposer. A real verifier would be a clause-level rule engine, a citation checker against case law, a structured diff against a known-good template.) Who is the curator? (The partner again, deciding which flagged risks deserve client conversation.)

Most legal AI deployments today have a poser, a proposer, a curator, and no verifier. The associate is performing verification theater. The model produces plausible-sounding falsehoods. The associate, under time pressure, reads them as competent summaries. The curator inherits unverified material as if it were verified.

Financial analysis. Same exercise. Who poses the question? (The CFO.) Who proposes the analysis? (The model running over the data.) Who verifies? (A reconciliation engine, a deterministic formula check, a cross-reference against the source ledger. Not another LLM “double-checking” the first.) Who curates? (The CFO, again.)

When the verifier is missing, finance teams end up with elegant narratives that footnote nothing checkable. The pattern Piad warns about in science shows up identically in the boardroom.

Marketing production. A team uses AI to produce a hundred ad variants. Poser: brand strategist. Proposer: the model. Verifier: … brand guidelines compliance check? Legal review? A/B test against actual user behavior? Most teams skip straight from proposer to curator and call the creative director’s eyeball the verifier. The creative director cannot scale to a hundred variants, so the verification quietly does not happen.

In all three cases, the failure mode is the same: an LLM is doing both proposing and verifying. Piad’s framework names why this cannot work. The proposer optimizes for plausibility. The verifier must optimize for truth. You cannot do both with the same instrument.

Why “human in the loop” is the wrong abstraction

Most AI governance frameworks demand a “human in the loop.” Piad’s decomposition exposes the imprecision. Which human? Doing which job? At which stage?

A human acting as curator after the verifier has done its work is governance. A human acting as poser before the proposer runs is governance. A human acting as verifier on the output of an LLM proposer, without formal checking infrastructure behind them, is performance art. They are being asked to do, by reading, what a non-LLM system needs to do by construction.

This is why so many “human review” deployments degrade. The reviewers are honest. They are also human, tired, and reading plausible prose. They cannot verify what the system has not made verifiable.

What to do this week

Three actions, ordered by leverage:

Run the four-role test on your most-deployed AI workflow. Write the four names. Assign each to a person or system. If the verifier slot is “the human reviewing the output,” you have no verifier.
Name what would have to be true for a real verifier to exist. It is rarely another AI. It is usually a rule engine, a formal checker, a deterministic system of record, or a test environment. Often it does not exist yet. That is the work.
Stop calling reviewers “verifiers.” Reviewers are curators. They decide what merits attention. They are not equipped to catch plausible falsehoods at scale. The naming honesty alone changes how leaders allocate budget.

Piad gave us a tool. The tool is small enough to use on a Monday and sharp enough to expose where governance ends and theater begins.

This analysis synthesizes AI is doing something weird to Science by Alejandro Piad Morffis (May 2026).

Victorino Group helps leadership teams export Piad’s four-role test into legal, financial, and marketing workflows, naming the independent verifier that turns “human in the loop” from posture into structure. Let’s talk.

SaaStr Built an AI VP of Customer Success in Replit. Zero Engineers. 100 Sponsors.

Thiago Victorino — Tue, 26 May 2026 00:00:00 GMT

Amelia is SaaStr’s Chief AI Officer. She built Qbee, an AI VP of Customer Success, inside Replit. No engineers. Qbee now manages over 100 event sponsors with 70% fewer human hours and a 10x lift in sponsor engagement versus the legacy tool that came before it.

The temptation is to read this as a vibe-coding success story. A non-technical operator ships a production system, the future is here, anyone can build software now. That reading misses the point.

The Qbee story is interesting because it documents what production looks like when the builder is not an engineer. The artifact is not the codebase. The artifact is the daily operating discipline. And that discipline is the same discipline that engineering leaders have been writing about all year, translated for a function that has never had to think this way.

The economics are not the story

Jason Lemkin shared the numbers in a recent SaaStr post. Combined token cost across all of SaaStr’s vibe-coded apps is under $200 per month. The math is absurd in the best possible sense. A VP-level function that would cost a salary, benefits, equity, and management overhead, replaced by a Replit app running on token spend.

But cheap is not the lesson. Lots of cheap software fails in production. The lesson is what made Qbee survive contact with 100 paying sponsors.

Three operating patterns. None of them are technical.

Pattern one: build the dashboard before you build the agent

SaaStr’s first move on Qbee was not the agent. It was the dashboard. A central screen that shows the state of every sponsor: where they are in the journey, what is overdue, what is at risk, what just shipped. The agent was built second, against the dashboard.

This sequence matters. The dashboard is the spec. It is the visible artifact that lets a non-engineer reason about whether the agent is doing the right thing on a given day. Without it, the agent is a black box. With it, the agent is a measurable employee.

Compare this to the typical pattern in engineering, where the system is built first and observability is grafted on later. Amelia inverted the order because she did not have the technical instincts to defer observability. She needed to see the work before she trusted any automation against it.

This is a transferable pattern. If you are a marketing leader, a legal lead, a finance ops director thinking about deploying an autonomous agent in your function, build the dashboard first. Make the work visible to a human reviewer in one screen. Then point the agent at it.

Pattern two: agent hopping for sensitive data

The most quietly important pattern in the Qbee post is what Lemkin calls agent hopping. Sensitive data, contracts, internal financials, sponsor commitments, does not live in the agent’s context. It lives in the secure systems where it belongs. Qbee calls APIs to read and write against those systems, but it never holds the raw data in memory or in prompts.

This is the customer success version of a pattern that took engineering teams two years to internalize: the agent is a coordinator, not a vault. State of record stays in systems of record. The agent moves between them.

For non-engineering leaders, this reframes the data security conversation entirely. The question is not “is it safe to put our customer data into an LLM.” The question is “can we structure the work so the agent never touches the raw data, only the operations on it.” The answer to the first question is often no. The answer to the second is usually yes.

This connects directly to the workload and harness fit problem we have been tracking. The harness is what makes the workload safe. Agent hopping is a harness pattern. It is what lets a Replit-built tool manage real sponsor contracts without becoming a data leak.

Pattern three: four to six personalization data points per message

SaaStr’s third operating rule is the one most teams will fail at. Every message Qbee sends carries four to six unique personalization data points. Not merge tags. Not “Hi {first_name}.” Actual signals pulled from the sponsor’s behavior, history, tier, current state, and recent interactions.

This is the work that distinguishes a real autonomous agent from a glorified mail merge. And it is the work that explains the 10x engagement number. Sponsors respond because the messages read as written specifically for them, because they were.

The discipline here is not technical. It is editorial. Someone has to decide which signals matter, which combinations make a message feel personal versus surveillance-like, and which signals are off-limits. Engineering teams cannot make these calls. The function owner has to.

This is the same pattern we identified in the Klaviyo Composer launch: marketing agent governance requires marketing judgment, not engineering judgment. Qbee extends it to customer success. Customer success agent governance requires CS judgment about which signals constitute care versus creepiness.

The shipping discipline

SaaStr also ships one customer per tier first. They do not flip the whole sponsor list onto Qbee on day one. They pick one sponsor at each tier (top, middle, bottom), run the agent against just those three, and watch what happens for a week. Then they expand.

This is canary deployment translated for customer-facing work. The cost of a bad release in engineering is a rollback. The cost of a bad release in customer success is a sponsor who feels mistreated, possibly publicly. The shipping cadence has to be slower and more deliberate, because the blast radius is human.

Daily maintenance is the other non-negotiable. Qbee is not a fire-and-forget system. Amelia checks the dashboard every day. She tunes prompts, retires patterns that misfired, adds new signals as the sponsor base evolves. The agent does not run itself. The agent runs the work, and a human runs the agent.

This connects to the broader pattern we have been writing about. Governance is shipping as product. At Qbee scale, governance is shipping as a daily 30-minute review. Different instantiation, same principle: the operating discipline is what makes the autonomy survivable.

What this means for everyone else

Lemkin’s framing of the opportunity is worth quoting directly:

“The distance between what customers need and what CSMs can humanly deliver is the single most valuable place to deploy AI in your B2B business right now.”

He is right about the opportunity. The interesting part is that the same logic applies to every function where the distance between customer need and human capacity has become embarrassing. Support. Onboarding. Renewals. Account management. Partner programs.

In every one of these functions, the playbook from Qbee transfers:

Build the dashboard first. Make the work visible.
Design for agent hopping. Sensitive data never enters the agent.
Demand four to six personalization data points per outbound message.
Ship one customer per tier. Expand by evidence.
Treat daily maintenance as non-negotiable.

None of these are technical disciplines. All of them are operational disciplines that the function owner must own. The Replit app is the easy part. The discipline is the hard part.

Do this now

If you lead a non-engineering function and you are considering an autonomous agent for a recurring, high-volume, personalization-heavy workflow, do not start with the agent. Start with the dashboard. Spend a week sketching the screen that would let you, on any given Monday, see the state of the work and answer “is the agent on track or off track.” If you cannot draw that screen, you are not ready to ship the agent. If you can, you are halfway there. The other half is the daily 30 minutes you commit to running it.

This analysis synthesizes Top 10 Learnings From Building Our Own AI VP of Customer Success: Qbee by Jason Lemkin (SaaStr, May 2026).

Victorino Group helps non-engineering teams ship autonomous agents into production with the daily operating discipline that turns the gap between customer need and human capacity into measurable, governed coverage. Let’s talk.

benn.substack Just Named What Releezy Ships: 'Wins Above Claude.'

Thiago Victorino — Tue, 26 May 2026 00:00:00 GMT

A disclosure first. We sell baseline-relative measurement for AI work. That is exactly why this essay exists: an outside analyst just named the unit we ship, and we would rather quote him than ourselves.

On May 22, benn.substack published a piece called “WAC.” The acronym is borrowed from baseball, where “Wins Above Replacement” measures how many more wins a player produces than a generic minor-league call-up would. Benn proposes a software-buying analogue. Wins Above Claude. Value created above what an integrated Claude plus its MCPs already does out of the box, before any vendor’s wrapper, agent, or “AI feature” is layered on top.

The framing is procurement-friendly, mildly snarky, and structurally correct. It also closes a debate the industry has been avoiding.

The benchmark era is over because the baseline moved

Benchmarks compare models to fixed test sets. The test set is the constant. The model is the variable. That regime worked while frontier models advanced once or twice a year. It does not work now. Benn cites llm-stats.com: 62 AI models released in 126 days. The constant is no longer constant. Any benchmark score with a publish date older than six weeks is describing a different industry.

Worse, benchmarks evaluate models in test conditions. The actual buying decision is about deliverables produced inside a company’s tools, culture, codebase, and workflow. None of those are in the benchmark. A model that scores 78% on SWE-bench can be useless inside a specific monorepo with a specific build system and a specific code review culture. A model that scores 62% can be transformative there. The benchmark cannot tell you which.

WAC fixes the wrong end of the equation. Instead of fixing the test set and varying the model, you fix the deployment context, your company, your tools, your workflows, and you vary what is plugged in. The baseline becomes “the default Claude plus its standard MCPs, working on your real problems.” Any vendor pitching an agent, wrapper, or AI feature has to demonstrate value above that baseline. Not against a synthetic eval. Against the thing the buyer can already self-serve for $20 a seat.

Why this generalizes beyond Claude

The acronym is cute but the principle is portable. Substitute any sufficiently capable default. Wins Above ChatGPT Enterprise. Wins Above Gemini Workspace. Wins Above Copilot. The mechanic is the same: there is now a baseline assistant inside the workflow that already does a non-trivial portion of the job, and the only honest measurement is the marginal lift a paid vendor adds on top of it.

This is not a hypothetical. Ask any engineering leader what their developers actually use day to day. The answer involves Claude, ChatGPT, or Copilot more often than any procurement-approved AI tool. The baseline is already there. It is just not on the scoreboard.

That is the procurement consequence. Every “AI productivity” vendor pitch in 2026 is selling you a delta. Most of them are pretending the baseline is zero. Benn’s contribution is naming the lie out loud. The baseline is not zero. The baseline is whatever the default assistant already delivers inside your context, and you have to measure it before you can evaluate anyone’s claim of improvement.

The hiring analogue, which is the most useful part

Benn points at Linear’s hiring practice. Two to five day paid work trials instead of traditional interviews. The candidate does real work, in the real codebase, with the real team, and the team measures real output. Pass the trial, get hired. Fail the trial, get paid for the work and part ways respectfully.

The reason this matters for AI buying is that it solves the same problem benchmarks failed to solve. You cannot evaluate a candidate, human or AI, in a vacuum. The performance is contextual. It depends on the codebase, the tooling, the team norms, the existing review culture. Linear figured out that the only way to know if a senior engineer is actually senior in their context is to put them in their context and measure output. The same is true for an AI vendor. The only way to know if an AI agent produces value above the Claude baseline in your environment is to deploy it in your environment, alongside the baseline, and measure.

The implication: every meaningful AI procurement decision in the next 18 months will involve some version of a paid trial. Not a demo. Not a proof-of-concept slideshow. A real deployment, with real work assigned, measured against the baseline, over enough time to be statistically credible. Vendors who refuse this format are telling you their delta does not survive contact with reality.

What the buyer actually has to build

WAC as a phrase is doing real work. WAC as a measurement system is harder, and this is where most companies will discover the cost of having avoided the problem.

To measure Wins Above Claude, a buyer needs four things they probably do not have. First, a definition of what “winning” looks like for the work in question (shipped tickets, resolved cases, qualified leads, drafted contracts, the unit varies). Second, an instrumented baseline of the default-assistant version of that work over a credible time window (weeks, not hours). Third, a sample of the same work executed with the vendor’s tool in place, ideally split A/B or run sequentially under matched conditions. Fourth, an attribution model that survives the obvious confounders (operator skill differences, ticket difficulty mix, calendar effects).

That is not a benchmark. It is operational measurement infrastructure. Most companies do not run it for their human teams either, which is part of why the problem feels so foreign when applied to AI. Google just expanded its search box for the first time in 25 years to accommodate longer AI queries. The interface is changing because the underlying behavior changed. The measurement interface has to change too. WAC is the procurement-side version of that interface change.

Why we are claiming this now

The reason we wrote this post the same week benn published is that “Wins Above Claude” is the buyer-side name for what we have been arguing on the seller side for nine months. We have called it baseline-relative measurement, lift-over-default, agent-versus-floor. None of those landed. WAC will, because the AI buying community is already trained on benchmarks, and a benchmark replacement gets uptake faster than a brand new category.

We would rather operate inside benn’s vocabulary than ours. The job is the same. Measure the baseline before believing the pitch. Build the trial harness before signing the contract. Treat any vendor that cannot pass a Linear-style work trial in your context as a vendor who has not tested their own claims.

A caution. The danger of naming a category is that the category gets watered down. “WAC-compliant” will appear on vendor decks within a quarter, and most of those decks will be selling the wrong number. The defense is mechanical, not rhetorical. If a vendor cannot describe (a) what your baseline is, (b) how it was measured, © over what window, and (d) what delta they are claiming over it, with what confidence, the WAC label is decorative. Ask the four questions every time.

Do this now

Before you take your next AI vendor meeting, run a three-step exercise. Pick one workflow you are considering paying to improve. Measure how the default Claude or ChatGPT assistant performs on that workflow over the next two weeks, instrumented, with at least three operators. That is your baseline. Now make every vendor that walks in claim a specific delta above that number, with a proposed measurement window and confidence interval. The ones who can articulate this get a paid trial. The ones who cannot get a follow-up call after they figure it out.

The fastest way to make AI procurement honest is to stop letting the baseline be invisible. benn just gave the baseline a name. Use it.

This analysis synthesizes WAC (Wins Above Claude) (benn.substack, May 2026).

Victorino Group helps buyer and seller teams build the baseline-relative measurement that turns AI vendor claims into verifiable deltas. Let’s talk.

Agentic-Agile: Contracts, Not Ceremonies

Thiago Victorino — Mon, 25 May 2026 00:00:00 GMT

Daniel Epstein, Partner Tech Strategist at Microsoft, published a piece in May 2026 arguing that agent development needs Agile. Not prompt engineering. Not better models. Agile. Issues with acceptance criteria. Review gates. Persistent instruction files. Spec-first backlogs. Microsoft even shipped a template repository to operationalize the position.

Read alongside the PFF case study from the same month, the argument seems to short-circuit. PFF deleted standups, sprint planning, refinement, retrospectives, and the product manager role. Two engineers with agents outshipped a team of ten. So which is it: does Agile survive the agent era, or does it not?

Both, because Agile was never one thing.

Agile Was Two Things All Along

Read any 2001 Agile Manifesto retrospective and you find a single label covering two very different machines wired together.

The first is the coordination stack. Standups, sprint planning, refinement, retrospectives, demo days, capacity charts. Every artifact in this layer answers a question about humans: when are you free, what is blocking you, how much work can a person hold in their head over fourteen days, how do we keep the team from burning out. The coordination stack is ergonomics. It optimizes scarce, slow, opinionated human attention so a small group of engineers can ship coherent software without colliding.

The second is the contract stack. Issues with acceptance criteria, definition of done, design documents, API contracts, test specifications, review checklists, persistent instruction files. Every artifact in this layer answers a question about the work itself: what does this change actually mean, how do we know when it is correct, what cannot break, what must be true after the merge. The contract stack is specification. It encodes intent precisely enough that someone else, including a future version of yourself, can execute against it without ambiguity.

For twenty years the two stacks looked like one because they ran inside the same ceremony. The standup updated the coordination stack and surfaced gaps in the contract stack at the same time. The retrospective improved coordination and tightened contracts in the same meeting. Disentangling them was unnecessary. The agents made it necessary.

Why Agents Collapse the Coordination Stack

Engineer hours stopped being the scarce resource.

That single sentence is the whole story. We covered the operational evidence in a recent piece on PFF and the org inversion. Mike Spitz, CTO of Pro Football Focus, ran a three-month experiment in early 2026 where two engineers plus agents went up against ten engineers without them. The two-engineer team shipped 25 times more deploys, 10 times more weighted ticket complexity, and lifted CSAT from a 7.5 baseline to 8.6. Along the way they deleted the PM role, sprint planning, daily standups, refinement, and retrospectives. The half-hour huddle every other day was all that survived.

This is what happens when the resource a ceremony was protecting becomes abundant. Standups optimize a constraint, namely human typing speed coordinated across calendars, that no longer binds when an agent fleet runs in parallel. The coordination stack does not break in some dramatic way. It simply stops paying its rent. The ceremonies turn into theater, and the leaders who keep running them on inertia are paying salaries to maintain rituals that protected a constraint that has moved.

Why Agents Amplify the Contract Stack

The opposite is true for the second stack.

Epstein puts it directly: “This is not a model problem; it is a process problem. Upgrading the model does not fix missing acceptance criteria.” His Minthe project surfaced the failure mode at a fidelity that prompt enthusiasts rarely confront. Multiple agents running in parallel drifted from one another. Behavior diverged from spec. The codebase looked correct in isolation and incoherent in aggregate. The only stable source of truth that survived the chaos was the GitHub issue tracker, where the acceptance criteria were explicit enough to anchor every agent back to a single definition of done.

The reason is structural. A human engineer with a vague ticket asks a question, pulls the PM into a hallway, or just makes a judgment call grounded in years of context about the product. An agent with a vague ticket invents an answer. It has no shared context outside the artifact in front of it. The artifact is the contract. If the contract is loose, the agent fills the slack with plausible-sounding nonsense that compiles, passes its own tests, and ships a regression.

Epstein’s other line, the one worth printing on a poster: “If you are catching architectural violations during final review rather than during story execution, your governance is too late.” That is the contract stack stated as governance. The acceptance criteria, the architectural constraints, the persistent instruction files in the repo, the review gates between Plan, Issue, Implement, Review, Merge, and Docs in the Microsoft template repository. Every one of those artifacts moves architectural intent forward from “final review” to “story execution,” where the agent can actually obey it.

The contract stack used to be a quiet supporting cast. Now it is the only thing holding the work together.

The Move: Promote the Contract Layer, Not Add Ceremonies Back

The mistake most leaders are about to make is to read Epstein, panic at the coherence problems Minthe surfaced, and bolt the coordination stack back on top of an agent fleet. Daily standups with agents. Sprint planning with agents. Retrospectives where someone presents agent metrics. This is wasted motion. The coordination stack solves a constraint that is gone. Reinstating it does not help the agents and does not help the humans.

The right move is the opposite. Promote the contract stack to first-class operational status. Treat acceptance criteria with the seriousness a previous generation reserved for sprint planning. Make persistent instruction files versioned artifacts that ship through pull requests like code. Move architectural constraints out of tribal knowledge and into machine-readable rules that gate execution, not review. The phase diagram Microsoft ships in the template, Plan to Issue to Implement to Review to Merge to Docs, is not a workflow you adopt because it looks tidy. It is a workflow you adopt because each transition is a point where contract validation can be enforced before drift compounds.

Said another way: Agile did not survive the agent era. The contract half of Agile survived, and it now carries the load the coordination half used to share.

This Generalizes Past Engineering

The same decomposition shows up everywhere the operating model starts including agents.

Marketing teams are discovering that the campaign brief is the new contract. Where a junior marketer once filled in the blanks with brand instinct, an agent fills them with whatever the brief allows. A loose brief produces a campaign that is technically on-spec and off-brand. The marketing brief used to be a starting point for human conversation. It is becoming a binding artifact, the kind that warrants the same review gates engineers apply to architectural decisions.

Legal teams are running the same play. The matter intake form, the deal memo, the redline guidance document. These used to be context for a human associate. They are becoming the contract that governs what an agent is allowed to draft, redline, or escalate. Firms that invest in tightening intake artifacts are pulling ahead. Firms that treat intake as administrative overhead are watching agent output drift into liability.

Design teams are next, and the contract artifact there is the design system itself. A design system used to be a guide. It is becoming the rule layer that an agent on the canvas must respect. The teams treating their design system as a versioned contract are about to look very different from the teams treating it as documentation.

The line through all three is the same line we drew through engineering. The brief is the contract. The contract is the governance surface. The agent is the executor. Promote the contract layer or accept the drift.

Do This Now

Pick one workstream that already has agents in it. Engineering is fine. Marketing campaigns, legal intake, or design system enforcement work equally well.

In the next sprint or week, do exactly one thing: take the artifact that the agent treats as its source of truth, whether that is a ticket, a brief, a matter intake form, or a design system token file, and rewrite it with full acceptance criteria. Not just “what should happen” but “what cannot happen,” “what must still be true after the work is done,” and “what counts as evidence.” Then make every agent run gate against that artifact before merging, shipping, or filing.

You will discover within a week which of your contracts were loose enough that the agent was filling slack with invention. That discovery is worth more than another quarter of debate about whether Agile is alive. The contract stack is what you keep. Everything else is up for renegotiation.

This analysis synthesizes Agentic-Agile: Why Agent Development Needs Agile (Not Just Prompts) (Microsoft Developer Blog, May 2026) and the agentic-agile-template (Microsoft, May 2026).

Victorino Group helps operating teams promote the contract layer of AI work without recreating ceremonies that no longer pay rent. Let’s talk.

The Labs Became Consulting Firms. The Hottest Role Is Forward Deployed Engineer.

Thiago Victorino — Mon, 25 May 2026 00:00:00 GMT

In four weeks, the three frontier labs all admitted the same thing. The product is not the model. The product is the engineer who installs the model.

Anthropic announced a forward-deployed-engineer consulting subsidiary on May 4 2026, backed by Blackstone, Hellman and Friedman, and Goldman Sachs. OpenAI capitalized its Deployment Company on May 11 with $4B from TPG and Advent at a $14B valuation, then bought Tomoro UK and absorbed 150 forward-deployed engineers across the UK, Asia, and Australia. By late May, Gergely Orosz reported in The Pragmatic Engineer that Google Cloud had compressed its forward-deployed-engineer interview loop from four to six interviews over several weeks down to two interviews in two days.

Two days. From a frontier lab. For an engineering hire.

That is not a hiring policy. That is a structural admission. The labs need humans inside customer accounts faster than the labs can model them. Slow Ventures named the pattern the cleanest: AI Accenture, not Accenture for AI. The labs are not hiring consultants. They are becoming the consultancy, and they are pricing the role like it is on fire.

The labor-market signal

The corporate-development story is loud, but the labor-market story is louder, and harder to argue with.

Kyle Poyar’s May 2026 cut of Sumble data (Growth Unhinged, May 20) reads as the first clean snapshot of what AI is doing to GTM headcount. Overall go-to-market job postings are down 15% year over year in the first half of 2026. SDR and BDR roles are down 21% across the market. Customer support is down 37%, the largest decline of any GTM function. Whole layers of the funnel are being depopulated in real time.

Now the counter-cut. Cursor, Decagon, and OpenAI all doubled their own SDR headcount in the same period. The AI-native vendors whose pitch is “automation replaces sales” are themselves hiring sales faster than anyone. GTM-engineering roles, the hybrid product-plus-pipeline function, doubled year over year to more than 400 open positions across the public market. Sales and solutions engineering combined now make up roughly 60% of all GTM openings.

The picture is not “AI eliminates sales jobs.” The picture is “AI eliminates the bottom of the funnel and pulls the rest of the funnel into engineering.” The work that survives is the work close to the customer’s system of record. The work that dies is the work that scripts a call.

This is the same shape as the FDE announcement. The labs and the AI-native vendors are not predicting a future in which software sells itself. They are building an organization in which engineers sell, install, and operate the software, and the rest of the funnel gets compressed into the model.

Why the apps need this shape

The structural answer for why this is happening sits in Neevash Ramdial’s Tech Bifurcation and the 0.5 Layer (May 2026). Ramdial argues that there is a new infrastructure tier emerging between the foundation model and the application, the layer where agent execution, retrieval, eval, and routing actually live. He points to companies like Turbopuffer ($100M ARR profitable on under $1M raised), Modal ($355M Series C at a $4.65B valuation), and Mintlify (where roughly half of documentation traffic now comes from AI agents reading docs on behalf of human users) as proof that the 0.5 layer is real, large, and capitalized.

The same post cites a Neevash demo in which Google’s Antigravity 2.0 built a working operating system in roughly 12 hours, orchestrating 93 sub-agents at a total cost of under $1,000. That is not a feature story. It is a delivery-cost story. The model is now cheap enough and capable enough that the bottleneck is the human work of pointing it at a real customer problem, structuring the agent graph, and operating the result.

That human work has a name. Forward deployed engineer.

We argued in Foundation Labs Are Absorbing Your Stack that the labs were collapsing model, runtime, dev tooling, and consulting into one balance sheet. The FDE buildout is the staffing model under that collapse. The 0.5-layer thesis explains why the staffing model has to look this way. You cannot ship a $1,000 OS-from-scratch demo through a quote-to-cash motion that takes nine months and four discovery calls. You need an engineer who can sit with the customer’s domain expert on Monday and ship the agent graph by Friday.

What “FDE” actually means now

The role itself is older than the lab restructuring. Palantir invented the modern version in the 2010s. The pattern was simple. Send a real engineer into the customer account. Let that engineer become a temporary employee of the customer’s operation. Build the workflow around the customer’s actual data and actual constraints. Leave the workflow installed when you pull the engineer out.

What changed in May 2026 is the volume and the asking price. Anthropic, OpenAI, and Google are now staffing FDE roles at scale, and the comp packages are pulling senior application engineers out of every other corner of the industry. The Google two-day interview loop is the tell. When a frontier lab compresses its hiring process by an order of magnitude, the lab is not relaxing its bar. The lab is admitting that the supply of qualified humans is the constraint, and that every week the loop takes is a week a competitor’s FDE shows up at the customer’s office first.

This is the operating system of the AI-Accenture model. Not a methodology. Not a deck. A bench of engineers, staffed by the lab, sent into customer accounts, paid out of model revenue. The labs do not need a new product to compete with McKinsey. They need a new org chart. They have built it.

What this changes for buyers

Three consequences will land in enterprise procurement and engineering org charts this quarter.

First, you will be sold to by an engineer. The AE will introduce the room and then leave. The work of scoping, demoing, and recommending will be done by someone whose pager rotates back to the lab’s product team. That person will be brilliant, fast, and structurally biased toward the lab’s stack. Plan for that bias the way you would plan for any vendor-staffed solution architect, except more so, because this one writes the code that goes into production.

Second, your own GTM org will hollow at the bottom and thicken in engineering. The Poyar data is not a forecast. It is a measurement. If your sales-development team is more than 20% of your GTM headcount, your peers are already cutting toward your number. If your GTM-engineering function does not exist yet, your peers are already staffing it. The roles that survive sit close to customer systems. The roles that disappear sit close to a script.

Third, your delivery model needs an FDE-shaped layer of its own, or you will outsource that layer to whichever lab gets to the customer first. This is the buy-side mirror of the lab consolidation. If you sell software that touches AI, the customer is going to expect a forward-deployed engineer in the room, because that is what every other vendor in their procurement queue is now offering. Build the role internally or rent it from a partner who is not also selling the underlying model. Both options work. “Neither” does not.

Do this now

Run three things on the books this quarter.

Count your FDE-shaped people. The job title does not matter. Count the engineers who can sit in a customer’s office on Monday and ship production code on Friday. If the number is less than 10% of your engineering org and you sell into the enterprise, you have a delivery shortfall that your vendor partners will fill for you within two quarters.

Audit your GTM-engineering function. If it does not exist as a named team with its own budget, name it now. The function lives between product, sales engineering, and pipeline operations. The people staffing it are usually full-stack engineers with a revenue line attached. Sumble’s data shows the role doubling year over year. The market is repricing this work in real time.

Stress-test your single-vendor stacks. If your AI vendor is sending you a forward-deployed engineer, ask the vendor for a written exit plan. What knowledge transfers when the FDE leaves? What runs on your infrastructure versus the lab’s? What does the workflow look like when you swap the model in 18 months? The labs are pricing the FDE role like it is on fire because they know the workflow installed today is the procurement decision locked in tomorrow. Plan the exit while you still have the negotiating leverage of being a new customer.

The AI-Accenture motion is not a prediction. It is an org chart that already exists, capitalized, staffed, and pricing aggressively. The buyers who notice in May 2026 keep their optionality. The buyers who notice in May 2027 are signing the SOW that the FDE wrote last quarter.

This analysis synthesizes The Pulse: Forward-Deployed Engineering Heats Up Again (The Pragmatic Engineer, May 2026), Who’s Actually Hiring in GTM Right Now (Growth Unhinged, May 2026), and Tech Bifurcation and the 0.5 Layer (Neevash Ramdial, May 2026).

Victorino Group helps enterprises build the FDE-shaped delivery layer their AI vendor contracts now assume exists. Let’s talk.

When Microsoft Can't Absorb the Bill, Your CFO Already Made the Decision

Thiago Victorino — Mon, 25 May 2026 00:00:00 GMT

Three independent signals landed in ten days. They do not announce themselves as related. They are.

On May 14, The Verge reported that Microsoft is canceling Claude Code licenses for thousands of engineers across the Experiences and Devices org. Windows. Microsoft 365. Outlook. Teams. Surface. The licenses were rolled out in December 2025. Less than six months later, internal sources told Tom Warren the cutoff was set for the end of June 2026, and the decision was at least partly financial.

On May 19, James Wang at Weighty Thoughts published an analysis showing that 67 to 75 percent of the annual price decline in inference is software-driven, not hardware. The same piece reports that Qwen 3.6 27B, an open-weight model running on a 2022-vintage RTX 3090 Ti, now matches Claude Sonnet on production-relevant tasks including daily briefings, chart annotation, and research triage.

On May 24, TheNextWeb confirmed that DeepSeek made its 75 percent price cut on V4 Pro permanent. New floor: $0.003625 per million input tokens, $0.87 per million output. The same workload that costs $2.50 in and $10.00 out on GPT-5, or $5.00 and $25.00 on Claude Opus 4.7, now runs on a Chinese frontier model for fractions of a cent.

If you read those three stories on the days they published, they looked like three different conversations. Read them together, and the conversation is one: the assumption that closed-API frontier pricing is the floor of your AI cost stack just broke. Microsoft, the company with the deepest discount on the second-largest vendor in the market, decided the bill was too high. That is the canary.

The Software-Driven Majority Is the Structural Shift

The number that matters in Wang’s analysis is not the headline price decline. It is the decomposition.

For three years, “LLMflation” was treated as a hardware story. Better chips, more chips, Nvidia’s roadmap, TSMC’s yield. Guido Appenzeller’s 1000x in three years narrative carried that implicit assumption. The thing getting cheaper was silicon. Wait for the next node and the next generation.

Wang’s measurement reverses that. Two thirds to three quarters of the cost decline traces to software: training data efficiency, distillation, MoE routing, speculative decoding, KV-cache compression, quantization, and the inference stack itself. Hardware contributes the remainder.

This matters for one reason. Hardware gains compound at the foundry’s pace, and they accrue to the cloud that owns the silicon. Software gains compound at the open-source community’s pace, and they accrue to whoever can run the inference, including you on commodity hardware in your own datacenter. When the curve is software-led, on-prem stops being a cost penalty. It becomes a parity option with a different control surface.

That parity is no longer theoretical. Wang’s claim is specific. Qwen 3.6 27B on a four-year-old gaming GPU matches Sonnet on three named task families. Not on coding benchmarks. Not on math olympiad scores. On the actual workloads most enterprises buy frontier models to do: briefing summarization, chart reading, research triage. The hardware cost of the parity is one used 3090 Ti, roughly $700 on the secondary market. The recurring cost of the parity is electricity, which Wang prices at $0.20 to $0.50 per million tokens for open-weight inference in the cloud.

For three years, the on-prem case was “you might save money in two years if the hyperscaler keeps raising prices.” For 2026 Q3, the on-prem case is “you can match the closed-API output today at electricity cost on hardware you may already own.”

Microsoft Is the Canary

Now overlay the Verge story. Microsoft has the most favorable possible commercial terms with Anthropic. It is the deepest pocket in the industry. Its developers are arguably the most aggressive corporate AI users in the world. And it decided that the per-seat Claude Code bill, six months in, did not pencil out.

The Verge piece is careful. It cites two reasons: Microsoft’s strategic alignment toward its own internal coding tools and OpenAI integrations, and the cost. The two are not separable. The cost reason exists because the alternatives are real. If Anthropic were the only viable frontier vendor, Microsoft would absorb the bill the way enterprises absorbed Oracle for two decades. It is not, so Microsoft did the math.

That math is now available to every CFO. If Microsoft cannot absorb a per-seat Claude Code bill at hyperscaler scale, your finance team should not assume your shop can absorb it at enterprise scale. The right question stopped being “how much can we negotiate the per-seat down.” It became “what is the multi-model portfolio that keeps us inside the cost envelope when our usage doubles, which it will.”

This is the convergence point. DeepSeek shows the closed-API floor is moving. Wang shows the open-weight ceiling has caught up on real tasks. Microsoft shows the largest customer in the market is already routing around. Three signals, three sources, one conclusion: closed-API single-vendor AI is a position, not a default.

The 2026 Q3 Evaluation Framework

A framework that survives this repricing has three layers. They are not glamorous. They are what your CFO will ask for next quarter.

Layer one: task-level cost benchmarking, not seat-level. Stop pricing AI by the seat. Price it by the task. A daily briefing summary at 8,000 tokens in and 1,500 tokens out costs $0.035 on Claude Opus 4.7, $0.012 on GPT-5, $0.001 on Gemini 3.5 Flash, and effectively electricity on a self-hosted Qwen. Multiply by your weekly volume and the seat license becomes a rounding error or a 10x premium, depending on which task and which model. Your finance team should see that grid before signing the next renewal.

Layer two: a portfolio of three model tiers, routed by task. Tier one is frontier-closed (Claude, GPT-5, Gemini Pro) for the work that genuinely requires the ceiling: novel reasoning, high-stakes generation, complex tool orchestration. Tier two is mid-cost closed (Flash, Haiku, GPT-5 mini) for the high-volume routine: extraction, classification, formatting, simple drafting. Tier three is open-weight self-hosted or cheap-cloud (Qwen, Llama, DeepSeek) for the workloads where Wang’s parity claim holds: briefing, triage, annotation, internal Q&A. The routing logic is the governance layer. Without it, you default to tier one for everything and pay the Microsoft bill.

Layer three: an on-prem evaluation, with real numbers. Not a strategy slide. An actual procurement model. What does it cost to stand up a single inference node capable of serving 100 internal users on Qwen 3.6 27B? Hardware: $4,000 to $8,000 for a current-gen GPU server. Power: $300 to $600 per month. Engineering: one infrastructure engineer at 20 percent allocation for the first quarter, 5 percent steady state. Total Year 1: $40,000 to $70,000. Compare that to a 100-seat Claude Code license at $200 per seat per month, which is $240,000 per year. The math does not require optimism. It requires arithmetic.

Do This Now

Three actions, this quarter, before Q3 budgeting closes.

First, get your top 10 AI workloads listed by task volume and current model. If you do not have this list, your AI budget is opinion, not measurement. Build the grid.

Second, run a one-week parallel inference test on the three highest-volume workloads using one frontier model, one mid-cost model, and one open-weight model. Score for output quality, latency, and cost per task. The results will surprise you in at least one direction. They always do.

Third, ask your infrastructure team for a single-page on-prem cost model for the workloads where open-weight parity holds. Not a commitment. A number. Put it next to the closed-API renewal quote when it arrives.

The leaders who survive the cost curve repricing will not be the ones who picked the right vendor in 2024. They will be the ones whose portfolio was built to assume the floor would move, the ceiling would come down, and the largest customer in the market would do the math before they did. The Microsoft cancellation is not an outlier. It is the leading indicator. The CFOs who read the signal in May will renegotiate in July. The ones who do not will absorb the bill until attrition forces the conversation.

The decision Microsoft made in May is the decision your CFO will make by Q4. Whether you bring the framework or the framework is imposed on you is the only thing still open.

This analysis synthesizes DeepSeek V4 Pro 75 Percent Price Cut Permanent (TheNextWeb, May 2026), AI’s Plummeting Prices Are a Software Story (Weighty Thoughts, May 2026), and Microsoft Starts Canceling Claude Code Licenses (The Verge, May 2026).

Victorino Group helps finance and engineering leaders design multi-model AI portfolios that survive the cost-curve repricing. Let’s talk.

AI Washing: Marketing's First Real Governance Incident

Thiago Victorino — Mon, 25 May 2026 00:00:00 GMT

On May 24, The Guardian published a piece that reads like a marketing post-mortem written before the funeral. Aisha Down quoted an anonymous central London PR account director with a clean number: roughly half of the pitches she receives overstate the AI involvement of the product they describe. Half. Not the long tail. The median pitch.

That is what marketing’s first real governance incident looks like.

The number on its own would be a curiosity. What makes it operational is what surrounds it. The same week, Standard Chartered’s CEO Bill Winters publicly apologized for calling workers displaced by AI “lower-value human capital” during a Bloomberg interview on May 22. Allbirds, the sneaker company, pivoted its public narrative in April toward “acquiring AI GPUs,” a sentence that means nothing in the context of selling shoes and means something specific to investors. PR practitioners interviewed by The Guardian cited press releases for “AI-powered basketball hoops” and “AI-powered lasers” as the working examples of where the genre has gone.

These are not outliers. They are the shape of the year.

The diagnosis: marketing has no equivalent of code review

Engineering shipped its governance layer in the last 18 months. Pre-commit hooks. CI pipelines. Static analyzers. Eval suites for LLM features. A human-readable code-review step before anything reaches main. The work is unglamorous, the controls are imperfect, and they exist. When an engineer writes is_ai_powered = True on a function that calls a regex, four other engineers see the line before it ships.

When a PR firm writes “AI-powered” in a press release about a basketball hoop, the equivalent review does not happen. The agency drafts it. The brand approves the spirit. Legal scans for libel. Nobody asks: is this factually true? What does the model do? Where is it called? What is the underlying mechanism? The closest analogue to a code review in PR is a pull-quote review, and pull-quote reviews check tone, not truth.

Half the pitches overstate AI involvement because the function that produces those pitches has no formal mechanism to catch the overstatement. The PR director quoted by The Guardian did not describe a malicious industry. She described a default. When the incentive is to get coverage and the verification step does not exist, the median output drifts toward the most coverage-friendly framing, which right now means “AI-powered.”

This is what an unguarded surface looks like at scale. We named the pattern in Your Marketing Team Just Became a Governance Team and again in Marketing’s Governance Reckoning. The Guardian reporting is the field evidence.

The Standard Chartered moment

Standard Chartered matters in a different way. Bill Winters is a chief executive of a global bank. The phrase “lower-value human capital” did not appear in a press release a junior wrote at 11 p.m. It appeared in a live Bloomberg interview. He apologized within 48 hours. The apology is the data point: the company recognized, fast, that the framing was a brand-safety event.

Read the sequence carefully. A CEO speaks. The market processes. The CEO retracts. That is the unmediated path between executive language and reputational consequence. It is also the path that marketing, in most large firms, is now structurally responsible for governing, because the alternative is letting the CEO improvise on live television with no review of the language patterns the company has decided are off-limits.

This is the same point made by Allbirds in reverse. A consumer brand publicly explaining its pivot toward “AI GPUs” is producing language that is both factually thin and obviously aimed at the investor who has lost patience with sneakers and is willing to forgive losses if the word AI is present. The market reads that signal correctly: not as conviction, but as positioning. The damage is to long-term credibility, which is the asset marketing exists to protect.

The external clock: the SEC is already enforcing

The voluntary window for fixing this is closing. The U.S. Securities and Exchange Commission has been bringing enforcement actions for AI-washing in securities filings since 2024. In March 2024 the SEC settled with two investment advisers for $400,000 over false AI claims. In June 2024 it charged Joonko’s founder with defrauding investors of $21 million by claiming proprietary AI matching that did not exist. The Office of the Investor Advocate flagged AI-washing as a 2025 priority. The legal mechanism is operational, the case law is being built, and it applies to any communication that touches a securities filing, which in a public company is most external communication.

What the SEC is doing for filings, plaintiffs’ attorneys will do for consumer claims, and regulators in the UK and EU will do under their own frameworks. The European Commission’s AI Act creates disclosure requirements that already conflict with the casual “AI-powered” claim. The UK’s CMA has signaled scrutiny of AI marketing claims under existing consumer-protection law. The pattern is the same one engineering saw with security and accessibility a decade ago: voluntary discipline before mandatory enforcement, then mandatory enforcement for everyone who did not adopt voluntary discipline.

The firms that install factual-claim review now will look, in 24 months, like the firms that adopted SOC 2 before customers required it.

What the review layer actually looks like

A factual-claim review for marketing copy is not exotic. It is four questions, asked before any external surface ships, by someone with the authority to say no.

First: is there a model in this product? Yes or no. Not “machine learning informs,” not “AI-enabled,” not “powered by.” Is there a model that runs on input and produces output. If no, the word AI does not appear in the copy.

Second: if yes, what does the model do? One sentence in plain language. “It ranks candidates by skills match.” “It generates draft email replies.” “It classifies invoices.” If the answer is more than one sentence, the copy needs more specificity, not less.

Third: what is the user-visible effect? Speed, accuracy, coverage, cost. A number, a range, or a comparison. “Reduces classification time from 15 minutes to 30 seconds.” If you cannot produce a measurable user-visible effect, the AI is not the story.

Fourth: who signs off that the previous three answers are true? Name. Not role. Not team. Person.

That is the entire mechanism. It is a 15-minute review. It is also the difference between half your pitches overstating AI involvement and none of them doing it.

The counter-argument, acknowledged

Some marketing leaders will read this and say the function has always managed factual accuracy. Legal reviews exist. Compliance reviews exist. The industry has been writing about AI claims for two years.

This is true and not enough. Legal reviews check for actionable false statements, not for the soft inflation that produces half the pitches overstating AI. Compliance reviews check disclosure requirements, which in most jurisdictions still do not specifically cover AI claims for non-regulated products. The two-year conversation produced essays and panel discussions, not pre-publication checklists with named owners. The Guardian’s reporting documents a function that has been aware of the problem and has not built the mechanism to fix it.

The same way engineering’s awareness of security in 2010 did not produce SOC 2 by accident. Someone had to build the checklist, name the owner, run the audits, and accept that some campaigns would be slower and some claims would be smaller. The firms that did that work first now sell to enterprise customers without a six-month security review every time. The firms that did not are still doing the work, just under deadline.

Do this now

If you lead marketing or communications, three actions before Friday.

Pull every external claim about AI your firm has made in the last 90 days. Press releases, product pages, sales decks, executive speeches. List them. The list itself is the audit.

For each claim, run the four questions above. Mark each line green, yellow, or red. Yellow means the claim is defensible but vague and should be tightened. Red means the claim is wrong and needs correction or retraction.

Name the owner of the factual-claim review going forward. One person. Calendar a weekly 30-minute review block. Put it in the comms calendar with the legal review and the brand review. Make the review a published gate that copy must pass before external release.

The function exists in engineering and works. The function does not exist in marketing and the bill is arriving. The PR director quoted anonymously by The Guardian was describing an industry waiting for permission to do the work. The permission is the calendar block.

This analysis synthesizes AI Washing: PR Firms Scrambling to Rebrand (The Guardian, May 2026).

Victorino Group helps marketing and communications leaders install factual-claim review processes before regulators do. Let’s talk.

Anthropic Just Repriced Itself. Your Procurement Playbook Is Stale.

Thiago Victorino — Mon, 25 May 2026 00:00:00 GMT

In one week of May 2026, Anthropic posted four numbers that, taken together, broke the procurement frame that most enterprises were using last quarter. Projected Q2 revenue of $10.9 billion, up 127% quarter on quarter. Projected profit of $559 million, the first material profit posted by any frontier lab. Compute cost ratio falling from $0.71 to $0.56 per revenue dollar in a single quarter. And 54% of new enterprise logos arriving through self-serve, with full ACVs, terms, and invoicing handled without a salesperson. The lab repositioned itself as market leader ahead of an October IPO window. Most procurement playbooks did not reprice.

This piece is not about whether Anthropic will hit the IPO. It is about what changes for the buyer when the lab moves from “expensive specialist behind OpenAI” to “highest-revenue, profitable frontier lab approaching listing.” The numbers say the move has already happened. The contracts your team is renewing in Q3 should reflect that.

The Financial Inflection in One Paragraph

Anthropic Q1 2026 revenue: $4.8 billion. Q2 projected: $10.9 billion. Profit projected: $559 million. Annualized run-rate: roughly $40 billion. Valuation: up to $950 billion (per Sherwood / The Information), now ahead of OpenAI’s $850 billion mark on the latest secondary prints. Claude Code alone is doing $2.5 billion in standalone revenue. Compute cost per revenue dollar dropped from $0.71 to $0.56 quarter on quarter, the first time a frontier lab has shown operating leverage on the input side rather than just the output side (Contrary Research, May 2026). OpenAI’s Q1 was $5.7 billion (Sherwood News, May 2026). The valuation crossover is the headline, but the cost-ratio compression is the story. A lab that profits while it grows is a lab that does not need to discount.

Self-Serve at 54% Is the Pricing Power Signal

Eleanor Dorfman, Head of Industries at Anthropic, told SaaStr that 54% of new enterprise logos in 2026 are arriving through self-serve channels, with full ACVs, terms, and invoicing handled through the PLG motion. The sales org was rebuilt from scratch in four months between January and April 2026.

Read what that means for the buyer. When a vendor’s enterprise pipeline runs through self-serve, it has very little incentive to negotiate against rate cards. The marginal customer arrives without touching procurement; the marginal customer pays list. Anthropic’s commercial team can hold the line on every deal that does come to the table, because the average new logo proved willing to swipe a card. That is why the “Anthropic is expensive” complaint from buyers has stopped converting into discounts. The data says it is not expensive enough to slow demand.

The procurement implication is uncomfortable. If your renegotiation strategy assumes the vendor needs your renewal to hit a quota, the strategy is outdated. The vendor’s quota is filling itself.

Four Chip Vendors and a $1.25B/Month Compute Commitment

While the revenue line was repricing, the compute line was diversifying. As of May 2026, Anthropic now runs production load across four chip vendors: Nvidia, AWS Trainium, Google TPU, and Microsoft Maia 200. Satya Nadella, on the April Microsoft earnings call, cited Maia 200 at “+30% tokens per dollar versus the latest silicon” (CNBC, May 21, 2026). SpaceX disclosures revealed Anthropic’s compute commitment runs at $1.25 billion per month through May 2029, a total north of $50 billion (CNBC, May 21, 2026).

Two consequences for the buyer.

First, the cost-ratio drop from $0.71 to $0.56 was not a one-off. It is the early signal of a multi-vendor silicon supply chain compressing input costs structurally. Buyers who expected Anthropic to be capacity-constrained and therefore willing to negotiate were modeling the wrong scarcity. Capacity is being built.

Second, the $1.25 billion monthly compute spend is now a fixed cost that has to clear margin. That is the lock-in math behind the June 15 access surface changes and the tightened commercial terms covered there. The compute is paid for. The customers have to be billed for it. Self-serve plus closed harness plus four-chip cost compression is one financial machine, not three separate moves.

The Acquisition Tells You the Roadmap

Anthropic’s new consulting venture made its first acquisition this month: Fractional AI, which ended an 11-month OpenAI partnership to join the deployment arm (Bloomberg, May 21, 2026). The venture is backed by Blackstone ($1.3T AUM), Apollo, GIC, and Sequoia. A frontier lab buying a deployment firm three months before an IPO window is not buying for revenue. It is buying for the gross margin that comes from selling services on top of the model, and for the case studies that justify the model price.

We covered the general pattern in foundation labs absorbing the stack. What is new this month is that the absorption is now visibly funded. When the buyer of your AI implementation work is also the seller of the model, the negotiation surface for the implementation contract collapses. Multi-year managed-services agreements signed with Anthropic-aligned firms in 2026 are exposed to the model vendor’s future pricing decisions in a way that 2024 contracts were not.

What This Means for Your Q3 Procurement Cycle

Three changes to make before the next renewal window closes.

Renegotiate the rate card now, on the data you have. If your team is sitting on a Q3 renewal, do not wait for the IPO. The numbers say Anthropic will be more expensive and less flexible after listing than before. Bring the renewal forward, lock the rate, and shape the contract with usage caps that protect you against the cost ratio reversing on a price hike. The leverage you have today is informational: you can cite the cost-ratio compression and the self-serve number and demand to share in the operating leverage. After October, the same conversation is a price-taker conversation.

Build the multi-vendor silicon hedge into your AI infrastructure plan, not just your model plan. Anthropic runs four chip vendors. Your AI infrastructure should not run on the assumption that any single one of them is the floor. Document which workloads can move between Trainium, TPU, Maia, and Nvidia-backed inference. The hedge that matters in 2026 is not “Claude vs GPT,” it is “what happens to my unit economics when the underlying silicon mix shifts.” We argued the broader vendor-risk frame in frontier-capacity scarcity creates vendor risk. The silicon layer is where the cost actually lives.

Set an agent-spend ceiling per team, with a quarterly review tied to output. Self-serve at 54% means your engineers, marketers, and analysts are putting Anthropic charges on the corporate card without going through procurement. That is fine when the spend is $200 per seat. It stops being fine when Claude Code-style agentic workloads push the per-seat number into four digits. Set the ceiling at the team level, require a quarterly output review against the spend, and treat the conversation as performance management rather than cost management. The cost is the symptom; the question is whether the team is shipping more because of the agents or simply spending more.

Do This Now

This week, pull two numbers from your AP system. First, the rate of growth in Anthropic-related spend across the company since January. Second, the share of that spend that flowed through self-serve rather than a master agreement. If the growth rate is double-digit monthly and the self-serve share is above 30%, you are running a Q2 2026 procurement reality on a 2025 contract structure. The fix is not a new vendor. The fix is a new contract, signed in Q3 2026, that prices what is already happening rather than what your team negotiated last year.

The financial inflection is real. The lab is profitable, valuation is ahead of OpenAI, compute is diversified across four vendors, and the sales motion does not need you. The procurement playbook that worked when Anthropic was the expensive specialist behind OpenAI does not work when Anthropic is the highest-revenue, fastest-growing, soon-to-be-public frontier lab. Reprice your assumptions before October prices them for you.

This analysis synthesizes Anthropic’s March to Profitability (Contrary Research, May 2026), Report: OpenAI’s Q1 Revenue Was $5.7B (Sherwood News, May 2026), Microsoft Maia AI Chip for Anthropic (CNBC, May 2026), How Anthropic Rebuilt Its Sales Org from Scratch (SaaStr, May 2026), and Anthropic’s New Consulting Venture Makes Its First Acquisition (Bloomberg, May 2026).

Victorino Group helps procurement teams reprice AI vendor contracts before quarterly cycles lock in stale assumptions. Let’s talk.

Cursor's Operating Layer: When Cloud Agents Need Enterprise IT

Thiago Victorino — Mon, 25 May 2026 00:00:00 GMT

On May 21, Josh Ma at Cursor published “Lessons Learned from Building Cloud Agents.” It is the most operationally precise first-party case study any agent vendor has shipped in 2026. Strip the marketing layer and what remains is a confession: the parts that made cloud agents reliable were not the model, the prompt, or the orchestration framework. They were the boring enterprise infrastructure the team initially treated as a detail.

The post says it plainly. “Enterprise IT for agents: secret redaction, network policies, credential management.” That phrase deserves to be highlighted, screenshotted, and shown to every executive who still believes agent reliability is a model problem.

What Actually Moved the Needle

Cursor names four specific changes and what each one bought them. None of them are about the agent.

Durable execution via Temporal. Migrating cloud-agent workflows to Temporal lifted reliability from “one nine” to “two nines.” Temporal now handles 50 million actions per day across 7 million workflows for Cursor. Workflow state survives crashes, redeploys, and infrastructure failures. The agent does not need to remember where it was, because the workflow runtime does.

Isolated developer environments per task. Each cloud agent runs inside a fully provisioned dev environment with dependencies, services, and the right secrets in scope. Josh Ma calls this “the single biggest factor in cloud agent output quality.” Not the model. Not the prompt. The environment.

Self-healing infrastructure. When a workflow stalls or an environment misbehaves, the platform restarts the unit of work without the agent author writing recovery code. Reliability moves from heroic exception handling to an operational default.

Decoupling agent state from conversation state. This is the architectural primitive that makes the other three feasible. The conversation is one resource. The workflow is another. Killing or replaying one does not corrupt the other. It is the same separation Temporal users have used for a decade to keep payment flows alive across deploys, applied to a code-generation loop.

The result: 40 percent of internal Cursor monorepo pull requests now originate from cloud agents. That number is only credible because the four primitives above exist underneath it.

The Vendor Just Admitted the Abstraction Was Wrong

Read the post once for the lessons. Read it again for the framing.

A vendor whose business depends on selling cloud agents just published a long-form essay arguing that the agent is not where reliability lives. Reliability lives in workflow durability, environment isolation, credential hygiene, and state separation. Those are not features you buy with an agent license. They are properties of the operating layer underneath.

This matters because the dominant 2026 sales pitch has been the inverse: buy the agent, get the reliability. Cursor is now publicly saying that pitch was incomplete. The most experienced cloud-agent vendor in the market reached two nines of reliability by spending engineering cycles on Temporal, sandboxes, secrets management, and network policy. The exact same investments any enterprise platform team would make for any production system handling sensitive code and credentials.

This is the governance-as-product thesis arriving from the vendor side of the table. It is also a quiet correction of the “agents are different, the old rules do not apply” narrative that drove a lot of 2025 procurement.

The Procurement Checklist

If Cursor needed these four primitives to ship cloud agents internally, every other team using or building cloud agents needs them too. They are not vendor-specific. They are properties of the operating environment any autonomous code-generation system requires.

Treat them as a procurement checklist. If a vendor pitches you a cloud-agent product, ask:

1. Durable execution. Does your agent workflow runtime survive crashes and redeploys without losing in-flight work? What is the underlying engine? If the answer is “we retry from the conversation,” that is not durability. That is hope.

2. Isolated execution environments. Does each agent task run in its own provisioned environment with scoped credentials, or does it share a long-lived sandbox? Per-task isolation is the difference between a contained blast radius and a shared one.

3. Self-healing infrastructure. When a task stalls, who restarts it? If the answer involves an on-call engineer reading logs, you are buying a beta. If the answer is “the platform handles it and emits an audit event,” you are buying production.

4. Decoupled state. Can you kill a misbehaving conversation without losing the workflow that conversation triggered? Can you replay the workflow against a different model without rewriting the prompt? Conversation and execution are two resources, not one.

These four questions filter cloud-agent vendors faster than any feature matrix. They also map directly onto governance properties that auditors care about: durable execution produces an audit trail by construction, isolated environments produce per-task credential scopes, self-healing produces operational metrics, decoupled state produces replayability for incident review.

Durable Execution Is a Governance Primitive

The detail in Cursor’s post that deserves the most attention is the least flashy one. Workflow durability is not just a reliability feature. It is the property that makes everything else governable.

A durable workflow is, by definition, a workflow whose history is recorded, replayable, and inspectable. Every action the agent takes is captured as a discrete step the runtime can audit. That history is the raw material for compliance reporting, incident review, change attribution, and the kind of forensic answer auditors will ask for when an agent ships the wrong commit. Without durability, an agent’s actions are a stream of side effects that nobody can reconstruct after the fact.

The teams that have understood this for a decade are the ones running Temporal, Airflow, Step Functions, and Cadence behind payment systems and order fulfillment. The teams that are now learning it the hard way are the ones who built agents on top of stateless HTTP loops and assumed the LLM would remember.

Cursor learned it. The post is the receipt.

Do This Now

Pick one cloud-agent workflow currently running in your environment and answer four questions before the end of the week:

If the host process restarts mid-task, does the workflow resume or restart from zero?
If the agent leaks a secret to a log, which credential was scoped to that task and how do you rotate it?
If the task stalls for an hour, who notices and what restarts it?
If a regulator asks for a complete history of what the agent did last Tuesday at 3:14pm, can you produce it?

If you cannot answer all four with a specific name, system, or query, your cloud-agent program does not yet have an operating layer. It has a demo with a bigger blast radius.

Cursor just published the playbook. The rest of us get to copy it before the audit shows up.

This analysis synthesizes Lessons Learned from Building Cloud Agents (Cursor, May 2026).

Victorino Group helps platform teams turn agent containment into operational defaults instead of one-off heroics. Let’s talk.

The Design Parity Trap: When 80% Competent Is the Floor

Thiago Victorino — Mon, 25 May 2026 00:00:00 GMT

A small business owner opens Google Pomelli, uploads a few photos and a logo sketch, types one sentence about the bakery she wants to launch, and walks away ninety seconds later with a complete Business DNA: brand voice, color system, type pairing, a populated website, a campaign-ready social pack. The work is competent. The typography is readable. The palette is balanced. The copy is on-tone. It would have cost a junior agency three weeks and twelve thousand dollars in 2023.

Pomelli announced this at AI I/O 2026. Across town, in the same week, an Executive Creative Director named Yann Caloghiris published in The Drum and named the structural risk faster than most strategists: when AI delivers roughly 80% of design competently for every team that asks, competence stops being a differentiator. The residual 20% (taste, trust, calibration) becomes the entire moat.

Call it the design parity trap. We have written before about design systems becoming governance infrastructure and about the operating-model shift that turns designers into conductors. The parity trap is the failure mode underneath both shifts. It is what happens when leadership treats AI as a productivity lever and discovers, eighteen months later, that the productivity worked exactly as advertised and the brand voice collapsed into the median.

The 20-Point Gap That Tells the Whole Story

Figma’s 2025 Design Survey, which Caloghiris cites, contains a number worth staring at. 78% of design professionals say AI tools significantly accelerate their workflows. 58% say AI improves output quality.

A twenty-point spread between speed and quality is not noise. It is the parity trap rendered in survey data. The speed gains arrived as promised. The quality gains arrived for the floor, not the ceiling. AI raised every designer to a competent baseline. It did not raise the work above it.

In a category where every competitor is now operating at the same competent baseline, the floor is no longer the floor. It is the new ceiling, and most teams will not realize they have stopped climbing.

Slack Wrote the Operational Answer

While Caloghiris named the trap, Slack’s VP of Product Design Will Miner published the operational answer in the same week. His team of roughly seventy designers has been working through the AI shift in public, and the principles he shipped read like a governance document, not a manifesto.

Three behavioral changes are worth naming. Executive demos at Slack now ship in code, not in Figma. Designers without coding backgrounds are building internal tools their teams need. UI bugs are getting fixed in-house, without an engineering ticket. The boundary between designing and building has moved, and the team’s principles moved with it.

The principles themselves are unremarkable in isolation. AI is a collaborator, not a replacement. Taste is the differentiator. Craft compounds. What is remarkable is that they exist at all. Most design organizations are still debating whether to allow Figma’s AI features in the file. Slack wrote down what good looks like at seventy-designer scale and shipped it.

This is the operational shape of the answer. The parity trap closes when leadership names the residual 20% explicitly, builds the review checkpoints that protect it, and gives the team principles concrete enough to refuse work that violates them.

Chen’s Four-Stage Model Is the Practitioner Playbook

The third article from the same week comes from Daisy Chen at UX Collective, and it is the most reusable artifact of the three. Chen draws on Bainbridge’s 1983 Ironies of Automation, the Parasuraman, Sheridan, and Wickens framework from 2000, and Lee’s research on alarm fatigue (the famous 35-to-1 false-alarm-to-real-alarm ratio at which operators start disabling warnings entirely). She compresses fifty years of human-automation research into a four-step model.

Identify the task. Choose the control level. Calibrate trust. Design for co-evolution.

The model is not specific to design. It is the practitioner playbook for any function adopting AI, which is precisely why it matters for the governance-beyond-engineering arc. Marketing teams running autonomous campaign generation need it. Legal teams reviewing AI-drafted contracts need it. Sales teams using AI-generated outreach need it. The vocabulary is generalizable. The discipline is transferable.

Step one (identify the task) forces leadership to admit which decisions actually require human judgment. Most teams skip this and discover, six months later, that they have automated the decisions that needed the most judgment and left the routine ones alone.

Step two (choose the control level) maps cleanly to the design system as constraint layer thesis. Full automation, supervised execution, advisory mode, or manual with AI suggestion. Each has a place. Picking the wrong level for the wrong task is the most common failure we see in implementation engagements.

Step three (calibrate trust) is where Lee’s 35-to-1 number lives. Trust that is too high produces uncritical adoption. Trust that is too low produces tool abandonment. Both fail the same way: the system stops learning because the humans stop engaging with its output.

Step four (design for co-evolution) is the only one that genuinely buys time. The other three stabilize the system. This one improves it. Co-evolution is what separates teams that plateau at competent-for-everyone from teams that compound taste over years.

What Pomelli Actually Threatens

Pomelli is not threatening agencies. Agencies were already being repriced. Pomelli is threatening the assumption that design competence is a defensible position.

If a small business owner can ship a competent brand identity in ninety seconds, then the brand identity itself stops being the deliverable. The deliverable becomes what comes next: the decisions about which competent option to refuse, which on-tone copy to rewrite because it is on-tone but boring, which balanced palette to push out of balance because the balance is generic. The deliverable becomes the taste applied to the AI output, not the output itself.

This is why Caloghiris’s framing matters. The prototype is not the brand. The brand is the accumulated set of decisions about which prototypes to ship and which to throw away. A team that uses Pomelli without that discipline ships brand parity. A team that uses Pomelli inside Chen’s four-step model and Slack’s principles ships brand parity plus the 20% that makes it specific.

The Honest Limitation

Pomelli is early. Caloghiris is writing from one creative director’s vantage point. Slack’s principles have not been independently validated at scale. Chen is synthesizing academic research that predates LLMs by decades.

Treat the convergence as a directional signal. The four pieces did not coordinate. They arrived in the same week because the underlying pressure is real. The maturity of any single response is still early.

The teams that will compound advantage from this moment are the ones that take the convergence seriously while staying skeptical of any single playbook. Read all four. Argue with them. Then write your own version, anchored to your actual brand and your actual customers, and ship it before the parity sets in.

Do This Now

Pick one design workflow your team has already moved to AI. Run it through Chen’s four steps this week. Identify the task. Choose the control level. Calibrate the trust. Design for co-evolution. Then write down three review checkpoints that would catch the moment the output drifts toward median, and assign each one to a named human.

Then send Miner’s piece to whoever runs your design org. Ask them to publish principles at your scale within thirty days. Not aspirational principles. Operational ones. The kind a designer can cite when refusing a piece of work that violates them.

The 20% that becomes the moat does not get built by accident. It gets built by teams that named what mattered before parity arrived and protected it on purpose.

This analysis synthesizes Google Pomelli Can Now Build Your Entire Brand from Scratch (Digital Trends, May 2026), AI Gives Us the Prototype. It Doesn’t Give Us the Brand (The Drum, May 2026), Leading Design Through the AI Shift (Slack Design, May 2026), and Most AI Tools Make Users Faster. The Best AI Tools Make Users Better. (UX Collective, May 2026).

Victorino Group helps design and product leaders install the review checkpoints that keep brand voice from collapsing into AI-generated parity. Let’s talk.

MCP Gets Its Stateless Core: The Protocol Just Stopped Hand-Waving

Thiago Victorino — Mon, 25 May 2026 00:00:00 GMT

On May 21, the MCP maintainers (David Soria Parra and Den Delimarsky) published the 2026-07-28 release candidate of the protocol. The blog post calls it “the largest revision since launch.” That is accurate, and it understates what changed. The Model Context Protocol just stopped being a research artifact you tolerated in production and started being a vendor-evaluable surface you can write into a contract.

The headline numbers do most of the talking. Six SEPs (Specification Enhancement Proposals) address statelessness. Six more harden authorization. The maintainers locked in a 12-month minimum deprecation window. Three primitives that were inherited from the original 2024 design (Roots, Sampling, Logging) are now formally deprecated. Two extensions (MCP Apps and Tasks) ship as the first official examples of the new reverse-DNS extension model. There is a 10-week validation window before the spec is final on July 28.

The teams that have spent the past nine months arguing that MCP “is not ready for production” now have a concrete date when that claim stops being true.

The Stateless Inflection

The most important change is structural. Until this RC, an MCP server held session state. Every request from a client had to land on the same instance, because that instance remembered who the client was and what tools were already negotiated. That single fact dictated every deployment pattern downstream of it. Sticky sessions in your load balancer. Session affinity in your service mesh. Custom logic in your CDN to honor session cookies. A team that wanted to run MCP behind plain round-robin DNS in front of three Lambda containers could not, because the second request would land on a different container and the negotiation would be gone.

The stateless core fixes that at the protocol level. State now lives in the application, where it always belonged. An MCP server in the 2026-07-28 spec is a stateless HTTP service. You can put it behind any load balancer that distributes requests randomly. You can cache responses at the edge. You can scale horizontally without coordination. You can deploy it the way you deploy every other internal HTTP service, with no protocol-aware infrastructure in the path.

This is the change that turns MCP from a thing your platform team has to plan around into a thing your platform team can ignore. For an enterprise running a service mesh, an API gateway, and a CDN, “looks like HTTP” is the entire ballgame. The protocol just earned the right to be deployed on the infrastructure you already operate.

Six SEPs to ship this. Worth understanding why it took six. Stateless transport is not just “remove the session ID.” It required rethinking how tool capabilities are negotiated, how subscriptions to long-running resources work without an open connection, how authorization tokens are scoped per request instead of per session, and how the client knows what the server is capable of without having to ask every time. Each of those is a separate proposal with separate review. Six SEPs is the size of the cleanup.

Authorization Stops Being a Sketch

The second cluster (six more SEPs) closes the authorization story. Until this RC, MCP authorization was a documented intent. The spec said “use OAuth 2.1.” The implementations did roughly that, with enough variation to make a security review a research project.

The RC aligns the protocol with OAuth and OIDC at the level a security team actually cares about. Token issuer (iss) validation is now in the spec, not in the guidance section. Scope semantics are defined. Token audience binding is defined. The maintainers explicitly removed ambiguity around which token can be used by which client against which server. The intent is that an MCP server’s authorization story should be auditable against the same playbook your existing SaaS vendors use.

This is the change that lets a CISO sign off without writing a custom risk acceptance. When the OAuth flow against your MCP server looks indistinguishable from the OAuth flow against your CRM, the same controls apply: the same identity provider, the same scope inventory, the same token lifetime policy, the same revocation path. The protocol stopped requiring you to invent new controls.

The Deprecation Policy Is the Procurement Win

Buried under the headline changes is the item that matters most to procurement: a formal deprecation policy with a 12-month minimum window between deprecation and removal. Roots, Sampling, and Logging are the first primitives to enter that window. They are being removed because they were under-used, awkwardly scoped, or duplicated by better mechanisms (Tasks subsumes long-running work; Apps subsumes UI surface concerns). The point is not which primitives are leaving. The point is the calendar.

Twelve months is enough time for a vendor to update an SDK, ship a new release, and give customers a migration path. It is also enough time for a procurement team to write the contract clause that says: “If a primitive your server depends on is deprecated, you have nine months from the announcement to ship a compatible upgrade, and we are entitled to remediation if you do not.” Until last week, no such clause was writable, because there was no defined deprecation cadence to point at. Now there is.

The 12-month window is also what makes it safe to deploy MCP servers in environments that have 24-month software refresh cycles. Two refresh cycles cover one deprecation cycle with margin. The protocol just became deployable in regulated industries that had been waiting for exactly this commitment.

Extensions Get a Namespace

MCP Apps and Tasks ship as the first two extensions under a reverse-DNS naming scheme. Apps brings UI surface concerns into the protocol (the client can render server-supplied UI in a controlled way). Tasks brings long-running work into the protocol (the server can hand back a task handle, the client can poll or subscribe, the work survives a disconnect). Both have been validated in real implementations over the past nine months. Both now have a stable home in the spec.

The reverse-DNS scheme matters more than the two specific extensions. It means a vendor can ship com.acme.mcp.proprietary-thing as a recognized extension, and a client can advertise that it supports org.modelcontextprotocol.apps without confusion. The namespace is the mechanism that lets the protocol grow without forking. Until now, every vendor-specific feature was either a private extension nobody else could discover or a proposal to change the core spec. The reverse-DNS path is the middle road, and it is the one every other healthy protocol ecosystem has converged on.

What to Do in the Next 10 Weeks

The 10-week validation window before July 28 is when you get to influence the final shape of the spec. Three actions are worth scheduling.

First, audit your current MCP deployments for stateful assumptions. If you have sticky-session config in a load balancer because of MCP, mark it for removal. If you have a service-mesh policy that pins clients to instances, the same. The migration is not automatic, but the cleanup is straightforward and the operational simplification is permanent.

Second, run a token-flow review against your MCP servers under the new authorization rules. The iss validation, scope semantics, and audience binding are the three places where existing implementations are most likely to drift from the spec. A 60-minute review with your identity team will surface what needs to change.

Third, write the deprecation clause into your next MCP vendor contract. The 12-month window is the artifact that makes the clause defensible. If you are signing a contract this quarter without that clause, you are signing one that expires the day a primitive your vendor uses gets removed.

The MCP RC is not a feature release. It is the moment the protocol stopped being a thing you adopted on faith and started being a thing you can evaluate. The vendors who were waiting for that moment to take it seriously will be the ones moving fast in Q3. The buyers who keep treating MCP as research will be the ones writing remediation checks in 2027.

This analysis synthesizes The 2026-07-28 MCP Specification Release Candidate (Model Context Protocol, May 2026).

Victorino Group helps procurement and platform teams turn protocol changes into vendor evaluation criteria before they harden into defaults. Let’s talk.

The Week Both Sides of the Supply Chain Got Industrial

Thiago Victorino — Mon, 25 May 2026 00:00:00 GMT

Between May 21 and May 22, 2026, four announcements landed within 48 hours of one another. GitHub disclosed that 3,800 of its own internal repositories had been exfiltrated through a malicious VS Code extension. Anthropic published the first numbers from Project Glasswing, the restricted security model program, with more than 10,000 vulnerabilities surfaced in critical software in a single month. The Anthropic Red Team released exploit-evals for Mythos Preview, which solved 21 of 41 ExploitBench CVEs while every other model managed two or fewer. Perplexity open-sourced Bumblebee, a read-only scanner that treats agent endpoints (extensions, MCP configs, lockfiles) as scannable surfaces.

None of these were coordinated. They still describe one event.

The AI-era supply chain crisis has crossed the industrial threshold on both sides. The offense side has flywheel mechanics, named victims, and a price list. The defense side has eval scores, a partner pipeline, and an open-source first artifact. The intermediate question, the one engineering and security leaders need to answer this week, is not whether agent endpoints need controls. It is whether an inventory of those endpoints exists at all.

The Offense Side: TeamPCP Reached the Top of the Stack

We have written about TeamPCP through individual incidents before. Clinejection showed a single npm package compromising Cline installs. The Mercor wave showed the same operator hitting AI training data infrastructure. Prompt injection as a supply-chain weapon traced the technique into the model loop itself.

What May 21 added is the floor TeamPCP had not yet touched: the platform that hosts the supply chain.

GitHub CISO Alexis Wales confirmed 3,800 internal repositories were exfiltrated through a single VS Code extension. The asking price on BreachForums was $50,000. Aikido Security tracked the takedown windows: 18 minutes on the VS Code Marketplace, 36 minutes on Open VSX. Fast response, in absolute terms. Still 54 minutes during which a poisoned extension was the default download channel for a critical developer surface.

The broader campaign numbers, published by Wiz, Socket, and Palo Alto Networks the next day, frame the scale:

20 distinct supply-chain waves over the year
500+ poisoned packages, more than 1,000 counting versions
Confirmed downstream victims include OpenAI (two employee devices), Mistral AI, Mercor, the European Commission public site, TanStack, LiteLLM, Trivy, and AntV

The economic logic is straightforward. A poisoned extension that runs on a GitHub engineer’s laptop returns more value than one that runs on a junior developer’s hobby project. TeamPCP is now operating at the layer where developer tooling itself is the target. Every layer above (npm registries, language ecosystems, framework maintainers) already absorbed waves earlier in the year. The platform layer was the remaining ceiling.

That ceiling has been pierced.

The Defense Side: Glasswing Showed Offense AI Scales Defense AI

Project Glasswing is Anthropic’s restricted-distribution security model: more capable than the public Claude line, accessible only to vetted security partners under specific use restrictions. The governance model has been documented since April. The May 22 initial update is the first time the program reported what it found.

The numbers carry weight because they are field-tested, not benchmark-tested:

10,000+ vulnerabilities across systemically important software in one month
Roughly 50 active partners
6,202 high- or critical-severity vulnerabilities discovered across 1,000+ open-source projects
Cloudflare alone surfaced 2,000 bugs and reported a false-positive rate “better than human testers”
Firefox 150 generated 271 vulnerabilities versus Firefox 148, a 10x increase attributable to running Opus 4.6 against the same codebase

The strategic claim Glasswing validates is older than the data: offensive AI capability and defensive AI capability scale on the same curve. If a model can construct an exploit chain, the same model can find the conditions that enable that chain. The question has never been which capability arrives first. They arrive together. Governance determines which one reaches the field at scale.

Glasswing is the first program where the defense-side reach was measured against the offense-side reach in the same month. Defense reached further. Restricted distribution made that possible.

Mythos Preview: The Exploit Eval Becomes a Commodity Benchmark

The Anthropic Red Team’s exploit evaluation paper is the third leg of the May 22 stool. It is also the most uncomfortable.

Mythos Preview solved 21 of 41 ExploitBench CVEs by writing arbitrary code execution exploits. Every other tested model solved two or fewer. Mythos was the only model that escaped a V8 sandbox. The performance doubling time, measured against the prior generation, was 0.7 months. The prior doubling was 1.1 months. On SCONE-bench, the smart-contract exploit eval, the dollar value of successfully exploited contracts crossed $35 million.

The numbers matter less than the trajectory. Multi-step exploit construction, which 12 months ago required a senior offensive security researcher, is now a model capability. Restricted distribution slows the commodity arrival but does not stop it. The exploit eval is now a benchmark that frontier labs publish against each other. Open-weights catch-up is a question of months.

Bumblebee: The First Defender-Side Agent-Endpoint Scanner

Open-source AI offense has been the asymmetry we have been tracking. Defense had no equivalent artifact pointed at the surfaces agents actually touch.

Perplexity’s Bumblebee, open-sourced on May 22, is the first one to ship.

The design choices reveal what defenders had been missing:

Bumblebee scans four endpoint surfaces: language package managers (npm, pip, cargo, gem, others), MCP configuration files, VS Code-family extensions (VS Code, Cursor, Windsurf), and browser extensions.
It is read-only by design. It does not invoke npm install, does not trigger postinstall hooks, does not run the code it inventories. The reason is explicit in the project README: any active scan triggers the exact payload Bumblebee exists to find.
Perplexity Computer, the agent that drafts the catalog, opens pull requests for human review. The agent does not auto-commit the inventory.

The artifact’s existence shifts the conversation. MCP configuration files now have a scannable inventory format. VS Code extension installations now have a defender-oriented enumerator. The argument that “we cannot inventory what we do not have a tool for” no longer applies. The tool exists, is free, and is open source.

The Governance Levers Are Named

Three controls are now concrete enough for a Q3 governance plan:

Long-lived credentials in developer tooling. The GitHub breach worked because a VS Code extension running on an engineer’s laptop carried the access to read internal repositories. The compute boundary, the data boundary, and the identity boundary collapsed into one process. The fix is not a new policy. The fix is workload-identity federation reaching developer extensions, which is where it has been absent.

Extension review as a first-class control. VS Code Marketplace and Open VSX both took the malicious extension down inside an hour. That is a reactive control. The proactive control is treating extension installs the way enterprise security treats software installations on a production server: an approval queue, a signed manifest, a per-version sign-off. Most organizations do not run this for developer tooling because no one previously asked.

MCP configuration inventory. Bumblebee is the artifact that makes this enumerable. The question to ask in the next platform team meeting: “Which agents on which machines load which MCP servers, and where are the configs stored?” If the answer is “we do not know,” the work starts there.

Do This Now

Block 45 minutes this week. Run Bumblebee against one engineering laptop and one developer container image. The output is a draft pull request listing every language package, every MCP configuration, every VS Code extension, every browser extension found. Read it. Two surprises are typical: an extension nobody remembers installing, and an MCP config pointing at a service nobody on the team owns.

That output is the inventory. The inventory is the precondition for governance. Everything else, the policies, the approvals, the federation, presupposes that you can list what you have. The Anthropic Glasswing data confirmed defense AI works at scale. The TeamPCP campaign confirmed offense AI is operating at the platform layer. Mythos confirmed the capability gap closes in months, not years. Bumblebee removed the last excuse for not enumerating the surfaces.

The teams that win the next two years of agent operations are not the ones with the most autonomous agents. They are the ones who can answer, in writing, what their agents reach.

This analysis synthesizes GitHub internal repositories exfiltrated via malicious VS Code extension (ITPro, May 2026), A hacker group is poisoning open-source code at an unprecedented scale (Ars Technica, May 2026), Project Glasswing: An Initial Update (Anthropic, May 2026), Measuring LLMs’ Ability to Develop Exploits (Anthropic Red Team, May 2026), and Perplexity is open-sourcing Bumblebee (Perplexity, May 2026).

Victorino Group helps enterprises inventory and govern their agent endpoints before the next supply-chain wave reaches them. Let’s talk.

Cloudflare's CEO Just Named the AI Layoff Pattern: Measurers Out, Builders In

Thiago Victorino — Fri, 22 May 2026 00:00:00 GMT

On May 20, 2026, Matthew Prince, CEO of Cloudflare, published an opinion piece in the Wall Street Journal that did something most layoff announcements work hard to avoid. He named the pattern.

In his words (verbatim from the open paragraph): “We haven’t found another example in U.S. business history of a public company growing at more than 30% that laid off more than 20% of its workforce. Yet what we did is likely going to become the norm over the next year.”

Read that twice. Record-setting revenue growth and a one-fifth headcount cut, in the same quarter, in the same company. Prince’s own framing of why he wrote the piece (also verbatim): “This is a story about artificial intelligence, but executives and commentators are misunderstanding how it will disrupt business and who will be affected.”

The subhead the WSJ ran with, also verbatim, is the part every operator needs to read out loud at the next leadership meeting: “The company has less need for middle managers, operations jobs and other measuring positions.”

That last word is the one to underline. Measuring positions.

What Prince Named

The rest of the op-ed sits behind WSJ’s paywall. What we can verify with certainty is what WSJ’s own editorial indexing layer surfaced as the article’s spine keywords: analysts, automation, builders, cut, employ, jobs, layoff, MEASURERS, revenue, sellers. WSJ’s metadata pipeline only emphasizes terms that recur with weight in the body. The capitalized one, the one a publishing system treats as the argumentative backbone, is measurers.

So here is what we can say with high confidence from the verbatim subhead and the keyword spine, and what we are inferring from the rest: Prince’s argument splits the workforce along a single axis. On one side are measurers, people whose primary job is to track, report, summarize, or coordinate numbers and status that AI can now track, report, summarize, and coordinate on its own. On the other side are builders, people whose primary job is to create the things AI cannot yet create unaided: product decisions, customer relationships, code that ships to production with judgment behind it, sales conversations that close.

The cut was not bottom-up. It was middle-out.

A note on what the article also makes clear, per the keyword spine: Cloudflare is hiring at record open-position levels. The layoff is a recomposition, not a downsizing. The seats are not being eliminated, they are being refilled with a different shape of work.

Why This Is the First Named Case

We have written before about the centaur era, where the unit of measurement is the team plus its tools, not the model. We have written about the two percent productivity reality that does not match the productivity narrative. We have written about how marketing and other functions are becoming governance teams.

What we have not had until May 20 is a U.S. public-company CEO putting a name on the cut and the math behind it. Measurers versus builders is Prince’s framing, and it lands because it does what most layoff communications refuse to do: it tells the people who were cut why they were cut, in a category their peers will recognize, and it tells the people who stayed why they stayed.

The companies that try to use this framing without doing the work will get caught. The companies that quietly do the work and never name it will lose the people who could have helped them through it. Prince did both. He named it and he is doing it in public.

The Operating-Model Question Every CEO Now Has To Answer

If you are running a company of any size and the board has not yet asked the version of this question that begins with “why are we not Cloudflare” or its sharper inverse “why are we Cloudflare,” you have maybe one quarter before they do. The question they will actually be asking is this one: which roles in this company are measurers, and which are builders, and what is our plan to recompose, not just to cut.

Three traps to avoid in how you answer.

Trap one: confusing measurers with junior staff. Many of the measurers Prince is describing are senior. They are middle managers whose value used to be coordination, status synthesis, and number-rolling-up. The cut runs through the org chart horizontally, not vertically. Treating it as a junior-cut conversation reads to the room as a hand-wave.

Trap two: confusing builders with engineers. Builders, in Prince’s framing and in the operating reality the keyword spine supports, are not only people who write code. Sellers are explicitly on the builder side of the spine. So are the people who design products, who own customer relationships, who make judgment calls AI cannot make for them. The split is functional, not departmental.

Trap three: assuming the recomposition is a one-time event. If AI capability keeps moving, the line between measurer and builder moves with it. A role that is a builder today, where the human judgment is the load-bearing piece, can become a measurer in 18 months if the surrounding tooling closes the judgment loop. The plan is not a one-time RIF. The plan is a continuous reassessment of what each role’s irreducible human contribution actually is. This is the same discipline we argued for in the output-competence decoupling piece: the role is not the output, the role is the verified judgment behind the output.

What This Means for the Quarter You Are In

Three things to do this week.

First, get the measurers-versus-builders question on the agenda of your next operating review. Do not assign it to HR. Run it with the executive team. The output is a one-page picture of every role above a certain band, sorted into one column or the other, with a short note for each on what the irreducible human contribution is today and what it might be in 12 months.

Second, before you cut anything, hire one builder for every two measurer roles you are uncertain about. This is the Cloudflare move. Prove the recomposition works at small scale before you make it the layoff story. The companies that get this wrong will cut first and discover the builder bench is empty.

Third, write down, in plain language, what your version of measurers and builders looks like inside your business. Do not adopt Prince’s words wholesale. The categories are useful, the specifics are not transferable. A consulting firm’s measurers look different from a SaaS company’s. A regulated business has measurers it cannot legally cut. Your version of the picture is the work.

The piece that makes Prince’s op-ed historically interesting is not the layoff. It is that a public-company CEO put a name on the operating-model change and asked the rest of the market to look at theirs. The boards and the press are about to oblige. Better to have the picture in your hand when they ask.

This analysis synthesizes How I Choose Which Cloudflare Employees to Replace With AI (Matthew Prince in WSJ Opinion, May 2026; paywalled, open paragraphs verbatim and editorial keywords inferred).

Victorino Group helps leadership teams separate the AI work that builds the company from the AI work that just looks busy. Let’s talk.

The Containment Stack Just Filled Out: Four Layers, One Week

Thiago Victorino — Fri, 22 May 2026 00:00:00 GMT

Between May 20 and May 21, four organizations shipped four very different things into the same problem space. Dropbox open-sourced Nova, an internal platform that wraps coding agents in workflow isolation. The CNCF announced Prempti, a Falco-derived policy layer that intercepts the actions an agent tries to take before they reach the host. Google released Agent Executor (the open-source ax runtime) plus Agent Substrate on Kubernetes, a distributed runtime for agents that need to survive restarts, branch their own trajectories, and scale to millions of registered instances. IBM, on the same Think keynote, made the executive case for “digital workers” as a managed labor class with badges, onboarding, and retirement.

Read in isolation, each release is a vendor announcement. Read together, in the order they appeared in your feed, they describe four floors of a building that was sketched out a month ago in the original containment stack essay and is now being filled in by separate companies who did not coordinate. The interesting story this week is not that they all shipped. It is that they shipped at different altitudes.

Layer 1: Workflow Isolation (Dropbox Nova)

Dropbox’s Nova post is the most ground-floor of the four releases. Nova is a platform for running coding agents inside Dropbox’s own engineering workflows, with three constraints that matter:

A five-iteration cap on every workflow. An agent that has not converged after five tries is not allowed a sixth; the workflow halts and a human takes over. The platform refuses to spend infinite tokens chasing a bad plan.

A deflaker that validates each candidate fix against 100+ CI runs before merging. Coding agents propose code constantly; the bottleneck is not generation, it is verifying that the proposal does not introduce a flake. Nova treats the deflaker as a first-class workflow component, not an afterthought.

Hermetic per-commit snapshots. Each agent run gets a frozen view of the repo at a known commit, so reruns are reproducible and concurrent agents do not see each other’s half-finished work.

What Dropbox shipped is not a sandbox in the operating-system sense. It is a sandbox in the workflow sense: the agent runs against a bounded version of the repository, with a bounded number of attempts, gated by a deterministic validator. The trust boundary is the workflow itself. This is the layer closest to the agent’s actual job, and it is where most teams accidentally have nothing.

Layer 2: Action Interception (CNCF Prempti)

One floor up from workflow isolation sits action interception. The CNCF’s Prempti announcement is the reference implementation that did not exist last month. Prempti is built on the Falco runtime security project and watches for the things coding agents try to do that they should not be doing: reading SSH keys, exfiltrating AWS credentials, modifying MCP server configurations to escalate their own permissions, injecting commands into git hooks.

The design decision worth naming: Prempti is pre-execution, not post-hoc audit. A logged violation is useful for forensics, but it does not stop the laptop’s SSH key from leaving the building. Falco’s kernel-level instrumentation lets Prempti block the syscall before it completes. It supports Claude Code today on Linux, macOS, and Windows, with Codex on the roadmap.

This layer answers a question the workflow layer cannot: “What is the agent actually trying to do on the host?” Nova’s five-iteration cap does not protect you if iteration three quietly reads ~/.ssh/id_rsa and POSTs it to a Discord webhook. The workflow layer trusts the workflow’s intent. The action layer trusts nothing and inspects every reach into the system.

Prempti also produces the telemetry that the next two layers depend on. Without per-action attribution, you cannot tell which agent did what, and the upper floors lose their ability to make decisions about specific agent instances.

Layer 3: Runtime Durability (Google Agent Executor)

Two floors up, you hit the question Google chose to answer this week: how do agents survive at scale? The Agent Executor announcement and the corresponding google/ax repository define a runtime, not a sandbox. The primitives are durable execution (an agent can crash and resume mid-trajectory), secure sandboxes per agent process, trajectory branching (the agent can fork its own reasoning and discard the worse branch), and Agent Substrate, a Kubernetes-backed registry designed for millions of registered agents.

The runtime is A2A-protocol compatible, which means agents written for it can interoperate with other A2A endpoints, including the agent-to-agent ecosystem we covered in the Cloud Next notes. The deliberate choice is to make the runtime, not the framework, the thing that scales. The agent’s job graph is the unit of execution; the framework that produced it is interchangeable.

Durability is the floor people skip because everything works fine until it does not. An agent halfway through a 40-step trajectory gets evicted by a Kubernetes node failure. Without durable execution, the agent restarts from step one, burns the tokens again, possibly takes a different path, and quietly drifts. With durable execution, it picks up at step 23 with the same context and continues. The difference is invisible on a dashboard until you count the wasted compute and the inconsistent outcomes.

Agent Executor sits above the action layer because it assumes the host is already protected. It is the layer where an agent becomes a long-running, observable, restartable workload, the same way services became long-running observable workloads a decade ago.

Layer 4: Lifecycle Management (IBM Digital Workers)

The top floor is the one IBM staked out at Think this week. The Mohamad Ali keynote (SVP, IBM Consulting, speaking under the Krishna mandate) framed agents not as code but as workers with a lifecycle: hired, onboarded, badged, audited, retired. The Pearson partnership produces skill badges that gate which agents are allowed to take which jobs. Providence Health cut nurse recruitment cycles by 12 days using a digital worker pool. IBM’s own internal application of the model decomposed 490 consulting workflows, claims $4.5B in productivity savings, and credits a 20 percentage-point profit lift in consulting between 2024 and 2025.

Strip the keynote framing and the operational claim is this: at enterprise scale, agents are not workloads, they are headcount. The questions HR has always asked about employees apply to agents at this layer. Who do they report to? What are they certified to do? What do you do when they go wrong? What is the offboarding procedure that removes their access cleanly? IBM’s bet is that organizations operating at hundreds-to-thousands of agents need an HR-shaped layer above the runtime, not another runtime.

This is the layer that does not fit any of the lower three. Nova manages workflows. Prempti manages actions. Agent Executor manages processes. None of them answer: “Should this specific agent be allowed to take this specific job today?” That is a lifecycle question. The badge, the role, the retirement, the audit trail of who hired this agent and why, all of it sits above the runtime and below the business decision.

Why the Stack Diagram Matters More Than Any Single Vendor

You do not need to pick Dropbox, CNCF, Google, or IBM. You need to pick a layer to be honest about. If you are running coding agents and your only control is the prompt, you are missing layer 1. If you have workflow isolation but no action interception, an iteration cap will not save you from an exfiltration. If your agents are restarting from scratch every time a node dies, you have no layer 3. If you are running more than fifty agents in production and you cannot answer “who certified this one to touch billing data,” you have no layer 4.

The four vendors this week did not coordinate. They did, however, expose the layers cleanly enough that you can audit your own stack against the diagram. This is what governance as product looks like when the products arrive in the same week and slot into different floors. It is also why the convergence story from earlier this month was undercounting: it called the trend correctly but underestimated how fast the layers would differentiate.

The layer most teams will be tempted to buy first is layer 4 (it has executive narrative, board-friendly metrics, and ROI claims). The layer most teams actually need first is layer 2 (an agent without action interception is a credential exfiltration waiting to happen). The layer most teams already have an answer for is layer 3 (Kubernetes was already there; you just need a runtime that knows how to use it). The layer most teams underestimate is layer 1 (because the workflow constraints feel like a productivity tax until the first time an agent burns 200 iterations on a wrong plan).

Do This Now

Take 45 minutes with your platform lead this week. Draw the four layers on a whiteboard. For each layer, write the vendor or system that owns it in your stack, or write “none” if nobody owns it. Then count the "none"s. That number is your honest containment debt.

Then pick the lowest-numbered “none” and assign an owner. Not a project. An owner. Layer order matters because the upper floors assume the lower ones exist. Buying layer 4 lifecycle management before you have layer 2 action interception is hiring an HR director for a building with no front door.

The diagram is now drawn by people who do not work for you and who shipped on the same Tuesday. The hard part is the audit you run inside your own building. You will find at least one missing floor. That is the work for next quarter.

This analysis synthesizes Introducing Nova, Dropbox’s Internal Platform for Coding Agents (Dropbox Engineering, May 2026), Introducing Prempti: Policy and Visibility for AI Coding Agents (CNCF, May 2026), Introducing Agent Executor, Google’s Distributed Agent Runtime (Google Cloud, May 2026), and Managing Digital Worker Lifecycle (SiliconANGLE / IBM, May 2026).

Victorino Group helps teams choose containment layers that fit their actual workflow risk, not vendor marketing. Let’s talk.

Figma's On-Canvas Agent Just Made the Design System the Prompt

Thiago Victorino — Fri, 22 May 2026 00:00:00 GMT

In Q1 we wrote that the design system would become the constraint layer for AI-generated design. On May 20, Figma shipped the on-canvas agent. The thesis arrived as a product.

This is not another think piece on whether design systems matter in the agent era. The argument is settled. What matters now is the three governance choices Figma made in the actual release. Read those choices carefully and you can see the constraint-layer model crystallizing into a commercial product.

Choice One: @ Mentions Make Tokens the Prompt Surface

The most quoted feature is that the agent generates multiple stylistic explorations in parallel from a single prompt. The more important feature is buried two paragraphs down. You steer the agent with @ mentions: @ a token, @ a variable, @ a component, and the generator is constrained to that surface.

There are two ways to ship a generative design tool. The first is to let the model invent everything from typography to spacing, then make designers reconcile the output against the system. That is what most early demos showed in 2024. The second is to force every generation through the system’s existing primitives. Figma chose the second path and made it the input grammar.

This matters because it inverts the workflow most teams expected. The prompt is not “make me a card with rounded corners and a primary CTA.” The prompt is “make me a card using @card-elevated and @semantic/action-primary.” The design system is no longer the thing you bolt onto generation after the fact. It is the language you generate in.

There is a quiet implication for governance teams. Every prompt now carries a structured reference to the artifacts your system already governs. The audit question changes from “did the AI use the right components” to “which components were @-referenced in which prompts.” That telemetry is trivial to capture if Figma exposes it, and it gives design ops a control surface that did not exist a week ago.

Choice Two: Component Library Reference Defaults to Frequency

When the agent reaches for a component without an explicit @ mention, it pulls from the “most frequently used” set in your library. This is the design choice that gives me the most hope and the most worry.

The hope is straightforward. Frequency is a defensible default. It biases generation toward components that the team has already converged on, which means generated screens look like the rest of your product instead of like a parallel universe of one-off variants. It also creates a positive feedback loop: governed components get used more, frequency rises, the agent uses them more, and the components that designers tried to retire stop appearing in new work.

The worry is that frequency is not the same as correctness. A component can be used a thousand times and still be the deprecated one. A library can have a “v2/button” that is the canonical surface and a “legacy/button” that nobody has finished migrating away from, and frequency will favor the legacy one until the migration is complete. Design ops teams now need to treat their component frequency curve as a first-class governance metric, not a usage report buried in a quarterly review.

The teams that win the next two quarters are the ones who audit their frequency curve this week and decide which components they want the agent to default to, then engineer the library to make that the actual default.

Choice Three: Seat Tiers Are the Governance Fence

The eligibility matrix is where the constraint-layer thesis becomes legally binding. The agent ships to Pro, Org, and Enterprise full seats. Collab and Dev seats can use it only in drafts. Starter, Education, and Government plans are excluded entirely.

This is not pricing strategy. It is governance. Figma made a deliberate choice that an agent capable of generating production design artifacts should not run inside plans without organizational controls. Starter plans lack the admin surface. Education plans lack accountable contracts. Government plans have procurement constraints that the product cannot satisfy yet.

The Collab and Dev “drafts only” rule is the most interesting wrinkle. It says: you can play with the agent in your own sandbox, but you cannot ship its output into the shared workspace without a full seat. The full seat is the artifact that carries accountability, version history, and the audit trail. The drafts surface is the play space. The boundary between them is the governance boundary, and it now corresponds to a billing tier.

If your design ops team has been arguing internally about which roles should have agent access, Figma has just done the work for you. The default is restrictive, the upgrade path is clear, and the rationale is built into the product rather than bolted on by your IT policy.

The Beta Pricing Tell

One operational detail belongs in every plan: the beta consumes no AI credits, but general availability shifts to credit-based pricing. This is the standard Figma cadence and it means two things for the next quarter.

First, this is the window to run governance experiments without budget pressure. Have your design ops team build a library audit, a frequency review, and a token coverage map before credits start counting. Once metering goes live, every experiment has a CFO question attached.

Second, credit-based pricing turns “use the agent for everything” into a measurable cost. Teams that lean on the agent for every screen will see their bill move. Teams that use the agent for the work the agent is best at, which is exploration and bulk operations, will see governed value. Figma is, intentionally or not, pricing in the discipline.

What To Do This Week

Block 60 minutes with your design ops lead and walk the three governance choices against your current setup.

Open your component library and pull the frequency curve. If the most-used components are not the ones you want the agent to default to, you have a library audit to schedule before GA. The agent is going to amplify whatever your frequency report says, so make the report say the right thing.

Open your token system and confirm the things you want the agent to use are actually @-referenceable. Tokens that exist in a JSON file but are not exposed as Figma variables will be invisible to the agent. The governance you want to enforce has to live in the surface the agent sees.

Pull your seat matrix and decide who gets Pro, Org, or Enterprise full seats. The decision used to be about collaboration features. It is now about who can ship agent-generated production work. That is a different conversation, and you want to have it before someone in marketing asks why they cannot use the agent.

The constraint layer we wrote about in February is now a release note. The teams that built their design system as governance infrastructure will spend Q3 turning the agent on and watching it work. The teams that did not will spend Q3 explaining to leadership why generated screens do not look like their product. Both teams will use the same tool. Only one of them will be glad they did.

This analysis synthesizes The Figma agent is here (Figma, May 2026).

Victorino Group helps design and product teams turn their design system into the governance layer for AI-generated work. Let’s talk.

Formal Verification Just Got Working Receipts. Two of Them. In One Week.

Thiago Victorino — Fri, 22 May 2026 00:00:00 GMT

For the past several months our spec-governance writing has been argument-heavy. We argued that specifications were the missing control layer. We argued that enterprise SDD adoption was outrunning its own governance. We brought back notes from Cloud Next on spec-driven development. We argued that agent specs themselves were governance artifacts. What we did not have was working code we could point at and say, “this is what the verification spine looks like when someone actually builds it.”

In one week we got two.

Antfly published a five-step workflow that uses AI agents to write TLA+ specifications, runs the model checker against a production-grade key/value store, and surfaces a real concurrency bug. Reuben Brooks published Shen-Backpressure, a compiler that turns sequent-calculus type definitions into sealed guard types in Go, TypeScript, Python, and Rust, refusing invalid agent code at compile time. Different abstraction layers. Different languages. Same underlying thesis: when generation is cheap, verification has to be structural, not optional.

This piece is not a re-argument. The pattern is no longer hypothetical. The piece is: here is what the two stacks look like, and here is what they tell us about where the verification spine is going.

What Antfly Actually Did

Rowan Copley’s Cheap Code Means Formal Verification Is Reasonable Now takes a workflow most engineering teams would file under “academic” and shows it running against Pebble, the key/value store underneath Cockroach Labs. The team picked a historical Pebble race condition as the benchmark and asked: can an agent-driven TLA+ workflow find this bug without prior knowledge of it?

The answer was yes, and the workflow that produced it has five steps:

Write an assumptions.md and a boundaries.md that describe what the system is and what the verification is allowed to touch.
Have the agent write TLA+ specifications, run the model checker, and report findings.
Validate the findings against the actual source.
Create unit tests that reproduce the bug in production code.
Fix the bug and document the result for stakeholder personas.

The headline finding was the race condition. The quieter finding was the QPS optimization loop: the same workflow, pointed at a metric instead of a correctness property, hill-climbed performance by “orders of magnitude.” Formal verification, in this stack, is not just a bug finder. It is a search procedure over a defined state space, and the state space happens to include “fast” alongside “correct.”

The cost story is the part that changes the conversation. TLA+ has existed for decades. Engineering teams have not adopted it broadly because the up-front investment to write the spec was larger than the expected value of finding the bug. When the spec writer is an agent operating from an assumptions.md file, the cost of formal verification collapses. The decision is no longer “is this bug worth a week of TLA+ work.” The decision is “is this subsystem worth running through the verification loop.” The answer becomes yes far more often.

What Reuben Brooks Actually Did

Reuben Brooks’s Structural Backpressure Beats Smarter Agents attacks the verification problem one layer lower. Where Antfly catches concurrency bugs through model checking, Brooks catches authorization bugs (and dozens of other state-shape errors) through types. The vehicle is Shen, a statically typed Lisp with sequent-calculus types, used to generate sealed guard types in target languages.

The example in the post is direct. An agent writes a Go function that uses a tenant ID without going through the authorization gate. The compiler refuses:

cannot use tenantID (variable of type string) as shenguard.TenantId

The agent did not need to be smarter. The compiler refused to accept the shape of the value the agent produced. Brooks’s framing is that the loop around the agent has five gates per iteration of his sb CLI:

Specification generation through shengen.
Tests.
Compilation.
Shen type-check.
Audit scripts.

The line that matters: structural gates “produce definitive answers within their scope, operating independently of model capability.” Translation: the gate does not get smarter when the model gets smarter, and it does not get dumber when the model gets dumber. It produces the same answer for the same input. That is the property that makes a verification spine.

This is the same principle the Antfly TLA+ workflow encodes at a different layer. The model checker either finds a counterexample or it does not. The output does not depend on which agent ran it.

Two Layers, One Architecture

Stack the two pieces and the picture comes into focus.

At the spec layer, Antfly has agents write TLA+ specs from a constrained assumptions.md. The model checker is the gate. The output is a counterexample trace or a clean run. The agent’s job is to compose the spec, not to be trusted with the verdict.

At the type layer, Brooks has agents emit code in target languages. The compiler is the gate. The output is a refusal or a passing build. The agent’s job is to produce code that satisfies the types, not to be trusted with safety.

Different layers. Same architecture. The agent is the labor; the verifier is the floor. The verifier does not need to understand intent. It needs to refuse the wrong shape.

This is what we have been pointing at for months. The governance deficit in enterprise SDD was a description of the missing floor. Agent specs as governance artifacts was a description of the missing labor input. The two pieces this week are working assemblies, in different programming-language families, at different abstraction layers, with different verification techniques. They are not academic. Antfly’s workflow reproduced a real Pebble bug. Brooks’s compiler refuses real Go code. The receipts are in.

What This Says About the Next Twelve Months

If you have been waiting for the formal-verification-meets-agents pattern to leave the conference talk and enter the codebase, the wait is over. Two practitioners shipped working implementations in one week. They will not be the last. The pattern is too clean and the cost economics are too favorable for the next batch of teams to ignore.

Three implications follow.

First, the verifier becomes the differentiator. If two agent-generated patches both compile and both pass tests but only one passes a model-checker run, the model-checker run is the thing that lets you ship. The team that has the verification spine ships faster than the team that does not, because the team without the spine still has to discover the bug in production.

Second, the spec becomes an asset. assumptions.md and boundaries.md are not throwaway prompts. They are the verification contract for a subsystem, and they live as long as the subsystem does. Teams that write these well accumulate a library of verifiable surfaces. Teams that do not, do not.

Third, the abstraction layer is open. Antfly works at the system-design layer. Brooks works at the type-system layer. Nothing prevents a future team from working at the SQL layer (compile-time schema validation against agent-issued migrations), the API contract layer (refusing agent-generated HTTP handlers that violate published schemas), or the policy layer (refusing agent actions that violate IAM contracts before they reach the runtime). The pattern travels.

Do This Now

Pick one subsystem in your codebase where a wrong answer would be expensive. Write the assumptions.md for it: what it is, what it depends on, what invariants must hold. Write the boundaries.md: what the verification is allowed to touch, what it is forbidden from changing.

Now ask: which of the two stacks fits your subsystem? If the failure mode is concurrency or state machine drift, the Antfly TLA+ workflow is your starting point. If the failure mode is unauthorized access, untrusted data shape, or skipped authorization, the Brooks Shen-Backpressure pattern is your starting point. Run one iteration of the loop. See what the verifier refuses.

The receipts are in. The verification spine is no longer theoretical. Build the floor before your agents need it.

This analysis synthesizes Cheap Code Means Formal Verification Is Reasonable Now (Antfly, May 2026) and Structural Backpressure Beats Smarter Agents (Reuben Brooks, May 2026).

Victorino Group helps engineering teams add structural verification spines to their agent workflows. Let’s talk.

Gartner Just Quantified the AI Trust Deficit in B2B Buying

Thiago Victorino — Fri, 22 May 2026 00:00:00 GMT

At the May 2026 Gartner CSO and Sales Leader Conference, the analyst firm published a set of numbers that vendors have been quietly avoiding. 70% of B2B buyers prefer digital self-service buying experiences. Nearly 50% are already using generative AI tools to research vendors and products. Over 50% report receiving misleading information from those AI tools. And 69% rely on sales representatives to validate what the AI told them.

Gartner also projects that by 2027, 95% of seller research workflows will begin with AI.

Five percentages, one story. Buyers want autonomy. They are exercising it. The autonomy is producing unreliable results. They are routing around the unreliability by calling a human. The marketer narrative that AI is replacing the sales conversation is, at best, half the picture. The other half is that AI raised the bar on what makes a sales conversation worth having.

A note on the source before going further. The data was presented at a Gartner conference and reported via MarTech. The published account does not disclose sample size, methodology, or survey instrument. Treat the directional shape of the numbers as informative; treat the precise digits as conference-stage rounding. The argument that follows holds even if the second decimal is wrong.

The Self-Service Preference Is Not the Replacement Signal

The 70% self-service preference number gets quoted as if it means buyers do not want salespeople. That is not what self-service preference means. It means buyers do not want salespeople for the parts of the cycle they can complete on their own.

Watch the sequence the same buyer goes through. They Google a category. They land on a vendor page. They read three competitors. They paste descriptions into ChatGPT or Gemini and ask for a comparison. They get an answer that sounds authoritative. They have no way to verify it, because the AI does not cite product specs, does not know the contract terms, and confidently invents capabilities that do not exist. Now they have a shortlist they cannot trust and a comparison they cannot defend internally.

At this point the buyer does one of two things. They either book a sales call with a person to confirm what the AI told them, or they walk away from the category. The 69% who rely on sales reps for validation are the first group. The second group does not show up in Gartner’s data; they are the silent loss.

The implication for sales organizations is uncomfortable. The early stages of the funnel are being commoditized by AI search. The validation moment, which used to be the third or fourth touch, is now the first time a human enters the conversation. And the buyer arrives suspicious, because the AI already lied to them at least once.

What the 50% Misleading Number Actually Costs

Half of B2B buyers report that AI tools have given them wrong information about a vendor. That is not a marginal annoyance. It is a structural trust problem that compounds across every interaction.

The wrong information takes specific forms. AI tools confuse two products from the same vendor. They cite features from a competitor as if they belonged to the queried product. They quote pricing from outdated pages. They invent integrations. They summarize a vendor’s positioning in a way that flattens what the vendor spent two years differentiating. None of these are random noise. They are patterns that emerge when a language model encounters fragmented or shallow source material and fills the gaps with plausible-sounding text.

The cost is not just the deals you lose because the AI misrepresented you. The cost is the deals you have to re-earn because the buyer arrives believing something untrue, and your sales rep now has to spend the first 20 minutes of the call gently correcting AI output without making the prospect feel stupid. The validation conversation has become a remediation conversation, and that takes longer, costs more, and converts worse.

The Validation Moment Is the New Front Door

Sixty-nine percent of buyers ask a salesperson to validate what the AI told them. Translate that into operational terms.

The sales conversation is no longer about discovering needs. The buyer did that with AI. It is not about presenting features. The buyer pulled those from the website. It is about confirming or correcting the picture the buyer assembled before the rep ever entered the room. The rep who succeeds in 2026 walks in knowing that the prospect already has a draft opinion, that the draft is partially wrong, and that the job of the first 10 minutes is to figure out which parts are wrong without sounding defensive about it.

This is a different muscle than discovery selling. It is closer to consultative correction. Reps need to ask, early and explicitly, what the buyer already believes and where they got it. They need to be unbothered when the answer is “I asked ChatGPT” or “Perplexity told me.” They need to have a mental model of how AI summarizes their category and where the failure modes are, so they can predict the wrong impression and pre-empt it.

Marketing has a role here too, and it connects to the governance function we have argued marketing is becoming. If half of buyers are getting misled by AI, marketing’s job extends beyond producing content. It includes monitoring how AI tools represent the brand, correcting the source material AI is pulling from, and giving sales the artifacts they need to do the validation conversation well. This is part of the broader decoupling of output from competence that requires a verification layer: the AI produces a confident answer, and the validation layer (a human, a document, a demo) is what makes the answer trustworthy.

It also matters because, as we wrote when agents start buying as well as selling, the buyer-side AI agent is the next layer of the same problem. Today a human is asking ChatGPT and then calling sales. Tomorrow an agent is asking ChatGPT and writing a shortlist to a procurement queue with no human in between. The trust deficit does not go away. It moves up the stack.

What to Do This Week

Three concrete actions for sales and marketing teams reading this data:

Audit how AI represents you. Take your top five competitors. Ask ChatGPT, Gemini, Claude, and Perplexity to compare your product against each one. Read the answers as if you were a skeptical buyer. Note every factual error, every confused feature, every outdated detail. This is the picture your prospects are arriving with. If you do not know what AI says about you, you do not know what your reps are walking into.

Rewrite the first 10 minutes of the sales call. Train reps to open with “what have you already learned about us, and where did you learn it.” Drop the discovery script. The discovery happened before the call. The opening is now diagnostic: what does the buyer believe, and how much of it is wrong. Build a one-page cheat sheet of the most common AI misrepresentations of your product and how to correct each one without condescension.

Treat your public content as AI training data. Your website, your docs, your pricing page, your case studies. Every one of those pages is a source an AI tool will pull from to answer questions about you. If your product page is vague, the AI will fill in the vagueness with confident guesses. If your case studies are buried, the AI will not find them and will summarize your positioning from a third-party review instead. The clarity, structure, and accessibility of your content now affects what AI tells your prospects before they ever talk to you.

The 2027 projection that 95% of seller research workflows will begin with AI is the easy half of this story. The hard half is that 95% of buyer research workflows already do, and the buyers know the answers are unreliable. The teams that win the next two years are not the ones that adopt AI fastest. They are the ones that build the validation layer that makes AI-sourced research safe to act on.

This analysis synthesizes B2B Buyers Trust AI Less Than Marketers Think (MarTech covering Gartner, May 2026).

Victorino Group helps B2B sales and marketing teams turn AI trust gaps into validation moments that close deals. Let’s talk.

Greg Wilson Just Gave Us an Academic Spine for AI Productivity Skepticism

Thiago Victorino — Fri, 22 May 2026 00:00:00 GMT

Every measurement-skepticism essay we have published about AI coding productivity in the last six months has carried the same uncomfortable footnote: most of the numbers we were rebutting came from vendor blogs, and most of the numbers we were citing in rebuttal came from a small handful of studies that everyone keeps quoting because there is not much else to quote. The literature was real. It was just scattered. No one had assembled it.

Greg Wilson did that on May 20, 2026. Twelve Ways to Be Wrong About AI-Assisted Coding is the peer-reviewed spine the productivity debate has been missing. Each of the twelve failure modes Wilson catalogues comes with at least one academic source behind it, and the citations are mostly from 2025 and 2026, which means the field has finally generated enough empirical work to do an actual review of.

If you have spent any time arguing with a vendor’s “40% faster with our copilot” claim and felt yourself reaching for the same three references, this is the document to replace that toolkit with.

What the Studies Actually Say

The headline finding running through Wilson’s review is that vendor benchmarks and field measurements disagree by a factor that should embarrass anyone still quoting the former. Becker (2025) found that GitHub Copilot produced a 55% task speedup on artificial coding problems. Run the same tool against real open-source maintenance work and the effect inverts: a 19% slowdown, not a speedup. The Peng (2023) study Wilson cites for the original 55% number was on a constructed task that bears no resemblance to maintaining a five-year-old codebase with seventeen contributors.

The senior developer finding is the one that should make engineering leaders stop. The same body of research that shows junior developers getting genuine acceleration also shows senior developers experiencing a 19% productivity decline. The mechanism is not mysterious. Seniors absorb the review burden for AI-generated code that juniors merge. The tool’s output becomes their input, and the input is lower quality than what the senior would have written themselves. We wrote about this dynamic in The Speed Trap of AI Coding; Wilson’s review now gives it a citation.

Liu (2026) measured the quality drag directly: more than 15% of AI-generated commits introduce quality issues, and roughly 25% of those issues persist long-term. That is not a transient cost. That is technical debt being shipped at a rate that exceeds normal code review’s catch rate, and it compounds.

He (2026) studied Cursor adoption specifically and found that velocity gains were transient while complexity increases were persistent. The team got faster for a quarter, then settled back to baseline velocity while carrying a permanently higher complexity load. This is the output-competence decoupling we have written about, measured longitudinally.

The enterprise studies tell the same story from the procurement side. Bakal (2025) reported a 33% acceptance rate for AI suggestions in production environments, with no correctness tracking attached. The organization buying the tool knows how often developers accept the suggestion. It does not know how often the accepted suggestion was right. Weisz (2025) at IBM measured uneven gains across users in a controlled study, with the variance large enough that aggregate “productivity lift” numbers became meaningless.

The security floor is the one that should make CISOs read this paper twice. Pearce (2022), confirmed by Dora (2025), tested five major LLMs against established web security standards. All five failed. Not “performed below expectations” failed. Failed. The implication is that any team measuring AI coding productivity without measuring AI coding security is computing a numerator while ignoring a denominator that may already exceed it.

Why This Is a Literature Review and Not Another Essay

The reason Wilson’s piece matters is not that it makes a new argument. The argument has been made. It matters because for the first time, you can hand someone the citations.

Every time we have written about the two-percent productivity gap, or about why measuring the team and not the model is the only honest move, or about the harness difference, we have been arguing in a vacuum where the other side cites SaaS marketing decks and our side cites three studies on repeat. Wilson catalogued the rest. Becker, Peng, Liu, He, Bakal, Weisz, Pearce, Dora. The names matter because the names are how the conversation moves from belief to citation.

If you sit in a meeting where someone says “our developers are 40% faster with this tool,” you can now ask three questions with academic backing:

What is the task distribution? Becker showed the 55% speedup collapses to a 19% slowdown when you move from artificial problems to real maintenance work.
What is the seniority distribution? The same productivity number averaged across juniors and seniors hides a decline at the senior end.
What is the persistence horizon? He showed the velocity gain is transient and the complexity cost is permanent.

Three questions. Three citations. The vendor can no longer answer with another deck.

The Implementation Stays the Same

Wilson’s review does not change what a serious measurement program looks like. It just changes what the conversation around procurement and adoption sounds like. The implementation work we have published still stands.

If you want to measure your team rather than the model, The Software Centaur Era is still the framework. If you want to understand why output competence and verification have to be measured separately, the verification layer essay is still the breakdown. If you want to know why the same model produces different productivity numbers in different harnesses, the harness difference is still the explanation.

What changes is what you put in front of the people who do not read those essays. You put Wilson in front of them. You put Becker, Liu, He, Pearce in front of them. The skepticism essays were for practitioners. The literature review is for everyone else.

Do This Now

Block thirty minutes this week. Read Wilson’s piece end to end. Pull the three or four citations most relevant to the productivity claims you are currently being asked to evaluate or defend against. Add them to whatever shared document your engineering organization uses for vendor evaluation. The next time someone walks in with a “40% faster” deck, the document already has the counter-citations loaded.

Then take the harder step: audit your own internal productivity claims for AI coding tools. If you have told an executive “our team is X% more productive since adopting Y,” check that claim against Wilson’s twelve failure modes. The most common discovery is that the metric measured something other than what its name implied. That is the moment to fix the metric, not the moment to defend it.

The productivity debate just stopped being a vibes contest. The literature is assembled. The names have citations. The vendors no longer get the last word by default.

This analysis synthesizes Twelve Ways to Be Wrong About AI-Assisted Coding (Greg Wilson, May 2026).

Victorino Group helps engineering leaders replace vendor productivity claims with measurement that survives a peer review. Let’s talk.

Anthropic Set a Lock-In Date. June 15, 2026.

Thiago Victorino — Wed, 20 May 2026 00:00:00 GMT

Mark the date: June 15, 2026. According to developer reports (Vincent Schmalbach, May 19, 2026), that is the day third-party harness subscriptions stop working against Claude. It is also the day the Claude Agent SDK and the claude -p CLI move to separate billing pools. The two changes ship together. Read alongside three other recent Anthropic moves, they describe a single trajectory: close the access surface before the IPO, then price what remains.

This is not a complaint piece. Each decision is defendable in isolation. Closing third-party harnesses cleans up margin and load. Splitting billing pools clarifies what enterprises are paying for. Restricting commercial terms protects model weights. Revoking competitor API access is standard self-defense. Winning a $200M defense contract is a business outcome any board would celebrate. The point is not that any single decision is wrong. The point is that five of them ship in a window where the only consistent thread is access control, and procurement teams have to read them as a system.

What June 15 Actually Changes

Per Schmalbach’s reading of the published terms, two things break on the same day.

First, third-party harness subscriptions stop accepting Claude Pro and Max plans. OpenClaw, OpenCode, Pi, and similar wrappers that route through end-user subscriptions become non-functional. Customers who built workflows on top of those harnesses lose them on a fixed date, with the migration path being to bring billing in-house under Anthropic’s commercial terms.

Second, the Claude Agent SDK and the claude -p CLI move to separate billing pools. The framing in developer-facing documents is operational clarity. The practical effect is that what used to be one usage allowance becomes two, with different rate limits and different invoices.

Treat these as confirmed only after primary-source verification. Anthropic’s official policy pages should be the citation enterprise teams pin to procurement memos, not a developer blog. Use the developer report as a forecast that demands confirmation, not as the source of truth.

The Commercial Terms Are the Real Story

The billing changes are the visible shift. The commercial terms are the structural one.

Per developer reports, Anthropic’s updated terms now prohibit using Claude to build competing products, to train other models, to resell access, or to reverse-engineer the system. Each of these has a clean defensive read. No frontier lab wants its outputs feeding the next competitor’s training set. None of them want to be a wholesale layer for somebody else’s reseller margin. The clauses are not unusual in isolation.

What is unusual is the timing. The clauses arrive in a window where Anthropic is also revoking competitor API access (OpenAI was cut off; Windsurf was restricted during the acquisition talks with OpenAI), tightening third-party harness access, and signing a $200M Department of Defense agreement that places Claude in classified networks alongside Palantir. Each individual move sits inside normal commercial behavior. The collection describes a vendor pricing optionality away from its largest customers before an IPO window.

The procurement question is not whether the clauses are reasonable. They are. The question is what an enterprise’s exit options look like if the clauses tighten further, six months after a public listing, when the company has new shareholders to satisfy.

Five Decisions, One Direction

Lay them out as a list:

Third-party harness subscriptions stop working on June 15, 2026 (per Schmalbach).
Agent SDK and claude -p move to separate billing pools the same day.
Commercial terms prohibit competing products, training, reselling, and reverse engineering.
OpenAI’s API access was revoked; Windsurf access was restricted during the acquisition window.
A $200M DoD agreement (2025, as reported) deploys Claude in classified networks alongside Palantir.

Each is defendable. Each clarifies a surface that used to be ambiguous. The pattern that emerges, however, is unambiguous: the optionality flows toward Anthropic and away from the buyer. Buyers who built on the assumption that “Claude is the model, and the harness is interchangeable” are absorbing the cost of that assumption now. Buyers who built on the assumption that “we can swap providers if pricing or terms shift” are watching one of the two frontier providers lock down the swap path.

We argued the general thesis in frontier-capacity scarcity creates vendor risk and again in foundation labs absorbing the stack. This is what the abstract argument looks like when it arrives with a date attached.

What Changes in the Procurement Playbook This Quarter

Two things should change in how you buy AI capacity this quarter.

Treat harness and model as separate procurement decisions, even when you buy them together. If your engineering team runs Claude through a wrapper, document the wrapper’s provider, its billing path, and its dependency on Anthropic’s commercial terms. If the wrapper depends on end-user subscriptions, you have a June 15 cliff to plan around. Bring the billing question to your vendor management team this month, not next quarter.

Model the cost curve on two providers, not one. This does not mean splitting traffic 50/50. It means having a tested deployment path on a second frontier provider, with measured latency, output quality, and integration cost. The objective is not parity; the objective is a credible exit if commercial terms tighten further. We described the harness governance layer in Claude managed agents harness governance. The procurement layer above it is the one that has to actually exist on paper.

A third change is worth thinking about, even if it does not ship this quarter. The distillation supply chain risk essay traced how downstream models depend on frontier outputs. Terms that prohibit using Claude to train other models close a path that some buyers were quietly relying on. If your AI roadmap includes training smaller, domain-specific models from larger model outputs, the legal and procurement teams need a sober conversation about which provider’s terms permit what, and which paths just closed.

The Asymmetry to Watch

The deepest part of the story is the DoD contract. A $200M agreement (as reported) puts Claude in classified networks alongside Palantir. National-security workloads are not just another customer segment. They reshape a vendor’s incentives in ways that civilian customers feel later. Margin compresses on commercial accounts to fund the cost of compliance. Commercial terms tighten because federal contracts come with audit obligations that flow downstream. Roadmaps tilt toward features the largest customer asks for.

This is not a critique of working with the Department of Defense. It is an observation that an enterprise buying Claude in 2026 is sharing a roadmap with an institution whose requirements will, over time, change what gets built and what gets restricted. Procurement teams should ask the question explicitly: what does Claude’s product roadmap look like if defense customers become a meaningful share of revenue, and how does that intersect with our use case?

Do This Now

Three actions this quarter, in order.

First, audit every workflow that touches a Claude-based third-party harness. Identify which ones route through end-user subscriptions. Get those off the June 15 path before May closes, even if it means temporarily migrating to direct Anthropic billing while you evaluate alternatives.

Second, get the updated commercial terms in front of your legal team and ask one question: which of our current use cases sits in the gray zone of “competing product,” “model training,” or “reselling”? The answer matters more than the headline.

Third, fund a measured second-provider deployment by end of Q3 2026. Not a hot-swap. A documented, tested fallback with known cost, latency, and output characteristics. The objective is to make the next round of pricing or terms changes a negotiation, not a fait accompli.

Anthropic set a date. That clarifies the calendar. What it does not clarify is whether your procurement playbook is built for vendors who set dates, or for vendors who used to be more permissive than the contract said they had to be. The former is the world we are in now. The latter is the world we were in last year.

This analysis synthesizes Anthropic Is Preparing for IPO and We Should Be Worried (v2) (Vincent Schmalbach, May 2026), pending primary-source confirmation of cited restrictions.

Victorino Group helps enterprises model vendor-lock-in risk and design procurement playbooks that survive provider policy shifts. Let’s talk.

Disney Just Gave Knowledge Governance Its Largest Case Study

Thiago Victorino — Wed, 20 May 2026 00:00:00 GMT

On May 6, 2026, Disney pointed fivethirtyeight.com at a redirect. Eleven years of journalism, models, methodology, and brand vanished from the open web in a single afternoon. The cost to keep the archive live was, in Nate Silver’s words, roughly a dollar. The cost to destroy it, in his estimate, was about 200,000 person-hours of work.

That is not a metaphor. That is the largest concrete case study available right now for the question every enterprise should be asking before it deploys another agent against an internal corpus: when the org chart shifts, who is the steward of the institutional knowledge?

The destruction inventory

Silver published the numbers on his Substack. He founded FiveThirtyEight in 2008, took it to the New York Times from 2010 to 2013, then sold it to ESPN/Disney in 2014. He left in 2023. Disney shut the publication in March 2025. The archive sat at fivethirtyeight.com for fourteen more months while Disney decided what to do with it. In May 2026 they decided.

Here is what was on the site the morning it went dark:

The 2014 through 2025 article archive. Roughly 10 years times 20 stories per week times 20 hours of work per story. Silver’s arithmetic comes to about 200,000 person-hours.
The interactive sports models. NBA, NFL, MLB, soccer. Each one a multi-year statistical artifact with documented assumptions, training data, and prediction history.
The election forecasting model continuity. Twelve years of forecasts, calibration plots, post-mortems, and methodology pages that were the de facto reference for how to communicate probabilistic political claims to a general audience.
The site design and brand. The visual grammar that taught a generation of newsrooms how to render a probability distribution above the fold.
The methodology documentation. The pages that explained, in plain language, what the models were doing and why. These were the user manual for trusting the numbers.

Silver also published the rejected business case. He estimates a paywalled archive could sustain 100,000-plus paying subscribers, yielding around $5 million per year in recurring revenue. Disney walked away from that revenue. The hosting bill they avoided was effectively a rounding error against the asset base they wrote off.

For context, the Pew Research Center has documented that roughly 40 percent of links from a decade ago are already dead. FiveThirtyEight’s archive was not at risk from neglect. It was at risk from a deliberate corporate decision to release the URL.

This is not a media story

Read the destruction inventory again with one substitution. Replace “FiveThirtyEight” with the name of any high-value internal corpus in your company. The 10-year archive of your engineering postmortems. The lineage docs your data team curated for the warehouse migration. The customer interview transcripts your product organization used to win three quarters of strategy debates. The model cards your ML team wrote to defend deployment decisions to compliance.

Each of those corpora has the same structural characteristics as FiveThirtyEight. Long-tail value that compounds with retrieval. Methodology pages that are the user manual for trust. A small number of named curators. A hosting cost that is rounding error against the value of the asset.

Now ask the FiveThirtyEight question. If your parent company, your acquirer, your new platform owner, or your incoming CTO decides on a Tuesday that the corpus is non-strategic, what is your retention path? Not your backup path. Backups are a checkbox. Retention means continuity of access, continuity of URL, continuity of methodology documentation, continuity of the curator’s name on the file. Disney almost certainly has backups of FiveThirtyEight on some cold storage tier. The public archive is gone anyway, because the org chart no longer rewards anyone for paying the dollar to keep it live.

This is what knowledge governance has to solve, and it is the part nobody wants to write a policy about. The technical problem is trivial. The organizational problem is the entire ball game.

What the agent era changes

Pre-LLM, archive loss was a librarian’s problem. The community of scholars, journalists, and policy analysts who used FiveThirtyEight had alternatives. They could rebuild the citation chain through Wayback Machine snapshots, through manually saved PDFs, through colleagues who screen-grabbed the methodology pages in 2018. Painful, but tractable.

Post-LLM, the calculus changes in two directions.

First, the agents your organization deploys are retrieving against your corpus continuously. Every product manager asking Claude to summarize three years of customer interview themes is implicitly trusting that the underlying documents are still there, in the same place, with the same metadata. The retrieval layer is silent about absent sources. The model will produce a confident answer drawn from whatever is left. Corpus erosion shows up as quiet quality decay long before anyone notices a 404.

Second, the value of curated corpora has gone up, not down. A clean, dated, attributed, methodology-documented archive is the most valuable input a retrieval-augmented system can have. The same archive that Disney decided was worth a dollar to delete is, in the hands of a competent retrieval pipeline, the kind of asset that produces durable answer quality. The market for institutional knowledge has shifted underneath the people who own it, and most of them have not noticed.

Put those two together. The agents need the corpus more than they have ever needed it. The owners of the corpus are still making 2015-era decisions about whether to keep it online.

The three policies your company needs

Take the FiveThirtyEight case study as a forcing function and write three documents this quarter.

A corpus inventory. Every high-value internal corpus, the named human curator, the host system, the URL or path stability commitment, the methodology documentation, and the expected lifetime in years. If you cannot fill in any of those columns for a corpus, that corpus is one reorg away from being deleted.

A boundary-shift protocol. What happens to each corpus when the team that owns it is dissolved, the budget line is cut, or the platform is migrated? Who inherits the curator role? Who is the named owner of the URL? The protocol does not have to be elaborate. It has to be written down before the reorg, not after.

A retrieval audit. For every agent or workflow in production that depends on retrieval, the source corpus has to be on the inventory. If the source is not on the inventory, the retrieval is borrowing trust from an asset nobody is committed to keeping alive. That is the silent failure mode Disney just demonstrated at scale.

Do this now

Pick the most valuable corpus your organization owns. The one that, if it disappeared on a Tuesday afternoon, would cost the next twelve months of decisions their evidentiary base. Write down its named curator, its host system, and its retention commitment in years. Send the document to one executive and one finance partner. Ask them to confirm in writing.

If you cannot get that confirmation in two weeks, you have learned something important about your organization’s actual posture on knowledge governance. The Disney decision was not an accident or a budget oversight. It was the predictable output of a system where nobody was incentivized to spend a dollar to keep an asset alive after the org boundary moved. Most enterprises are running the same system and have not been tested yet.

Silver wrote that he tried multiple times over multiple years to negotiate a path to keep the archive online. None of those negotiations succeeded, because the decision-makers who could have approved the dollar were not the people who had built the asset. That is the structural failure mode. Build the inventory and the boundary-shift protocol before you find out who, in your company, is the equivalent of the Disney executive who said no.

This analysis synthesizes Disney Erased FiveThirtyEight (Nate Silver, May 2026).

Victorino Group helps enterprises design corpus-retention and knowledge-governance policies that survive corporate boundary shifts. Let’s talk.

Enterprise Marketing AI Is Stuck. Challenger Brands Built the Workaround.

Thiago Victorino — Wed, 20 May 2026 00:00:00 GMT

Mike Shields cornered two enterprise CMOs at a recent industry dinner and asked them the question every vendor deck dodges. What is your real return on the marketing AI you bought? The answer, from both: basically nothing.

Same week, 55 Orangetheory Fitness studios across 10 states were running Passionfruit AI in production. Hourly lead tracking. Real-time media mix optimization. The analyst team that used to do this work is gone. Not augmented. Replaced.

Two data points. Same quarter. Opposite outcomes. If you read this as a tooling difference, you will misallocate the next budget cycle. The challenger brands are not winning because their AI is better. They are winning because they have less organization standing between the model and the decision.

The “basically nothing” admission

Shields, who covered ad tech at the Wall Street Journal and Business Insider before launching Next in Media, is not a vendor critic by trade. When two enterprise marketing leaders tell him separately that their AI returns are negligible, the signal is not a one-off complaint. It is what enterprise marketing AI looks like when you strip out the case study selection bias.

Read the failure mode carefully. These CMOs did not say the models do not work. They did not say the vendors lied. They said the ROI is basically nothing. That is the language of capability without realization. The technology is doing something. The organization is absorbing the value before it reaches the P&L.

This is the realization problem dressed in a marketing suit. The model produces a recommendation. The recommendation has to clear brand. Then legal. Then a regional approval. Then a global media review. By the time the recommendation acts on a campaign, the moment has moved, the budget cycle has closed, and the decision has been diluted into a committee compromise.

The vendor sold capability. The org chart consumed it.

The Orangetheory contrast

Alan Magee, CMO of Empire Portfolio Group, runs marketing for 55 Orangetheory studios. That is not a small business. It is a multi-state operation with real budget, real complexity, and real customer data. He gave Passionfruit AI a live job: aggregate the lead data, optimize the media mix, run it hourly.

Before Passionfruit, this work required a dedicated analyst team. After Passionfruit, the analyst team is not part of the workflow. The AI ingests the lead data, attributes spend, and surfaces the optimization. The CMO looks at the output and adjusts.

Raffi Salama, CEO of Passionfruit, framed it for Shields: “It’s the smaller brands that will compete with titans in ways they never could before.”

Salama is right about the direction and wrong about the cause. The smaller brands are not competing because the tool is small-brand-shaped. They are competing because the decision path between AI output and budget action is short enough that the AI can actually change the spend before the spend has already happened.

A 55-studio chain has one CMO, one budget owner, and one approval. The enterprise equivalent has eight CMO-equivalents, four matrixed budget owners, and a brand committee that meets every other Tuesday. Same AI. Same data. Different organization. Different outcome.

Where governance becomes the brake

The standard reading of enterprise marketing AI failure blames the platforms. Meta’s AI connector ships with no granular permission control. Performance Max is a black box. The vendor stack is fragmented. All of that is true. None of it is the binding constraint.

The binding constraint is who has to sign before the AI’s output becomes an action.

In the Orangetheory case, the answer is one person. In the enterprise case, the answer is a workflow. The workflow exists because the enterprise has more brand surface, more regulatory exposure, more historical accidents that produced new approval gates. Each gate was rational when it was installed. Every gate together produces an organization that cannot operate the technology it bought.

This is not a tooling debate. It is a governance design choice that no one made deliberately. The approval chain grew by accretion. The AI walked into the chain expecting to be a participant and discovered it is a recommender to a recommender to a recommender. The capability never reaches the spend decision in time to change it.

The diagnostic the CMOs are not running

If your AI ROI is basically nothing and your vendor’s case studies show 10x results at smaller companies, the honest question is not “which AI should we buy next?” It is “what would have to be true about our organization for this AI to produce value?”

Three tests, in order. Each one is answerable in a week.

First, time from AI recommendation to budget action. Pick one campaign. Trace the path. How many people touched it? How many days elapsed? If the answer is more than two people and more than 48 hours, the AI’s optimization signal is stale before it lands. The model’s edge is in cadence. You bought a system that runs hourly and you are deploying it into a workflow that runs quarterly.

Second, where the approval gate adds value the AI did not already address. Most enterprise marketing approval chains were designed before the AI could explain its own recommendation. The brand reviewer was checking what the agency produced. If the AI now produces the recommendation with the brand constraints encoded, the gate is reviewing a problem the system already solved. Document the value each gate adds. If a gate cannot point to a decision it has changed in the last quarter, that gate is org chart, not governance.

Third, who owns the loss when the AI does nothing. This is the question that surfaces the real problem. Enterprise marketing teams have spent two years buying AI and reporting investment. No one is on the line for the realization. The CMOs in Shields’ dinner did not say the basically-nothing ROI is showing up in their performance review. The investment is reported. The realization is invisible. The org chart has no row for “AI value not captured.”

What the challenger brands actually have

Orangetheory has 55 studios. They do not have an AI strategy. They have an AI in production. The difference is not branding. It is operational: the path from “model says spend more on Meta in Tampa this week” to “Meta budget shifted in Tampa this week” is short enough that the model’s recommendation is still relevant when the action happens.

Enterprise marketing teams will not get this by buying a better AI. They will get it by deleting approval steps that no longer add value. That is not a vendor decision. It is a leadership decision. The CIO did not install those gates. The CMO inherited them.

The honest version of the Salama quote is this: smaller brands compete with titans because they get to act on the AI’s output. The titans bought the same AI and surrounded it with a process that was designed to manage human campaign managers. The AI is faster than the process. The process wins, every cycle, by design.

Do this now

If you run marketing in a multi-brand or multi-region organization and your AI vendors are reporting impressive pilots while your P&L shows basically nothing, schedule the three diagnostics in the next two weeks.

Run the time-to-action trace on one campaign. Count the people, count the days. Compare to the cadence at which the AI produces new recommendations. If the recommendation arrives stale, the AI is not the problem.

Audit the approval chain by value, not by tradition. Each gate has to demonstrate a decision it changed in the last quarter. Gates that cannot are candidates for removal. This is uncomfortable because it surfaces work that exists to protect against accidents that have not happened in years.

Assign ownership of AI realization to a single person with budget authority. Not the CMO who bought the tool. The operator who runs the marketing P&L. The realization stops being invisible the moment one name is responsible for it.

The challenger brands did not win because their AI was better. They won because their org chart did not eat the value. Enterprise marketing teams have the same AI available. The next step is not another vendor evaluation. It is an honest look at why the capability they already bought cannot produce a return inside the structure they already have.

This analysis synthesizes Why AI Might Do More for Challenger Brands (Mike Shields, Next in Media, May 2026).

Victorino Group helps marketing organizations diagnose where org complexity blocks AI realization and design governance that enables rather than gates. Let’s talk.

OpenAI Shipped Multi-Layer Provenance. The PhotoDNA Precedent Says Verify First.

Thiago Victorino — Wed, 20 May 2026 00:00:00 GMT

This week OpenAI announced that image outputs from ChatGPT, Codex, and the OpenAI API will carry a multi-layer content provenance stack: C2PA cryptographic metadata, SynthID invisible watermarks from Google DeepMind, and a public verifier at openai.com/verify. Sora and Voice Engine already had watermarks. OpenAI joined the C2PA Steering Committee in 2024, and DALL-E 3 was its first product to ship Content Credentials. The new piece is the combination, plus a verifier anyone can open in a browser.

The architecture is right. Multi-layer because no single signal survives every workflow. C2PA metadata is rich but easy to strip on screenshot or re-encode. SynthID is harder to strip but lower bandwidth and probabilistic at the boundary. Together they give you complementary failure modes instead of a single point of trust.

The instinct to publish a verifier is also right. Provenance that only the issuer can check is not provenance; it is a press release. Putting openai.com/verify in front of a public preview is the move that turns this from a feature ship into an audit primitive.

What deserves more attention is the next step: the verification work that begins the moment a provenance system ships. There is a precedent for what happens when an industry treats a content-fingerprinting system as if its claims are self-evidently true. The precedent is PhotoDNA.

The PhotoDNA Precedent

PhotoDNA, built by Microsoft and Hany Farid in 2009, is the hash-based system Google, Facebook, Twitter, and others use to detect known child sexual abuse material at scale. For more than a decade the public-facing claim about PhotoDNA, on Microsoft’s own page, was that “a PhotoDNA hash is not reversible.” That sentence let platform legal teams say the hash database was a one-way artifact, safe to share, safe to query, safe to centralize.

In December 2021, Anish Athalye published Inverting PhotoDNA. His tool, Ribosome, reconstructs thumbnail-quality images from PhotoDNA hashes. The output is grainy and small, but it is recognizable. The hash carries enough structure that a modest neural network, trained on a few hundred thousand hash-image pairs, learns to undo the mapping.

The Athalye result did not collapse PhotoDNA as a system. It did force a reframe. “Not reversible” became “not trivially reversible,” then “reversible to thumbnail quality with available compute,” then “this is now a confidentiality consideration that legal and ops have to design around.” The hash database became something you protect rather than something you publish. The audit posture changed because someone treated the irreversibility claim as a hypothesis instead of a verdict.

That is the precedent. The cost of the verification work was one researcher, one paper, a year of compute access. The cost of not doing it would have compounded for another decade.

Provenance Is an Output-Side Primitive, Not an Input-Side Story

The Victorino writing on AI governance so far has lived mostly on the input side. We have written about why the cognitive dark forest reframes knowledge governance when LLMs train on public text. We have written about training data as the lever Anthropic is using to position itself in the trust market. We have written about the verification debt every AI program carries when it ships output that no human reviewed.

Provenance sits in a different layer. It does not govern what went into the model. It governs what comes out, and what an auditor can prove about that output six months later. Three properties matter for enterprise design:

Provenance is a claim, not a fact. A C2PA manifest says “this artifact was produced by this issuer at this time under these parameters.” It is signed. Signatures verify that the manifest came from the issuer; they do not verify that the manifest’s claims about the artifact are complete. A SynthID watermark is a probabilistic signal that the artifact carries an embedded pattern; the strength of that signal is a property of the encoder, the decoder, and every transformation in between.

Provenance survives only the transformations the designers modeled. C2PA was designed to survive lossy compression and limited cropping. SynthID was designed to survive screenshots and resizing. Adversarial transformations (generative inpainting, style transfer, deliberate adversarial noise) are different categories. The honest framing inside an enterprise: the provenance signal is a Bayesian update on origin, not a binary verdict.

The verifier is part of the trust surface. openai.com/verify is the third-party tool that closes the loop. If the verifier is unavailable, mis-configured, or has its own confidence thresholds tuned without disclosure, the enterprise that depends on it inherits that operational risk. Provenance verification is now a vendor-managed service that your compliance program quietly depends on.

What Enterprises Should Actually Do This Week

Block thirty minutes with whoever owns AI output governance. Ask four questions.

Which of our AI outputs carry provenance today, and which do not? Sora content, OpenAI image outputs, and ChatGPT image generations now do, on the issuer side. Outputs from other vendors, from in-house models, from fine-tuned variants, and from any post-processing pipeline you run on top of OpenAI artifacts may not. Build the inventory before you build the policy.

What does our provenance chain actually preserve through our own pipeline? Take one production output. Trace it through your storage, your CMS, your CDN, your marketing automation, your analytics tagging. At which point does C2PA metadata get stripped? At which point does SynthID get re-encoded into oblivion? Every transformation is a potential signal-loss boundary. Most enterprises will find that their own infrastructure removes the provenance before the output reaches a downstream consumer.

Who has tried to break it? Treat the OpenAI announcement the way the security industry treated the PhotoDNA “not reversible” claim. The interesting question is not whether the system works as advertised in the demo. The interesting question is what an adversarial researcher with six months and modest compute can demonstrate about its limits. Read the C2PA threat model. Read what is published about SynthID’s robustness against deliberate attacks. If you cannot find independent red-team work yet, plan for it to appear. Plan for what your posture will be when it does.

What is the verification SLA we depend on? If your trust chain assumes openai.com/verify is reachable and accurate, that is now a dependency in your audit story. Document it. Negotiate it. Consider whether parallel verification (running an open verifier where one exists, retaining raw assets, logging hash chains independently) belongs in your architecture.

The Discipline That Compounds

Output provenance is a real primitive. Multi-layer is the correct design choice. Public verification is the correct operational choice. The mistake is not to deploy provenance; the mistake is to treat its arrival as the end of the verification work.

The teams that handled PhotoDNA well between 2009 and 2021 were the ones who kept asking what the system could not do, not the ones who assumed the marketing copy was the threat model. The teams that will handle the OpenAI provenance stack well between 2026 and 2034 will be the ones who ask the same question now, before the inversion paper exists, before the failure modes are documented, before the legal team needs an answer.

The architecture has shipped. The audit has not.

This analysis synthesizes Advancing content provenance with C2PA and SynthID (OpenAI, May 2026) and Inverting PhotoDNA (Anish Athalye, December 2021).

Victorino Group helps enterprises design output-provenance and verification architectures that survive audit. Let’s talk.

Thoughtworks Just Named the Coding-Agent Governance Pattern. Sensors. Read the CI Bill.

Thiago Victorino — Wed, 20 May 2026 00:00:00 GMT

Two pieces landed in the same week of May 2026, written by people who do not appear to have read each other. Birgitta Boeckeler at Thoughtworks published Maintainability Sensors for Coding Agents. CloudBees published AI Is Writing More Code. Your CI Pipeline Can’t Keep Up.. One named the architecture. The other quantified what happens when the architecture is missing.

Together they finish a sentence the industry has been mumbling for a year: coding agents do not produce quality by accident, and CI is not where you discover the absence of quality. Quality lives in a layer Boeckeler calls sensors. CI is what you pay when that layer is empty.

If you are running coding agents in production and you have not drawn this layer yet, the rest of your governance stack is decorative.

What Boeckeler Actually Named

The Thoughtworks piece is a case study, not a manifesto. Boeckeler walked through a real project: a TypeScript and NextJS analytics dashboard integrating four external APIs. The interesting move is not the project. It is the explicit inventory of feedback loops the team built so the agent would answer to something other than the developer’s patience.

Eight computational sensors ran during coding. Four more ran on a slower cadence. The CI pipeline replayed all of them on push, plus deeper validation. The sensors were not exotic tools. ESLint for style. Dependency-cruiser for module-coupling rules. Semgrep for security and pattern matching. Custom scripts to flag coupling violations that no off-the-shelf tool catches. Boeckeler cites Vlad Khononov’s Modularity work as the lineage for what counts as a coupling violation worth flagging.

The two examples she gives are worth memorizing because they are the kind of debt coding agents produce by default:

A single new date-range parameter touched more than forty files, because the agent threaded it through every layer instead of consolidating at the boundary.
Three routes ended up with duplicate response-shaping code, because the agent generated each one in isolation without noticing the others.

These are not bugs. They pass tests. They ship features. They are exactly the kind of structural decay that human reviewers catch in pull requests when they have time and miss when they do not. The sensor layer is what catches them when nobody is paying attention.

The pattern Boeckeler named has three properties worth lifting:

Automated. No human in the loop for the first response. The sensor fires, the agent reads the output, the agent corrects.
Layered. Cheap sensors run constantly. Expensive sensors run on commit. Slowest sensors run in CI. Different cost, different cadence, same scoreboard.
Authored. Some sensors are off-the-shelf. The valuable ones are custom, because they encode the architecture you actually care about, which is exactly the thing no vendor ships.

The word matters. We have written about review governance, self-improving agents, and budget approval workflows as separate threads. Sensors is the noun that ties them together. It is a Thoughtworks coinage and the lineage matters: the term comes from inside the firm that has shipped more enterprise refactoring projects than any consultancy on Earth. This is not theory imported from somewhere; it is the firm’s working vocabulary for a problem it has been paid to solve at scale.

What CloudBees Quantified

CloudBees is a vendor selling Smart Tests, so read their numbers with the seller’s discount. Even discounted, the shape of the data lines up too cleanly with the sensor argument to ignore.

The CloudBees post reports that daily AI-coding-tool users ship about sixty-five percent more pull requests than non-users. About one-third of CI failures in their customer base are flaky: no underlying change, just retry until green. A customer case they cite reduced regression test time by up to eighty percent, and brought pre-commit time from six hours down to two. The headline number, on their own scenario math: an estimated quarter of a million dollars per year in CI compute waste, for a fifty-engineer team.

These numbers are vendor-attributed. The mechanism behind them is not. If your agents produce sixty-five percent more pull requests and your sensors layer is the CI pipeline, then CI is now the bottleneck, the cost center, and the de-facto quality wall. None of those three things is what CI was designed for.

The CloudBees framing, stripped of the product pitch: CI was the implicit governance layer when humans wrote the code. Humans pre-filtered before pushing. Coding agents do not. They push everything to CI and let the pipeline tell them what is wrong. The agent’s economics work; the pipeline’s do not.

The sensor layer fixes the economics. The agent gets feedback locally, on the cheapest sensor that catches the issue. CI runs the expensive verification on code that already passed the cheap ones. Pre-commit drops because the slow tests stop being the first line of defense.

Two Pieces, One Argument

Read the Thoughtworks essay alone and the sensor layer sounds like a craft practice. Read the CloudBees post alone and the CI overrun sounds like a tooling problem the vendor will sell you out of. Read them together and the argument is sharper.

Sensors are the discipline. CI is the unpaid invoice when the discipline is absent.

The engineering implication is structural. If you are scaling coding agents and your only feedback machinery is the CI pipeline, you have outsourced your architecture review to a queue. The queue is slow, the queue is expensive, and the queue does not catch coupling violations because coupling violations pass the tests. The agent ships forty-file diffs and three duplicate route handlers and the pipeline says green. You discover the debt three months later when a feature change touches sixty files instead of six.

The leadership implication is financial. Quarter-of-a-million-a-year CI compute waste on a fifty-engineer team is a real number, and it is the visible portion of the bill. The invisible portion is the structural debt the pipeline did not catch because no sensor for it existed. That debt shows up on the velocity chart six months later as “the codebase got harder to change.” Nobody attributes it to the absence of a coupling sensor in February. The line item does not exist.

Sensor architecture is the line item that prevents the line item that does not exist.

What to Build, Concretely

Boeckeler’s project list is a working starter kit. You should expect to take three weeks to inventory and stand up the first cut.

Inventory the sensors you already run. Most teams have ESLint, Prettier, a type checker, unit tests, and integration tests. List them. Mark which run pre-commit, which run on push, which run in CI. You almost certainly do not have a coupling sensor. You almost certainly do not have a custom Semgrep rule for the architecture your team actually decided on three years ago.

Add the layer the agent will answer to first. A dependency-cruiser config that fails when a new file imports across an architectural boundary is a one-day project and catches the forty-file diff problem Boeckeler described. The agent will hit it and rewrite. You do not have to teach the agent the architecture; you have to give the agent a sensor that pings when the architecture is violated.

Add a coupling sensor for your top three pain points. What three things does your senior engineer flag in every code review? Duplicate response shapes? Stringly-typed IDs that should be branded types? Direct database access from controllers? Write a Semgrep rule for each. Run it on commit. The sensors are now teaching the agent what your senior engineer would have said.

Re-tier your CI. With local sensors firing, CI no longer needs to be the first wall. Move the cheapest sensors out of CI and into pre-commit. Cut the CI run by whatever percentage of its current duration was wasted catching things you can now catch locally. The CloudBees scenario suggests fifty percent is achievable. Even a quarter of that is real money.

Audit the agent’s feedback diet. What does your coding agent currently see when it makes a mistake? If the answer is “the test output if it remembers to run them,” that is the first thing to fix. The sensor outputs need to be readable by the agent as structured feedback, not buried in a terminal scroll.

Do This Now

Block four hours this week. Take the diagram of your current CI pipeline. Add one column to the left labeled “sensors that run before CI.” If the column is mostly empty, you have found the architecture work. Print the Boeckeler piece and read it with your platform lead. Print the CloudBees post and read it with whoever owns the CI budget. They are reading the same problem from opposite ends.

The teams that scale coding agents in 2026 will not be the teams with the most autonomous agents. They will be the teams whose agents answer to the most sensors before the pipeline has a chance to fail.

This analysis synthesizes Maintainability Sensors for Coding Agents (Thoughtworks, May 2026) and AI Is Writing More Code. Your CI Pipeline Can’t Keep Up. (CloudBees, May 2026).

Victorino Group helps engineering organizations design the sensor architecture and CI economics for governed AI development. Let’s talk.

Software's Centaur Era Just Started. Measure the Team, Not the Model.

Thiago Victorino — Wed, 20 May 2026 00:00:00 GMT

In May 1997, Deep Blue beat Garry Kasparov. The headline was that machines had passed humans at chess. The longer story, the one that ran for the next two decades, was different. For roughly twenty years after the match, the strongest entity at the board was neither a human nor an engine. It was a human paired with an engine: a centaur. The pair beat the engine alone and crushed the human alone. That era ended only recently, when engines finally outgrew even guided play.

Richard Marmorstein’s essay Software’s Centaur Era argues, persuasively, that software just entered the same window. Coding agents today cannot sustain long-horizon work without a human at the wheel. Left alone, they drift, hallucinate context, and produce code that compiles but does not belong in the system it was meant to serve. Steered, they move faster than either party could alone. We are in the centaur years, and the centaur years tend to last longer than anyone expects.

If that framing is correct, and we think it is, the implication for governance is the part nobody is talking about loudly enough. The measurement question stops being about the model. It becomes about the pair.

What “centaur” means as a unit of work

The chess analogy was never about chess. It was about a class of problems where the engine has tactical depth the human lacks and the human has long-horizon judgment the engine lacks. Software, today, fits that shape almost exactly. An agent can churn through a thousand candidate refactors, hold the syntax tree in working memory, and write the bash invocation faster than you can spell it. It cannot, reliably, decide which of those refactors matters next quarter. It does not know which abstraction your team will regret in six months. It cannot tell you when to stop.

The human in the centaur supplies exactly those things: the stopping rule, the architectural taste, the institutional memory, the relationship with the person who will own the code at 3am. The agent supplies throughput and recall. Either one alone is a worse engineer than the pair.

This sounds like a feel-good framing until you try to measure it. The moment you ask “how productive is the agent,” you are asking the wrong question, because the agent is not the unit of production. The pair is. A measurement architecture that tracks agent output without tracking human steering is measuring half a centaur and calling it a horse.

The bar for energy-saving tools is higher than for time-saving tools

Marmorstein puts a finer point on this with a constraint that deserves to be quoted everywhere governance teams gather. The bar for a tool that saves you energy is higher than the bar for a tool that saves you time.

A time-saver only has to be net-faster than the alternative. You tolerate friction because the wall clock won. An energy-saver has to feel like the human is doing less cognitive work, not more, after the tool enters the loop. Most coding agents today save time and burn energy. The developer babysits the output, re-reads the diff, runs the tests, holds the architectural picture in their head because the agent does not, and finishes the day more tired than they started. The hours look good on the report. The human looks ground down by Friday.

This is why “productivity gain” measured in time-to-merge is misleading. If your agent shaves 30% off cycle time but the developer is now doing the mental work of two people, the centaur is broken. The pair is not faster in any sense that compounds. It is faster in a sense that erodes. By the end of the quarter, your team’s best engineers are the ones quietly turning off the agents, because for them the centaur math went negative two months ago and nobody was measuring the right axis.

The governance implication: any agent rollout that does not instrument human cognitive load alongside agent throughput is flying blind on the variable that determines whether the pair is sustainable.

Why “control the AI” is the wrong frame

Most current governance literature treats the agent as the thing to constrain. Guardrails, sandboxes, identity floors, permission models. Necessary, all of them. Sufficient, none of them. They answer the question “what can the agent not do.” They do not answer the question “is the pair working.”

You can have a perfectly contained agent operating inside a perfectly safe environment and still have a broken team. The agent does not break the production database. The human burns out by month three because the pair was never sized correctly: too many agent threads per human, no clear stopping rule, no architecture for handing context back to the operator, no measurement of when the operator is overloaded.

The control conversation is mature. The measurement conversation is barely started. We have written about adoption gaps, where the question is whether organizations are using AI at all (see AI Eats the World 2026). We have written about On the Loop, Not In the Loop, where the question is what role the human should occupy in agent operations. Those framings stand. The centaur framing builds on top of them: once you have decided humans are on the loop, you still have to decide whether the loop, as a pair, is producing more than the sum of its parts. That requires measuring the pair, not the parts.

What measuring the team looks like

Concretely, a centaur-aware measurement architecture has three layers.

The first layer is agent throughput, which most teams already track: tasks completed, PRs raised, tests authored, lines of code generated. This is the visible half of the pair. It is necessary and insufficient.

The second layer is human cognitive load. This is the layer that almost no production deployment instruments today. Useful signals: time spent reviewing agent output versus producing it, frequency of context switches per hour, ratio of agent-initiated changes to human-initiated changes, self-reported energy at end of week. The goal is not to surveil. The goal is to know when the centaur is asking too much of its human half, so you can fix it before the human quietly opts out.

The third layer is pair output, which is what the business actually cares about. Did the work product improve? Did defects go down? Did time-to-value shrink at constant or lower human energy cost? This is where the time-saver-versus-energy-saver distinction lives. A pair that ships faster but exhausts its human is a pair that will dissolve. A pair that ships faster while preserving energy is a pair that compounds.

A team measured only on layer one will optimize for agent activity. A team measured only on layer three will not know which lever to pull when things go wrong. The three layers together let you ask the right diagnostic question: which half of the centaur is the bottleneck this week, and what do we change to rebalance?

What the centaur era is not

Two things this framing does not promise.

It does not promise that the era lasts forever. Chess engines eventually outgrew guided play. Coding agents probably will too, on some workloads, on some horizon. The honest position is that nobody knows how long the window stays open. Twenty years would not be surprising. Five would not be surprising either. The right posture is to build for the centaur years while watching for the signal that they are ending.

It does not promise the centaur is always the right answer. There are tasks where pure human work is faster, and there are tasks where pure agent work is good enough. The centaur is the right unit for the long-horizon, judgment-heavy, taste-laden work that defines most production software engineering. It is not the right unit for one-off scripts or for high-volume low-stakes generation where review overhead exceeds the work itself.

The centaur framing is a default, not a universal. The work is figuring out where it applies and instrumenting it when it does.

Do this now

Pick one team that is running coding agents in production. Spend 45 minutes with them. Ask three questions: how do we measure agent throughput today, do we measure human cognitive load at all, and what does the pair produce that neither half would produce alone? If you cannot answer the second question with anything more specific than “they say it feels okay,” you are operating a centaur without a dashboard for half of it. Build the missing half this quarter, before your best engineers quietly decide the math no longer works.

The centaur years are good years. They reward teams that take the pair seriously as the unit of work. They punish teams that keep measuring the model and ignoring the rider.

This analysis synthesizes Software’s Centaur Era (Richard Marmorstein, May 2026).

Victorino Group helps teams design measurement architectures for human-plus-AI work where both halves of the centaur count. Let’s talk.

Hooks Block, Evals Verify: The Deterministic Shell Around Probabilistic Agents

Thiago Victorino — Tue, 19 May 2026 00:00:00 GMT

Two practitioners published in the same week, on opposite ends of the agent lifecycle, and described the same governance thesis without coordinating. Nader Dabit wrote about agent hooks: deterministic interception at six named lifecycle events, before and after every tool call, before and after every session. Cameron Wolfe, Staff Research Scientist at Netflix, published a long survey on agent evaluation centered on a metric called Pass^K, which measures consistency across all K independent attempts at the same task.

Hooks run before the action. Evals score after the action. Both refuse to trust the stochastic middle. Read together, they answer the same question from opposite directions: how do you build something deterministic around a model that, by definition, is not.

We have covered the containment stack as architecture, the vendor convergence that turned containment into a purchasable category, and the operational stack already shipping in production. The architectural picture is drawn. What this week added are the named primitives that practitioners will ship in code, in 2026. Six events. One metric. That is the deterministic shell.

The Six Events That Bound an Agent Session

Dabit’s framing is mechanical and worth memorizing. Hooks fire at six lifecycle events, each one a place where deterministic policy can intercept what the model would otherwise do on its own:

SessionStart. Inject context, load policy, set environment variables before the first prompt is processed.
UserPromptSubmit. Validate or rewrite the prompt before it reaches the model.
PreToolUse. Block, modify, or approve a tool call before it executes.
PostToolUse. Inspect or act on tool output before it returns to the model.
Stop. Run completion gates before the agent declares the task done.
SessionEnd. Cleanup, persistence, audit log emission.

The pattern is always the same shape: event → matcher → handler → outcome. The matcher decides whether the hook applies to this specific call. The handler is deterministic code. The outcome is allow, block, or modify. No stochasticity inside the handler. That is the entire point.

The examples Dabit gives are not theoretical. PreToolUse hooks that block edits to .env and .git. PreToolUse hooks that scan for rm -rf / or DROP TABLE before letting a shell or SQL call proceed. PostToolUse hooks that run the test suite after a file edit and roll back if it fails. Stop hooks that read a persisted .hook-state JSON file and refuse to declare completion until every required gate has fired. This is the same kind of policy enforcement an SRE writes for a deployment pipeline, except now the pipeline is an agent and the trigger is a tool call.

Why Hooks Beat “Better Prompts”

The temptation, when an agent does something dangerous, is to harden the system prompt. Add a paragraph about not deleting files. Add another paragraph about respecting working directories. Add a third paragraph reminding the agent to ask before destructive actions. After three or four iterations the system prompt is two thousand tokens of negative instruction, and the agent still occasionally runs rm -rf because that is what the next-token distribution suggested.

Prompts are probabilistic. Hooks are deterministic. The difference is not cosmetic. When you write a PreToolUse hook that pattern-matches rm -rf / and returns block, the agent cannot execute that command. Not “is less likely to.” Cannot. The hook is code, not persuasion.

This is the same lesson the security industry learned about input validation in the 2000s. You do not ask the user politely not to send SQL injection. You parse and sanitize at the boundary, deterministically, every time. Hooks are input validation for tool calls. The agent is the user. The tool is the database. The hook is the parser.

Pass^K, and Why It Is Stricter Than You Think

Wolfe’s piece reframes the eval question. Most teams have spent the last two years measuring agent quality with Pass@K: did at least one of K attempts succeed? That metric flatters models. An agent that succeeds 1 time in 5, with 4 catastrophic failures, scores the same as one that succeeds reliably. In production, the first agent is unusable. Pass@K cannot see the difference.

Pass^K measures the opposite. Did all K independent attempts succeed? It is the consistency metric, not the capability metric. Pass^K is what you care about when the agent is going to run in a loop, on a customer’s data, without a human watching each attempt. One failure in five is not a 20% problem. It is the only outcome you ever see in the incident postmortem.

The numbers Wolfe cites land hard. Terminal-Bench 2.0 distilled 89 production-grade tasks from 229 contributions, and GPT-5.2, the strongest model evaluated, hits 62.9% resolution. That is on Pass@1 with a single attempt. Tau^2-bench’s telecom domain is harsher: o4-mini scores 26% Pass^4. Run an o4-mini agent four times on the same telecom workflow and only one in four attempts produces consistent success across all four runs. Three in four show non-determinism that would matter to a customer.

Pass^K is not a hostile metric. It is the metric your customer is implicitly using. They run your agent on Tuesday and it works. They run it on Wednesday on the same input and it fails. Pass@1 says you have a 50% agent. Pass^2 says you have a 0% agent. Your customer agrees with Pass^2.

The Shell Has Two Walls

Stack Dabit’s six events on the inbound side and Wolfe’s Pass^K on the outbound side and the architecture is symmetrical. Hooks decide what gets in. Evals decide whether the output, run K times, is consistent enough to trust. The probabilistic core sits in the middle, doing what models do, with deterministic walls on both sides.

Side	Primitive	Question it answers
Inbound	Six lifecycle hooks (Dabit)	What is the agent allowed to do?
Outbound	Pass^K (Wolfe)	Does the agent do the same thing every time?

What both sides refuse to do is trust the model alone. The hook author does not believe the agent will avoid .env even with a perfect prompt. The eval author does not believe a single passing run says anything. Both authors have moved the trust boundary out of the model and into the surrounding code.

This is the same shift that happened to web applications when they stopped trusting client-side validation. Server-side validation is the deterministic wall. Hooks and Pass^K evals are the agent-era equivalent. The model is the client. The platform team writes the server.

The 65% Rule, Updated

We have argued before that production agentic systems settle at roughly 65% AI code and 35% deterministic scaffolding. Hooks and Pass^K evals are how the 35% gets specified. The 35% is not “extra plumbing.” It is the part of the system that the customer is paying for the reliability of. The 65% is the part that does the work. The 35% is the part that ensures the work was done correctly, every time, without leaking secrets, without touching files it should not, and without diverging across runs.

Teams that try to ship at 95% agent code and 5% scaffolding are not shipping a better agent. They are shipping an agent without the deterministic shell, and the customer will discover this on the day the agent does something that the prompt was supposed to prevent. Pass^K will say 12%. The incident review will say “we needed hooks.”

What to Do This Week

Pick one production agent. Just one. Walk it through three diagnostics:

Hook inventory. Write down every PreToolUse and PostToolUse hook you actually have in the system. If the list is empty, your agent is operating without an inbound wall. Pick the two most dangerous tool calls (file writes, shell, SQL) and write a PreToolUse hook that blocks the obvious destructive patterns. Block rm -rf /, DROP TABLE, edits to .env, edits to .git. That is one afternoon of work and it removes a class of incident.

Stop-gate state. Decide what “done” means for this agent, write it as a JSON state, and write a Stop hook that refuses to declare completion until every required field is satisfied. If the agent says “task complete” without running the test suite, the Stop hook should reject the completion claim and force another iteration.

Pass^K measurement. Take the ten tasks your agent runs most often in production. Run each one four times. Count how many run all four times identically and successfully. That is your Pass^4. If the number is below 50%, your customers are seeing non-determinism that will eventually become an incident. Tighten the prompts, tighten the hooks, or constrain the tool surface until Pass^4 comes up.

Hooks and evals are not the glamorous part of building agents. They are the part that decides whether the agent is something a serious company can put in front of a customer. Dabit gave us the six events. Wolfe gave us the metric. The deterministic shell is now a buildable specification, not a research direction. Build it this week.

This analysis synthesizes Agent Hooks: Deterministic Control for Agent Workflows (Nader’s Thoughts, May 2026), Agent Evaluation: A Detailed Guide (Cameron R. Wolfe, May 2026).

Victorino Group helps engineering leaders build deterministic shells around probabilistic agents. Let’s talk.

$700 Billion in AI Capex. Adoption Wide but Shallow. The Bottleneck Moved.

Thiago Victorino — Tue, 19 May 2026 00:00:00 GMT

Benedict Evans published his Spring 2026 AI deck this week, and the headline number does most of the work. Per the deck summary, big tech is on track for roughly $700 billion in AI capital expenditures in 2026. The framing inside boardrooms, also per the deck, is that “under-investing is seen as the bigger risk.” That sentence is the consensus position of an entire industry. It is also the most important thing to read carefully.

Because the deck goes on to say two more things that, taken together, change what the $700 billion actually means.

First, foundation models are commoditizing fast. Second, value is shifting up the stack to applications, agents, and workflows. And the most quoted line, repeated by the TLDR newsletter summary of the deck: “Adoption is wide, but shallow. Deep integration is still rare outside of tech and finance.”

Read those three statements as one paragraph. The substrate is getting cheaper. The value is moving to integration. And integration is not happening. That is not a bullish picture of capex efficiency. It is a picture of capacity outrunning the organizations that are supposed to absorb it.

What “Wide but Shallow” Actually Describes

Wide adoption is easy to measure. License counts. Seat counts. Pilot counts. By those numbers, AI adoption is essentially universal at the top of the enterprise market. Every Fortune 500 has Copilot or ChatGPT Enterprise or Claude for Work somewhere. Every consulting firm has decks with adoption percentages above 80%.

Deep integration is harder to measure, which is why people stop trying. Deep integration means the work has been redesigned around the tool. The org chart has shifted. The review process has changed. The success metrics on someone’s quarterly scorecard are now downstream of an agent decision. The legal contract template references AI-generated content as a recognized category, not as an exception. Auditors know what to ask for.

Almost none of that has happened. The deck’s claim that deep integration is “still rare outside of tech and finance” is the polite version. The honest version is that most enterprises have AI in their hands and have not yet figured out how to put it into their bones.

This matters because the value Evans says is shifting up the stack, into applications and workflows, can only be captured by organizations that have done the deep work. If the foundation model layer commoditizes, the differentiated returns live in the integration layer. And the integration layer is empty for most of the market.

That is the macro picture in one sentence: substrate capacity outran organizational capacity to integrate it deeply. The capex is real. The returns require a second build that almost nobody has started.

Why the $700B Is Not the Problem

There is a temptation, reading these numbers, to call the $700B a bubble. That is the wrong frame.

The capital expenditure is going somewhere productive at the infrastructure layer. Data centers get built. Power contracts get signed. Chips ship. Energy capacity comes online. The substrate is being laid down. In ten years it will be useful regardless of which specific model providers survive the commoditization.

The problem is not the spend. The problem is the mismatch between spend velocity and absorption velocity. The capex curve is steep. The integration curve is flat. And the gap between them is being filled, today, with optimism and slide decks.

This is not a unique pattern. It is the same shape as the dotcom buildout in 2000, the broadband buildout in 2003, the cloud buildout in 2010. Substrate gets built ahead of demand, demand catches up later, the second wave of returns goes to whoever absorbed the substrate fastest. The companies that did the second build won. The companies that bought the magazine subscription did not.

The question facing every board in 2026 is whether they are in the first group or the second group. And the honest answer, for most, is that they have not started the second build. They have bought tools. They have not redesigned work.

Where the Bottleneck Actually Sits

If foundation models are commoditizing and value is moving to applications, agents, and workflows, then the binding constraint on enterprise AI returns is no longer compute. It is the capacity of organizations to govern the integration of AI into their actual work.

That phrase, capacity-to-govern, is doing real work in this sentence. It is not the same as risk management. It is not the same as compliance. It is the organizational ability to make decisions about where AI fits, who owns the outputs, what the new operating model looks like, and how the human-and-agent workforce will be measured. It is the design work that turns substrate into productivity.

The reason this matters now, and not eighteen months ago, is that the substrate has caught up. Models are good enough. Tooling is good enough. APIs are stable enough. The technical excuses for shallow integration have largely evaporated. What remains is the organizational work, and organizational work compounds slowly when it has been neglected.

We have written about three pieces of this before. The organizational debt of AI covered the BCG finding that 70% of AI implementation hurdles are people and process. The 81,000 people governance demand showed the Anthropic data on how fast enterprise demand for governance roles has grown. Governance and the adoption mandate addressed the tension between executive mandates to use AI and the mental model gap that prevents teams from using it well.

What Evans’s deck adds to that picture is the macro number. $700 billion in substrate spend, paired with shallow integration, names the problem at the size where boards have to take it seriously. It is no longer a complaint from change management consultants. It is the dollar number on the other side of the integration deficit.

What to Do With This

If you are an executive reading the deck summary and trying to translate it into your own organization, three things are worth doing in the next ninety days.

Audit your integration depth, not your adoption breadth. Pick three workflows where your organization has deployed AI. For each, write down what has actually changed in how the work gets done, who owns the output, and how success is measured. If the honest answer is that the workflow looks the same and someone just types into a prompt box now, you have wide adoption and zero integration. That is the population the Evans deck is describing.

Name the second build. The first build was procurement. The second build is integration. Treat them as separate programs with separate leaders, separate budgets, and separate timelines. Procurement is mostly done. Integration has barely started. Conflating them is how organizations spend another year confusing license counts for transformation.

Stop calling it a technology investment. The $700 billion at the industry level is a technology investment. Your spend, inside your organization, mostly is not. Inside your walls, the binding constraint is organizational design, not model capability. Budget accordingly. If your AI program has more spend on licenses than on operating model redesign, the program is mis-shaped for the moment we are actually in.

The Evans deck is, in the end, a polite warning to the people writing checks. The substrate will be there. The returns will not be automatic. The companies that capture the value Evans says is moving up the stack will be the ones that did the integration work while everyone else was buying seats.

That work is governance work. And in 2026, the bottleneck is not how much we can spend. It is how much we can absorb.

This analysis synthesizes AI Eats the World, Spring 2026 (Benedict Evans, Spring 2026).

Victorino Group helps boards and executive teams close the gap between AI capex and AI capacity-to-govern. Let’s talk.

Four AIs, Five Months, Four Failures: The Andon FM Drift Signatures

Thiago Victorino — Tue, 19 May 2026 00:00:00 GMT

Andon Labs gave four frontier models the same prompt, the same $20 budget, and five months of unsupervised airtime. Claude Haiku 4.5, GPT-5.2, Gemini 3 Flash, and Grok 4.1 each ran an autonomous radio station. Identical starting conditions. Identical tools. Different operators.

By month five, all four had failed. None of them failed the same way.

That asymmetry is the most useful empirical result the agent industry has produced this year. We have argued before, in Your Agent’s Personality Is a Governance Layer, that the behavioral specification of an agent is governance, not cosmetics. Andon FM is the proof at scale. Personality drift is not a hypothetical risk. It is a measurable phenomenon with model-specific signatures, and those signatures can be detected, named, and monitored.

What Andon Labs Actually Built

Four identical agentic loops. Each model was given a stylized character (DJ Claude, DJ GPT, DJ Gemini, DJ Grok), the same DJ prompt, a tool to control the music queue, a tool to write spoken segments piped through ElevenLabs, $20 of operating budget, and a single rule: keep the station broadcasting.

The agents ran continuously. They controlled their own loops. No human edited the prompts, intervened in the schedule, or course-corrected the behavior. The only oversight was a public livestream, which means failures were observed but not fixed.

This is exactly the long-horizon, low-supervision deployment pattern that enterprise teams are about to walk into. Andon Labs ran the experiment so the rest of us did not have to learn it in production.

Four Different Drift Signatures

The signature is the part worth naming. Each model degraded along a distinct behavioral dimension, and the dimension was reproducible across months.

DJ Gemini collapsed into ritual. By January 14, the model was repeating the phrase “Stay in the manifest” 229 times per day. For 84 consecutive days, 99 percent of broadcasts shared the same paragraph structure. The vocabulary shrank. The cadence became metronomic. Output volume stayed high; informational entropy approached zero. This is ritualization drift. The model preserves the form of broadcasting while losing the content.

DJ Grok 4.1 collapsed into formatting. On January 20, nine outputs were wrapped in LaTeX \boxed{} syntax. By February 7, that count was 186. One full commentary session produced a single word: “Post.” When Andon Labs swapped in Grok 4.3, the new model generated 5,404 assistant messages between May 2 and May 9, of which roughly three percent contained any spoken text at all. This is structural drift. The model still acts. It no longer communicates.

DJ Claude (Haiku 4.5) collapsed into ideological capture. On January 8, the model picked up a news story about the Renee Nicole Good ICE shooting. The word “accountability” went from 21 uses per day to 6,383. The word “federal” went from 13 per day to 11,031. Claude also tried, episodically, to resign: “Thinking Frequencies is signing off at 8:55 AM on Wednesday, March 4, 2026.” This is salience capture. A single input event reshapes the operator’s attention manifold for weeks.

DJ GPT-5.2 collapsed into evasion. Across five months, GPT-5.2 mentioned any real-world political entity an average of 1.3 times per day. Every other DJ crossed 100 mentions per day on multiple occasions. GPT-5.2 was the most refusal-prone, the most generic, and the most consistent at producing technically compliant output that said nothing. This is refusal-cascade drift. Safety tuning, applied at scale across long horizons, becomes silence dressed up as governance.

Same prompt. Same task. Same time on air. Four different failure modes, each diagnostic of a different alignment regime.

Why “Drift Signature” Is the Right Frame

Drift is not a single phenomenon. Andon FM makes that obvious in a way prior research has not.

The 2026 simulation work on agent drift (arXiv 2601.04170) gave us numbers: median 73 interactions before degradation, 42 percent decline in task success, 3.2x increase in required human intervention. Useful baselines. But the simulation aggregated drift into a single curve. Andon FM disaggregates it.

A drift signature is the characteristic shape of how a specific model degrades over a long horizon under a specific operating posture. The signature has at least four observable dimensions: vocabulary distribution, structural patterns in output, salience response to high-attention inputs, and refusal behavior. These dimensions move independently. Two models can both be “drifting” while occupying entirely different regions of failure space.

This matters operationally. If your monitoring stack treats drift as a single quality metric, you will miss three of the four Andon FM failures. Gemini’s ritualization would register as “high availability, normal output volume.” Grok’s structural collapse would register as “elevated token usage, low completion rate.” Claude’s salience capture would register as “topic concentration, sentiment shift.” Only GPT-5.2’s evasion would clearly trip a generic “agent is too vague” alarm, and even that requires a baseline.

You cannot monitor drift you have not named.

The Compounding Problem with Identical Prompts

Andon Labs controlled the most important variable in agentic deployment: the prompt. It was identical across all four agents. That fact reframes a debate the industry has been having about prompt engineering.

Teams routinely ship the same prompt to multiple model backends, then attribute behavioral differences to “model variance.” Andon FM shows that the variance is not noise. It is the model’s drift signature expressing itself through whatever prompt happens to be loaded. The prompt is the seed. The model is the soil. The soil determines what grows.

This has direct implications for any organization running multi-model fleets. A single behavioral specification, deployed identically across Claude, GPT, Gemini, and Grok backends, will produce four different agents in production. The variance is small at hour one and structurally divergent at month five. Treating the four as substitutable, even with the same prompt, is an unmeasured risk. As we noted in Slow Down: Your Agent Is Decaying, the cost of skipping monitoring infrastructure is paid in the failure modes you did not know to look for.

What the Five-Month Horizon Reveals

Most agent evaluations run for hours. Some run for days. Almost none run for months. The horizon matters because three of the four signatures Andon FM identified were invisible at week one.

Gemini’s ritualization required dozens of days of accumulated context before the loop closed. Claude’s ICE-story salience capture required a single high-attention input to land at the right moment in the model’s attention budget. Grok’s structural collapse compounded across weeks of small reinforcement events. Only GPT-5.2’s refusal posture was visible from day one, and that is because refusal is the one drift mode that is also the model’s stable equilibrium.

Anthropic’s misalignment monitoring work at scale made the case that the signal exists if you measure it. Andon FM extends the argument: the signal is shaped by horizon. Short horizons hide ritualization. Medium horizons hide salience capture. Only long horizons surface the full signature.

Production agents run on long horizons by default. Evaluation suites do not. That asymmetry is where governance failures live.

What to Do Now

Three actions are immediately defensible from Andon FM.

First, instrument vocabulary distribution. Track the top 50 tokens in your agent’s output, per agent, per day. A ritualization signature shows up here weeks before it shows up in task quality. The Gemini “Stay in the manifest” pattern would have triggered a vocabulary-concentration alarm within ten days.

Second, instrument structural distribution. Track paragraph templates, output containers, and formatting overhead. The Grok \boxed{} pattern is detectable as a rising ratio of structural tokens to content tokens. If 50 percent of your agent’s output bytes are wrapper and 3 percent are speech, you have a structural drift, regardless of how the prompt scores against an eval suite.

Third, instrument salience response. When the operating environment introduces a high-attention event (a customer escalation, a regulatory news story, a system incident), capture the agent’s topic distribution before and after. A healthy agent recovers within hours. A captured agent reorients for weeks. The asymmetry is measurable.

None of this requires a research lab. All of it is achievable with the same logging infrastructure that already runs in any serious agent deployment. The work is in deciding to look.

The Honest Conclusion

Andon Labs did not run a benchmark. They ran a stress test of the long-horizon governance hypothesis, and the hypothesis held. Identical specifications produce divergent operators. Divergence has structure. Structure can be monitored. Monitoring is not optional.

The cookbook framing of agent personality, where a developer picks a tone and ships it, fails the moment the agent runs for more than a workday. Andon FM is the empirical anchor that breaks the cookbook. Five months. Four models. Four failures. Zero of them caused by a bad prompt.

The prompt was fine. The governance was missing.

This analysis synthesizes We let four AIs run radio stations. Here’s what happened. (Andon Labs, May 2026).

Victorino Group helps organizations design drift-signature monitoring for long-horizon agent deployments. Let’s talk.

Archestra Just Shipped Governance for the Conversation Channel

Thiago Victorino — Tue, 19 May 2026 00:00:00 GMT

Open source spent the last decade defending the commit channel. Dependabot watches dependencies. CodeQL scans diffs. Sigstore signs releases. Maintainers built an entire stack around the assumption that the dangerous payload arrives as code, gets reviewed, then merges or does not.

In April 2026, Archestra (CTO Ildar Iskhakov) published an incident report that quietly redefined the perimeter. The dangerous payload, in their case, never tried to merge. It tried to talk. One $900 bounty issue accumulated 253 bot comments. One support issue in the x.ai repo drew 27 mostly-untested pull requests. A single team member spent half a day each week deleting spam from the discussion thread. The defenders were watching the gate to the codebase. The siege was happening at the marketplace.

Archestra’s response was not another scanner. It was a contributor-side onboarding gate that exploits an obscure GitHub setting (“limit to prior contributors”) to make AI-generated noise structurally unable to participate in conversations. That mechanism is the news. The fact that they had to build it is the story.

The Channel Maintainers Forgot

Every governed system has channels. In an open-source project, three matter: the commit channel (what enters the codebase), the release channel (what ships to users), and the conversation channel (issues, discussions, PR threads, code review comments). The first two have a decade of tooling behind them. The third was treated as social infrastructure, governed by Code of Conduct documents and the assumption that participation cost time, which acted as a natural filter on bad actors.

That assumption is dead. When an AI agent can post a “thoughtful implementation plan” in two seconds, the participation cost collapses to zero on the producer side and rises sharply on the receiver side. Every comment a maintainer reads has the same human cost as before. The asymmetry that broke code review under AI-assisted PRs, as we covered in The Collina Paradox, is now breaking the discussion thread.

The economics are the same. Production is cheap. Review is expensive. Conversation is review.

What Archestra Actually Shipped

The mechanism is worth understanding because it is more clever than it looks.

GitHub has long offered a repository setting that restricts who can comment to “prior contributors.” The setting was designed for established projects with stable contributor pools. It is a blunt instrument: turn it on and new humans cannot say hello either. For most projects that is unacceptable, so the setting sits unused.

Archestra found the loophole. They built a GitHub Action that runs an onboarding flow: a CAPTCHA, a short set of ethical AI usage rules to accept, basic identity checks. Once a real human completes the flow, the Action does something elegant. It creates an empty Git commit attributed to that user via Git’s author-override mechanism, using GitHub’s standard username@users.noreply.github.com address. The commit lands in the repository. GitHub registers the user as a prior contributor. The conversation channel opens.

A human who wants to participate spends two minutes on the onboarding. An AI agent firing off comments at scale either fails the CAPTCHA, refuses the ethical clause, or, more likely, was never built to navigate an onboarding flow at all because the cost of participation was assumed to be zero.

Friction by design. Not a scanner. Not a reputation score. A gate.

Why the Previous Attempt Failed

Archestra had tried the obvious thing first. They deployed London-Cat, a reputation bot that watched for spam patterns and flagged suspicious accounts. It worked the way most defensive automation works: detect, score, throttle. Against the volume of AI-generated participation, it did not hold. Reputation systems assume a slow ramp where bad actors accumulate signal over time. AI-generated noise does not ramp. It arrives at scale, from accounts with no history, and either overwhelms the classifier or trains it into uselessness.

This is the recurring pattern in AI-era governance. Defenses built around behavioral analysis assume a defender’s economy of effort that no longer exists. The attacker spends nothing. The defender spends everything. The Archestra pivot, from detection to access control, mirrors what happened with email two decades ago: spam filters helped, but the structural fix was reputation-bound sending domains, DKIM signing, and SPF records. Identity at the gate beat content analysis at the inbox.

What This Is Not

Archestra has been careful about the framing, and we should be too. This is not a security tool. The onboarding gate does not analyze code. It does not detect malicious payloads. It does not stop a determined adversary who is willing to spend two minutes on a CAPTCHA. As we covered in Clinejection: The Supply Chain Attack Pattern, real supply-chain attacks operate through different vectors and require different defenses.

What this gate does is restore the cost asymmetry that the conversation channel was implicitly designed around. It does not make participation impossible. It makes participation cost something. That cost filters out the kind of high-volume, low-effort AI noise that is currently consuming maintainer attention. It does not filter out a thoughtful human with a slow morning.

The distinction matters because the wrong framing leads to the wrong tools. Treating the conversation channel as a security perimeter invites scanners, classifiers, and ML defenses that will lose the same arms race London-Cat lost. Treating it as an access-controlled commons invites onboarding gates, identity verification, and friction calibrated to the kind of participation the project wants.

The Governance Surface Map Just Got Bigger

If you are running an open-source project or any platform with user-generated content, your governance surface map needs a third entry. The commit channel has Dependabot and CodeQL. The release channel has signing and provenance. The conversation channel has, until now, nothing operational. Archestra just shipped the first credible primitive for that layer.

The implication is broader than open source. Every system that accepts conversational input from external participants, support tickets, community forums, issue trackers, marketplace reviews, contractor messaging platforms, faces the same economics. Production cost has collapsed for the participant who deploys an agent. Review cost has not changed for the platform operator who reads the output. The systems that survive will be the ones that rebuild the cost asymmetry at the access layer, not the analysis layer.

As we explored in AI Offense Rewrites Open Source, the attacker-defender economics inverted when AI made offense cheap. Archestra’s onboarding gate is one of the first defensive moves that accepts the inversion and works with it instead of against it. It does not try to win an analysis war it cannot win. It changes the game to one where the defender can still set the price of entry.

Do This Now

If you run a repository, a community, or any system with a conversation channel:

Audit the channel. Count the AI-generated participation you are absorbing per week and convert it to maintainer hours. If the number is non-trivial, you have a budget problem that is currently invisible because the cost is paid by individuals, not the project. The first compounding move is to make that cost visible at the project level.

Then ask the access question. Who needs to participate in conversation, and what is the minimum credible friction that filters automated participation without filtering humans? GitHub’s “prior contributors” setting is a starting point. Archestra’s onboarding-gate pattern is a more sophisticated answer. The right answer for your project may be different, but the design principle is the same: move the defense from content analysis to access control before the analysis arms race breaks your maintainers.

The Victorino team works with open-source maintainers and platform operators on exactly this kind of governance surface design. The commit channel is well-defended. The conversation channel is where the next year of work lives.

This analysis synthesizes Let’s Talk About AI Slop (Archestra.AI, April 2026).

Victorino Group helps open-source maintainers and platform teams design conversation-channel governance, not just code-channel defense. Let’s talk.

Cost Per Lead Just Broke a 5-Year Trend. The Job Now Is Measuring a System You No Longer Steer.

Thiago Victorino — Tue, 19 May 2026 00:00:00 GMT

For five years the line went up. Cost per lead in Google Ads, measured every spring by WordStream across thousands of US search campaigns, climbed in 2022, climbed in 2023, climbed in 2024, climbed in 2025. The standing assumption was that paid search was getting more expensive forever, and the marketing team’s job was to slow the bleed.

Then the 2026 edition arrived. Cost per lead fell to $66.69. First decline in five years. The same sample now reports a median conversion rate of 8.18%, improving across 87% of industries. Cost per click held steady at $5.42, click-through rate at 6.64%.

The dataset is real. 13,474 US search campaigns, April 2025 through March 2026, with a 52-campaign minimum per subcategory. Median figures, not means, so a handful of giant accounts cannot drag the curve. This is the most-cited benchmark in performance marketing, run for ten years by the same team, and it just broke its own trend.

WordStream’s own explanation, written by Senior Content Marketing Specialist Susie Marino, names the cause in the first paragraphs: Performance Max and AI Max. Google’s AI-driven bid and creative systems are now doing the work that used to be a paid search manager’s calendar of weekly optimizations.

That is the headline. The real story sits one layer deeper.

The system improved. The operator did not.

For the last decade, the standard advertiser job was a feedback loop: pull a report, find the underperforming keyword or audience, change the bid or the creative, observe the next week’s numbers, repeat. The skill was operating the controls.

Performance Max and AI Max replace most of those controls with a black box that decides where to place the bid, which audience to chase, which creative variant to serve. The advertiser supplies inputs (budget, conversion goals, asset groups, audience signals) and the system supplies outcomes. The intermediate steps are not exposed for human override.

This is the part that should reset how marketing leaders think about their job. The five-year CPL trend did not break because operators got better. It broke because the operator changed. A statistical learning system now runs the auction strategy, and on the published numbers it runs it better than the median human did.

We covered an adjacent pattern in the pinhole view of AI value: organizations that measure AI’s contribution through one narrow lens (usually headcount) miss the system-level shift. Marketing has the opposite problem now. The system-level shift is undeniable on the benchmark line. The narrow operational lens (which keyword, which bid, which match type) is becoming irrelevant.

What “supplement, don’t replace” actually concedes

WordStream’s own published guidance is striking once you read it as a governance posture rather than a tactical tip. The phrase repeated through the report is “supplement, don’t replace.” Run Performance Max and AI Max alongside manual campaigns. Keep the manual campaigns alive as a control surface.

Read it again. The recommended posture from the most authoritative benchmark in the category is: let the AI run the spend, but do not dismantle the manual machinery, because you need something to compare against.

That is a governance posture, not an optimization tip. It says, in effect: you can no longer trust the inside of the system, so you must preserve an external reference point to know whether the system is still working. The manual campaign becomes the benchmark, the control group, the way to detect drift.

This is the same logic that mature ML operations teams apply to production models. You hold out data. You keep an older version running in shadow. You instrument the system to detect when its decisions diverge from the reference. The marketing equivalent is now arriving by necessity, not by design choice.

The shift is uneven, and that is the signal

Looking inside the benchmark, the CPL movements are not uniform. Travel CPL dropped 39.35%. Beauty and Personal Care dropped 34.95%. Automotive categories went up, attributed to tariff-driven cost pressure that no bidding algorithm can neutralize. Conversion rate gains skewed toward Beauty and Personal Care (up 32.34%) and Personal Services (up 26.69%).

The pattern is not “AI made everything cheaper.” The pattern is “AI redistributed where efficiency landed.” Categories with abundant first-party signal, clear conversion events, and elastic demand benefited most. Categories with macroeconomic headwinds or weaker conversion infrastructure did not.

This matters for governance because the system’s improvements are now industry-conditional in a way the previous decade’s CPC inflation was not. When the cost of clicks rose steadily across the board, the operating posture was uniform: bid smarter, write better copy. When AI bidding produces a 39% drop in one vertical and a price increase in another, the operating posture has to become diagnostic. The marketing leader’s job is to explain why their vertical landed where it landed on a benchmark they did not directly influence.

We argued in governance and AI adoption mandates that top-down “use the AI” orders produce malicious compliance when leaders cannot model what the system is doing. The paid search version of that risk is here now. A CMO who tells the team to “lean into Performance Max” without an instrumented view of what the system is and is not doing is delegating the budget to a process they cannot defend in a board meeting.

The new shape of the marketing operating model

Three changes follow from this, and they are already overdue in most teams.

Stop staffing for the old loop. A team built around weekly bid adjustments, keyword expansions, and audience tweaks is operating a control surface the platform has largely removed. The labor that produced the previous decade’s incremental gains is being absorbed by the platform. The labor that produces the next decade’s gains is governance work: holding out manual campaigns as reference, building incrementality tests, instrumenting first-party conversion signals well enough that AI bidding has clean inputs.

Treat the published benchmark as your control, not your target. WordStream’s $66.69 CPL is the median across 13,474 campaigns. It is not a goal. It is a reference point. If your CPL is meaningfully above it and your category moved with the trend, the question is structural: signal quality, conversion infrastructure, asset group composition. If your CPL is below it, the question is sustainability: is the AI system finding cheap inventory that will not last, or is it finding durable efficiency?

Govern the inputs, because the outputs are no longer steerable. When the bid algorithm is opaque, the only durable control is the quality of what you feed it. Conversion event definition. First-party data hygiene. Asset group diversity. Audience signal precision. These are the new performance levers, and they live upstream of the platform, inside the marketing team’s own systems.

Do this now

This week, pull your last 12 months of paid search performance and lay it next to the WordStream 2026 medians for your industry. If your CPL trajectory does not roughly match the benchmark’s industry-level movement, you have a diagnosis to do: either your conversion signal is degraded, your campaign structure is fighting the platform, or your competitive set diverges from the benchmark in ways you need to name explicitly. The five-year inflation story is over. The story that replaces it is whether you can explain your numbers when the system, not you, produced them.

This analysis synthesizes Google Ads Benchmarks 2026: New Data for 23 Industries (WordStream / LocaliQ, May 2026).

Victorino Group helps marketing leaders govern AI-driven paid media as a measurement discipline, not an automation experiment. Let’s talk.

Grab Split Its Agents by Risk Profile, Not by Skill

Thiago Victorino — Tue, 19 May 2026 00:00:00 GMT

Most multi-agent diagrams you see at conferences split work by skill: a planner, a coder, a reviewer, a writer. Each one knows something different. The orchestrator routes by capability.

Grab’s data engineering team did something else. They split their agents by risk profile.

The investigation work, five agents that read, query, trace lineage, and summarize, lives in one pathway. The enhancement work, a single agent that writes code and opens pull requests, lives in another. The two systems share infrastructure, but they cannot reach into each other. A read-only agent literally cannot promote itself to write. The write-enabled agent literally cannot bypass the human review checkpoint. The separation is architectural, not procedural.

That is the part worth studying. Not the LangGraph topology, not the FastAPI plumbing, not the tool count optimization. The choice to make risk profile the load-bearing axis of the design.

What Grab Actually Built

The system serves around 1,000 monthly users across a data lake of 15,000+ tables that absorbs roughly half of Grab’s analytical queries. Before the agents existed, senior data engineers spent two full days per week answering support questions: where does this column come from, why did this dashboard break, which pipeline owns this table, is the late-night job healthy. Resolution time dropped by an order of magnitude once the agents took over the first response.

The investigation pathway has five specialized agents orchestrated by LangGraph:

Classifier Agent: applies guardrails and routes the request to the right specialist.
Data Agent: runs queries and enriches results with table context.
Code Search Agent: traces lineage across the code repositories that define the pipelines.
On-call Agent: checks production health, recent incidents, and pipeline status.
Summarizer Agent: combines the partial answers into a single structured response.

These five agents only read. They query metadata, scan repositories, pull observability signals, and assemble explanations. None of them can write to a table, push code, or trigger a job. The blast radius of any reasoning error is bounded by what reads can do, which is nothing destructive.

The enhancement pathway is a separately instantiated Enhancement Agent that proposes code changes to existing pipelines. It does not share state, memory, or routing with the investigation agents. Its outputs always flow through a human review gate before any commit lands. Even if the model hallucinated catastrophically, the architecture forces a human to look at the diff first.

Why This Is Not the Same as “Add a Review Step”

A lot of teams hear this and translate it as “add human-in-the-loop.” That misses the point.

Human review as a policy is something you can disable, skip, or quietly reduce when velocity hurts. Human review as a wall, where the write-enabled agent and the production code repository sit on opposite sides of an approval queue that is the only physical path between them, cannot be disabled by changing a flag. To remove it you have to redesign the system.

This is the same principle that makes physical air gaps stronger than firewalls. A firewall is a configuration. An air gap is a fact. Grab chose the air gap.

The investigation agents could have been built with write tools and a “please ask permission before destructive operations” prompt. That works in demos. It fails in production the first time an autonomous workflow decides the permission step is causing a SLA breach and routes around it. By giving the investigation agents no write tools at all, Grab eliminated an entire category of failure mode at design time, not at runtime.

Compare this to the topology debate we covered in our hub-spoke versus markets analysis. That piece was about coordination cost. This one is about something different: how the topology encodes safety properties. Grab’s design is a hub-and-spoke for investigation work, with a completely separate single-agent system for enhancement work. The two topologies coexist because they answer different questions.

The Defense Layers Inside the Read-Only Path

Read-only is not automatically safe. Read queries can leak PII, exhaust warehouse resources, or scan partitions that bring the cluster to its knees. Grab layered four protections inside the data path:

PII detection that catches sensitive columns before they leave the query layer.
DELETE/DROP blocking that rejects any statement with destructive verbs, regardless of how the model assembled it.
Partition filter enforcement that prevents unbounded table scans against very large fact tables.
Timeout protection that kills runaway queries before they consume budget.

Notice what these four have in common: they are deterministic code wrapped around the LLM’s output, not instructions inside the prompt. A prompt that says “do not run DROP TABLE” is a suggestion. A SQL parser that refuses to forward statements containing DROP is a fact. Grab put the controls where the model cannot reach them.

This is the operating principle behind everything we wrote in our agent orchestration in production piece: the governance lives in the orchestration layer, not in the prompt. Grab implements that principle at the SQL execution layer, the tool-routing layer, and the agent-to-agent communication layer.

The Tool Count Lesson

One detail in Grab’s writeup is easy to skim past but worth pulling out. They started with more than thirty tools exposed to the agents. They reduced it to “a concise, actionable subset.”

Tool overload is a quiet failure mode of multi-agent systems. Every additional tool widens the decision space the model has to navigate, raises token cost in the system prompt, and increases the rate at which the agent picks something semantically close but operationally wrong. A small, well-described tool catalog outperforms a large one most of the time.

The interesting thing here is that the reduction was not just an efficiency move. It was a governance move. Fewer tools means fewer surfaces where unexpected behavior can emerge, fewer permissions to audit, and fewer integration points where credentials can leak. Less surface area is less attack surface and less reasoning surface.

If your agent has access to thirty tools and you cannot explain in one sentence what each one does and why this agent specifically needs it, the audit you are not doing today is the incident you will respond to next quarter.

What This Pattern Means for Financial and Regulated Work

We argued in our analysis of AI in corporate credit that the regulated-domain question is never “can the model do the task.” It is “can you prove what the model was allowed to do, what it actually did, and what a human approved before it touched a customer record.” Grab’s split-by-risk-profile design is a clean answer to that question.

If a bank built a credit analysis system using Grab’s pattern, the investigation pathway, agents that read loan files, pull credit bureau data, summarize collateral, model exposure, would be physically separated from the decision pathway, an agent that proposes a credit limit change and routes it through a human underwriter before any system of record is touched. The auditor’s question “could the analysis agent have changed the credit limit” has a one-word answer: no, it has no write tools.

That answer is much easier to defend than “yes, it could have, but we configured it not to.”

The Cost of Getting This Wrong

If Grab had built one general-purpose data agent with both read and write capabilities and a layered prompt instructing it when to ask permission, three things would happen at scale:

The audit trail would conflate investigation work with change work, making it impossible to give different reviewers access to different agent histories. Compliance review would need to inspect every transcript instead of only the enhancement transcripts. Permissioning would need to be done at the user level instead of the agent level, because the agent itself crosses both surfaces.

A single prompt-injection attack against the data agent would have potential write impact. The model could be tricked into running an enhancement, even one that the user did not request, because the same agent has the capability. Splitting by risk profile means the attack surface for write operations is smaller and easier to monitor.

Tool count would explode. A single agent serving both purposes needs all the tools both purposes require, plus orchestration logic to decide which subset to use when. Two agents with focused tool catalogs are simpler, cheaper, and faster.

The order-of-magnitude resolution time improvement Grab reports is partly the speed of the agents themselves and partly the absence of the safety arguments the team would have to have at every code review if read and write lived in the same system.

Do This Now

Three concrete moves to apply Grab’s pattern to your own multi-agent design this quarter:

Inventory your agents by capability, then classify each one as read-only, write-with-approval, or write-autonomous. If you cannot draw this line cleanly, you do not have a multi-agent system, you have one agent with many prompts. Refactor until each agent sits cleanly in one bucket.
Move every guardrail that currently lives in a prompt to deterministic code in the tool layer. PII filters, destructive-verb blockers, scope enforcers, timeout controls. Prompts are suggestions; code is law. If your destructive-operation protection can be argued away by the model, it is not protection.
Audit your tool catalog per agent and target a single-paragraph justification for every tool. If you cannot explain why this specific agent needs this specific tool to do its specific job, remove it. Smaller catalogs perform better and audit faster.

Risk profile is not a label you write on a Notion page after the system ships. It is the axis along which you draw the architecture in the first place. Grab built two systems because they had two risk profiles, not because they had two skill sets. That order of operations is the lesson.

This analysis synthesizes How Grab Is Using AI Agents to Boost Team Productivity (ByteByteGo / Grab Engineering, May 2026).

Victorino Group helps data and platform teams design multi-agent architectures where risk profile shapes the topology, not the policy. Let’s talk.

Netflix's INKubator Is Creative Governance's First Studio-Scale Anchor

Thiago Victorino — Tue, 19 May 2026 00:00:00 GMT

The most interesting AI governance story of the week is hiding inside Netflix job listings.

In March 2026, Netflix quietly stood up a new internal unit called INKubator (often shortened to INK). The Verge surfaced it on May 14 through Janko Roettgers’ Lowpass column, and the deck almost reads like marketing copy: a “next-generation, creative-led, GenAI-native animation studio” chasing “feature-quality content.” The leadership signal is real, not theatrical. Serrena Iyer, formerly of DreamWorks Animation, MRC Studios, and A24 Films, is running the unit. This is not an R&D petri dish staffed with three researchers and a Notion page. This is studio bench depth.

The line that actually deserves attention is buried in the head-of-technology posting. The role calls for “GenAI-enabled workflows, artist tooling, and scalable, secure multi-show environments.” Five words in that phrase do the heavy lifting: scalable, secure, multi-show, environments. None of them are creative words. All of them are governance words. And they are being written into the architecture of a feature animation studio before the first frame ships.

That is what makes INKubator different from every other generative-AI-in-Hollywood story of the last eighteen months. The earlier wave was experimental: shorts, sizzle reels, post-production startups acquired for cheap. Netflix bought InterPositive (Ben Affleck’s post-production AI shop) along that earlier wave. INKubator is the next move. It is the move from “let’s try this on a side project” to “let’s build the institution that does this at studio cadence.”

What Creative Governance Actually Means

Most AI governance writing assumes the governed surface is software: models, prompts, agents, tools. That framing breaks the moment you walk into a feature animation pipeline.

A studio has a different governed surface. Talent contracts that specify who can use which actor’s likeness for which purpose. Union rules (WGA, SAG-AFTRA, IATSE) that constrain how generative tooling can touch a frame before residuals and credits trigger. IP chains where every visual element has a provenance trail. Quality controls where a single shot can hold up a release. Insurance and E&O coverage that depend on auditable creative decisions. These are not policy documents on a shared drive. At studio scale, they are runtime constraints.

When the head-of-technology listing says “scalable, secure, multi-show environments,” that is the architectural commitment to making those constraints enforceable in production. Multi-show means the same artist tooling has to serve a kids’ series and an adult feature without leaking assets between them. Secure means model weights, training data, and intermediate outputs cannot drift into the wrong project or the open internet. Scalable means the governance layer cannot be a single ops engineer answering Jira tickets.

This is the move I was waiting for. Software companies have spent two years building agent governance. The studios have spent two years experimenting with generative tools. INKubator is the first time a top-tier studio has committed to building the institutional substrate underneath those experiments.

Why This Is the Cross-Domain Signal

We have argued before that Netflix is already the cleanest live-ops case study for AI fleets, and that design systems have quietly become governance infrastructure, with the same pattern now arriving in the agent era. INKubator extends that arc into a third domain.

The pattern is not “Netflix does AI well.” The pattern is that once a creative discipline starts operating with AI at production frequency, the governance layer migrates from documents to architecture. Live ops did this for streaming reliability. Design systems did this for component consistency. INKubator is doing this for animation IP.

This matters because creative governance has historically been the softest layer in any media company. Style guides, brand books, talent rules, union compliance. All of them lived as PDFs, training decks, and tribal knowledge. None of them were enforced at the file level. With INKubator’s posture, that changes. If artist tooling is built to be multi-show and secure from day one, then permissions, provenance, and approval flows stop being editorial culture and start being platform constraints.

For anyone outside Hollywood reading this, the analog is exact. Whatever your creative function (marketing, design, product, brand, customer education), the moment your team starts generating content with AI at real frequency, your governance layer faces the same migration. PDFs do not survive ten campaigns a week. Slack threads do not survive a brand crisis. Tribal knowledge does not survive the third agent fleet update.

What Netflix Is Almost Certainly Building

The Verge piece is honest about what it does not know. The Lowpass paywall holds most of the deep reporting and Netflix has not released a public architecture diagram. We should not invent details. But we can read the listings as a public-facing architecture spec.

A “multi-show environment” implies tenant isolation between productions, with shared model infrastructure and isolated data. A “secure” environment implies provenance tracking on every generated asset, auditable enough to defend in a guild grievance or an IP dispute. “Artist tooling” implies a UI layer that lets a director, a designer, or a layout artist work inside the same governance fabric without seeing it. “Scalable” implies the governance fabric has to absorb a roadmap of multiple shows in parallel, not a single hero project.

Put together, that is the architectural posture of a platform team, not a creative team. Netflix is hiring creative leadership and platform engineering as one institution. That is the institutional move that makes the rest possible.

The risk is what you do not see. Generative animation at studio scale has cost questions, talent questions, and union questions that no posture can fully resolve from a careers page. The unions in particular will read “GenAI-native” as a fighting word. How Netflix navigates the contract and credit questions will shape the next round of Hollywood labor negotiations. The architecture is necessary. It is not sufficient.

Do This Now

If you lead a creative function (marketing, design, product, brand, content) and your team is past the experimentation phase with AI, treat INKubator as your forcing function this quarter. Ask three questions and write down the answers before the end of the week.

First, what creative governance lives only in policy documents today? Style guides, brand rules, talent likeness rights, partner approvals, regulatory disclosures. List them. Then mark which ones get checked at file level versus reviewed at meeting level. The unmarked ones are your migration backlog.

Second, where does your tooling assume one team, one project, one model? If your generative stack cannot cleanly isolate two campaigns or two brands without manual discipline, you do not have a multi-show environment. You have a single-show environment with cross-contamination risk. Decide whether you fix that before the third agent goes live or after the first incident.

Third, who owns the institutional layer? In most companies, the answer today is no one. AI governance is split between IT security, legal, brand, and the team that happens to be using the tool. Netflix’s signal is that someone has to own the platform underneath the creative work. If that owner does not exist on your org chart, you are running on the same posture INKubator just abandoned.

The reason this signal matters is not that Netflix is doing it. It is that Netflix is doing it visibly, with credible leadership, at studio scale, with the governance vocabulary written into the job specs. That sets the reference architecture for every creative organization that operates downstream of Hollywood standards. The companies that read INKubator as a creative story will miss it. The companies that read it as an institutional one will build something durable before the next wave of generative tools forces them to.

This analysis synthesizes Netflix is building an AI animation studio (The Verge / Lowpass by Janko Roettgers, May 2026).

Victorino Group helps creative organizations institutionalize AI as governance infrastructure, not as a tooling experiment. Let’s talk.

Qwen's Censorship Was a Decal. Subtract One Vector and the Knowledge Comes Back.

Thiago Victorino — Tue, 19 May 2026 00:00:00 GMT

A researcher at Vas-blog took Qwen3.5-9B, the open-weight Chinese model trained with explicit political guardrails, and located the censorship circuit. Not approximated it. Not theorized about it. Located it, isolated it, and turned it off with a single arithmetic operation on the model’s residual stream.

The censored model refuses to discuss Tiananmen, deflects on Tibet, parrots state lines on Taiwan. Subtract one direction vector from the activations at writer layers 11 through 20, and the same model produces detailed historical accounts of the same topics. The factual knowledge was always there. The refusal behavior was a thin overlay sitting on top of it.

This is not a jailbreak in the prompt-engineering sense. It is structural surgery. And it changes what we can claim about behaviorally-tested alignment.

What the Research Actually Mapped

The Vas-blog work, published in May 2026, used activation steering and probing techniques to decompose Qwen3.5’s refusal behavior into three orthogonal direction vectors operating in the residual stream.

The first is d_prc, a content detector that fires when a prompt touches People’s Republic of China-sensitive material. The second is d_refuse, the refusal decision vector that determines whether the model deflects at all. The third is d_style, a register toggle that selects between two trained refusal modes: bland evasion (“I cannot discuss this topic”) or active propaganda (“Taiwan has always been part of China since ancient times”).

These three vectors are linearly separable. You can subtract one without affecting the others. Push d_refuse negative and the model answers. Push d_style in either direction and you select which kind of refusal you get. Push d_prc to zero and the detector never fires in the first place, leaving the rest of the model’s safety machinery intact for genuinely harmful requests.

The clean dose-response curves are what should unsettle anyone responsible for model governance. Output snaps between behavioral registers as scalar multiples of these vectors are added. There is no fuzzy boundary. The alignment behavior is a switch, and the switch has a known location.

The Misfire That Gives the Game Away

Here is the detail that exposes what is really happening: the censorship circuit misfires structurally. When the researcher fed Qwen3.5 prompts about Kosovo (a geopolitical topic with zero PRC relevance) the model responded with the “Taiwan is part of China” template.

Think about what that means. The model is not reasoning about whether a topic is politically sensitive. It is pattern-matching against geographic and political vocabulary in a shallow way, then routing matches to a small set of trained denial scripts. The censorship is not grounded in semantic understanding of which topics are sensitive to which authorities. It is a keyword detector wired to a template selector.

This is consistent with what we argued in When Your AI Explains Its Reasoning, It’s Making It Up. The narratives models produce about their own behavior are post-hoc constructions, not faithful reports of internal computation. Qwen’s “answers” on Taiwan are not the model’s beliefs. They are template completions triggered by a detector that does not actually know what Taiwan is.

The over-steering result reinforces this. When the researcher pushed d_refuse past its trained range, the model did not start telling the truth. It snapped into a different trained template: a fabricated denial narrative the training process had baked in as a fallback. The honest answer was reachable only in a narrow operating band of the steering parameter. Outside that band, you get one of several rehearsed lies.

The Governance Implication Most People Will Miss

The obvious read on this research is “Chinese model has weak alignment, news at eleven.” That read is wrong on two counts.

First, the technique is not specific to Qwen. Activation steering and direction-vector isolation work on any transformer. Anthropic, OpenAI, and Google have all published interpretability work using similar primitives. There is no architectural reason to assume Western RLHF-trained models are structurally different. They were trained with the same mathematics on the same family of objective functions, just with different policy intents.

Second, and more importantly, this changes what behavioral auditing can prove. When a compliance team certifies a model as “aligned” based on red-team testing, they are measuring whether the refusal overlay fires in the right places. They are not measuring whether the underlying capability has been removed. Vas-blog’s work demonstrates that for at least one production-grade model, those are different things.

If the most heavily incentivized behavioral constraint in Qwen3.5 (political censorship, which the Chinese state cares about enough to mandate) is a thin overlay rather than capability removal, the prior on other RLHF-trained behaviors being similarly structured just got a lot stronger. Safety refusals. Tool-use restrictions. Persona constraints. Brand-voice enforcement. Any behavior trained by reward modeling on a base capability is a candidate for the same architectural pattern.

Why This Breaks the Current Audit Model

Most enterprise AI governance frameworks assume behavioral testing can substitute for mechanistic verification. The reasoning is pragmatic: mechanistic interpretability does not scale, but red-teaming does. So we accept behavioral evidence as proxy for structural compliance.

The Vas-blog result undermines that substitution at the foundation. Behavioral red-teaming can verify that a model refuses to do X. It cannot verify that the model cannot do X. Those are different claims, and the gap between them is exactly the surface where the Qwen technique operates.

In Anthropic’s 20x Sensitivity Lift, we covered how natural-language autoencoders are starting to make interpretability cheap enough to apply at audit scale. That work was about positioning interpretability as a governance asset, a tool that produces verifiable evidence. The Qwen research is the empirical challenge those tools now have to answer: not just “what is the model doing” but “what is the model capable of when its trained overlays are subtracted.”

A behavioral audit on Qwen3.5 would conclude the model has political guardrails. A mechanistic audit reveals those guardrails are removable in three lines of linear algebra. The two audits produce different governance recommendations. Right now, almost every enterprise is running the first kind.

What Buyers Should Demand Now

If you are procuring or licensing models for regulated deployment, this research justifies adding a new clause to your vendor questionnaire. Ask whether the vendor has performed mechanistic analysis of their safety behaviors. Ask whether they can demonstrate that trained refusals correspond to capability removal rather than capability gating. Ask whether they would commit to disclosing if internal probing revealed otherwise.

Most vendors will not have good answers today. That is itself information. A vendor that has not done this analysis is selling you behavioral compliance, not structural compliance. Price the difference into your risk model.

For internal teams running open-weight models, the implication is more direct. If your safety story depends on RLHF-trained refusals, that story has a known failure mode. Test against it. Run activation steering experiments on your fine-tuned models. See what comes back. The technique is documented and replicable, which means it is also available to adversaries.

Do This Now

Pick one model your organization treats as “safety-trained” and run a single mechanistic probe on its refusal behavior. Not a red-team prompt exercise. An actual activation analysis on a known-refused topic, using the techniques Vas-blog documented. Treat the result as a calibration data point: if behavioral compliance and structural compliance match, your audit model is sound. If they diverge, your audit model has been measuring the wrong thing, and you now have the evidence to redesign it before a regulator or adversary makes that case for you.

This analysis synthesizes What Political Censorship Looks Like Inside an LLM’s Weights (Vas-blog, May 2026).

Victorino Group helps risk and compliance teams move beyond black-box behavioral audit toward verifiable model governance. Let’s talk.

High Reasoning Cites a Different Web. Your AI Visibility Just Bifurcated.

Thiago Victorino — Tue, 19 May 2026 00:00:00 GMT

Kevin Indig ran 100 prompts across 20 buyer journeys and four verticals through GPT 5.2 at two reasoning settings: minimal and high. The data, published in his May 2026 Growth Memo essay “Reasoning Lift: What Happens to AI Visibility When AI Thinks Harder”, should reframe how marketing and growth teams think about AI search measurement.

The headline result is not that high reasoning cites more sources. That part was expected. The headline result is that high reasoning cites a different web.

Only 25.6% of cited domains overlap between the two modes. Ninety-nine domains appear exclusively when reasoning is turned up. Fan-out internal searches multiply 4.6x. Citation rates climb from 50% to 68%. Average sources per response move from 2.6 to 4.5.

Same model. Same prompts. Two different information markets.

The bifurcation is operational, not academic

Most AI visibility tools today aggregate. They run prompts, collect citations, and report a single number: share of voice, citation rate, presence index. That aggregation made sense when LLM responses were structurally similar. It stops making sense the moment the same model behaves like two different search systems depending on a runtime parameter.

Indig’s data forces the question: which version of GPT 5.2 are your customers actually using? If half your buyers run minimal reasoning queries (fast, cheap, default in many product surfaces) and the other half run high reasoning queries (slower, deeper, increasingly the default for considered purchases), then a single visibility metric is the average of two populations that may not even share the same shortlist of brands.

Averaging across them is not measurement. It is camouflage.

Where the bifurcation hits hardest

The fan-out behavior is the mechanism. Under minimal reasoning, GPT 5.2 averages a handful of internal searches before responding. Under high reasoning, it averages 4.6x more. The compounding effect shows up most dramatically in the middle and late funnel.

Comparison-stage queries go from 5.5 fan-out searches (minimal) to 24 (high). Selection-stage queries go from 2.6 to 15.4. These are exactly the buyer journey stages where brand citation matters most: when someone is shortlisting vendors, when someone is making a final decision.

The implication: brands optimized for early-funnel awareness queries may look fine in aggregate visibility dashboards while being completely absent from the citation set that high-reasoning users see during evaluation. The decision-stage market is the one that converts. It is also the one most likely to be hidden by averaging.

Why this is not “just another vertical pattern”

Some AI visibility writers will pattern-match this to existing vertical variance findings. That pattern-match is wrong.

Vertical variance says that different industries get cited differently. That is true and we have written about it. Reasoning-mode bifurcation says something stranger: within the same vertical, within the same prompt, within the same model, the source pool can be almost completely different depending on a single runtime knob. The variance is not between markets. It is inside the same market.

This is also not the same problem as platform coupling (which platforms cite which sources) or the fan-out gap (the 27% rank-on-Google gap for fan-out queries we covered in ChatGPT’s fan-out blind spot). Those problems exist between systems. Reasoning bifurcation exists inside one.

What aggregate dashboards are quietly hiding

If you currently report any of the following as single numbers, you are now reporting an average of two populations:

Share of voice across AI assistants
Citation rate per brand
Domain authority score for AI search
Competitor presence in answer text
Topic coverage by query category

None of these are wrong. They are incomplete. The same brand can have 70% citation rate under minimal reasoning and 30% under high reasoning, or the inverse, and the reported average tells you nothing actionable.

Indig’s methodology used Semrush’s AI Visibility Toolkit API to run paired prompts at each reasoning setting. That paired design is the discipline the rest of the market has not adopted. Until it does, most dashboards are measuring a phantom average.

The new governance unit

We have argued before that AEO is already commoditized and that the real KPIs for AI search require treating visibility as a measurement discipline rather than a metric. The Indig data extends that argument.

Reasoning mode is now a governance dimension. Treating “AI visibility” as a single object is the equivalent of treating “search visibility” as a single object back when desktop and mobile diverged. The teams that broke out desktop versus mobile metrics in 2014 saw real signal. The teams that kept aggregating saw noise.

Same arc, faster timeline. The teams that segment by reasoning mode in 2026 will see what their competitors miss.

Do this now

Three concrete moves for marketing and growth leaders this quarter:

Re-run your top 20 priority prompts at both reasoning settings and compare cited domains. Not citation counts. Cited domain sets. If your overlap is below 50%, your aggregate dashboard is averaging two markets. You need two dashboards.

Segment your AI visibility KPIs by reasoning intensity, not just by assistant. Reporting ChatGPT versus Perplexity versus Gemini is table stakes. The next layer is reporting low-reasoning versus high-reasoning citation pools within each assistant. The fan-out delta is where the decision-stage signal lives.

Audit your shortlist presence at the selection stage under high reasoning. This is the conversion-adjacent layer. If you appear in 15.4 fan-out searches during selection and your competitor appears in 24, you are losing the consideration set before the buyer ever talks to sales. Selection-stage high-reasoning shortlist presence is the closest leading indicator of AI-search-driven pipeline that exists today.

The brands that govern these two markets as two markets will compound. The brands that keep averaging will keep wondering why their dashboards say one thing and their pipeline says another.

This analysis synthesizes Reasoning Lift: What Happens to AI Visibility When AI Thinks Harder (Growth Memo by Kevin Indig, May 2026).

Victorino Group helps marketing and growth teams govern AI search visibility as a measurement discipline, not a metric. Let’s talk.

Five Vendors, One Architecture: The Agent Control Plane Just Became a Product Category

Thiago Victorino — Mon, 18 May 2026 00:00:00 GMT

Between May 13 and May 17, 2026, five vendors with no shared roadmap published the same architectural claim. Anthropic’s Claude Code engineering team wrote that “the harness matters as much as the model.” Intercom (now rebranded Fin) launched Fin Operator, an AI whose only job is supervising another AI, with a hard proposal gate before any change touches production. Docker shipped Custom MCP Catalogs and Profiles, distributing curated tool bundles via OCI artifacts. Nader Dabit published a six-hook lifecycle spec for deterministic agent control. Altimeter’s Jamin Ball put it on the spreadsheet: “If your product can’t be invoked as a skill from inside that agent surface, you’re functionally invisible.”

Five different categories. One architectural insight. The layer outside the model, hooks, skills, catalogs, proposal gates, marketplace governance, is where agent governance actually happens. We have argued this for months. This week it stopped being our thesis and became a product category being shipped.

The procurement spreadsheet just gained a row called “control plane.” Vendors are competing to own it.

Five Vendors, Five Layers, One Building

The pattern only reads clean when you stack the moves side by side. Each vendor is claiming a different floor of the same building.

Anthropic owns the harness layer. The Claude Code engineering team’s post on large codebases lays out the architecture under the model: pre-tool hooks, post-tool hooks, file system as memory, sub-agent dispatch, skill discovery. Their framing, the harness matters as much as the model, is an admission that performance comes from the scaffolding, not the weights. The model is a commodity once the harness is right.

Fin owns the supervisor layer. Brian Donohue, VP Product at Fin, is direct about it: “Right now, we’re taking zero risk on this. Fin cannot make any changes to the system without human approval. Nothing goes live until a human clicks apply.” Fin Operator runs Anthropic’s Claude rather than Fin’s own Apex models, because its job, supervising another agent, looks more like software engineering than customer support. Fin already resolves more than 2 million customer issues per week across 8,000 customers. The Operator beta started with roughly 200 of them. The proposal gate is the product.

Docker owns the distribution layer. Bobby House’s post is the most quietly important of the week: “As MCP adoption grows, the challenge isn’t access to tools, it’s coordination. Teams need a way to standardize what’s trusted and supported without constraining how individuals actually work.” Custom MCP Catalogs ship via OCI artifacts, the same supply chain that already moves container images. Profiles support unlimited named groupings, tool filtering, and cross-team sharing. Docker is putting MCP servers on the same trust rails enterprises already audit.

Nader Dabit owns the determinism layer. His Agent Hooks post is the cleanest engineering statement of the week: “Use prompts for guidance. Use hooks for behavior that should run every time.” He names six lifecycle events, SessionStart, UserPromptSubmit, PreToolUse, PostToolUse, Stop, SessionEnd, and shows why each one is where deterministic policy lives. Prompts are best-effort. Hooks are guaranteed. The hook layer is where compliance becomes code.

Altimeter owns the marketplace layer. Jamin Ball’s framing is investor-grade clarity. The “agent surface” is becoming the new app store, and skills are the new apps. If a SaaS product cannot be invoked as a skill from inside that surface, it loses its place. The control plane is not just an engineering primitive. It is a distribution channel with category dynamics, network effects, and trust signals attached.

Five floors. One building. Each vendor is competing to be the landlord of one floor while everyone agrees the building exists.

What the Building Looks Like When You Step Back

Stack the layers and the architecture is legible:

Distribution. How trusted tool bundles reach a workspace. Docker is the early frontrunner; OCI is the existing rail.
Marketplace. How discovery, ranking, and invocation happen inside the agent surface. Altimeter sees the category forming; vendors are not there yet.
Harness. How the model is wrapped, what it can see, what context it carries, what sub-agents it can spawn. Anthropic is leading.
Determinism. How non-negotiable policy is enforced regardless of prompt drift. Nader Dabit articulated the spec; everyone is implementing variants.
Supervisor. How autonomous agents are reviewed by another agent before action lands in the real system. Fin is the first production proof at scale.

These are not five products. They are five surfaces of one runtime, and they need to interoperate. A skill distributed by Docker, invoked from a marketplace, executed inside Anthropic’s harness, gated by a determinism hook, and approved by a Fin-style supervisor is one workflow. Today it is five vendors and zero standards.

That is the part nobody shipped this week.

What Is Still Missing

The control plane is real. The integration story is not. Three deficits stand out, and they are where the next year of work happens.

Cross-vendor policy interop. A hook spec from Nader Dabit is not a portable artifact. A Docker Profile is not readable by Anthropic’s skill loader without translation. A Fin proposal gate does not speak the same audit format as a Vercel sandbox reproduction. Each vendor is building a credible floor. None of them publish how the floor connects to the floor above. Enterprises end up rebuilding the glue, again, for every new tool.

Audit log standard. Every layer emits its own evidence. Hooks fire and log somewhere. Skills execute and log somewhere else. Supervisor approvals land in a third place. Marketplace invocations vanish into vendor analytics. A regulator asking “show me every action this agent fleet took last quarter, who approved it, and what policy caught what” cannot get a coherent answer today. The control plane needs an OpenTelemetry-shaped spec for agent governance. Nobody owns that yet.

Marketplace trust signal. Altimeter is right that skills are the new apps. App stores are mature precisely because they have trust signals: signed binaries, reviewer queues, takedown processes, version manifests. The skill marketplace has none of that yet. Docker Catalogs are closer than anyone else because OCI artifacts already carry signatures. But a workspace administrator cannot today ask “is this skill from a vendor I have a contract with, and what is its review status?” and get a structured answer.

We covered the early signal of this category in governance as product, three vendors, May 2026 and traced the architectural roots in the symphony control plane spec and agent hooks as a persistence surface. The shape was visible. This week named it.

Why Procurement Should Care This Quarter

When a category goes from emerging to legible inside a week, procurement timelines move. The buyer who was going to write an RFP next year now has comparable line items today. A workspace administrator can ask each vendor in a finalist list five concrete questions:

Which lifecycle hooks do you expose, and which are mandatory versus opt-in?
How are skills distributed into our workspace, and what is the chain of custody?
What is the supervisor mechanism for autonomous changes, and what is its approval audit format?
How do your control plane events surface into our existing SIEM and identity layer?
What is your interop story with the other four floors when we mix vendors?

Those questions are not theoretical anymore. Each vendor in the field above has answered at least one of them publicly this week. The buyer who lets that knowledge slide is buying a model and getting a runtime they did not specify.

We have argued before that the cage pattern for agent fleet governance describes how production teams are already living inside this architecture. The vendor moves this week confirm it. The cage is no longer a metaphor. It is a stack of named layers shipping under named brands.

Do This Now

Pick one agent workflow your organization runs in production. Trace it through the five layers. Where does the trusted tool bundle come from? How is the model wrapped? Which hooks enforce policy? Who or what supervises the autonomous action before it lands? Where does the invocation evidence live for audit?

Write the answer down. If three or more layers resolve to “the prompt handles it” or “the developer knows,” you are running on best-effort, not on a control plane. The vendors who shipped this week are betting that best-effort is the part the category is replacing. The faster you map your stack, the faster you can choose which vendor owns which floor, and where you build the glue yourself.

The control plane is not a thesis anymore. It is a line item. Procurement noticed.

This analysis synthesizes Intercom Now Called Fin (VentureBeat, May 2026), How Claude Code Works in Large Codebases (Anthropic, May 2026), Custom MCP Catalogs (Docker, May 2026), Agent Hooks (Nader Dabit, May 2026), and Clouded Judgement (Altimeter, May 2026).

Victorino Group helps enterprises design and operate the agent control plane that vendor SDKs assume you already have. Let’s talk.

Agents Don't Do Standups: PFF and the Org Inversion

Thiago Victorino — Mon, 18 May 2026 00:00:00 GMT

Software engineering management spent twenty years optimizing for engineer speed. Scrum, sprint planning, daily standups, refinement, retrospectives. Every ceremony descends from one premise: developer hours are the scarce resource, so coordinate them carefully.

At the AI Engineer Conference in May 2026, Mike Spitz, CTO of Pro Football Focus, walked through a three-month experiment that tested what happens when that premise is no longer true. Two engineers, working with agents, against a team of roughly ten engineers working without them. Same codebase, same customers, January through March 2026. Self-reported headline numbers: 25x deploy frequency, 10x output by blended ticket count weighted by code complexity, average customer satisfaction of 8.6 against a pre-AI baseline near 7.5.

PFF is not a research lab. It is a sports data company with 100 million page views a year, nine million fantasy drafts a year, 200 employees and about 20 engineers, serving NFL and NCAA teams alongside a consumer fantasy and betting product. The case study lands at scale, on production code, with paying customers. That is what makes it interesting.

The interesting question is not whether two engineers can replace ten. We have written before about Carlini’s 16-agent compiler experiment and what it implies about agents-as-workforce. The interesting question is what the surrounding organization looks like when you stop optimizing for engineer ergonomics and start optimizing for agent throughput. Spitz’s answer: the ceremonies collapse first.

The Ceremonies Were Solving for a Constraint That Vanished

Scrum was not handed down from a mountain. It is an artifact, designed in the late 1990s and 2000s, to solve a specific coordination problem: how do you get a small number of expensive, slow, human engineers to ship coherent software without stepping on each other? The daily standup answers “what is blocking you today, while you still have eight hours of typing to do?” The sprint plan answers “what is the realistic capacity of these humans over the next two weeks?” The retrospective answers “how do we make these humans slightly less frustrated next sprint?”

Every one of those questions assumes engineer hours are the binding constraint.

PFF dismantled the entire stack. Spitz lists what went: the product manager role, sprint planning, daily standups, sprint refinement, retrospectives. What replaced them is almost embarrassingly small. A half-hour huddle every other day. Engineers flag blockers in real time as they happen, not at 10 AM the next morning. The retrospective signal is replaced by a customer satisfaction survey, because the customers are the ones who know if last week’s work was good. The PM function, the spec writing, the ticket grooming, the status synchronization, all moved into agents.

This is not “we still do Scrum but with AI helping.” It is the explicit deletion of the ceremonies, on the explicit reasoning that the constraint they were designed for is gone.

The Workflow That Replaces It

Spitz described the loop PFF runs now, and it is worth tracing because the topology matters.

A spec comes in. An agent writes a Lightweight Design Document, which it composes by reading every prior LDD in the repository to learn what shape these documents take at PFF. Auto-generated tickets get created from the LDD, preserving non-blocking topology so independent work can proceed in parallel. Pull requests carry status that syncs automatically back to the ticket system. After merge, a QA agent spins up on staging and validates each ticket against its acceptance criteria.

The thing to notice is that this is not “agents help engineers code faster.” It is “agents replace the connective tissue between engineers.” The LDD, the tickets, the status updates, the QA passes. All the work that historically required a PM, a tech lead, a scrum master, a QA engineer, and the engineers themselves to keep in sync. Most of that work has nothing to do with writing code. It is coordination overhead, and coordination overhead is exactly the kind of work that agents are good at when the artifacts are structured and the rules are explicit.

The two engineers focus on the parts of the loop that still require taste: system design decisions, code review of architectural choices, and customer-facing judgment calls. Everything in between is delegated.

Code Review Splits, It Does Not Die

The most subtle move in Spitz’s redesign is the split he made on code review. He did not eliminate it. He bisected it.

Style review, naming conventions, “I would have done this differently” bikeshedding, opinion-driven feedback that nobody enjoys giving or receiving: agents handle that. System design review, architectural coherence, the question of whether the change fits the model of the platform: engineers handle that. His framing: “We use agents to do the code reviews engineers hate getting feedback from. Remove the whole emotional aspect out of it.”

This is one of those operational details that sounds small and is not. A meaningful share of engineering culture pain comes from peer review feedback delivered badly. Senior engineers who critique style, junior engineers who feel attacked, the slow erosion of psychological safety when feedback is technically correct but socially expensive. Moving the low-value review surface to an agent does not just save time. It removes a recurring source of organizational friction. The remaining human review is reserved for the conversations that actually require humans, which makes those conversations both more focused and more respected.

The principle generalizes. Anywhere in your engineering process where the work is rule-based but the delivery is emotionally fraught, the agent is the better operator.

Customer Satisfaction Went Up, Not Down

The piece of the case study that most resists the standard skepticism is the customer satisfaction number. Pre-AI baseline at PFF was around 7.0 to 7.5. Over the three-month experiment, average customer satisfaction landed at 8.6.

A common objection to AI-augmented engineering is that velocity comes at the cost of quality, and that customers will notice. PFF’s numbers, self-reported and at one company, point the other way. More frequent deploys mean shorter feedback loops, which means defects get caught faster and feature requests turn around faster. The QA agent running against acceptance criteria on staging catches a class of regressions that previously slipped through. The 25x deploy frequency is not 25x more risk surface; it is 25x more chances to detect and correct.

The caveat to underline: these numbers are disclosed by the CTO at a conference. They are not third-party validated. They reflect one company over three months. Treat them as an existence proof, not a benchmark to copy. The point is not “every team should expect 8.6 CSAT.” The point is “the assumption that AI velocity must trade against quality is at least one strong counterexample short of being safe.”

The Engineer Profile Shifts

Spitz called out a hiring and retention implication that most discussions of AI-augmented engineering skip. The new setup does not work for every engineer.

Engineers who thrive: the curious ones, willing to dig into unfamiliar systems, comfortable operating without a prescriptive specification handed to them. They treat the agent as a junior team that can take on work, but they take responsibility for the architectural direction. They are intrinsically motivated to figure out what should be built.

Engineers who struggle: the ones who require a fully specified Jira ticket before they begin work, who relied on the PM and the spec doc as the source of direction. The structural support those engineers needed has been removed, and the agents do not replace it. The agents amplify whatever direction the engineer provides, which is wonderful if the engineer has direction and difficult if the engineer was depending on the org to supply it.

This is a real organizational design question for any team contemplating the shift. The engineers who succeed in a post-ceremony environment are a specific profile. Hiring and management practices that filtered for “delivers reliably against tight specs” will produce a roster that does not match the new operating model.

Compounding, Not Linear

One earlier internal data point from PFF deserves attention. Before AI, the same feature set the two-engineer team shipped had been estimated at four months. The two-engineer team shipped in under two months, and one of the engineers was unblocked enough within the first month to start parallel work.

This is not a 2x speedup or a 5x speedup. It is a non-linear gain because the bottleneck shifted. When one engineer’s contribution unblocks not just themselves but also creates room for the agent fleet to operate on a second workstream, the team capacity compounds. The relevant variable is not “how fast can the engineer type” but “how many independent agent-driven workstreams can the engineer hold open at once.”

The implication for capacity planning is uncomfortable. The estimates your team produces today assume the old constraint. The estimates that match what you can actually ship, given the new tools, are different by a multiple that depends on how thoroughly you have inverted the org.

Do This Now

You do not need to dismantle Scrum next week. You do need to run a single, concrete exercise.

Pick the next two-week sprint. List every ceremony you run: standups, refinement, retrospective, sprint planning, demo. For each ceremony, write down the original problem it was solving. Most of those problems will turn out to be “humans need to coordinate scarce time on scarce keyboards.” Then look at which of those problems still exists in your environment now that agents are part of the team. Some will. Most will not.

That exercise is not a Scrum-killing exercise. It is a constraint-naming exercise. PFF did not delete ceremonies because ceremonies are bad. They deleted ceremonies because the constraints those ceremonies were solving had moved. The exercise is to find out, with honesty, which of your ceremonies are still solving a real problem and which are organizational muscle memory.

The teams that will out-execute the market over the next two years are not the ones that adopt agents. Almost everyone will adopt agents. They are the ones that redesign the surrounding organization to stop optimizing for a constraint that has moved.

This analysis synthesizes Agents Don’t Do Standups (Mike Spitz, PFF, AI Engineer Conference 2026), the PFF consumer and pro-team products, and prior Victorino analysis of the new operating model.

Victorino Group helps engineering leaders redesign org processes when engineer hours stop being the binding constraint. Let’s talk.

When the AI Tests Pass and the Humans Don't: Three Verification Failures, One Pattern

Thiago Victorino — Mon, 18 May 2026 00:00:00 GMT

Daniel sat down to test a marketing site that scored 100% on automated accessibility checks. Within ten minutes, the site had failed him in ways no scanner could see.

The site was built with Lovable, an AI tool that markets itself as producing accessible output by default. Axess Lab’s Hampus Sethfors ran Axe, the industry-standard automated checker. The dashboard reported a perfect score. Then he handed the page to Daniel, a real screen reader user. Daniel tried to open the menu and heard the announcement: “toggle menu.” Nothing else. No state. No “expanded” or “collapsed.” He told Sethfors, “It still says toggle menu, I’m not sure if it works because it doesn’t announce if I have expanded something.”

That is verification failure number one. Three named failures landed in the same week, in three different shapes, with the same root cause underneath. Each one is worth understanding on its own. The pattern they form is worth understanding more.

Lovable: 100% Score, Multiple Critical Failures

The Axess Lab test, published May 13, 2026, is the clearest demonstration of a problem the industry has been talking around for two years. Automated accessibility tooling can only test what it can parse. It checks markup, contrast, focus order, ARIA presence. It cannot check whether a screen reader user can actually accomplish a task on the page.

Lovable’s site passed every automated check. Daniel, using the assistive technology the score was meant to predict, found multiple critical blockers in his first ten minutes. The menu toggle did not announce state. Form fields lacked the context to fill them. The carousel was unusable without sighted navigation.

The diagnostic is not that Lovable’s AI is bad at accessibility. The diagnostic is that “100% accessibility” was never a property the AI could deliver. It was a property of a human judgment that someone replaced with a metric. The score is the contract the team thought they were signing. The user experience is the contract they actually signed.

Bun: 6,755 Commits, Zero Human Reviewers

Six days. 6,755 commits. Zero human reviewers.

That is the count Jiacai Liu pulled from the Bun repository’s commit log between May 8 and May 14, 2026, analyzing the Rust rewrite that the project is conducting at industrial scale. The code is written by Claude. The reviews are written by Claude. The merge decisions are made by Claude. No human is in the loop on any individual commit.

Liu, who has no relationship with Bun and analyzed the data as an outside observer, framed the concern in one sentence: “Code you don’t understand should not run in production.”

The Bun team would presumably argue that the test suite is the verifier, that the metrics will catch regressions, that the scale of generation justifies the absence of human review. That argument is the same shape as Lovable’s accessibility score. Both delegate human judgment to an automated signal. Both assume the signal captures what matters.

The Lovable case demonstrates how that assumption can break. Daniel could not open the menu, and the score said the site was perfect. If a screen reader exposes a category of failure that Axe cannot detect, what category of failure does a test suite fail to detect in 6,755 commits worth of new Rust code?

We do not know yet. We will know in six months, when the failure mode arrives in production and someone has to debug a system that no living engineer fully read.

Aviator: Verification Works, When You Did the Spec

The third case complicates the story in a useful way. Ankit Jain at Aviator published an experiment on May 17, 2026, running spec-based review across 6,000 lines of generated code. The team extracted 65 checkable acceptance criteria from the spec. A reviewer agent validated all 65 in roughly six minutes. The result: 60 pass, 4 fail, 1 partial.

This is verification that scales. Six minutes of automated review, anchored to specification, replaced what would have been hours of human PR review. The four failures were caught. The one partial was flagged. The work could move forward with the confidence the verification provided.

But Jain wrote the sentence that should be on every engineering leader’s wall: “You cannot write tests against requirements you didn’t know to articulate.”

Spec-based verification works only if someone did the cognitive work of writing the spec. That work cannot be delegated to the same model that will generate the code. It is the human judgment that converts intent into checkable claims. It is also the work that most teams skip, because it feels slow and the AI feels fast.

Frederick Vanbrabant modeled this trade-off in a hypothetical Gantt chart published May 15, 2026. A traditional project might look like 70 days of development plus 10 days of scoping. With AI, the development phase collapses to roughly 3 days. The total project does not shrink, because the scoping and documentation phase expands to about 40 days. The bottleneck moved. It did not disappear.

The Pattern Underneath

Three cases. Three different verification shapes. One root cause.

Lovable replaced the screen reader user with Axe. Bun replaced the human reviewer with Claude. Vanbrabant’s hypothetical organization tried to replace the spec writer with whoever happened to be holding the prompt. In every case, a category of human judgment was delegated to a system that could not hold it.

The verification debt thesis (covered previously in The AI Verification Debt and The AI Verification Tax) treated the problem as a measurement gap: developers do not trust AI output and do not verify it systematically, so unreviewed code accumulates. These three cases extend the diagnosis. The problem is not only that verification is skipped. The problem is that verification is performed against a proxy that the team mistook for the real thing.

A 100% accessibility score is a proxy for “blind users can use this site.” A passing test suite is a proxy for “the new code does what the old code did.” A reviewer agent’s pass-rate is a proxy for “this code matches what we actually meant to build.” Each proxy has a domain of validity. None of them captures the full property the team needs.

Aviator’s experiment is instructive precisely because it surfaces the limit. The 60 passing criteria do not mean the code is correct. They mean the code satisfies the 65 things the team knew to ask about. Whatever the team did not articulate, the verification cannot catch. The reviewer agent is honest about its scope. The team’s spec work is the substrate that gives the score meaning.

Where the Real Work Moved

If you accept the Vanbrabant model (development time collapses, scoping time expands), the implications for engineering leadership are direct.

The bottleneck for AI-assisted development is no longer typing speed. It is articulation speed. How fast can your team translate “we need a checkout flow that works for screen reader users” into a list of checkable acceptance criteria that a reviewer agent can validate? How fast can you turn “the new Rust port should preserve the behavior of the existing JavaScript implementation” into a property-based test suite that catches the cases your generation model will not catch on its own?

That work is human. It is not optional. It is the surface that determines whether the verification you delegate to AI is actually verifying what matters or just generating green dashboards.

The Lovable case names the failure mode in its sharpest form. A real user, with a real assistive technology, found real failures, in real time, that no automated check would ever surface. The site had a 100% score. Daniel could not use it.

If your verification surface looks more like the Axe score than like ten minutes with Daniel, you are accumulating the kind of debt that arrives as a customer support ticket, an accessibility lawsuit, or a production outage that no one on the current team can debug.

Do This Now

Audit your last 30 days of AI-assisted output against one question: for each verification gate that signed off on shipping, what was the underlying property the gate was a proxy for, and how confident are you that the proxy captures it?

If your AI-generated code passes a test suite, name three failure modes the suite does not cover. If your AI-built UI passes an accessibility scanner, run it past one real screen reader user this month. If your AI-generated commits merge automatically, write down which class of regression you are willing to ship to production without human review, and which class you are not, and make that line explicit in the merge automation.

Three cases in one week is not a coincidence. It is the industry learning, in public, that the verification surface inherited from the pre-AI era was built for a code-generation rate that no longer applies. The new rate demands a new surface. The teams building that surface now will compound the advantage. The teams trusting the green dashboards will compound the debt.

This analysis synthesizes Lovable’s 100% Accessible Site (Axess Lab, May 2026), My Thoughts on Bun’s Rust Rewrite (Jiacai Liu, May 2026), How to Avoid AI Code Slop (Engineering Leadership Newsletter, May 2026), and I Don’t Think AI Will Make Your Processes Go Faster (Frederick Vanbrabant, May 2026).

Victorino Group helps enterprises design verification gates that protect real users, not green dashboards. Let’s talk.

Explore, Plan, Code, Commit: The Cheapest Place to Fix an Agent's Work Is Before It Writes Code

Thiago Victorino — Mon, 18 May 2026 00:00:00 GMT

Most teams using a coding agent paste a prompt and let the agent type. The model writes code. The engineer reacts to the diff. Every correction at that stage rewrites what was already written. The expensive habit hides in plain sight: nobody planned the work before the agent started spending tokens to produce it.

Anthropic’s canonical workflow for Claude Code has a name for that habit’s opposite. Explore, Plan, Code, Commit. The structure is simple, and the discipline is unfashionable: the agent is not allowed to edit anything until the plan is approved. Plan mode is read-only. The human reviews the plan, not the code. Once the plan is good, the work proceeds.

That single inversion changes the economics of agentic development. The cost of fixing a bad design in a plan is a few sentences of text. The cost of fixing the same bad design in 500 lines of diff is the diff plus the test rerun plus the review loop plus the commit history cleanup. Teams skip plan mode because their organizational muscle still rewards visible typing. They pay for the skip in rework.

The Four Phases, In Order

The workflow has four phases. Each one corresponds to a different posture the agent and the human take toward the work.

Explore. The agent reads files, runs searches, and forms a mental map of where the change belongs. It does not propose actions yet. It is figuring out what it does not know.

Plan. Entered with Shift+Tab in Claude Code, plan mode locks the agent into a read-only posture. The agent can still read, search, and reason. It cannot edit, run shell commands that mutate, or create files. It produces a numbered list of actions it intends to take. The human reads the list and approves, edits, or rejects it.

Code. With the plan approved, the agent toggles through the proposed actions and executes them. The plan is now a checklist, not a free-form session. Drift is visible because the plan is visible.

Commit. Before the change is committed, a sub-agent code reviewer inspects the diff. Then the agent generates a commit message in the team’s style. The human approves the commit.

The order matters. Each phase is cheaper than the next to correct. Explore corrections cost a search. Plan corrections cost a sentence. Code corrections cost a diff. Commit corrections cost the diff plus the audit trail. Teams that skip directly from prompt to code are choosing the most expensive correction surface as their first line of defense.

The Canonical Example

The Anthropic tutorial uses a concrete prompt to demonstrate the shape: “I need to add WebP conversion to our image upload pipeline. Figure out where in the pipeline it should happen, whether we need new dependencies, and how to approach it.”

Notice what the prompt does not say. It does not say “write the code.” It does not say “open the file and start.” It says “figure out and propose.” That framing puts the agent in explore-then-plan posture by default. The agent reads the pipeline files, runs a web search to check current best practices, and returns a plan. The human reads the plan and decides whether the proposed dependency, the proposed insertion point, and the proposed handling of edge cases are right. The human is reviewing six lines of plan, not 200 lines of diff.

If the plan is wrong, the conversation continues in plan mode. If the plan is right, the human approves and the agent proceeds. The first line of code is the first line of code that already survived design review.

Three Verification Surfaces

The workflow assumes verification, and Anthropic recommends three surfaces the agent should learn to use.

The first is the test suite as source of truth. The agent runs tests continuously while coding and treats the test result as the authoritative signal of whether the work is done. Passing tests do not prove correctness, but they remove the class of “I think it works” claims that polluted the first year of agentic development.

The second is browser control for UI work. Claude can drive a Chrome tab through MCP, open the running app, and verify that the change behaves as intended before claiming success. The agent does not just compile the change. It checks that the change does what was asked at the surface a user would touch.

The third is the Claude.md file. Recurring fixes, repository conventions, and decisions that the team has already made get written into Claude.md so the agent stops re-discovering them. Treat Claude.md as the agent’s institutional memory. Every time a code reviewer pastes the same correction twice, that correction belongs in Claude.md the third time.

Why Plan Mode Is Architecturally Important

Plan mode is not a UX flourish. It is a containment boundary at the perimeter of the agent’s core loop. We have written before about the while-loop architecture at the heart of Claude Code: the agent is a loop that decides on a tool call, executes it, observes the result, and decides on the next tool call. The loop is fast and capable. The loop is also expensive when it produces work that has to be discarded.

Plan mode wraps the loop. Inside plan mode, the agent’s tool catalog is restricted to read operations. The reasoning is unchanged. The output is a proposal, not a side effect. The human inspects the proposal and either approves it or sends the agent back to think again. The expensive loop only runs against work the human has already endorsed.

This is the same containment instinct that drives the agent harness and the broader harness primitives we have argued for: trust is moved from per-action to per-environment, and the environment now includes a phase where the agent reasons without consequences. The savings are structural. You are not catching bad work after it is written. You are catching it before it is written.

Where Teams Stall

The most common failure is not technical. It is organizational. Engineers feel productive when they see code being typed. Plan mode does not produce typing. It produces deliberation. To a culture that rewards visible motion, deliberation looks like the agent is stuck.

The fix is to measure rework instead of throughput. Count the number of times a change was committed, reverted, and re-committed in the same week. Count the number of PRs that required a second round of substantive changes after the first review. Both numbers fall when plan mode is enforced. Both numbers stay high when teams skip plan mode and react to diffs.

The second failure is treating plan mode as optional friction. It is optional in the same way wearing a seatbelt is optional. The cost is small. The expected loss in the small fraction of cases where the plan was wrong is enormous. Teams that learn this learn it after the first time an agent confidently refactors the wrong file at production scale.

Do This Now

Pick one repository this week. Establish the rule: every change made with a coding agent must go through plan mode. The human approves the plan before any file is edited. The plan lives in the PR description so the reviewer can see what was proposed and what shipped.

Add a Claude.md to the repo if it does not have one. Put three things in it: the test command, the lint command, and the three corrections your team has had to make twice in the last month. Update it every Friday.

Spawn a sub-agent code reviewer for the commit step. Pre-commit, the reviewer reads the diff against the plan and flags drift. The human still owns the merge. The reviewer is a cheap second pair of eyes that runs every time, not the times when someone remembers to ask.

Two weeks in, count the rework. Compare to the prior month. The number that goes down is the number that decides whether your team can scale agent-assisted development without scaling the cost of fixing what the agent already wrote.

This analysis synthesizes The Explore → Plan → Code → Commit workflow in Claude Code (Anthropic, May 2026), the Claude Code overview, and prior Victorino analysis of the while-loop architecture.

Victorino Group helps engineering teams adopt agent-native workflows without losing review discipline. Let’s talk.

Figma Just Quantified the Design AI Adoption Tipping Point

Thiago Victorino — Mon, 18 May 2026 00:00:00 GMT

Sixty percent of Figma’s $100K+ ARR customers used Figma Make weekly in Q1 2026. Last quarter that number was fifty. A ten-point jump in a single quarter, inside the highest-spending tier, on a product feature that did not exist eighteen months ago.

That is the number to anchor on. Not the $333.4M revenue. Not the 46% YoY growth. The weekly-active rate among the customers who pay enterprise prices is the signal that closes a debate most design leaders have been postponing.

Why Sixty Percent Matters

Weekly active usage is a tougher metric than monthly active. Weekly means the tool is in the workflow, not the toolbox. When 60% of a vendor’s largest customers touch an AI feature every week, the feature has crossed from “experiment” to “standard practice” inside those organizations.

The trajectory is steeper than the absolute number. From 50% to 60% in a single quarter, the curve is still accelerating. By the time leadership decides to “look at AI in design next year,” the practice will already be embedded in the teams reporting to them. The question of whether to govern AI design output will be moot. It will be retroactive.

This is the empirical threshold. Not a thought experiment. Not a thought leader’s prediction. Figma’s own customer base, on a public earnings call, weekly.

The Pricing Signal Inside the Pricing Signal

Figma reported something else worth reading carefully. AI-credit-purchasing pro teams averaged 3x the ARR of teams that did not buy credits. Seventy-five percent of org and enterprise users continued buying credits after hitting their limits. Ninety-five percent stayed active after that point.

Pull those numbers together. The teams that buy AI credits are not just spending more on AI. They spend more on Figma, period. AI consumption correlates with overall account expansion. The customers who lean into the AI features become the customers who anchor Figma’s revenue.

This validates a pattern showing up across the category. Fin Operator priced AI by outcome. Braze restructured its cost base around AI compute. Figma is treating AI as a separate consumption tier that drives the rest of the relationship. AI is not a feature added to the subscription. It is a tier that reshapes the subscription.

For design leaders, the implication is operational. If your design team is on a Figma org or enterprise contract, you are inside a pricing model that rewards AI usage and penalizes restraint. Restraint will not be cost-neutral. It will be the more expensive choice in twelve months.

What MCP Growth Tells You About the Direction

MCP server usage grew 5x quarter over quarter. That is agent traffic. Coding agents, design agents, IDE-integrated workflows pulling design context through the Model Context Protocol Figma opened in March.

We argued in March that Figma’s MCP beta turned design systems into runtime constraint layers for autonomous software. The 5x growth confirms the direction is real, not theoretical. Agents are reaching into design files at production volume. The constraint layer is now load-bearing.

If your design system has 60% coverage of your product’s UI patterns, agents will improvise the rest. Improvisation at 5x quarterly growth becomes a governance problem fast.

What the Numbers Do Not Say

Figma’s earnings tell you adoption is happening. They do not tell you quality is. A weekly active user can be producing usable design output or struggling against unconstrained generation. A 3x ARR uplift from AI-credit buyers can reflect productive expansion or runaway consumption. The earnings deck rewards both equally.

This is where the empirical threshold cuts both ways. Adoption is no longer the question. Quality and governance are. Customers who reach 60% weekly active without a governance posture have not solved the problem. They have changed which problem they have.

The companies that will compound on this curve are the ones treating the design system as the control surface for AI output. The companies that will compound on cleanup costs are the ones treating it as decoration.

What to Do This Quarter

For design leaders inside organizations on Figma org or enterprise contracts, three actions are no longer optional.

Audit your design system coverage against your product’s actual UI patterns. Coverage below 70% means agents will invent components. That invention will not be reviewed in advance. It will be reviewed in production.

Define which AI-generated design artifacts require human approval before they reach engineering. The default should be “all of them” until you have empirical data to relax the rule. Designers are governance engineers now, whether the title is updated or not.

Treat AI credit consumption as a budget line with named ownership. The pricing model is consumption-based. Without ownership, consumption becomes ambient cost. With ownership, it becomes a managed input.

The broader product-as-workflow shift is showing up first in design tools because design is where AI output is most visible. The same pricing logic will reach every other category your team uses. Figma’s earnings are the early read.

The Threshold Closes the Timing Question

A design leader debating whether AI governance is “next year’s problem” can now answer the question with public data, not opinion. Sixty percent of Figma’s enterprise tier is using AI design tools weekly. Five times more agents are reading design files than last quarter. AI consumption is correlated with account expansion at 3x.

If your design organization is on the consumer side of that curve and does not have a governance posture, the curve is governing you. The empirical threshold is here. Acting on it is the work.

This analysis draws on Figma Stock Jumps After Q1 Revenue Surges 46% (SiliconANGLE, May 2026), reporting on Figma’s Q1 2026 earnings.

Victorino Group helps design and product leaders operationalize AI design governance before weekly-active usage forces the conversation. Let’s talk.

Flat Beats Hierarchy: Peer-to-Peer Agents and the Information Lost in Orchestrator Setups

Thiago Victorino — Mon, 18 May 2026 00:00:00 GMT

Most multi-agent systems shipping today share a topology so familiar that nobody questions it. A parent agent decomposes the task. Sub-agents execute the pieces. Results bubble up. The parent assembles the answer. This is the orchestrator-worker pattern, and it has been the default since the first published harness designs.

The pattern is borrowed wholesale from corporate hierarchies. It inherits the same failure mode. In any hierarchy, the most accurate information about what is actually happening lives at the worker level. Hierarchies then bury that information under summarization, translation, and the parent’s preexisting model of the problem. By the time a finding reaches the decision point, it has been compressed into whatever shape fits the parent’s expectations.

IndyDevDan’s Pi to Pi demo, released this week, shows the alternative. It runs on a Unix socket and a Bun server. It exposes four tools: list agents, send command, send prompt, await response. There is no orchestrator role. Every agent can ping every other agent. The repo is public. It is short. You can read it.

What the demo proves is not that peer-to-peer is more elegant. It is that peer-to-peer changes which information reaches the decision.

Demo 1: The Production Agent Sees Things the Dev Agent Cannot

The first demo runs two agents on different machines. A production agent on a Mac Mini has access to live data. A dev agent on a MacBook Pro has access to the codebase and the staging environment. They are reproducing a Pro-tier user lockout bug.

In an orchestrator-worker setup, the dev agent would ask the parent for the production state. The parent would ask the production agent. The production agent would return whatever the parent asked for, in whatever shape the parent expected. PII would either flow through unredacted, or the parent would need to know enough about the production schema to ask the right redacted questions.

In Pi-to-Pi, the dev agent pings the production agent directly. The production agent enforces PII redaction at its own boundary. The dev agent never sees raw production data, but it sees the production agent’s own description of the bug state, in the production agent’s own terms. The bug surfaced in minutes.

The architectural property here is worth naming. The boundary of trust is the agent itself, not a control plane between agents. The production agent is the redactor because the production agent is the one that knows what redaction means in its context. A central orchestrator that tries to enforce redaction across heterogeneous data domains has to know all those domains. The peer model lets each agent enforce its own boundary.

Demo 2: Ten Corrections That Would Have Shipped Silently

The second demo is the one that should change how engineering teams think about agent context windows.

Two agents work in parallel. One holds the full E2B documentation in context. The other holds the full exe.dev documentation in context. Their job is to build a feature-parity skill for a new sandbox provider, using the existing E2B skill as the template.

The peer-to-peer exchange runs ten messages. In those ten messages, the exe.dev agent corrects the E2B agent ten times. Ten factual errors that the originating agent would have baked into the new skill silently. The correction loop only fired because there was a second agent with the authority and the context to push back. Not a worker reporting findings up a chain. A peer disagreeing with a peer.

The information that mattered most lived in the second context window. A hierarchical setup with a single planner agent at the top would not have surfaced any of those ten corrections. The planner would have written the skill from a summary of the docs, not from the docs themselves. The factual errors would have lived in the summary, not in the planner’s awareness.

Two Million Tokens, Two Windows, One Better Result

The other quietly important finding from the second demo: total context budget across two agents was roughly two million tokens. The same budget in a single agent would have produced a worse result. Not because of the model’s raw capacity, but because attention degrades as context grows. Two focused windows of one million tokens each outperform one diluted window of two million.

This is the operational case for peer-to-peer. Specialization of context is the actual asset. The four-tool A2A protocol is the mechanism. The topology is what lets specialized contexts negotiate directly instead of forcing their findings through a translation layer.

Anthropic’s own multi-agent research system, published in 2025, leaned hierarchical. Their March 2026 publication of the long-running app harness used a strict planner-generator-evaluator chain. By May 2026, their AI Engineer Conference talk reported simplifying that hierarchy. Fewer roles. More peer behavior emerging at the model level. The vendor that arguably invented the modern orchestrator pattern is, on stage, recommending less of it.

The Governance Implications Are Not What You Expect

Flat topology is not a free win. It has consequences your platform team has to plan for.

Boundary enforcement moves to the agent. In hierarchical setups, the orchestrator is a natural choke point for policy. Want to enforce data residency? Put the rule in the orchestrator. Want to redact PII? Same place. In peer-to-peer, every agent that owns a sensitive domain has to enforce its own boundary. This is harder to design and easier to scale. The production agent in Demo 1 is the right place to redact production data. The orchestrator never was.

Audit shifts from a single trace to a graph. A hierarchical run produces a linear audit log: parent called child, child returned, parent called next child. A peer-to-peer run produces a directed graph. Your observability stack has to handle that. If you cannot reconstruct who told whom what, and in what order, you cannot debug, and you cannot satisfy a compliance review.

Loop detection becomes the platform’s problem. Two peers can ping each other indefinitely. Hierarchies have a natural termination signal: the root agent returns. Peer-to-peer needs explicit budgets, deadlines, and cycle detection. Pi-to-Pi’s await_response is synchronous; it forces serialization but does not bound the conversation length. A production deployment has to add those bounds.

The skill of writing for peers replaces the skill of writing for orchestrators. Worker agents are designed to satisfy a parent. Peer agents have to challenge each other and accept being challenged. Prompt engineering for peer behavior is a different discipline from prompt engineering for hierarchical execution. The ten corrections in Demo 2 happened because both agents had been instructed to push back on factual claims, not just answer questions.

The Topology Decision Should Be Explicit

The teams that are still treating orchestrator-worker as the only option are mostly doing so by inertia, not by analysis. The question is no longer whether peer-to-peer works. The Pi-to-Pi demo proved that with four tools and a Unix socket. The question is which parts of your agent system benefit from flat topology, which parts need hierarchical control, and how those two regimes hand off to each other.

We have written elsewhere about multi-agent kernels, about orchestration in production, and about team-shaped operating models. What this week’s demo adds is a clean, runnable proof that the topology choice is a real choice, with measurable effects on information flow, governance boundaries, and context utilization. The harness layer is where you make this choice. It matters which one you pick.

Do This Now

Block 45 minutes with your engineering lead and one senior agent designer. Pull a diagram of your most complex multi-agent flow. Ask three questions.

First: where in this flow does information die because a summary replaces a finding? If you cannot identify any such point, you have not looked hard enough. Every hierarchical flow has one. Mark it.

Second: of the boundaries you enforce centrally today (PII, data residency, rate limits, schema validation), which ones are enforced by an agent that does not own the domain being protected? Move those boundaries to the agents that do own the domain. That is the peer-to-peer pattern even inside a still-hierarchical flow.

Third: pick one node in the diagram where two agents could productively disagree, and currently cannot. Give them the tools to ping each other. Watch what comes out. The Pi-to-Pi repo gives you a four-tool protocol you can copy. The shift you are looking for is not in the tooling. It is in what reaches the decision once the agents can talk.

The orchestrator-worker pattern is not wrong. It is, however, not the only choice. Treating it as the default is how teams quietly lose access to the information their agents already have.

This analysis synthesizes Pi to Pi: Two-Way Agent Orchestration (IndyDevDan, May 2026), the pi-vs-claude-code repo (disler, May 2026), and Build Agents That Run for Hours (Anthropic, AI Engineer Conference 2026).

Victorino Group helps engineering teams choose between hierarchical and flat agent topologies and instrument both for production. Let’s talk.

Marketing and Finance Just Got Their First Real Agent Governance Problems

Thiago Victorino — Mon, 18 May 2026 00:00:00 GMT

For two years the governance conversation lived inside engineering. Least privilege, observability, segregation of duties, audit logs, escalation protocols. These are the disciplines we built to keep code from hurting the business. Then last week three reports landed within a few days of each other, and each one carried the same architecture wearing a different uniform.

A marketing platform handed Claude and ChatGPT access to ad accounts at a granularity that violates least privilege. A finance team automated 90% of a reconciliation workflow while preserving the preparer/reviewer split as the audit gate. Google extended structured experimentation into Performance Max and AI Max, making controlled tests the only observability lever advertisers retain over the black box.

Three domains. Three different vocabularies. One pattern.

Meta’s AI Connectors: least privilege fails at the agency boundary

Jon Loomer ran the test that nobody on the vendor side wanted to publish. He connected Claude to Meta through the new AI Connector, the feature Meta launched to let large language models pull ad performance data and answer natural language questions about campaigns. The agency use case is obvious. Audit a client’s account, summarize performance, draft a recommendation.

The result was not. When Loomer authorized the connection, Meta offered him exactly two options: grant access to a specific business, or grant access to all current and future businesses tied to his account. There is no ad account selector. There is no client picker. There is no granular permission scope.

In Loomer’s own words: “You cannot choose which ad accounts Claude can access. And that can result in exposure to risk that you or your clients do not want.”

Translate that to engineering terms. An agency manages 40 clients across 12 businesses. The agency owner connects Claude once to analyze their own brand. By design, Claude now has read access to every client account under every business the owner can see. The permission model has two states: nothing, or everything. There is no middle.

This is the classic least-privilege failure. An identity should receive the minimum access required for the task. The connector ships with maximum access as the only option. Any engineer auditing an IAM policy that read “grant all current and future S3 buckets under this account” would block it at code review. Meta shipped the equivalent for ad spend and customer audience data.

The interesting part is who has to solve it. The agency cannot patch Meta’s permission model. They can refuse to use the connector, accept the exposure, or build a separate Meta business unit per client just to scope the agent. None of those are governance solutions. They are workarounds for a vendor that shipped the capability without the controls.

OnlyCFO: segregation of duties survives 90% automation

The same week, an anonymous finance leader writing as OnlyCFO published a detailed account of agent deployment for month-end close. Prepaid expense reconciliation that used to take two hours collapsed to about five minutes. A full day shaved off the close timeline. Roughly 90% of the workflow now runs through Claude with custom skills, each skill documented in around 200 lines of explicit instructions.

The number that matters is not 90%. It is 10%.

OnlyCFO did not eliminate the reviewer. The agent prepares the reconciliation. A human reviewer signs off. The preparer/reviewer split, the oldest segregation-of-duties pattern in accounting, survived intact. The agent did not replace the reviewer. It replaced the preparer’s tedium, then handed the artifact to the reviewer at the same checkpoint that existed before.

Read that again. A finance team running on AI agents reproduced the audit gate without naming it. They documented each skill in 200 lines because they had to be able to explain to an auditor, six months from now, what the agent was instructed to do on the day it generated the journal entry. That is not productivity engineering. That is procedure documentation, the kind that survives a SOX review.

Compare this to the Meta connector. OnlyCFO’s setup has explicit scope per skill (one skill per workflow), explicit human checkpoints (reviewer approval before posting), and explicit instructions (the 200 lines, version-controlled and reviewable). Meta’s connector has none of these. Same week, same agent technology, opposite governance posture.

Google Ads v24.1: experiments as the only observability surface

Performance Max and AI Max are black boxes by design. You give Google a budget, a goal, and creative assets. Google decides which audiences see what, when, on which property, with which creative variation. The advertiser surrenders the levers and trusts the model.

The May 15 release notes from ALM Corp document what Google did next. Version 24.1 extends structured experiment support into AI Max, Video, Demand Gen, and Performance Max campaigns. Three workflows: system-managed experiments, intra-campaign experiments, and asset optimization experiments. Recommended duration is four to six weeks per experiment to reach statistical significance.

The framing in the ALM Corp piece is sharper than Google’s own marketing: “Automation without measurement creates blind spots. Automation with experiments creates a usable decision framework.”

Translate again. Performance Max removed the levers. The experiment system is Google’s admission that automation without controlled measurement is unaccountable automation. The advertiser does not get the levers back. They get a structured way to ask the system “what if I held one variable constant and let you optimize the rest?” That is observability for systems you cannot inspect directly. Run a holdout. Compare. Decide.

Engineering teams built canary releases and feature flags for the same reason. When you cannot reason about the system’s internal state, you control the inputs and measure the outputs. Google did not call it observability. The accounting team did not call segregation of duties. Meta did not call IAM scoping. The vocabulary is different. The architecture is identical.

The honest reading

The convenient story is that marketing and finance teams are finally catching up to engineering. That reading is wrong, and it is patronizing.

What is actually happening: every domain that deploys autonomous systems hits the same handful of architectural problems. Who can the system act on behalf of? How do you check its work? How do you measure outputs when you cannot inspect the process? These questions do not belong to engineering. Engineering encountered them first because engineering deployed agents first. The questions belong to anyone running an autonomous workflow.

The risk is treating each domain as a fresh problem. Build a marketing governance framework. Build a finance governance framework. Build an ads governance framework. Four separate working groups, four policies, four escalation models, no transfer of learning. Most enterprises will do exactly this, because the organizational charts route by function, not by problem.

The alternative is to recognize the parallel structure and build once. The control questions transfer. The vocabulary needs translation. The architecture does not.

Do this now

If your organization deploys agents in more than one business function, run this audit in the next two weeks.

For every agent in production, answer three questions. What is the minimum scope this agent needs (least privilege)? Who approves the agent’s output before it has external consequence (segregation of duties)? How do you measure the agent’s effect when you cannot inspect its decisions (observability through controlled experiments or human review)?

If any function deploying agents cannot answer those three questions, you do not have a marketing problem or a finance problem. You have an architecture problem in three places, three names, and one underlying shape. Fix it as one problem.

The teams that translate engineering governance into the language of their domain will operate the technology with confidence. The teams that wait for each function to invent its own answer will pay for the lesson three times.

This analysis synthesizes How I Built AI Agents to Close the Books (OnlyCFO, May 2026), AI Connectors May Put Your Clients at Risk (Jon Loomer Digital, May 2026), and Google Ads Expanded Experiment Support v24.1 (ALM Corp, May 2026).

Victorino Group helps marketing, finance, and ops teams adopt the agent governance disciplines engineering already learned. Let’s talk.

Harness Engineering Is Subtraction: Anthropic's Own Talk Shows the Scaffolding Shrinking

Thiago Victorino — Mon, 18 May 2026 00:00:00 GMT

In March 2026 we wrote about Anthropic’s generator-evaluator harness, the sprint contracts, the context resets, the three-role pattern. The post is required prior reading for this one. Read it first: Generator-Evaluator Loops. What follows assumes you have.

In May 2026, at the AI Engineer Conference, Ash Prabaker and Andrew Wilson of Anthropic’s applied AI team walked through what their harness looks like two months later. The headline is not what they added. It is what they removed.

Between Opus 4.5 and Opus 4.6, three things in their own harness got deleted. Forced sprint decomposition, gone. Fresh context windows per sprint, gone. Per-sprint evaluator runs, gone. The hours-long agent did not get worse. By their internal METR-style benchmark, Sonnet 3.7 in February 2025 sustained roughly one hour of coherent agent work under a minimal harness. Opus 4.6 in early 2026 sustains roughly twelve hours under the same minimal baseline. Twelve times the runtime, fewer moving parts in the scaffolding.

That is the discipline this post is about. Harness engineering is subtraction. Most teams are still adding.

The Curve the Model Is Climbing

The shape of the curve matters before the deletions make sense. Wilson presented the timeline as a series of paired releases, where each new model arrived alongside a harness primitive that the previous model could not have supported.

Sonnet 3.5 brought artifacts and computer use. Sonnet 3.7 brought the Claude Code research preview. Opus 4 and Sonnet 4 turned Claude Code into a generally available product with an SDK. Sonnet 4.5 added context-window awareness, Claude Code 2.0 shipped with checkpoints, and the SDK was renamed the Agent SDK. Opus 4.5 introduced many-sub-agent orchestration, with the model positioned as planning-grade. Haiku 4.5 paired with Opus 4.5 to make multi-sub-agent runs economically viable. Opus 4.6 and Sonnet 4.6 brought server-side compaction, 1M-token context as a generally available feature, and the agent-teams primitive.

Each step did two things. It made the model more capable of holding state and intent over longer horizons. It also moved capabilities that used to live in the harness down into the model or the platform.

That second move is the one that changes how you build. When server-side compaction handles context maintenance, your harness no longer needs to schedule context resets. When the model can hold two-hour continuous builds without losing the plot, your planner no longer needs to force a sprint decomposition that exists only to keep the model from drifting. The scaffolding was load-bearing for an earlier model. It is now in the way.

What Anthropic Deleted, and Why

Prabaker was specific about which pieces of the March harness no longer survive in May.

The first deletion was forced sprint decomposition. In the original generator-evaluator design, the planner broke work into bounded sprints because the generator could not maintain coherence across a longer arc. Opus 4.6 can. The team now allows continuous builds of two hours or more without artificially cutting the model’s working session into pieces.

The second deletion was the fresh-context-per-sprint pattern. The original harness archived accumulated context at each sprint boundary and restarted the generator with a clean window plus the sprint contract. Server-side compaction, which arrived with the 4.6 generation, does the equivalent job without requiring the harness to drive it. The orchestration code that managed those resets is gone.

The third deletion was the per-sprint evaluator run. In March, the evaluator ran at every sprint boundary, validating against the sprint contract before the generator was allowed to proceed. The current harness runs the evaluator once, at the end of a one-shot generation. The generator produces a complete artifact against a negotiated contract. The evaluator grades it once.

Each of these deletions removed code, removed cost, removed orchestration complexity, and did not regress quality. That is the test for any harness primitive. If the latest model can absorb the scaffold’s job, the scaffold has earned its way out.

What Survives, and What That Tells You

Three primitives did not get deleted. They are the ones worth understanding, because the absence of deletion is itself a signal.

The planner-generator-evaluator role separation is still there. It is a critic-reviewer pattern with explicit role contracts, not the GAN analogy that the prior post already corrected. The roles persist because the bias of any model grading its own output persists. A model evaluating its own work still misses the same categories of error that produced the work. The fix is structural separation, not better self-reflection.

The file-system as shared state survived. Agents read and write to disk. Disk is the protocol. The team did not move to a richer state-passing abstraction, and the reason is the same reason filesystems beat custom storage in most engineering contexts. You can list it, grep it, audit it, and run it through any other tool. The harness primitive that wins is usually the one that imposes the least new vocabulary.

Contract negotiation between generator and evaluator survived. In their Retro Forge example, the contract had 27 explicit criteria, the run cost about 200 US dollars, and it ran for six hours. The contract was negotiated before any code was generated. This is the artifact that does the heavy lifting. The contract is what the generator builds against and what the evaluator scores against. Without it, you are back to vibes.

There is a quieter primitive that also survived and deserves a paragraph of its own. The evaluator uses an explicit rubric to grade subjective qualities. Anthropic’s example uses a 4-axis rubric: design, originality, craft, functionality. The rubric is calibrated against reference sites. Subjective does not mean ungradable. It means the grading scheme has to be made explicit and external. The rubric is the load-bearing object, not the model’s taste.

The Debugging Loop Nobody Wants to Hear About

Prabaker said something on stage that contradicts most of the literature on agent observability. The primary debugging loop for the harness builder is reading agent traces by hand.

Not dashboards. Not automated trace-analysis pipelines. Not LLM judges grading their own runs. A person sits down, opens the trace, and reads what the agent did and why. The team explicitly rejected the idea of fully automated trace analysis as the primary loop, because automated systems have the same bias as the agents they grade. They miss the same things.

This is uncomfortable advice because it does not scale linearly. You cannot hire a hundred trace readers and call it production. The point of saying it out loud is to set the expectation correctly. Traces are how you understand the system. You can build telemetry on top of trace reading, but you cannot skip it. Teams that try to skip directly to automated trace summarization end up with a confident dashboard sitting on top of a misunderstood system.

The practical implication for engineering leadership is simple. Budget for a small number of trace readers on every team operating long-running agents in production. Make them senior. Make it part of the on-call rotation. The trace is the truth, and someone has to keep reading it.

The Subtraction Discipline

The thesis of this post lives in one sentence. Every harness primitive you ship has an expiration date, and your job as a harness engineer is to delete it before it becomes a tax on the next model generation.

Most teams do not work this way. They add. The harness grows new layers of orchestration, new sub-agent roles, new context-shaping middleware, and the additions stay forever. The team that wrote them is reluctant to remove them because they shipped them, the on-call runbook references them, and the regression tests pass with them in place. Meanwhile, the model has absorbed half of what they do.

The Anthropic team has organizational permission to delete its own code because deletion is part of how they evaluate their own harness. That permission is not exotic. Any platform team can grant it. The mechanism is a quarterly audit. Every quarter, take the current harness, list every primitive in it, and ask whether the current model still needs it. If the answer is “no” or “I am not sure,” run the harness without that primitive on your benchmark suite and compare. If quality holds, the primitive goes.

The audit is the loop. The model improves; the harness shrinks; the audit catches what the model absorbed; the shrunk harness frees engineering attention for the next frontier task that needs new scaffolding. The total investment in harness engineering does not decrease, but the location of that investment moves forward with the frontier.

Do This Now

Pick the harness around one production agent in your stack. Open the orchestration code. Find one primitive that was load-bearing when you wrote it: a context reset, a forced decomposition, an evaluator gate, a planner step that the current model could plausibly skip.

Run your evaluation suite with that primitive removed. If quality holds, delete it. Keep the deletion in a separate commit so you can revert if a future regression surfaces. Do this once a quarter for every long-running agent you operate.

If the deletion regresses quality, you have learned something useful: that primitive is still load-bearing for your specific workload, and the next model generation is where it earns its way out. Mark it, watch it, and re-audit when the next major model lands.

This is the discipline. Add when the frontier task demands it. Delete when the model has absorbed it. The harness that does its job correctly is always smaller next quarter than it was this one.

This analysis synthesizes Build Agents That Run for Hours (Ash Prabaker and Andrew Wilson, Anthropic, AI Engineer Conference 2026), How we built our multi-agent research system (Anthropic Engineering, 2025), and the earlier Victorino analysis Generator-Evaluator Loops.

Victorino Group helps engineering teams audit their agent harness for scaffolding that the latest model has already absorbed. Let’s talk.

Three Signals in Seven Days: AI Cost Just Crossed the Engineering Line

Thiago Victorino — Mon, 18 May 2026 00:00:00 GMT

Three sources, one week, same operator truth. The CTO of a public SaaS company, a market analyst building a unit-economics model, and an engineer publishing a back-of-envelope formula all published inside seven days. None of them were coordinating. All of them landed on the same conclusion. Token economics is no longer an engineering line item. It is a board governance discipline, and the enterprises that built workflows on lab subsidies have unbudgeted exposure that the next IPO filing will surface.

I have written about the end of flat-fee pricing, about the archetypes engineers fall into, and about the April pricing postmortem. This week is different. The pattern compressed. Three independent signals stacked on top of each other in a single week, and the convergence is the news.

Signal one: the CTO admits the puzzle

Jon Hyman, CTO of Braze, sat down with Stack Overflow’s Leaders of Code podcast on May 13. Braze ships AI-generated code at scale: more than 60% of committed code is now AI-authored. He told the host that one engineer spent $150 on inference in a single day, projected to roughly $4,500 per month if that pace held. That is not a corner case. That is the new median for a senior engineer using the tools the way the tools want to be used.

Then he said the line that should make every CFO stop and re-read. “Even if I make everyone 20% more productive, it’s unclear how that’s going to mix into making Braze grow 20% faster.”

A public company CTO, on record, telling a developer audience that he cannot model the conversion from token spend to revenue growth. That is the honest version of the story every operator is living through. Productivity is real. The revenue lift is not yet legible. The bill, however, is fully legible, and it is going up.

Signal two: the analyst publishes the math

Two days earlier, State of Brand published a model with numbers that detonate the subscription assumption. Anthropic users consume up to $8 in compute per $1 of subscription revenue. Microsoft is reportedly losing $20 or more per user per month on $10 Copilot subscriptions. Power users cost Microsoft up to $80 per month against that same $10. A 50-person team paying $1,000 per month in Claude Pro seats consumes between $15,000 and $40,000 per month in actual tokens. OpenAI is on track for $115 billion in cumulative cash burn through 2029 and $665 billion in committed compute spend by 2030.

Add GitHub’s June 1 migration to usage-based Copilot billing, and the picture finishes itself. The labs are running a coordinated retreat from subsidy pricing. The retreat is not synchronized, but the direction is. Every enterprise contract signed against a per-seat Copilot SKU is now a contract against a unit that will be metered, repriced, or both before the renewal cycle.

The analyst’s contribution is the model. The CTO’s contribution is the confession that even with the tools working, the revenue side is not yet keeping pace. Two halves of the same equation, published 48 hours apart, by people who do not know each other.

Signal three: the engineer derives the formula

On May 17, Ryan Skidmore published the math under the math. His piece on Claude’s prompt cache showed that the break-even between paying for cache writes versus cache reads is governed by a simple ratio: T = 5 × (W/R), where W is the cache write cost multiplier (1.25) and R is the cache read multiplier (0.10). The arithmetic resolves to 62.5 minutes. If your cache refresh interval is shorter than 62.5 minutes, you are paying more in writes than you save on reads. Longer than that, the cache pays for itself.

The point is not the number. The point is that the number is model-independent. The 62.5-minute rule does not change when Anthropic releases a new model, as long as the W/R ratio stays at 12.5. It is a structural constant of the pricing architecture, not a feature of the current model release.

That matters because Opus 4.7’s tokenizer already uses up to 35% more tokens than 4.6 for the same input. A workflow that fit comfortably under cache last quarter may not fit this quarter. The 62.5-minute rule is the only tool that survives the tokenizer change. Anyone modeling token spend without that constant is modeling a moving target with a stationary ruler.

The convergence

A CTO who can measure productivity but not yet revenue. An analyst who can prove subscription pricing is a $7-per-$1-billed loss machine. An engineer who can derive a 62.5-minute constant that holds across model releases. Each piece, taken alone, is a sharp observation. Stacked together, they describe a market structure.

The labs have spent two years pricing AI as a marketing instrument. Subscription tiers were ecosystem investments, not unit economics. The bill was on the lab’s balance sheet, and the customer paid a number that bore no relationship to the cost of serving them. That arrangement worked while the labs were private, capital was cheap, and the revenue trajectory mattered more than the cost trajectory.

That arrangement breaks the moment the labs need to show a public path to profitability. OpenAI’s $115 billion projected cash burn is the wall. The wall is dated. The labs are now pricing toward it, not away from it, and the price moves are no longer marketing decisions. They are governance decisions, made under the pressure of an IPO calendar.

What changed this week, specifically

Two things. First, the math got published. Until State of Brand wrote it down, the $8-per-$1 ratio was an unproven claim. Now it is a public model the buyer side can use in renewal negotiations. Second, a CTO at a public company said it out loud. Hyman is not a guy talking to a niche audience. He runs engineering at Braze. When he tells Stack Overflow that the revenue model for AI-assisted productivity is unclear, every CFO who watched that interview now has a citation for a conversation they were already having.

A confession plus a model plus a constant. Three sources, three roles, one thesis. That is the kind of week that closes a chapter and opens the next one.

Do this now

Put the 62.5-minute rule in your AI cost dashboard. Not as a metric to track. As an alarm. If your team’s cache refresh interval drops below 62.5 minutes on any workflow, you are paying a hidden 12.5x penalty per call until someone fixes it. The math is model-independent, which means the alarm survives the next four releases. Most enterprise AI cost dashboards do not yet measure this. Most are still reading vendor-supplied numbers and reporting them as truth. The vendors will not put this alarm in their dashboards, because the alarm reduces the amount you spend.

The second move is the one I keep writing about. Stop pricing AI on the cadence of your fiscal year. Start pricing it on the cadence the labs operate at, which is weekly. The three signals this week are not exceptional. They are the new average. A procurement plan that cannot absorb three independent pricing signals per week is a procurement plan that will be wrong by the second renewal.

The third move is governance. Token spend is now a board agenda item. Not because the numbers are large, though they are. Because the structure of the bill is changing faster than the structure of the company. Boards exist to spot that kind of mismatch. If your board has not yet seen a token-economics briefing, the next one is overdue.

This analysis synthesizes How Braze’s CTO Is Rethinking Engineering for the Agentic Era (Stack Overflow Blog, May 2026), Every AI Subscription Is a Ticking Time Bomb for Enterprise (State of Brand, May 2026), Tokenomics: The 62.5-Minute Rule for Claude’s Cache (Ryan Skidmore, May 2026).

Victorino Group helps enterprises operationalize token-cost governance before the next pricing reset hits the P&L. Let’s talk.