The AI Control Problem

The Governance Inflection Point: Sonnet 4.6 and the Cost of Near-Frontier Intelligence

TV
Thiago Victorino
8 min read

Anthropic released Claude Sonnet 4.6 on February 17, 2026. It scores 79.6% on SWE-bench Verified. Opus 4.6, the flagship model released two weeks earlier, scores 80.8%.

The difference is 1.2 percentage points. The price difference is 5x.

Sonnet 4.6 costs $3 per million input tokens and $15 per million output tokens. Opus 4.6 costs $15 and $75. For most enterprise tasks — coding, analysis, document processing, agent orchestration — the cheaper model will produce indistinguishable results.

This is the moment the governance conversation changes. Not because a new model is impressive, but because the economics of near-frontier intelligence just made oversight architecture urgent.

What “Good Enough” Actually Means

The AI industry has trained enterprise buyers to think in tiers. Flagship models for serious work. Mid-tier for cost-sensitive tasks. Small models for edge deployment. The implicit assumption: you pay a premium for capability that matters.

Sonnet 4.6 breaks this framing.

On SWE-bench Verified, it closes to within 1.2 points of the flagship. On OSWorld, the computer use benchmark, it scores 72.5% versus Opus 4.6’s 72.7% — a gap so small it is within measurement noise. Users preferred Sonnet 4.6 over its predecessor Sonnet 4.5 approximately 70% of the time in Anthropic’s internal testing.

The ARC-AGI-2 improvement tells the capability story more directly: 58.3%, up from 13.6% on the previous Sonnet. A 4.3x improvement on a benchmark designed to resist incremental progress.

These numbers have a practical consequence that most coverage misses. When the mid-tier model matches the flagship on the tasks enterprises actually use it for, the constraint on AI deployment shifts from budget to governance. The bottleneck is no longer “can we afford to use this?” It is “can we control what happens when everyone does?”

Computer Use Crosses the Production Line

The computer use trajectory deserves its own attention, because it represents a category change, not an incremental improvement.

Anthropic’s computer use scores over the past 16 months: 14.9%, 28.0%, 42.2%, 61.4%, 72.5%. That is nearly a 5x improvement. More importantly, 72.5% on OSWorld means the model can complete real computer tasks — navigating applications, filling forms, executing multi-step workflows through a GUI — roughly three-quarters of the time.

This is not a demo capability. At 72.5%, computer use enters the range where organizations can build production workflows around it, with human review on failures. Insurance platform Pace reports 94% accuracy on its computer use benchmark. Box reports a 15 percentage point improvement in reasoning Q&A tasks.

But here is the governance problem that the capability number obscures. Anthropic’s own system card notes that GUI-based computer use has “noticeably more inconsistent” alignment than text-only interactions. When an agent navigates a GUI, it operates in a richer action space with more ambiguity about intent. A text API call either succeeds or fails. A GUI interaction can partially succeed, navigate to the wrong screen, enter data in the wrong field, or take an unexpected path that produces a result that looks correct but is not.

Traditional AI governance assumes text-in, text-out interactions. Computer use agents interact with the same interfaces humans use. They click buttons, fill forms, navigate menus. The governance framework for these agents requires different categories: screen-level access control, interaction audit trails, and rollback mechanisms for GUI actions. Most enterprises have none of this.

The Multi-Model Routing Reality

Sonnet 4.6 ships with Adaptive Thinking — configurable effort levels that let developers control how much reasoning the model applies to a given task. Low effort for simple classification. High effort for complex analysis. The model adjusts its thinking depth, and its cost, accordingly.

This is a routing primitive disguised as a feature.

In practice, enterprise AI architectures are converging on multi-model routing: different models for different tasks, selected by cost, latency, and capability requirements. Sonnet handles the bulk workload. Opus handles the complex edge cases. Haiku handles the high-volume, low-stakes tasks. Adaptive Thinking adds another dimension: even within a single model, the reasoning investment is now tunable.

This is the right architecture. It is also ungoverned in most organizations.

The routing layer — the system that decides which model handles which task, at what effort level, with what permissions — is the most consequential piece of AI infrastructure in the enterprise. It determines cost, quality, speed, and risk exposure. And in most deployments, it is a collection of if-else statements written by the team that shipped fastest.

Governance must cover the routing layer. When a system decides that a customer-facing interaction is “low effort” to save costs, that is a risk decision. When it routes a compliance-sensitive task to a cheaper model to reduce latency, that is a governance decision. These decisions happen thousands of times per minute, and they are rarely logged, rarely audited, and rarely reviewed.

The Thinking Token Cost Trap

A note on economics, because the headline pricing is misleading.

Sonnet 4.6 costs $3/$15 per million tokens. But when Adaptive Thinking is engaged at higher effort levels, the model generates thinking tokens — internal reasoning that is billed at the output rate of $15 per million tokens. A task that generates 2,000 output tokens but requires 10,000 thinking tokens costs five times what the headline price suggests.

This is not a flaw. It is the correct economic design for tunable reasoning. But it means that the effective cost of Sonnet 4.6 depends on how it is used, not just what it costs. Organizations that deploy it with high-effort thinking on high-volume tasks will discover costs that look nothing like the published rate card.

The 1M token context window, available in beta for Tier 4+ accounts, adds another layer. Tokens beyond 200K are priced at a premium. The headline “1M context window” is real, but the economics of using it at scale are materially different from the economics of using 200K tokens.

None of this is unusual for cloud services. But it matters for governance because cost governance and capability governance are the same problem. An uncontrolled thinking effort dial is an uncontrolled cost dial. An uncontrolled context window is an uncontrolled data exposure surface.

What Convergence Means for Enterprise Strategy

Two weeks ago, we analyzed what the simultaneous release of GPT-5.3-Codex and Opus 4.6 revealed about the industry’s direction. Sonnet 4.6 adds a new dimension to that convergence.

The frontier is no longer a single point. It is a band. Opus 4.6 sits at the top, but Sonnet 4.6 sits close enough that for most practical purposes, the performance difference is invisible. GPT-5.2 occupies a similar position in OpenAI’s lineup. Gemini 3 Pro competes on multimodal tasks.

This means the “best model” framing is now doubly obsolete. Not only do frontier models from different companies perform within noise of each other on shared benchmarks — as we documented in the February 5th convergence analysis — but models within the same company’s lineup now perform within noise of each other on most practical tasks.

The strategic implication is stark. Model capability is commoditizing at two levels simultaneously: across vendors and across tiers. The value differentiator for enterprises is not which model they choose. It is how they govern, route, monitor, and control whichever models they deploy.

What a Cautious CTO Should Do Now

If you are leading technology at an enterprise considering or already using AI agents, Sonnet 4.6 does not change what you should build. It changes how fast you need to build it.

Audit your routing layer. If your organization uses multiple models or effort levels, document the logic that decides which model handles which task. Treat routing decisions as governance decisions. Log them. Review them. Set boundaries on what can be routed to lower tiers.

Build computer use governance before deploying computer use agents. The capability is production-ready. The governance frameworks are not. Before any team deploys an agent that clicks through GUIs, define the interaction boundaries, the audit trail requirements, and the rollback procedures. The 72.5% success rate means 27.5% of interactions will need human intervention. Plan for that.

Instrument thinking effort. If your teams adopt Adaptive Thinking, treat the effort level as a governed parameter. High-effort thinking on sensitive tasks is a safety feature worth its cost. Low-effort thinking on those same tasks is a risk decision that should be explicit, not default.

Model the real costs. The gap between headline pricing and actual spend will surprise organizations that deploy without cost modeling. Build a cost model that accounts for thinking tokens, long-context premiums, and routing patterns before scaling.

Stop waiting for the “right” model. The convergence between Sonnet and Opus makes waiting for a strictly superior option a losing strategy. Near-frontier models are good enough for the vast majority of enterprise tasks. The organizations that will lead are those that invest the waiting time in governance infrastructure instead.

The Real Signal

Sonnet 4.6 is not a story about a new model. It is a story about what happens when near-frontier intelligence becomes cheap.

Every organization that deferred AI governance because deployment was limited to a few high-cost, carefully managed use cases now faces a different reality. At $3 per million input tokens, the barrier to deploying capable AI agents is not budget approval. It is procurement friction. It is a developer with a credit card. It is a team that prototypes on Tuesday and ships to production on Thursday.

The governance question is no longer “should we adopt AI?” That question was settled. It is not even “which model should we use?” That question is becoming irrelevant as models converge.

The governance question is: when capable AI agents cost less than your morning coffee to run, do you have the infrastructure to know what they are doing?


Sources

  • Anthropic, “Claude Sonnet 4.6 Announcement,” February 17, 2026
  • Anthropic, “Claude Sonnet 4.6 System Card,” February 17, 2026
  • SWE-bench Verified leaderboard, February 2026
  • OSWorld benchmark results, February 2026
  • ARC-AGI-2 benchmark results, February 2026
  • Anthropic API pricing page, accessed February 18, 2026
  • Box AI integration case study, February 2026
  • Pace insurance benchmark case study, February 2026

If this resonates, let's talk

We help companies implement AI without losing control.

Schedule a Conversation