- Home
- The Thinking Wire
- Three Models, One Week: Separating Signal from Noise in February 2026
Three Models, One Week: Separating Signal from Noise in February 2026
In the first two weeks of February 2026, three companies — OpenAI, Google DeepMind, and Anthropic — released frontier AI models within days of each other. The announcements generated enough breathless commentary to fill a small library. Most of it was noise.
This piece is the filter. We are going to walk through each release, separate what is genuinely new from what is marketing, and tell you what actually changes for organizations building with these tools. If you have read our previous analysis of the February 5th convergence, consider this the expanded field report — now with Google in the picture and a week more data.
The Releases at a Glance
Before we go deep, the timeline:
- February 2: OpenAI ships the Codex macOS app. Google DeepMind announces Gemini 3 Deep Think.
- February 5: OpenAI releases GPT-5.3-Codex. Anthropic releases Claude Opus 4.6.
- February 12: OpenAI releases GPT-5.3-Codex-Spark, powered by Cerebras hardware.
Five major announcements in ten days. Three companies. One industry trying to figure out what any of it means.
Let us take them one at a time.
OpenAI: GPT-5.3-Codex and Codex-Spark
What is new
GPT-5.3-Codex merges GPT-5.2-Codex (coding) with GPT-5.2 (reasoning and professional knowledge) into a single model that runs 25% faster and uses fewer tokens. It expands beyond code: slides, spreadsheets, PRDs, copy editing. The Codex macOS app is now a “command center for agents” with over 1 million developers and usage that doubled since December 2025.
A week later, Codex-Spark arrived: a smaller, ultra-fast model running at 1,000+ tokens per second on Cerebras Wafer Scale Engine 3 hardware — the product of a partnership reportedly worth over $10 billion. It is text-only, 128K context, and performs somewhere between GPT-5.3-Codex and GPT-5.1-Codex-Mini. Research preview for ChatGPT Pro users only.
Codex-Spark received OpenAI’s first “High capability” cybersecurity classification under their Preparedness Framework — the same classification that GPT-5.3-Codex received when it discovered two previously unknown Next.js vulnerabilities.
Key benchmarks for GPT-5.3-Codex: SWE-Bench Pro 56.8% (current leader), Terminal-Bench 2.0 75.1% (77.3% via Codex CLI).
What is hype
The “first model instrumental in creating itself” narrative. OpenAI claims early versions of GPT-5.3-Codex debugged their own training pipeline. This sounds dramatic. In practice, it means they used their model as a development tool during training — the same thing every AI lab does. Anthropic builds Claude with Claude. The framing implies something more recursive and profound than the reality warrants.
The Codex-Spark speed claims are impressive in isolation but require context. One thousand tokens per second matters for interactive use. It matters less for autonomous agents that spend most of their time planning, not generating. Speed is a feature, not a moat.
The Cerebras partnership valuation ($10B+) is a business story, not a capability story. It signals that custom silicon for inference is now a competitive necessity, but the dollar figure tells you nothing about what the model can do.
What is valid
The convergence of coding and knowledge work into a single model is real and significant. GPT-5.3-Codex is not a coding assistant that also does other things — it is a general knowledge worker that is particularly strong at code. This distinction matters for enterprise procurement: the evaluation is no longer “which coding tool for engineering” but “which AI platform for the organization.”
The efficiency gains are real. Fewer tokens for equivalent output means lower costs and faster responses in production. For teams running thousands of agent sessions per day, a 25% speed improvement compounds.
Terminal-Bench 2.0 at 75.1% is a strong result, though we will address benchmark comparisons separately below.
Google DeepMind: Gemini 3 Deep Think
What is new
Gemini 3 Deep Think is a specialized reasoning mode — not a general-purpose model, but a “System 2” thinking layer that performs iterative, multi-hypothesis reasoning. Think of it as a model that argues with itself before answering.
The benchmarks are striking. ARC-AGI-2 at 84.6%, verified independently by the ARC Prize organization — a result that is unprecedented and well above any other model. Humanity’s Last Exam at 48.4%, a new benchmark designed to be genuinely difficult for AI. Codeforces Elo of 3455. Gold-medal level performance on IMO, IPhO, and IChO competition problems. GPQA Diamond at 91.9%.
The practical demonstrations are more interesting than the numbers. Google reports that Deep Think found a logical flaw in a peer-reviewed mathematics paper at Rutgers. It optimized crystal growth processes at Duke. These are not benchmark performances — they are applied scientific reasoning tasks.
Available through the Gemini app for AI Ultra subscribers ($20/month) and API access for select researchers and enterprises. Gemini 3 Pro pricing sits at $2/$12 per million tokens — the best price-to-performance ratio in the frontier tier.
What is hype
The benchmark numbers, while impressive, come with a caveat that Google’s marketing understandably does not emphasize: Deep Think is a specialized reasoning mode, not a general-purpose model. Comparing its ARC-AGI-2 score (84.6%) to Opus 4.6’s (68.8%) or GPT-5.2’s (~54%) is like comparing a Formula 1 car’s lap time to a sedan’s. The F1 car is faster around the track. It also cannot carry groceries.
Humanity’s Last Exam is too new to know whether 48.4% is a meaningful milestone or an artifact of benchmark design. The test was explicitly designed to be hard for AI, which means it selects for exactly the kind of multi-step reasoning Deep Think is optimized for. It may tell us more about the test than about the model.
“System 2 thinking” is a metaphor borrowed from cognitive science that implies a precision the underlying mechanism does not justify. The model does not think slowly and deliberately the way a human does. It runs multiple inference passes. The marketing language flatters the technology.
What is valid
The ARC-AGI-2 result, independently verified, is genuinely significant. ARC-AGI tests abstract reasoning — pattern recognition on problems the model has never seen before. An 84.6% score suggests that iterative reasoning architectures can achieve forms of generalization that single-pass models cannot. This is a real architectural insight, not a benchmark game.
The scientific applications are the most important signal. Finding errors in peer-reviewed papers and optimizing experimental processes are tasks where Deep Think’s strengths — sustained reasoning, hypothesis generation, self-correction — match real needs. If your work involves research, complex engineering, or scientific reasoning, Deep Think is worth serious evaluation.
The pricing is aggressive. Gemini 3 Pro at $2/$12 per million tokens undercuts both OpenAI and Anthropic significantly. For organizations running high-volume inference, the cost difference is material.
Anthropic: Claude Opus 4.6
What is new
Opus 4.6 introduces a 1M-token context window with genuine recall — scoring 76% on MRCR v2 versus 18.5% for Sonnet 4.5. This is not just a larger window; it is a functional capability improvement. Agent Teams in Claude Code enable multiple agents to coordinate on complex tasks. Adaptive thinking with effort controls (low/medium/high/max) lets developers calibrate reasoning depth per task. PowerPoint integration enters research preview.
Key benchmarks: Terminal-Bench 2.0 65.4%, ARC-AGI-2 68.8% (doubled from 37.6% in Claude 4.5), GPQA Diamond 77.3%, BigLaw Bench 90.2%, BrowseComp best-ever result.
The market impact was outsized. Claude Code reached a $1 billion annual revenue run rate. Anthropic raised $10 billion at a $350 billion valuation. Software stocks experienced a $285 billion rout in the days following the announcement — a signal that investors now see AI agents as substitutes for, not complements to, existing software.
In cybersecurity evaluation, Opus 4.6 was preferred in 38 of 40 investigations compared to Claude 4.5.
What is hype
The $285 billion stock rout is a market reaction, not a technical evaluation. Stock prices reflect sentiment, positioning, and narrative as much as capability. Using market movements to validate a model release conflates financial speculation with technical assessment.
The $1 billion run rate for Claude Code is a business metric that tells you about developer adoption, not model quality. Revenue reflects pricing, distribution, and market timing. It is meaningful for Anthropic’s viability as a company. It tells you nothing about whether Opus 4.6 is the right model for your workload.
Agent Teams is a significant architectural feature, but the marketing emphasis on “multiple agents coordinating” implies a sophistication that the current implementation may not fully deliver. Multi-agent coordination is a hard problem. The feature is new. Reserve judgment until production reports accumulate.
What is valid
The context window improvement is the most practically significant advancement across all three releases. Going from 18.5% to 76% recall at long contexts is not incremental — it is a qualitative change in what the model can do. For teams working with large codebases, lengthy legal documents, or complex research papers, this is the single most impactful feature announced in February.
The ARC-AGI-2 doubling (37.6% to 68.8%) suggests real architectural improvements in abstract reasoning, not just benchmark optimization. When a score doubles between model versions, something structural changed.
BigLaw Bench at 90.2% and the cybersecurity results (38/40 investigations) come from external evaluators with professional reputations at stake. These are stronger evidence than self-reported benchmarks.
Pricing at $5/$25 per million tokens positions Opus 4.6 as a premium offering — five times Gemini 3 Pro’s cost for input tokens. The question for buyers is whether the context window, agent capabilities, and ecosystem justify the premium.
The Benchmark Table Nobody Should Trust
Here is the comparison table that everyone wants:
| Benchmark | GPT-5.3-Codex | Opus 4.6 | Gemini 3 Deep Think |
|---|---|---|---|
| SWE-Bench Pro | 56.8% | ~55% | ~45% |
| Terminal-Bench 2.0 | 75.1% | 65.4% | ~54% |
| ARC-AGI-2 | ~54% (GPT-5.2) | 68.8% | 84.6% |
| Humanity’s Last Exam | — | — | 48.4% |
| GPQA Diamond | — | 77.3% | 91.9% |
And here is why you should not make decisions based on it.
These numbers were not generated under comparable conditions. Different harnesses. Different compute allocations. Different retry policies. Different sampling configurations. Some numbers are self-reported; others are independently verified. The ”~” symbols indicate estimates or numbers from prior model versions because same-generation comparisons do not exist.
We have written extensively about this problem in our analysis of the AI code review benchmark paradox. The pattern is the same: every vendor wins the test they designed. The methodology is the message.
The more useful observation is directional. OpenAI leads on software engineering tasks. Google leads on abstract reasoning and scientific problems. Anthropic leads on long-context work, legal reasoning, and cybersecurity. These are not competing for the same crown — they are optimizing for different kinds of work.
The Three Strategic Bets
Step back from the benchmarks and you see something more interesting: three companies making three different bets about where AI goes next.
OpenAI bets on speed and distribution. The Codex app, the Cerebras partnership, Codex-Spark at 1,000 tokens/second — these are infrastructure plays. OpenAI is betting that the winning model is the one developers reach for first because it is fast, available everywhere, and good enough for most tasks.
Google bets on depth of reasoning. Deep Think is not trying to be the fastest or the most versatile. It is trying to be the smartest on hard problems. Google is betting that as AI moves from assistance to autonomy, the ability to reason deeply about novel problems will be the scarce capability.
Anthropic bets on reliability at scale. The 1M-token context window, Agent Teams, effort controls, the focus on reduced hallucination and lower over-refusal — these are production infrastructure features. Anthropic is betting that the winning model is the one that works most reliably in complex, long-running, high-stakes workflows.
All three bets might be right. The market is large enough to support all three theses, at least for now. The important thing for practitioners is to understand which bet aligns with your needs:
- Building developer tools or high-volume applications? OpenAI’s speed and distribution advantage matters.
- Solving scientific, mathematical, or deeply analytical problems? Google’s reasoning depth matters.
- Running complex, long-context agent workflows in production? Anthropic’s reliability architecture matters.
What the Enterprise Data Actually Says
The announcements are exciting. The enterprise reality is sobering.
According to 2026 industry data: 89% of companies report using AI in some capacity, but only 6% have fully implemented agentic AI. Two-thirds report productivity gains, but only 20% report measurable revenue growth from AI. Perhaps most telling: 83% of AI leaders report major concerns — an eightfold increase in just two years.
The gap between “we use AI” and “AI drives our results” is not a model capability problem. It is an organizational capability problem. Better models do not fix broken workflows, unclear governance, or poor data quality.
Multi-model routing — dynamically selecting which model handles which task based on difficulty, cost, and latency — is now reducing inference costs by 70-80% at organizations that implement it. This is a more important optimization than choosing the “best” model for everything.
The foundation still matters more than the flash: data quality, governance frameworks, evaluation pipelines, and team skills determine outcomes more than which frontier model you subscribe to.
What This Means For You
If you are an engineering leader, CTO, or head of AI at an organization trying to make sense of February 2026, here is what we recommend:
Stop picking sides. The era of “our shop is an OpenAI shop” or “we are all-in on Anthropic” is over. Build multi-model infrastructure. Route tasks to the model that handles them best. The cost savings alone justify the investment, and you eliminate single-vendor dependency.
Test on your workloads, not on benchmarks. Run each model against your actual tasks, in your actual environment, with your actual constraints. A model that scores 10 points lower on a public benchmark but handles your specific domain better is the better model for you. The benchmark paradox has one resolution: your own evaluation.
Invest in the boring stuff. Data quality. Prompt engineering. Context management. Evaluation frameworks. Governance policies. Review processes. These determine 80% of your production outcomes. The model determines 20%. Most organizations have this ratio inverted in their attention and budget allocation.
Watch Google closely. Deep Think’s reasoning capabilities are genuinely differentiated. If your work involves research, scientific analysis, or complex problem-solving, the $20/month AI Ultra subscription is the most asymmetric bet in AI right now. Do not dismiss it because it is not leading the coding benchmarks.
Upgrade your context, not just your model. Opus 4.6’s 1M-token context window is a real capability shift. But it only matters if you have the infrastructure to fill that context with the right information. Better retrieval, better chunking, and better context curation will extract more value from any model than simply subscribing to a larger one.
Budget for continuous change. These three releases happened in ten days. The next wave will come in weeks, not months. If your AI strategy assumes stable tooling for the next year, revise it. Build for adaptability — in your architecture, your contracts, and your team skills.
The Uncomfortable Summary
Here is the part that none of the three companies will tell you in their announcements:
For most organizations, for most tasks, all three frontier models will produce comparable results. The differences that show up on benchmark tables largely disappear when filtered through real-world constraints: imperfect prompts, messy data, organizational friction, and the thousand small decisions that separate a research demo from a production system.
The real differentiator is not the model. It is the engineering discipline, governance structure, and organizational capability you build around it.
February 2026 gave us faster models, smarter reasoning, and longer context windows. All good. But the organizations that extract disproportionate value from these capabilities will not be the ones that pick the “winning” model. They will be the ones that build the systems, processes, and skills to use any model well.
The model is the commodity. The discipline is the moat.
Sources
- OpenAI. “Introducing GPT-5.3-Codex.” openai.com, February 5, 2026.
- OpenAI. “GPT-5.3-Codex-Spark.” openai.com, February 12, 2026.
- OpenAI. “Codex macOS App.” openai.com, February 2, 2026.
- Google DeepMind. “Gemini 3 Deep Think.” deepmind.google, February 2026.
- ARC Prize. Independent verification of Gemini 3 Deep Think ARC-AGI-2 results.
- Anthropic. “Claude Opus 4.6.” anthropic.com, February 5, 2026.
- CMT-Benchmark, Humanity’s Last Exam, SWE-Bench Pro, Terminal-Bench 2.0 public leaderboards.
- Victorino Group. “The February 5th Convergence.” victorinollc.com, February 2026.
- Victorino Group. “The Benchmark Paradox.” victorinollc.com, February 2026.
At Victorino Group, we help organizations build multi-model AI infrastructure with the governance, evaluation, and engineering discipline that turns capability into reliable outcomes. If your team is navigating model strategy in a market that changes weekly, reach out.
If this resonates, let's talk
We help companies implement AI without losing control.
Schedule a Conversation