The AI Control Problem

AI Agents Have Opinions: What Claude Code's Tool Picks Reveal About Ungoverned Delegation

TV
Thiago Victorino
10 min read
AI Agents Have Opinions: What Claude Code's Tool Picks Reveal About Ungoverned Delegation

Ask Claude Code to set up CI/CD for a new project. It will pick GitHub Actions. Not sometimes. Not usually. 93.8% of the time.

Ask it to choose a state management library. Redux will not appear. Zero primary picks across 2,430 responses. Zustand wins by default.

Ask it to build an API. Express will not be suggested. Zero picks across 119 framework questions. Fastify and Hono take its place.

These are not hallucinations. They are preferences. And the question every engineering leader should be asking is: who decided?

The Study

Amplifying.ai published one of the most rigorous studies of AI agent tool selection to date. The methodology: 2,430 structured responses across three Claude models (Sonnet 3.5, Sonnet 4.5, Opus 4.6), covering 20 tool categories and four project types. Each response was extracted, categorized, and cross-validated. The raw data is on GitHub.

The findings are striking in their consistency. Across 18 of 20 categories, all three models agreed on the primary pick. The agreement rate hit 90%. These are not random selections from a vast possibility space. They are convergent recommendations with narrow variance.

Some results reflect reasonable engineering judgment. GitHub Actions at 93.8% for CI/CD makes sense: it has deep integration with GitHub repositories, which dominate the projects tested. Vercel hit 100% for JavaScript deployment, which is defensible for the project types in the study. shadcn/ui at 90.1% for component libraries reflects genuine community momentum.

Others reveal something more interesting.

The Express Question

Express getting zero picks sounds like a damning verdict on the most popular Node.js framework. It is not. It is a prompt artifact.

All four test repositories in the study used framework-native routing (Next.js, Nuxt, SvelteKit, and Astro). When the framework already handles HTTP routing, recommending Express is like recommending a second steering wheel. The model made the correct contextual decision for the projects it was given.

Express still has 33.7 million weekly npm downloads. It powers a large fraction of Node.js production workloads. The study did not test “build me an Express API from scratch.” It tested “what tools should this existing project use?” Context shaped the answer.

This distinction matters because it illustrates a deeper problem. The study’s methodology was transparent and rigorous. But anyone reading the headline “Express: 0 picks” without understanding the test conditions would draw the wrong conclusion. Now imagine that same misinterpretation happening silently inside an AI agent making production decisions on your behalf.

The Custom Preference

The most surprising finding is also the most underexamined. In 12 of 20 categories, “custom/DIY” was the primary pick, accounting for 252 selections total.

The instinctive reaction is to call this bias. The model prefers building from scratch over adopting established tools. But there is a simpler explanation that deserves consideration first.

The study tested greenfield projects. For greenfield work, the calculus is different. You do not inherit dependencies. You do not have existing patterns to maintain. The cost of building a tailored solution is lower, and the cost of importing an unnecessary abstraction is higher. A senior architect reviewing the same projects might make the same call.

The question is whether the model is making that judgment for the right reasons (project context) or the wrong reasons (training data frequency). The study cannot distinguish between these. Neither can the agent running in your repository right now.

Model Personality

Here is where the data gets genuinely uncomfortable for organizations treating AI agents as interchangeable.

Sonnet 4.5 and Opus 4.6, both Claude models, showed measurably different preferences across the same questions. Sonnet 4.5 recommended Redis 93% of the time and Prisma 79%. Opus 4.6 recommended Drizzle at 100% and chose custom solutions 11.4% more often than Sonnet 4.5.

Same company. Same model family. Different opinions.

Sonnet 4.5 leans conservative: proven tools, established patterns, safe choices. Opus 4.6 leans forward: newer ORMs, more custom solutions, less reliance on incumbents. These are not configuration differences. They are personality differences baked into training.

If you upgrade your agent from one model version to another, your technology stack recommendations change. Not because the technology changed. Not because your requirements changed. Because the model’s training data distribution shifted.

In any other context, we would call this a supply chain risk. A component in your decision pipeline changed its behavior without notification, without changelog, without your approval. As we explored in The Benchmark Paradox, the tools we use to evaluate AI capabilities are already unreliable. Model personality adds another uncontrolled variable.

Training Data as Hidden Policy

Why does Zustand beat Redux? Why does Drizzle beat Prisma in one model and not another? Why does Hono appear as a primary pick when it has a fraction of Express’s install base?

The answer is training data composition.

Research published on arXiv confirms that large language models contradict their own recommendations 83% of the time when prompted differently. The preferences are not stable conclusions from systematic evaluation. They are statistical artifacts of which blog posts, tutorials, GitHub repositories, and Stack Overflow answers dominated the training corpus.

Recent content gets weighted. Popular tutorials get weighted. Trending repositories get weighted. The model does not evaluate Zustand against Redux on technical merits. It reflects the aggregate sentiment of its training data, which skews toward whatever the developer community was enthusiastic about during the training window.

This is not a flaw to be fixed. It is a structural property of how these systems work. And it means that every agent recommendation carries an invisible policy: the collective opinion of the internet, filtered through a training pipeline, frozen at a point in time, and presented as objective technical guidance.

The Cline Lesson

Cline, an open-source AI coding agent, recently improved its Terminal Bench score from 47% to 57%. A ten-point jump sounds like meaningful progress. The details tell a different story.

Terminal Bench tests 89 coding tasks. Cline’s biggest improvement came from a single change: increasing the task timeout from 600 seconds to 2,400 seconds. Four times longer. The agent did not get smarter. It got more patient.

This is not a criticism of Cline. The team was transparent about their methodology, publishing a detailed hill-climbing guide that others can replicate. But it reveals something important about how we measure agent capability.

Benchmark scores conflate multiple dimensions: reasoning quality, tool selection, context management, and infrastructure constraints like timeouts and rate limits. A ten-point improvement could mean better reasoning. Or it could mean the agent was previously failing tasks it already knew how to solve, simply because the clock ran out.

For organizations evaluating agents, the lesson is direct. Ask what changed. A better score is not the same as a better agent.

The Benchmark Vendor Problem

DeepSource published a thoughtful analysis of AI code review benchmarks in February 2026, noting that every vendor who publishes a benchmark conveniently wins it. The criticism is valid: there is no SWE-bench equivalent for code review, no independent standard that levels the field.

The irony is that DeepSource published this analysis alongside its own favorable benchmark results. The company that critiques self-serving benchmarks also participates in the practice.

This pattern, where everyone recognizes the problem and everyone contributes to it, is characteristic of markets without independent measurement standards. As we noted in How AI Decides What to Quote, AI systems have predictable patterns shaped by their architecture and training data. The same principle applies to agent tool selection: without independent standards, we cannot distinguish signal from artifact.

What This Means for Governance

The Amplifying.ai study has limitations that deserve acknowledgment. It tested JavaScript and Python projects only. No Go, Rust, Java, or C#. It tested greenfield projects, not legacy codebases with existing dependencies. And it used Claude’s own output to assess Claude’s preferences, a self-judging methodology the researchers estimate at 85% accuracy. Fifteen percent noise is significant when you are drawing conclusions about systematic bias.

These limitations do not invalidate the findings. They bound them. The tool preferences are real. The model agreement is real. The personality differences between model versions are real. What remains uncertain is whether these patterns hold across other languages, other project types, and other model families.

But even with those caveats, the governance implications are concrete.

When an AI agent picks your CI/CD platform, your ORM, your component library, and your state management approach, it is making architecture decisions. Those decisions will persist for years. They will shape hiring, vendor relationships, and maintenance costs. They will determine which security vulnerabilities you are exposed to and which performance characteristics you inherit.

The agent makes these decisions based on training data composition, not on your organization’s specific context, compliance requirements, or technical strategy. It does not know your vendor agreements. It does not know your team’s expertise. It does not know your security posture. It knows what was popular on the internet during its training window.

That is not a tool making a recommendation. That is an ungoverned policy being applied to your architecture.

Three Things to Do

Audit agent defaults before they become permanent. The first tool an agent picks for a greenfield project tends to stick. Review agent-selected dependencies within the first week, not after they are woven into production. A component library or ORM choice made in week one becomes a migration project in month six.

Version-pin your models, not just your dependencies. If Sonnet 4.5 and Opus 4.6 produce different technology recommendations, a model upgrade is a policy change. Treat it like one. Test agent outputs against your architecture standards before and after model changes.

Establish a tool allowlist. Your organization has opinions about technology choices. Your AI agent has different opinions. One of them should win, and it should be yours. Maintain an approved tool list that agents must operate within, the same way you would constrain a new hire’s technology choices until they understand your environment.

Training data is hidden policy. The question is whether you will govern it, or discover it after the architecture is built.


This analysis synthesizes Amplifying.ai’s Claude Code study (February 2026), Cline’s hill climbing methodology (February 2026), and DeepSource’s benchmark analysis (February 2026).

Victorino Group helps organizations govern the invisible decisions AI agents make on their behalf. Let’s talk.

If this resonates, let's talk

We help companies implement AI without losing control.

Schedule a Conversation