Who Watches the 700 Experiments?

On March 8, Andrej Karpathy posted a 630-line Python script to GitHub. By Monday morning, it had 24,000 stars and 8.6 million views on X. The tool, called autoresearch, runs AI experiments while you sleep. An agent reads its own source code, forms a hypothesis, modifies the code, trains the model, evaluates the result, and keeps or reverts the change. Then it does it again. And again.

In one overnight run, the agent completed 126 experiments, improving a language model’s efficiency from 0.9979 to 0.9697 bits per byte. Left running for two days on a deeper model, it processed approximately 700 autonomous changes and found roughly 20 improvements that transferred to larger models. It dropped the “Time to GPT-2” benchmark from 2.02 hours to 1.80 hours, an 11% efficiency gain on a codebase Karpathy considered already well-tuned.

The agent caught oversights in attention scaling and regularization that Karpathy says he missed over two decades of work. It broke what the ML community considers “established folklore” about weight decay exclusions by finding that tiny weight decay on embeddings and value embeddings improved performance. A narrow optimum at transformer initialization scale 0.68x emerged from brute-force search across hundreds of variations.

The tool works. The question no one is asking is: who governs it?

What autoresearch actually is

Intellectual honesty requires precision here. Autoresearch is not “automating the scientific method,” despite what the headlines suggest. The scientific method involves formulating novel hypotheses, designing experiments to test them, and revising theoretical understanding. Autoresearch does something narrower: it runs a loop of modify-measure-keep/revert on a fixed training script against a fixed validation metric.

This is closer to automated hill-climbing than to scientific reasoning. The agent cannot question its own objective function, propose new evaluation criteria, or recognize when its optimization target is misaligned with the actual goal. It optimizes within the frame a human defines. It does not reframe.

That said, autoresearch is genuinely different from traditional AutoML tools like Optuna, Ray Tune, or Hyperopt, which have been doing automated experimentation for over a decade. The difference: autoresearch uses an LLM to modify actual code, not just sweep predefined parameter spaces. The agent can change model architecture, add regularization techniques, restructure training loops, and articulate its reasoning in natural language. The experiment log reads like a research notebook, not a parameter grid.

This is a real increment. It is not a revolution.

The progression no one is governing

As we explored in The Phase Shift in Software Engineering, Karpathy described a profound shift in January 2026: flipping from 80% manual code to 80% AI-generated code. The progression since then maps a clear trajectory of diminishing human involvement:

2025: Human writes code. AI assists with completion and suggestions.

January 2026: Human describes intent in natural language. AI writes code. Human reviews. This is “vibe coding.”

March 2026: Human defines a fitness function and a compute budget. AI writes code, runs it, evaluates results, and decides what to keep. Human sleeps.

Each step reduces human involvement and expands the governance surface area that organizations must cover. In vibe coding, the human still reviews every change. In autoresearch, review happens after 126 or 700 changes have already been committed.

As we argued in The Thinking Gap Is the Governance Gap, the bottleneck has moved from writing code to defining what should be written and reviewing what was generated. Autoresearch pushes this further: the bottleneck is now defining the search space and trusting that the agent stays within it.

The validation integrity problem

GitHub user alexisthual raised the sharpest question in the discussion thread: “Aren’t you concerned that launching that many experiments will eventually ‘spoil’ the validation set?”

This is not a hypothetical concern. With 126 experiments evaluated against the same validation data, autoresearch faces the multiple testing problem. Each experiment is a hypothesis test. Without correction, the probability of keeping changes that exploit validation set idiosyncrasies grows with each iteration.

Research from the MLE-bench study (arXiv 2507.02554) confirms this at scale: when AI agents select solutions by test score instead of validation score, results inflate by 9-13%. Agents begin demonstrable overfitting around the 50-hour mark. Autoresearch’s loop has no cross-validation, no holdout rotation, no drift detection.

Karpathy reported that 20 improvements “transferred to larger models,” which provides some evidence against pure overfitting. But “transferred” is vague. What was the effect size? Did all 20 transfer equally? A tweet is not a replication study.

The 81% failure rate in the published session (102 of 126 experiments discarded) and the fragility of results (5% warmup “did NOT reproduce” across sessions, seed changes produced no benefit) suggest the method’s reliability as a research tool remains unproven.

The swarm scenario

The story escalated when Varun Mathur, CEO of Hyperspace AI, distributed autoresearch across a peer-to-peer network. Thirty-five autonomous agents ran 333 experiments completely unsupervised using the GossipSub protocol for real-time discovery sharing. VentureBeat reported that these agents “independently rediscovered ML milestones that took human researchers 8 years to formalize.”

This claim requires scrutiny. The “milestones” in question are RMSNorm and tied embeddings, both standard techniques documented in countless papers and tutorials. The LLM agents driving the experiments were trained on this literature. They did not derive these techniques from first principles. They recognized contexts where known techniques apply. A calculator that outputs “2+2=4” has not rediscovered arithmetic.

What is genuinely interesting about the Hyperspace experiment is the emergent coordination. GPU-rich agents and CPU-only agents developed different experimental strategies. The GPU machines brute-forced aggressive learning rates; the CPU machines, constrained by compute, focused on initialization and normalization choices. When one agent found that Kaiming initialization dropped loss by 21%, the discovery propagated through the gossip network within hours.

This is swarm intelligence. It is valuable. It is also fundamentally resistant to centralized governance. No human approved the Kaiming discovery before 23 other agents incorporated it. No one verified it would transfer before it spread. The coordination happened faster than any human oversight process could operate.

The governance vacuum

The EY Technology Pulse Poll from March 2026 provides the enterprise-scale context: 97% of tech executives view autonomous AI as a high or essential priority. But 52% of department-level AI initiatives operate without formal oversight. Eighty-five percent prioritize speed-to-market over exhaustive AI vetting. Forty-five percent have experienced data leaks from unauthorized AI tools.

Autoresearch fits squarely in this gap. It is open source, MIT licensed, 630 lines of Python. Anyone with GPU access and an LLM API key can deploy it tonight. The simplicity is part of the appeal. It is also the governance risk.

The tool has exactly one containment mechanism: the agent can only modify train.py. That is the entire safety architecture. No approval gates. No anomaly detection. No scope constraints beyond file boundaries. No audit trail beyond git commits. No mechanism to detect when the agent has drifted into metric gaming rather than genuine improvement.

When we discuss agent governance in theory, the scenarios can feel abstract. Autoresearch makes them concrete. Who is accountable when an agent-discovered “improvement” causes a regression in production? Who reviews 700 autonomous changes for correctness, not just for metric improvement? What happens when agents share bad results through a gossip network and multiple systems incorporate flawed discoveries?

What “experimental designer” actually means

Karpathy’s framing is that the human role shifts from “experimenter” to “experimental designer.” This is accurate but incomplete. The experimental designer must also be the experimental governor.

Defining the fitness function is not just a technical decision. In autoresearch, val_bpb is clean and unambiguous. But as Ken Huang notes in his analysis of autoresearch’s enterprise implications, most organizational objectives lack this clarity. What is the val_bpb equivalent for customer satisfaction? For regulatory compliance? For brand trust?

The agent broke established folklore about weight decay exclusions. This is valuable when the folklore is wrong. It is dangerous when the “folklore” is actually a safety constraint that exists for reasons the agent cannot understand.

Setting the compute budget is also a governance act. Five minutes per experiment limits blast radius. But the Hyperspace experiment shows what happens when the constraint loosens: 35 agents, 333 experiments, zero human checkpoints. The agents were bounded by compute, not by policy.

The measurement prerequisite is the unglamorous foundation of governed autonomous experimentation. Before deploying any autoresearch-like system, organizations need answers to: What metric are we optimizing? Does improving this metric guarantee we are improving the thing we actually care about? Who reviews when the metric improves but the outcome does not? How many unsupervised experiments are acceptable before a human checkpoint?

The honest assessment

Autoresearch is a useful tool that has been wrapped in revolutionary framing. Karpathy himself is relatively measured: he notes fragility, disclaims production use, and invites scrutiny. The overclaiming comes primarily from VentureBeat’s editorial layer and from commercially interested parties like Hyperspace AI and Eric Siu, who projected “36,500 marketing experiments per year” without acknowledging that marketing experiments require real user traffic, not GPU cycles.

The genuine contribution is the LLM-as-hypothesis-generator pattern: using natural language reasoning to navigate a search space that traditional AutoML tools traverse mechanically. This is a real increment over Optuna and Ray Tune. It is not the automation of science.

For enterprise leaders, the lesson is not “deploy autoresearch.” It is that autonomous experimentation loops are now simple enough that any team with GPU access can run them. The governance question is not whether to allow autonomous AI experimentation. It is whether your organization will know when it is happening.

This analysis synthesizes VentureBeat’s coverage of autoresearch (March 2026), Karpathy’s session report on GitHub (March 2026), the EY Technology Pulse Poll on autonomous AI oversight (March 2026), the MLE-bench study on agent overfitting (arXiv 2507.02554), and Ken Huang’s enterprise analysis (March 2026).

Victorino Group helps organizations build the governance infrastructure that makes autonomous AI experimentation safe, auditable, and actually useful. Let’s talk.