Operating AI

25 Hours, 13 Million Tokens: What a Codex Marathon Reveals About Agent Memory

TV
Thiago Victorino
8 min read
25 Hours, 13 Million Tokens: What a Codex Marathon Reveals About Agent Memory

Derrick Choi from OpenAI published a cookbook entry this week about an experiment most engineers would consider implausible. GPT-5.3-Codex, running at maximum reasoning, built a complete design tool from scratch. Uninterrupted. For 25 hours. Consuming approximately 13 million tokens and generating around 30,000 lines of code.

The delivered application includes canvas editing, live collaboration, layers management, alignment guides, history with replay, prototype mode with hotspot navigation, threaded comments, and export to JSON and React+Tailwind. Ten major capability categories, end to end.

The numbers are impressive. They are also, by themselves, meaningless. A single experimental run by an OpenAI employee using the latest model at maximum settings does not establish a benchmark. What it establishes is a methodology — and that methodology is worth studying carefully.

Time Horizon Is the New Metric

The conceptual shift Choi identifies is subtle but important. “Agentic coding is increasingly about time horizon, not just one-shot intelligence.” The advance is not that models got smarter in any given moment. The advance is that agents can stay coherent for longer, complete larger work segments end to end, and recover from errors without losing the thread.

METR’s research quantifies this: approximately seven months for the doubling time on the length of software tasks that frontier agents can complete reliably. That trajectory — if it holds — means that by early 2027, agents will reliably handle tasks that take days, not hours.

But coherence over time does not emerge from model capability alone. It requires operational infrastructure. And that infrastructure is what makes Choi’s experiment interesting.

The Four-Document Memory System

The methodology that kept Codex coherent for 25 hours is not a prompt engineering technique. It is an external memory architecture consisting of four markdown documents that the agent revisits throughout the session.

Prompt.md freezes the target. It contains goals, non-goals, hard constraints, explicit deliverables, and “done when” criteria. The purpose is blunt: “Freeze the target so the agent doesn’t build something impressive but wrong.” This is specification discipline applied to autonomous work. Every constraint that exists only in the operator’s head is a specification that the agent cannot follow.

Plan.md turns open-ended work into a sequence of checkpoints. Milestones are sized for single-loop completion. Each milestone has acceptance criteria paired with validation commands. The critical rule: “If validation fails, repair before moving on.” Decision notes are recorded to prevent circular reasoning — if the agent already decided against an approach, it should not reconsider it without new information.

Implement.md is the execution runbook. It designates Plan.md as the authoritative source and mandates post-milestone validation. Scope containment is explicit: each milestone’s changes must stay focused and reviewable. The directive is direct: “Follow the plan, keep diffs scoped, run validations, update docs.”

Documentation.md is the shared memory and audit log. Current milestone status, decision rationale, operational instructions, known issues. This document lets a human check in after hours of absence and understand exactly where the agent is and why it made the decisions it made.

Why This Works

The four documents solve different problems.

Prompt.md solves specification drift. Without a frozen target, the agent optimizes for local impressiveness — each step looks good, but the overall direction wanders. In a 5-minute task, this manifests as a minor misalignment. In a 25-hour task, it manifests as a completely wrong application.

Plan.md solves the coherence problem. An agent operating without milestones has no way to measure progress. It cannot distinguish between “making progress” and “generating tokens.” Milestones with validation criteria create objective checkpoints: either the build passes or it does not, either the tests pass or they do not.

Implement.md solves execution discipline. The gap between “having a plan” and “following a plan” is real even for autonomous agents. Without explicit instructions to follow the plan and validate at each step, the agent may take shortcuts that undermine the overall architecture.

Documentation.md solves inspectability. A 25-hour autonomous session is only practical if the operator can check in periodically, understand the state, and course-correct if needed. Without a status log, the operator must read 30,000 lines of code to understand what happened.

Milestone Verification as Governance

The most operationally significant pattern in the experiment is the verification gate. After every milestone, the agent runs lint, typecheck, build, and tests. No forward progress without passing all gates.

When lint failures occurred, the agent identified violations and corrected them before advancing. This is not a nice-to-have. For a 25-hour session, a bug introduced in hour two that is not caught until hour twenty produces cascading failures that may be unrecoverable.

This pattern maps directly to what Factory’s Signals system does at the session level: automated verification preventing error accumulation. The implementation differs — Factory monitors user friction; Codex monitors code quality — but the architectural principle is identical: continuous verification prevents drift from compounding.

For organizations building autonomous workflows, this is the transferable insight. The verification gate is not optional infrastructure. It is the mechanism that makes extended autonomy possible. Without it, longer agent sessions produce more code and more problems in equal measure.

The Delegation Model

Choi frames the shift as moving from pair programming to delegation with guardrails. The human writes the specification, defines the milestones, and checks in at boundaries. The agent executes, verifies, and documents. Steering happens at milestones, not at every line.

This is not fundamentally different from how senior engineers delegate to junior engineers. The difference is scale: the agent does not take lunch breaks, does not context-switch between projects, and can be given a 25-hour uninterrupted block. The cost is tokens, not time.

The practical enabler is git worktrees. The agent works in an isolated branch. The developer continues their normal work in the main branch. When the agent completes, the developer reviews the diff and decides what to merge. Long autonomous work does not block anything.

What This Does Not Prove

Important caveats before applying this pattern.

First, this is a greenfield application. Building from scratch is the easiest case for autonomous generation. Modifying an existing codebase with legacy dependencies, incomplete documentation, and implicit architectural decisions is a fundamentally harder problem. The 4-document system assumes you can specify the target completely, which is often impossible for maintenance work.

Second, 30,000 lines of generated code is not 30,000 lines of production-ready code. The article confirms that lint, typecheck, and build pass. It does not confirm security review, performance benchmarking, accessibility compliance, or long-term maintainability. Someone will need to maintain those 30,000 lines.

Third, the cost is undisclosed. Thirteen million tokens at GPT-5.3 maximum reasoning is not cheap. The economics of a 25-hour agent session matter for organizational decisions, and they are not addressed.

Fourth, the token-to-code ratio is high. Thirteen million tokens for 30,000 lines works out to roughly 433 tokens per line of code. That is an enormous overhead, reflecting the reasoning, rework, and context maintenance required for coherence. The efficiency will need to improve significantly for this to be economically practical at scale.

The Pattern That Transfers

Strip away the GPT-5-specific details and a general pattern remains.

For any autonomous agent task longer than a few minutes:

Freeze the target in a specification document. Everything the agent needs to know about what “done” means should be written down, not assumed.

Decompose into verifiable milestones. Each milestone must have acceptance criteria that can be checked automatically. No forward progress without passing gates.

Create an execution runbook. Explicit instructions on how to follow the plan, when to validate, and what to do when validation fails.

Maintain a status log. The operator must be able to check in at any point, understand the current state, and decide whether to course-correct.

This framework works whether your agent runs for 25 minutes or 25 hours. It works whether you are using GPT-5 or a local model. The model provides the capability. The documents provide the governance. Neither is sufficient without the other.

That is what operating autonomous AI actually requires: not a better model, but a better contract between human intent and agent execution.

If this resonates, let's talk

We help companies implement AI without losing control.

Schedule a Conversation