What the Codex Agent Loop Reveals About Building Production AI Agents

OpenAI recently published the first in a series of engineering articles dissecting the internals of Codex CLI, their open-source software development agent. Written by Michael Bolin, the article covers the agent loop — the core orchestration logic that connects users, models, and tools. The source code is available at github.com/openai/codex, written in Rust.

This is not a product announcement. It is one of the rare cases where a frontier AI lab publishes genuine engineering details about how their agent actually works in production. For teams building agentic systems, the lessons here go beyond OpenAI’s specific stack.

The Loop Itself Is Simple. The Engineering Around It Is Not.

Every AI agent follows the same basic pattern: take user input, query a model, execute tool calls if requested, feed results back to the model, repeat until the model returns a final message. Codex is no different.

What matters is what Codex does around this loop to make it performant and reliable at scale.

Prompt Composition: Four Layers of Context

Each inference call to the Responses API includes three top-level fields: instructions, tools, and input. The input itself is assembled from multiple sources:

Permissions instructions (developer role) — sandbox description, writable folders, approval policies
Developer instructions (developer role) — custom instructions from config.toml
User instructions (user role) — aggregated from AGENTS.md files found between the git root and the current working directory, plus skill metadata
Environment context (user role) — current working directory, shell type
User message — the actual request

The roles follow a priority hierarchy: system > developer > user > assistant. This structure enables layered governance where platform-level safety rules cannot be overridden by user-provided project instructions.

The Quadratic Problem and Prompt Caching

Here is the non-obvious insight. Without optimization, the agent loop is quadratic in the total JSON sent to the API over a conversation. Each inference call includes the full conversation history, and each call adds more content to that history.

Prompt caching converts this to linear cost. The key constraint: cache hits only work on exact prefix matches. This means the beginning of the prompt must remain identical across calls. Static content (instructions, tools, environment) goes first. Dynamic content (tool calls and results) appends at the end.

This has real engineering consequences:

MCP tool enumeration must be deterministic. Codex discovered a bug (PR #2611) where MCP tools were listed in inconsistent order, causing cache misses across every inference call.
Mid-conversation configuration changes are appended, not replaced. If the sandbox config or working directory changes, Codex adds a new message rather than modifying an earlier one, preserving the cache prefix.
Tool list changes from MCP servers mid-conversation are expensive. The notifications/tools/list_changed MCP notification can invalidate the entire cache.

Context Compaction: Not Just Summarization

When the conversation exceeds a token threshold, Codex uses a dedicated /responses/compact endpoint. This does not simply summarize the conversation in natural language. It returns a list of items including an opaque encrypted_content that preserves the model’s latent understanding — meaning the compacted representation retains information that a text summary would lose.

The trade-off: compaction introduces a processing delay and the agent loses explicit access to older tool call details. But it keeps long-running sessions functional rather than crashing into context limits.

Stateless by Design

Codex does not use the previous_response_id parameter that the Responses API offers for server-side state management. Every request is fully stateless. This simplifies the architecture and enables Zero Data Retention (ZDR) configurations, but at the cost of sending the full conversation JSON on every call.

The reasoning tokens from prior turns are preserved through encrypted_content — the server stores the decryption key but not the data itself. This is a pragmatic solution for enterprise customers who need both conversation continuity and data compliance.

What This Means for Engineering Teams

1. Cache-Aware Prompt Architecture Is Non-Negotiable

If you are building an agent loop, prompt caching is not an optimization — it is a requirement. Without it, costs and latency grow quadratically. This means:

Keep your system prompt, tools, and instructions stable across inference calls
Enumerate tools in a deterministic order
Append rather than modify when context changes

2. MCP Is Powerful but Introduces Cache Fragility

The Model Context Protocol enables powerful tool extensibility, but it introduces a subtle failure mode: if an MCP server changes its tool list mid-conversation, you lose your entire prompt cache. Teams adopting MCP should batch tool changes and avoid dynamic tool registration during active conversations.

3. LLM-Agnosticism Has Limits

The Codex CLI is designed to work with any provider implementing the Responses API — including local models via Ollama or LM Studio. However, the model-specific system prompts (e.g., gpt-5.2-codex_prompt.md) and the opaque encrypted_content for reasoning tokens suggest that full performance requires OpenAI models. Local models will run the loop but may lack the specialized training that makes the agent effective at software tasks.

4. The Agent Loop Is the Easy Part

The loop pattern (input -> inference -> tool call -> repeat) is well-understood. The hard engineering is in context management: deciding what goes in the prompt, maintaining cache efficiency, handling compaction gracefully, and ensuring stateless operation. These are infrastructure concerns, not AI concerns.

5. Open Source Transparency Sets a New Standard

Publishing the full agent harness with links to specific code lines, PRs, and design decisions is valuable. It allows the community to learn from production engineering rather than marketing abstractions. Other agent builders — Anthropic with Claude Code, Cursor, Windsurf — face pressure to match this level of transparency.

Critical Perspective

Several things the article does not address:

Error recovery. What happens when a tool call fails? When the model produces malformed JSON? When a compaction loses critical context? The article describes the happy path.
Reliability metrics. No data on success rates, error rates, or recovery strategies.
Cost. The stateless design means sending the full conversation on every call. For long sessions with many tool calls, this adds up.
Model dependency. Despite the LLM-agnostic framing, the architecture is deeply coupled to OpenAI’s Responses API spec. Switching to a different API format requires significant adaptation.

These are not criticisms of the article — it explicitly positions itself as the first in a series, with sandboxing and tool implementation coming later. But engineers should not treat this as a complete guide to building production agents.

Conclusion

The Codex agent loop article is most valuable as an engineering case study in context management at scale. The core lessons — cache-aware prompt design, deterministic tool enumeration, append-only context updates, latent-space compaction — are applicable to any team building agentic AI systems, regardless of which LLM provider they use.

The agent loop pattern itself is commodity knowledge. The engineering that makes it work in production is not.

Source: Unrolling the Codex Agent Loop by Michael Bolin, OpenAI Engineering, January 23, 2026.