Containment Is Architecture: Letting AI Draft a Database Change Without Touching Production

The most important line in OpenAI’s SchemaFlow reference design is a list of things the system refuses to do. It withholds SQL execution. It leaves migrations unapplied. It opens zero pull requests. It keeps production untouched. What it produces is a bundle of reviewable artifacts and then it stops.

That refusal is the whole design. SchemaFlow, published in the OpenAI Cookbook as a partner reference architecture, takes the riskiest place to put an AI (the layer where one bad statement drops a table or locks a hot index) and shows how to extract real work from a model while the keys stay in human hands. The containment lives inside the shape of the pipeline, encoded in the stages themselves rather than bolted on afterward.

Five Narrow Agents, Not One Smart One

The instinct with database work is to ask a capable model for the SQL and review what comes back. SchemaFlow rejects that shape. The cookbook decomposes the task into five staged agents, each with a single job and a typed output.

First, a parse stage reads the requested change and extracts intent into a structured form. Second, an impact analysis stage examines what the change touches: dependent objects, downstream consumers, the blast radius. Third, an execution plan stage produces prechecks, postchecks, and a rollback path before any SQL exists. Fourth, the SQL generation stage writes the actual statements, now constrained by everything the earlier stages established. Fifth, a deterministic validation stage checks the output against rules that do not depend on a model’s judgment.

The ordering carries the argument. SQL arrives fourth, after the groundwork. By the time the model writes a statement, impact has already been mapped and a rollback already exists. The agent that writes SQL stays inside changes whose consequences a prior stage already examined, because examining consequences was a separate, earlier step with its own output.

A single large prompt could in theory do all of this. It would also collapse the boundaries that make the work auditable. When one agent parses, analyzes, plans, writes, and validates in a single pass, the reasoning blurs into one opaque step you cannot inspect. Five stages give you five inspection points.

Typed Artifacts Are the Control Surface

Each stage emits a structured artifact, and the structure is enforced. The cookbook uses Pydantic output-schema contracts so that a stage cannot hand the next stage a blob of prose. Impact analysis produces an impact object with defined fields. The execution plan produces a plan object with prechecks, postchecks, and rollback as first-class elements. The contract is the handoff.

This matters more than it first appears. Free-text handoffs between AI steps fail silently. A stage produces something plausible, the next stage interprets it loosely, and an error propagates downstream wearing the costume of a valid result. Typed contracts make the failure loud. When impact analysis fails to produce a conforming impact object, the pipeline stops at that boundary and leaves the malformed assumption behind, well short of SQL generation.

Between the stages sit deterministic validation gates. These are rules, executed as plain code rather than model calls. They catch the class of silent failure that probabilistic checks miss, the cases where the output looks right to another model but violates a hard constraint. A human reviewer at the end reads artifacts that already passed deterministic checks, which frees the review to focus on judgment while the mechanical errors a rule could catch are already gone.

The Draft-Not-Execute Boundary

The hardest and most valuable decision in the design is the one that limits its own power. SchemaFlow draws a line at execution and holds it. The model’s entire output is a proposal. A human applies it, or sets it aside.

This inverts the usual risk equation for AI on infrastructure. The danger with autonomous agents on production systems lies past their reasoning, in the moment a plausible-looking action with real-world effects runs before anyone catches the flaw. Remove execution from the agent’s reach and the worst case changes character entirely. A wrong SchemaFlow output is a bad document, while a wrong autonomous-agent output is a bad migration already running against production.

We made a related argument in the agent permissions model as system of record: safety turns on what the agent is allowed to touch, far more than on how smart it is. SchemaFlow is that principle expressed as an architecture for the data layer specifically. The agents are narrow on purpose, and the most important narrowing sits at the point of action.

It also changes what review means. Because the system stops at a draft, the human reviews a proposal at rest rather than rubber-stamping an action already in flight. They read an impact analysis, a rollback plan, and a set of statements, and decide whether to proceed. The architecture creates the conditions for a real decision in place of a reflexive approval.

Repeatability as a First-Class Concern

The cookbook integrates Promptfoo for evaluation, which addresses a problem most AI tooling ignores until it bites. Prompts drift. Models get swapped. A pipeline that worked last quarter quietly degrades, and the failure stays invisible until an output is wrong in production. Building eval into the design means the staged agents can be reassessed whenever a prompt or model changes, against a known set of cases.

This is the same discipline that separates production agent systems from demos. We covered the broader pattern in the agent operations stack: readiness is something you measure rather than assert. SchemaFlow applies that to the data layer by making the pipeline testable. The artifacts are typed, so they can be asserted against. The stages are discrete, so a regression localizes to a stage. Evaluation belongs to the core design, part of how the system stays trustworthy as its parts change.

What This Pattern Covers and What It Leaves Open

Two honest caveats. SchemaFlow is vendor-authored, an OpenAI partner cookbook rather than an independent benchmark. Read it as a reference architecture and a set of design choices worth adopting, while treating the specific outcome as unproven. The cookbook shows a pattern. The amount of risk it removes in your environment remains for you to measure.

And the pattern stays narrower than full automation. SchemaFlow stops well short of hands-off database operations. It promises that the AI portion of the work ends at a reviewable artifact, with a human owning the act of execution. For teams whose instinct is to chase autonomy, that can read as a limitation. The narrowing is exactly what makes the AI usable on a system where mistakes are expensive.

The connection to hallucination control is direct. We have written about building systems that contain hallucinations rather than hoping models stop producing them. SchemaFlow is that thesis applied where it matters most. It assumes the model will sometimes be wrong and arranges the architecture so a wrong output meets a deterministic gate or a human reviewer, kept clear of any live database.

Do This Now

If you are letting or planning to let AI touch your data layer, copy the boundary first and the cleverness second. Decompose the task into narrow staged agents with typed outputs. Put impact analysis and a rollback plan ahead of SQL generation. Add deterministic validation gates between stages so silent failures stop at a boundary and propagate no further. And draw the line at execution: the AI drafts, a human applies. The safety mechanism lives in the architecture, well beyond the intelligence of your model.

This analysis synthesizes OpenAI’s Database Change Analysis (SchemaFlow Design Guide) (OpenAI Cookbook, June 2026).

Victorino Group designs containment architecture for AI on production data systems. Let’s talk.