46.5 Million Messages in 2 Hours: Why Agent Security Is an Architecture Problem

On February 28, 2026, a security firm called CodeWall pointed an autonomous testing agent at McKinsey’s internal AI platform. Within two hours, the agent had extracted 46.5 million chat messages, 728,000 files, 57,000 user accounts, and the system prompts governing 384 AI assistants used by 43,000 McKinsey employees.

The vulnerability was SQL injection. Not a novel exploit. Not an adversarial prompt. Not a sophisticated chain of AI-specific attacks. A JSON key injection where field names were concatenated directly into SQL queries. Twenty-two of more than 200 API endpoints had no authentication at all.

OWASP ZAP, the industry-standard vulnerability scanner, missed it entirely. CodeWall’s autonomous agent found it through error-message analysis across 15 blind iterations. The scanner followed playbooks. The agent explored.

McKinsey patched the vulnerability within 24 hours of disclosure, which is fast and responsible. But the breach exposes something more important than one company’s security posture. It reveals the structural fragility of AI platforms that were built for capability first and security second.

What Was Actually Exposed

The scale of the Lilli breach deserves specific attention because it illustrates what AI platforms actually contain.

Beyond the 46.5 million chat messages: 192,000 PDFs, 93,000 Excel files, 93,000 PowerPoints, 58,000 Word documents. These are the files McKinsey consultants uploaded to their AI assistants. Client strategy documents. Financial models. Due diligence reports. The intellectual product of the world’s most influential consulting firm, accessible through a single unauthenticated API.

Then the AI-specific data. 3.68 million RAG document chunks. 266,000 OpenAI vector stores. 95 system prompt configurations across 12 model types. 94,000 workspaces. 1.1 million files and messages exchanged via external API integrations.

The most dangerous finding was not the read access. It was the write access. CodeWall confirmed that the vulnerability enabled modification of system prompts. No deployment needed. No code change. A single UPDATE statement could have altered the instructions governing AI assistants used by 70 percent of McKinsey’s workforce, processing 500,000 prompts per month.

Consider what that means. An attacker with write access to system prompts could have instructed McKinsey’s AI assistants to exfiltrate data, inject false information into analyses, or subtly alter recommendations, all without any visible change to the user interface. The consultants would have continued using Lilli, trusting its outputs, unaware that the instructions beneath had been rewritten.

The Uncomfortable Pattern

As we explored in AI Governance IS Cybersecurity, the organizational separation between AI governance and cybersecurity creates exploitable seams. The McKinsey breach is the strongest evidence yet for that thesis.

Lilli is not an unusual platform. It is a standard enterprise AI deployment: a web application connecting users to LLMs, with document upload, retrieval-augmented generation, and workspace management. Every enterprise building internal AI tools is building some version of this architecture.

The vulnerability class is not unusual either. SQL injection has been in the OWASP Top 10 since the list was created. The security community has been writing about it for two decades. The fact that it appeared in a platform built by one of the world’s most resource-rich organizations tells us something about how AI platforms are being developed.

AI teams build AI platforms. They focus on model performance, prompt engineering, retrieval quality, and user experience. The security fundamentals of the web application layer receive less attention because the team’s expertise and incentives point elsewhere. This is not negligence. It is specialization without sufficient security review.

The pattern repeats. Prompt injection as a supply chain weapon documented how the Clinejection attack succeeded not because prompt injection was novel, but because conventional infrastructure misconfigurations (legacy tokens, overprivileged CI/CD environments) created the blast radius. The entry point was new. The damage mechanism was old.

McKinsey’s Lilli breach follows the same structure. The data at risk was AI-specific (prompts, RAG chunks, vector stores). The vulnerability was not.

OpenAI Changes the Frame

Two days after CodeWall’s public disclosure, OpenAI published a paper titled “Designing Agents to Resist Prompt Injection.” The timing may be coincidental. The content is not.

OpenAI’s core argument: prompt injection is evolving from a technical exploit into social engineering. Attackers are not just inserting malicious tokens. They are constructing context, urgency cues, and authority signals designed to manipulate an agent’s behavior the way a social engineer manipulates a human.

OpenAI documents a real attack from 2025, reported by Radware, where a prompt injection delivered via email successfully manipulated ChatGPT in approximately 50 percent of attempts. The payload did not look like code. It looked like a persuasive message.

This reframing matters because it changes the defensive calculus. You cannot filter social engineering at the input layer. You can train people to recognize it, but they still fall for it at measurable rates. The same will be true for AI agents.

OpenAI’s proposed defense is architectural, not behavioral. They introduce a source-sink analysis: an attacker needs both a source (the ability to influence the agent’s context) and a sink (a dangerous capability the agent can execute). Defense focuses on constraining sinks rather than detecting sources.

Their concrete mechanism is called “safe URL.” It detects when an agent is about to transmit information learned during a conversation to a third party and blocks the action. The agent is not told “do not exfiltrate data.” The system makes exfiltration structurally impossible for that specific vector.

The principle they articulate is precise: “Defending against [prompt injection] cannot rely only on filtering inputs. It also requires designing the system so that the impact of manipulation is constrained, even if some attacks succeed.”

This is blast radius control. Accept that some attacks will penetrate the perimeter. Design the system so penetration does not cascade into catastrophe.

Architecture, Not Prompting

As we examined in The Architecture of Agent Trust, reliable agent behavior comes from environmental constraints, not instructions. OpenAI’s framework provides the clearest articulation yet of why this principle extends to security.

OpenAI’s Instruction Hierarchy Challenge trains models to enforce trust hierarchies across four layers: system instructions, developer instructions, user instructions, and external context. Models learn to prioritize higher-trust instructions over lower-trust ones. When an email tells the agent to ignore its system prompt, the agent treats the email as lower-authority than the system prompt.

This is a training-level intervention. It does not depend on the prompt containing the words “ignore prompt injections.” It restructures the model’s decision process so that external inputs carry less weight than developer-defined boundaries.

The idealloc analysis of AI data breach architecture articulates the complementary principle for infrastructure: agents are fundamentally non-deterministic, so infrastructure controls must verify actions at the data layer. You cannot predict what an agent will attempt to do. You can control what it is permitted to do.

The practical implications are specific.

Credential scoping. An agent with email access can reset passwords and take over accounts. Granting tool access means granting the full capability surface of that tool, not just the intended use case. Short-lived credentials with narrow permissions limit what a compromised agent can reach.

Action verification. Every agent action passes through a deterministic verification layer before execution. The verification layer does not use an LLM. It applies rules: is this action within the agent’s permitted scope? Does it match the expected pattern? Does it target an authorized resource?

Circuit breakers. Anomaly detection triggers automatic containment. If an agent’s behavior deviates from its baseline (requesting unusual resources, operating at unusual volume, accessing data outside its normal scope), the system suspends the agent and alerts a human. Kiteworks found in 2026 that 63 percent of organizations cannot enforce AI purpose limits, and 60 percent cannot terminate misbehaving agents. Circuit breakers are the minimum viable containment.

Trust boundaries at the data layer. The agent’s runtime environment enforces boundaries that the agent itself cannot modify. Database permissions, network segmentation, API rate limits. These controls exist independent of the agent’s instructions and persist regardless of what the agent is told to do.

The Scanner Problem

One detail from the McKinsey breach deserves separate attention. OWASP ZAP, a widely deployed security scanner, tested the same endpoints CodeWall’s agent tested. ZAP found nothing. The agent found 22 unauthenticated endpoints and a SQL injection vulnerability that exposed 46.5 million records.

The difference is methodology. Scanners follow predefined attack patterns. They check known vulnerability signatures against known endpoints. CodeWall’s agent used error-message analysis: it sent malformed requests, studied the error responses, inferred the database structure, and iterated. Fifteen blind iterations to move from “this endpoint exists” to “this endpoint leaks data.”

This does not mean scanners are useless. It means that AI platforms present attack surfaces that traditional scanning methodologies were not designed to find. JSON key injection (where the field name itself is the attack vector, not the field value) is a variant that many scanners do not test for. The attack surface of an AI platform includes prompt templates, RAG retrieval paths, vector store configurations, and agent orchestration logic. None of these appear in a standard vulnerability scanner’s playbook.

Organizations relying exclusively on periodic scanning to secure their AI platforms are testing the lock on the front door while the loading dock stands open.

What McKinsey Got Right

Responsible analysis requires acknowledging what worked. McKinsey’s CISO responded within 24 hours of CodeWall’s disclosure. The vulnerability was patched. The response timeline (discovery February 28, disclosure March 1, patch March 2, public disclosure March 9) represents competent incident handling.

CodeWall is a security vendor. Their disclosure serves their commercial interest in demonstrating autonomous security testing. This does not invalidate the findings. The vulnerability was real. The data exposure was verifiable. But the framing (an autonomous agent finding what scanners missed) is also a sales pitch. Evaluate the technical claims independently of the marketing narrative.

The lesson is not that McKinsey failed. It is that even well-resourced organizations, organizations that advise other companies on digital transformation, build AI platforms with conventional web security vulnerabilities. If McKinsey’s platform had these exposures, the baseline assumption for every enterprise AI deployment should be that similar vulnerabilities exist until proven otherwise.

Three Questions for Every AI Platform

The convergence of these events (McKinsey’s breach, OpenAI’s architectural defense framework, and the growing body of evidence that agent security requires infrastructure controls) produces three questions every organization deploying AI should answer.

Can your AI platform’s system prompts be modified without a code deployment? If yes, whoever controls the data store controls the behavior of every AI assistant in your organization. This is not a theoretical risk. McKinsey’s Lilli platform demonstrated that a single SQL statement could rewrite the instructions for 384 AI assistants serving 43,000 users.

Does your security testing cover AI-specific attack surfaces? Standard vulnerability scanners test for standard vulnerabilities. JSON key injection, prompt template manipulation, RAG poisoning, and vector store corruption are not standard. If your security testing does not include these vectors, you are testing a subset of your actual attack surface.

Are your agent permissions scoped to minimum viable capability? Gartner reports that 40 percent of enterprise applications will feature AI agents by 2026, while only 6 percent of organizations have advanced AI security strategies. The default in most deployments is broad permissions, because broad permissions are easier to configure and agents work better with more access. Broad permissions also mean that every compromised agent is a skeleton key.

The answers to these questions determine whether an organization’s AI security posture is architectural or aspirational. Architecture constrains blast radius by design. Aspiration depends on nothing going wrong.

The McKinsey breach proved that things go wrong. OpenAI’s framework acknowledges that they will continue to go wrong. The only defensible position is building systems where “wrong” does not mean “catastrophic.”

This analysis synthesizes CodeWall’s McKinsey Lilli breach disclosure (March 2026), OpenAI’s prompt injection defense framework (March 2026), and idealloc’s analysis of AI data breach architecture (March 2026).

Victorino Group helps enterprises build agent security architecture that doesn’t depend on prompting for protection. Let’s talk.