Zero Tenant Leaks at 0.89 Recall: Isolation Lives in the Cluster

Elastic Search Labs published a build in June 2026 that moves the agent-memory conversation from policy to measurement. A persistent multi-tenant memory layer scored 0.89 recall@10 across a 168-question evaluation, with zero cross-tenant leaks. The isolation number is the one that matters. Not because zero is a marketing round number, but because of where the zero came from: the database enforced it, not the application, and not the prompt.

That single design decision is the whole argument. Most teams building agent memory put tenant boundaries in the retrieval layer, then trust the agent to respect them. Elastic put the boundary in Document-Level Security, expressed through API-key role descriptors. The agent cannot query across tenants because the query engine refuses to return documents the key is not scoped to. The boundary holds whether the prompt is well-behaved or hostile.

The Recall Numbers, Read Correctly

The headline is 0.89 recall@10 on average. The breakdown tells you more than the average does.

Elastic reports semantic memory at 0.81, episodic at 0.98, and procedural at 1.0. The spread is the signal. Procedural memory (how to do a thing, the steps and tools) and episodic memory (what happened, when, in what order) are highly structured and retrieve almost perfectly. Semantic memory (general facts and learned preferences) is fuzzier and pulls the average down. A team treating “agent memory” as one undifferentiated store would never see this. The architecture splits memory into separate indices precisely because the retrieval characteristics differ, and the eval confirms the split was right.

Three memory indices plus one catalog. Write refresh under 100 milliseconds, so a fact the agent just learned is queryable almost immediately. These parameters decide whether the agent remembers what you told it ten seconds ago or has already forgotten.

One line from the writeup deserves to be quoted exactly, because it reframes a comfortable assumption: “A 1M-token context window is a scratchpad. It is not a memory system.” Stuffing history into a long context is recall by brute force. It does not survive the session, it does not isolate by tenant, and it does not let you audit what the agent knew at decision time.

Isolation as a Database Property

Document-Level Security is the mechanism worth copying. Each tenant’s memories carry access metadata. Each agent operates under an API key whose role descriptor scopes it to exactly one tenant. Retrieval runs through that key. The filter is not a WHERE tenant_id = ? clause the application appends and might forget. It is enforced by the engine on every query, including the ones a compromised or confused agent might try to run.

This is the part the industry keeps getting wrong. When isolation lives in application code, every new retrieval path is a new chance to leak. A developer adds a feature, writes a query, forgets the tenant filter, and the boundary is gone. When isolation lives in the cluster as a property of the credential, there is no path that bypasses it. The agent could be fully prompt-injected and still see only its own tenant’s data. The leak it is trying to cause is a query the engine refuses to serve, and no prompt instruction changes that.

Elastic’s eval design reflects this. The 168 questions include cross-tenant probes. Zero leaks across that set is a claim about the architecture, not a claim about the agent’s good behavior. That distinction is the entire reason the number is credible.

Contradictions Get an Audit Trail, Not a Delete

The second design decision is quieter and just as important. When a new fact contradicts a stored one, the system does not overwrite or delete the old fact. It supersedes it and keeps the trail.

Consider what deletion costs you. An agent learns a customer’s shipping address, then learns a new one. Delete the old address and you have a cleaner store and a blind spot. You can no longer answer “what address did the agent have when it shipped the wrong order in March?” Supersession keeps both facts, marks which is current, and records when the change happened and why. Memory integrity and auditability come from the same mechanism.

This matters for anyone operating under a regulator’s gaze. The right-to-erasure conversation around agent memory usually treats deletion as the goal. Supersession reframes it: you can prove what the agent knew at any point in time, which is exactly what an audit or an incident reconstruction requires. Erasure of a specific subject’s data becomes a targeted operation on identifiable records, not a blind purge of a compressed blob.

Two more parameters from the writeup show the same governance instinct. Memory decays on a 180-day offset with a five-year half-life, so old facts lose weight without vanishing. A source-prior coefficient of 0.85 weights where a memory came from, so a fact from a trusted source outranks a fact from a noisy one. Both are knobs that belong to an operator, set in infrastructure, not improvised by the agent at runtime.

Why This Reframes the Governance Question

Earlier work on agent memory, including ours, framed the problem: persistent memory creates risks that data-governance frameworks were not built for. That framing was correct and incomplete. It left the impression that governance is a policy layer you bolt on top.

This build shows the opposite. The strongest governance properties here are infrastructure properties. Tenant isolation is a credential scope. Auditability is a supersession record. Retention is a decay curve. Source trust is a coefficient. None of these is a prompt instruction, and none is application logic that a future feature can quietly break. They live below the application, where the agent cannot reach to weaken them.

That is the lesson worth internalizing. A governance property that a clever prompt or a sloppy code change can override is just a hope with a better name. The properties that hold are the ones the system enforces structurally, on every operation, regardless of what the agent intends.

Treat the absolute numbers with the caution any first-party vendor benchmark deserves. Elastic built the system and ran its own eval. The pattern survives that caveat. Whether your real-world recall lands at 0.89 or 0.79, the architecture of where you put isolation, how you handle contradictions, and what you make tunable by operators is what determines whether your agent memory is governable at all.

Do This Now

Audit where your agent-memory isolation lives. If a developer can write a retrieval query that forgets a tenant filter, your boundary is in the wrong layer. Move it into the credential: scope each agent’s database access to exactly one tenant, enforced by the engine, and add a cross-tenant probe to your evaluation set so a regression shows up as a failed test rather than a breach. Then replace deletion with supersession for any fact an agent can revise, so you can reconstruct what the agent knew at any moment a regulator or a customer asks.

This analysis synthesizes 0.89 Recall and Zero Tenant Leaks (Elastic Search Labs, June 2026), a first-party engineering writeup whose benchmarks are self-reported.

Victorino Group helps organizations build agent-memory architectures where isolation, audit, and retention are infrastructure properties rather than prompt-level hopes. Let’s talk.