Three Memory Patterns That Hold Up (and Two That Don't)

Tim Kellogg has been building production agent systems long enough to be picky about what works. He is the author of dura, the Git auto-commit daemon with 4.4K stars on GitHub. When he writes a taxonomy of agent memory patterns, the taxonomy is empirical, not theoretical. His piece Agent Memory Patterns is the cleanest catalogue I have seen of what holds up in production and what does not.

Three patterns hold up. Two do not. And a single number — 500 characters — is the most operationally useful detail in the post for any team designing the governance layer.

The Three That Hold Up

Files. Hierarchical key-value stores for data and knowledge. The agent navigates them with the tools you would expect: ls, find, cat, write. The literal substrate is irrelevant — flat files on disk, database records, S3 blobs. What matters is that the structure is hierarchical and the access pattern is exploration.

Files are the right pattern when the data is too large or too varied to live inline, and when the agent needs to discover what is there before reading it. The exploration tools are not optional. If your agent jumps to specific file paths without listing or finding, you have built a file system that is being used as a hard-coded address book. Random file referencing is a symptom that the navigation tools are underutilized — which is to say, the design is not really using the Files pattern at all.

Memory Blocks. A flat key-value store, included inline in the system or user prompt. The Letta framework popularized the implementation: WriteBlock(), ListBlocks(), ReadBlock(). The defining property is visibility: the contents of a memory block are guaranteed to be in the model’s context window for the turn. There is no retrieval step that might fail. The agent sees it, period.

That guarantee is the whole point. Memory blocks are for behavior, preferences, identity — the small set of facts that must shape every turn. “Always respond in Brazilian Portuguese.” “The user prefers terse code review feedback.” “This account is read-only.” If the data needs to be in front of the model every time, it goes in a block.

Skills. Directory-based structures combining files with a system-prompt representation. The Claude Code Skill tool is the reference implementation: a SKILL.md metadata file with a name and a description sits in the system prompt; the rest of the directory loads only when the description matches the situation. Progressive disclosure is the architectural principle. Most of the data is invisible most of the time. The metadata is what triggers contextual loading.

Skills are the right pattern when you have a body of instructions or data that applies in specific circumstances and would be wasteful — or actively confusing — to include in every turn. The trigger description is the design surface. If it is too vague, the skill loads when it should not. If it is too specific, the skill never loads when it should.

The Two That Do Not

Knowledge graphs. They look elegant on a whiteboard. They do not work in production. Kellogg’s empirical observation is that LLMs do not reason well in token-space over graph structures, and the indirection between “the model says it walked the graph” and “what the graph actually says” is where the failures live. The model hallucinates edges. It misinterprets node types. The graph is internally consistent and the agent’s behavior is not.

Writable SQL-backed data models. Same root cause. The agent issues a query, the database answers with a tabular result, and the agent’s reasoning over that tabular result is a token-space operation that the model is not architecturally good at. Reads are sometimes salvageable. Writes are where the wheels come off — the agent commits a row that violates an invariant nobody told it about, and the data model is now poisoned. Kellogg’s verdict is direct: these patterns “tend to not work very well.”

This is the part of the post worth quoting at the next architecture review. Two patterns that look like first-class engineering — a knowledge graph, a writable database — empirically fail when an agent is the consumer. The teams shipping this kind of memory in production are mostly not doing it on purpose. They picked the pattern because it was familiar from human-facing systems. The agent is a different consumer.

The 500-Character Threshold

The single most useful sentence in Kellogg’s piece is the failure mode for memory blocks. Above approximately 500 characters, blocks “tend to confuse the agent.” That is not a stylistic preference. It is an operating constraint.

500 characters is also a permission boundary, even if Kellogg does not frame it that way. A memory block under 500 characters is small enough that a human reviewer can read it in one glance and decide whether it should exist. A 480-character block that says “always escalate financial transactions over $10K to a human reviewer” is a governable artifact. A 4,800-character block that says everything an agent might need to know about the company’s escalation policies is not — it is a small document, and the model will get confused by it.

The architectural implication is clean. The moment a memory block crosses the threshold, it is no longer a block. It should be a File (if the agent needs to navigate it on demand) or a Skill (if it should auto-load when relevant). The threshold is not a soft limit you fight against. It is a signal that you have picked the wrong pattern.

This matters for governance because the small size of memory blocks is what makes them reviewable. Anything that must be guaranteed visible to the agent must also be small enough for a human to govern. The 500-character ceiling enforces both properties at once.

Choose by What the Data Is For

The architectural takeaway from Kellogg’s taxonomy is not “use all three.” It is: choose the pattern by what the data is for, and let the pattern’s failure modes tell you when you have picked wrong.

Files when the agent needs to explore. The bias to detect: hard-coded paths instead of ls and find calls.
Memory Blocks when behavior must be guaranteed visible. The bias to detect: blocks that grow past 500 characters because nobody wanted to refactor them.
Skills when progressive disclosure matters. The bias to detect: vague trigger descriptions that load the wrong skill at the wrong time.

And the patterns to rule out: knowledge graphs and writable SQL stores. They look like good engineering. They are not, when the consumer is an LLM.

The reason this taxonomy is worth treating as a checklist is that the failure modes are now empirical. We are no longer guessing whether a graph-backed memory will work. Production teams have tried, and it does not. We are no longer guessing what size a memory block should be. The number is 500. The teams who internalize these constraints early will spend the next year building memory architectures that hold up. The teams who do not will spend it discovering, expensively, why their elegant graph never converged.

This analysis synthesizes Agent Memory Patterns (Tim Kellogg, April 2026).

Victorino Group helps engineering teams choose the memory architecture for AI agents that holds up in production — and rule out the patterns that empirically fail. Let’s talk.

The Three That Hold Up

The Two That Do Not

The 500-Character Threshold

Choose by What the Data Is For

If this resonates, let's talk