Memory for AI Agents: Benchmarks, Architectures, and a Surprising Discovery
AI agents need to remember to be useful in real-world tasks. Without persistent memory, each interaction starts from scratch. The agent does not know its history, user preferences, or context from previous conversations.
This limitation is at the center of a fascinating technical debate: what is the best way to implement memory for AI agents? The answer is surprising.
The Discovery That Defies Expectations
Letta, an agent infrastructure company, tested a minimalist approach: storing conversation histories as text files and using basic filesystem operations for search.
The result: 74.0% accuracy on the LoCoMo benchmark using GPT-4o mini with simple grep and semantic search operations.
Compare that with Mem0, which uses a sophisticated knowledge graph architecture with LLM-powered conflict resolution: 68.5%.
A simple solution outperformed a complex one by more than 5 percentage points.
Why Simplicity Can Win
The explanation is counterintuitive but makes sense: agents are extensively trained on filesystem operations. Commands like grep, find, and cat are part of the basic repertoire of any model trained on code.
When you give an agent familiar tools, it uses them more effectively than new, specialized tools.
Letta summarizes it this way: “Memory for agents is about whether agents successfully retrieve needed information when required. Effective tool usage matters more than the specific retrieval mechanism.”
The LoCoMo Benchmark
To understand these numbers, we need to know the benchmark.
LoCoMo (Long-Context Conversation Memory) was created by Snap Research and published at ACL 2024. It is the standard for evaluating long-term conversational memory.
Dataset composition:
- 50 human conversations
- Up to 35 sessions per conversation
- Approximately 300 average turns
- 9K average tokens per conversation
Types of questions evaluated:
- Single-hop: Direct fact retrieval
- Multi-hop: Connecting multiple pieces of information
- Temporal: Reasoning about time and sequence
- Adversarial: Questions with traps
Human performance on LoCoMo is around 88%. Even the best systems fall below that.
Architecture Comparison
Here is the complete overview of tested approaches:
| Architecture | Benchmark | Score | Complexity |
|---|---|---|---|
| Filesystem (Letta) | LoCoMo | 74.0% | Low |
| Graph Memory (Mem0) | LoCoMo | 68.5% | High |
| Hindsight (4 Networks) | LoCoMo | 89.6% | Very High |
| EmergenceMem (RAG) | LongMemEval | 82.4% | Medium |
| Full Context GPT-4o | LoCoMo | 58% | N/A |
Some important observations:
Filesystem: High auditability. Data in readable format. Standard security models applicable.
Graph Memory: Strong relational reasoning. Good for multi-hop queries. Requires specialized graph expertise.
Hindsight: Best absolute performance. Separates facts from opinions. Requires significant engineering investment.
The Hindsight Architecture
The highest-performing approach deserves detail. Hindsight uses four separate networks:
World Network: Objective environmental facts. Verifiable information independent of the agent.
Experience Network: Agent actions in first person. History of what was done and its results.
Opinion Network: Subjective judgments with confidence scores that evolve with new evidence.
Observation Network: Neutral entity summaries. Consolidated information about people, places, things.
The main innovation: separating what the agent observes from what it believes. Opinions can change when new evidence arrives, while facts remain stable.
Benchmark Limitations
Before making decisions based on these numbers, some important caveats:
Scores are not directly comparable. Hindsight uses Gemini-3, Letta uses GPT-4o mini. Different benchmarks (LoCoMo vs LongMemEval) have different scales. Methodologies are disputed — Zep and Mem0 contest each other’s results.
What benchmarks do not measure:
- Performance under multi-user load
- Degradation with conflicting information
- Coherence in tool-calling chains
- Cost-performance at scale
Human gap persists. Humans outperform the best systems by approximately 56% on LoCoMo. Even Hindsight at 89.6% falls below the human ceiling.
Conflicts of interest. Each company publishes results favorable to its products. Letta, Mem0, Zep — all have commercial incentives.
Tiered Decision Framework
Based on this analysis, I propose a practical framework for choosing memory architectures:
Tier 1: Filesystem-Based (Recommended)
Conversations stored as text files with semantic + keyword search.
When to use:
- Histories under 100K tokens
- Per-session/user memory
- Team without graph expertise
- Auditability is priority
Expected performance: 70-75%
Tier 2: Graph-Enhanced (Advanced)
Filesystem + entity and relationship extraction to graph.
When to use:
- Frequent multi-hop queries
- Knowledge evolves and conflicts
- Relationships are central
- Entity-oriented domain
Expected performance: 68-75%
Tier 3: Structured (Enterprise)
4-network architecture Hindsight-style.
When to use:
- Mission-critical applications
- Opinions must evolve
- Temporal reasoning essential
- Budget for high investment
Expected performance: 85-90%+
Guiding principle: Start simple. Add complexity only when data shows it is necessary. Tier 1 solves most use cases.
Governance Implications
Memory architecture directly impacts governance and compliance of AI systems.
| Architecture | Control | Auditability |
|---|---|---|
| Filesystem | High | High |
| Graph DB | Medium | Medium |
| Vector Store | Low | Low |
Audit Trails: Filesystems generate standard access logs. Graphs and vectors require additional instrumentation.
Right to Be Forgotten: Deleting data is simpler in files than in distributed embeddings. GDPR has direct implications.
Data Sovereignty: Clear control over where data resides. Files can be inspected; vectors are opaque.
Poisoning Attacks: Vector stores are more vulnerable to malicious data injections that can compromise agent behavior.
Trends to Watch
Memory-Specific Training: Models trained specifically on memory operations may close the gap between simple and complex approaches.
Hybrid Retrieval: TEMPR (semantic + keyword + graph + temporal) shows that combining strategies outperforms single approaches.
Benchmark Evolution: New benchmarks will focus on memory management, not just retrieval. Adversarial robustness will be central.
Latency vs Accuracy: Production systems must balance memory quality with response time. Hindsight (89.6%) is slower than Mem0 (68.5% at 1.44s p95).
Open Questions
Some questions that research has not yet answered:
- How do architectures perform with multimodal memory (images, documents)?
- What are the attack surfaces for memory poisoning in each approach?
- How does performance degrade over very long time horizons (months, years)?
- What governance frameworks should wrap each type of architecture?
Key Conclusions
Simplicity can win. Filesystem-based (74%) outperformed Graph-based (68.5%) on LoCoMo. Complexity is no guarantee of performance.
Familiarity matters. Agents use tools they know from training (grep, find) better than new specialized APIs.
Benchmarks have limits. Results are not directly comparable. Models, prompts, and methodologies vary. Context is essential.
Governance varies by architecture. Filesystems offer greater auditability. Vector stores are more opaque. Choice impacts compliance.
Start with Tier 1. For most cases, filesystem-based is sufficient. Add complexity when data justifies it.
Human gap persists. Even the best systems fall below humans. Human-in-the-loop is still valuable.
At Victorino Group, we implement governed agentic AI for companies that cannot afford to fail. If you need memory for your agents with full control over data and decisions, let’s talk.
If this resonates, let's talk
We help companies implement AI without losing control.
Schedule a Conversation