Memory for AI Agents: Benchmarks, Architectures, and a Surprising Discovery

AI agents need to remember to be useful in real-world tasks. Without persistent memory, each interaction starts from scratch. The agent does not know its history, user preferences, or context from previous conversations.

This limitation is at the center of a fascinating technical debate: what is the best way to implement memory for AI agents? The answer is surprising.

The Discovery That Defies Expectations

Letta, an agent infrastructure company, tested a minimalist approach: storing conversation histories as text files and using basic filesystem operations for search.

The result: 74.0% accuracy on the LoCoMo benchmark using GPT-4o mini with simple grep and semantic search operations.

Compare that with Mem0, which uses a sophisticated knowledge graph architecture with LLM-powered conflict resolution: 68.5%.

A simple solution outperformed a complex one by more than 5 percentage points.

Why Simplicity Can Win

The explanation is counterintuitive but makes sense: agents are extensively trained on filesystem operations. Commands like grep, find, and cat are part of the basic repertoire of any model trained on code.

When you give an agent familiar tools, it uses them more effectively than new, specialized tools.

Letta summarizes it this way: “Memory for agents is about whether agents successfully retrieve needed information when required. Effective tool usage matters more than the specific retrieval mechanism.”

The LoCoMo Benchmark

To understand these numbers, we need to know the benchmark.

LoCoMo (Long-Context Conversation Memory) was created by Snap Research and published at ACL 2024. It is the standard for evaluating long-term conversational memory.

Dataset composition:

50 human conversations
Up to 35 sessions per conversation
Approximately 300 average turns
9K average tokens per conversation

Types of questions evaluated:

Single-hop: Direct fact retrieval
Multi-hop: Connecting multiple pieces of information
Temporal: Reasoning about time and sequence
Adversarial: Questions with traps

Human performance on LoCoMo is around 88%. Even the best systems fall below that.

Architecture Comparison

Here is the complete overview of tested approaches:

Architecture	Benchmark	Score	Complexity
Filesystem (Letta)	LoCoMo	74.0%	Low
Graph Memory (Mem0)	LoCoMo	68.5%	High
Hindsight (4 Networks)	LoCoMo	89.6%	Very High
EmergenceMem (RAG)	LongMemEval	82.4%	Medium
Full Context GPT-4o	LoCoMo	58%	N/A

Some important observations:

Filesystem: High auditability. Data in readable format. Standard security models applicable.

Graph Memory: Strong relational reasoning. Good for multi-hop queries. Requires specialized graph expertise.

Hindsight: Best absolute performance. Separates facts from opinions. Requires significant engineering investment.

The Hindsight Architecture

The highest-performing approach deserves detail. Hindsight uses four separate networks:

World Network: Objective environmental facts. Verifiable information independent of the agent.

Experience Network: Agent actions in first person. History of what was done and its results.

Opinion Network: Subjective judgments with confidence scores that evolve with new evidence.

Observation Network: Neutral entity summaries. Consolidated information about people, places, things.

The main innovation: separating what the agent observes from what it believes. Opinions can change when new evidence arrives, while facts remain stable.

Benchmark Limitations

Before making decisions based on these numbers, some important caveats:

Scores are not directly comparable. Hindsight uses Gemini-3, Letta uses GPT-4o mini. Different benchmarks (LoCoMo vs LongMemEval) have different scales. Methodologies are disputed — Zep and Mem0 contest each other’s results.

What benchmarks do not measure:

Performance under multi-user load
Degradation with conflicting information
Coherence in tool-calling chains
Cost-performance at scale

Human gap persists. Humans outperform the best systems by approximately 56% on LoCoMo. Even Hindsight at 89.6% falls below the human ceiling.

Conflicts of interest. Each company publishes results favorable to its products. Letta, Mem0, Zep — all have commercial incentives.

Tiered Decision Framework

Based on this analysis, I propose a practical framework for choosing memory architectures:

Tier 1: Filesystem-Based (Recommended)

Conversations stored as text files with semantic + keyword search.

When to use:

Histories under 100K tokens
Per-session/user memory
Team without graph expertise
Auditability is priority

Expected performance: 70-75%

Tier 2: Graph-Enhanced (Advanced)

Filesystem + entity and relationship extraction to graph.

When to use:

Frequent multi-hop queries
Knowledge evolves and conflicts
Relationships are central
Entity-oriented domain

Expected performance: 68-75%

Tier 3: Structured (Enterprise)

4-network architecture Hindsight-style.

When to use:

Mission-critical applications
Opinions must evolve
Temporal reasoning essential
Budget for high investment

Expected performance: 85-90%+

Guiding principle: Start simple. Add complexity only when data shows it is necessary. Tier 1 solves most use cases.

Governance Implications

Memory architecture directly impacts governance and compliance of AI systems.

Architecture	Control	Auditability
Filesystem	High	High
Graph DB	Medium	Medium
Vector Store	Low	Low

Audit Trails: Filesystems generate standard access logs. Graphs and vectors require additional instrumentation.

Right to Be Forgotten: Deleting data is simpler in files than in distributed embeddings. GDPR has direct implications.

Data Sovereignty: Clear control over where data resides. Files can be inspected; vectors are opaque.

Poisoning Attacks: Vector stores are more vulnerable to malicious data injections that can compromise agent behavior.

Trends to Watch

Memory-Specific Training: Models trained specifically on memory operations may close the gap between simple and complex approaches.

Hybrid Retrieval: TEMPR (semantic + keyword + graph + temporal) shows that combining strategies outperforms single approaches.

Benchmark Evolution: New benchmarks will focus on memory management, not just retrieval. Adversarial robustness will be central.

Latency vs Accuracy: Production systems must balance memory quality with response time. Hindsight (89.6%) is slower than Mem0 (68.5% at 1.44s p95).

Open Questions

Some questions that research has not yet answered:

How do architectures perform with multimodal memory (images, documents)?
What are the attack surfaces for memory poisoning in each approach?
How does performance degrade over very long time horizons (months, years)?
What governance frameworks should wrap each type of architecture?

Key Conclusions

Simplicity can win. Filesystem-based (74%) outperformed Graph-based (68.5%) on LoCoMo. Complexity is no guarantee of performance.

Familiarity matters. Agents use tools they know from training (grep, find) better than new specialized APIs.

Benchmarks have limits. Results are not directly comparable. Models, prompts, and methodologies vary. Context is essential.

Governance varies by architecture. Filesystems offer greater auditability. Vector stores are more opaque. Choice impacts compliance.

Start with Tier 1. For most cases, filesystem-based is sufficient. Add complexity when data justifies it.

Human gap persists. Even the best systems fall below humans. Human-in-the-loop is still valuable.

At Victorino Group, we implement governed agentic AI for companies that cannot afford to fail. If you need memory for your agents with full control over data and decisions, let’s talk.

The Discovery That Defies Expectations

Why Simplicity Can Win

The LoCoMo Benchmark

Architecture Comparison

The Hindsight Architecture

Benchmark Limitations

Tiered Decision Framework

Tier 1: Filesystem-Based (Recommended)

Tier 2: Graph-Enhanced (Advanced)

Tier 3: Structured (Enterprise)

Governance Implications

Trends to Watch

Open Questions

Key Conclusions

If this resonates, let's talk