Governance as Advantage

Memory for AI Agents: Benchmarks, Architectures, and a Surprising Discovery

TV
Thiago Victorino
15 min read

AI agents need to remember to be useful in real-world tasks. Without persistent memory, each interaction starts from scratch. The agent does not know its history, user preferences, or context from previous conversations.

This limitation is at the center of a fascinating technical debate: what is the best way to implement memory for AI agents? The answer is surprising.

The Discovery That Defies Expectations

Letta, an agent infrastructure company, tested a minimalist approach: storing conversation histories as text files and using basic filesystem operations for search.

The result: 74.0% accuracy on the LoCoMo benchmark using GPT-4o mini with simple grep and semantic search operations.

Compare that with Mem0, which uses a sophisticated knowledge graph architecture with LLM-powered conflict resolution: 68.5%.

A simple solution outperformed a complex one by more than 5 percentage points.

Why Simplicity Can Win

The explanation is counterintuitive but makes sense: agents are extensively trained on filesystem operations. Commands like grep, find, and cat are part of the basic repertoire of any model trained on code.

When you give an agent familiar tools, it uses them more effectively than new, specialized tools.

Letta summarizes it this way: “Memory for agents is about whether agents successfully retrieve needed information when required. Effective tool usage matters more than the specific retrieval mechanism.”

The LoCoMo Benchmark

To understand these numbers, we need to know the benchmark.

LoCoMo (Long-Context Conversation Memory) was created by Snap Research and published at ACL 2024. It is the standard for evaluating long-term conversational memory.

Dataset composition:

  • 50 human conversations
  • Up to 35 sessions per conversation
  • Approximately 300 average turns
  • 9K average tokens per conversation

Types of questions evaluated:

  1. Single-hop: Direct fact retrieval
  2. Multi-hop: Connecting multiple pieces of information
  3. Temporal: Reasoning about time and sequence
  4. Adversarial: Questions with traps

Human performance on LoCoMo is around 88%. Even the best systems fall below that.

Architecture Comparison

Here is the complete overview of tested approaches:

ArchitectureBenchmarkScoreComplexity
Filesystem (Letta)LoCoMo74.0%Low
Graph Memory (Mem0)LoCoMo68.5%High
Hindsight (4 Networks)LoCoMo89.6%Very High
EmergenceMem (RAG)LongMemEval82.4%Medium
Full Context GPT-4oLoCoMo58%N/A

Some important observations:

Filesystem: High auditability. Data in readable format. Standard security models applicable.

Graph Memory: Strong relational reasoning. Good for multi-hop queries. Requires specialized graph expertise.

Hindsight: Best absolute performance. Separates facts from opinions. Requires significant engineering investment.

The Hindsight Architecture

The highest-performing approach deserves detail. Hindsight uses four separate networks:

World Network: Objective environmental facts. Verifiable information independent of the agent.

Experience Network: Agent actions in first person. History of what was done and its results.

Opinion Network: Subjective judgments with confidence scores that evolve with new evidence.

Observation Network: Neutral entity summaries. Consolidated information about people, places, things.

The main innovation: separating what the agent observes from what it believes. Opinions can change when new evidence arrives, while facts remain stable.

Benchmark Limitations

Before making decisions based on these numbers, some important caveats:

Scores are not directly comparable. Hindsight uses Gemini-3, Letta uses GPT-4o mini. Different benchmarks (LoCoMo vs LongMemEval) have different scales. Methodologies are disputed — Zep and Mem0 contest each other’s results.

What benchmarks do not measure:

  • Performance under multi-user load
  • Degradation with conflicting information
  • Coherence in tool-calling chains
  • Cost-performance at scale

Human gap persists. Humans outperform the best systems by approximately 56% on LoCoMo. Even Hindsight at 89.6% falls below the human ceiling.

Conflicts of interest. Each company publishes results favorable to its products. Letta, Mem0, Zep — all have commercial incentives.

Tiered Decision Framework

Based on this analysis, I propose a practical framework for choosing memory architectures:

Conversations stored as text files with semantic + keyword search.

When to use:

  • Histories under 100K tokens
  • Per-session/user memory
  • Team without graph expertise
  • Auditability is priority

Expected performance: 70-75%

Tier 2: Graph-Enhanced (Advanced)

Filesystem + entity and relationship extraction to graph.

When to use:

  • Frequent multi-hop queries
  • Knowledge evolves and conflicts
  • Relationships are central
  • Entity-oriented domain

Expected performance: 68-75%

Tier 3: Structured (Enterprise)

4-network architecture Hindsight-style.

When to use:

  • Mission-critical applications
  • Opinions must evolve
  • Temporal reasoning essential
  • Budget for high investment

Expected performance: 85-90%+

Guiding principle: Start simple. Add complexity only when data shows it is necessary. Tier 1 solves most use cases.

Governance Implications

Memory architecture directly impacts governance and compliance of AI systems.

ArchitectureControlAuditability
FilesystemHighHigh
Graph DBMediumMedium
Vector StoreLowLow

Audit Trails: Filesystems generate standard access logs. Graphs and vectors require additional instrumentation.

Right to Be Forgotten: Deleting data is simpler in files than in distributed embeddings. GDPR has direct implications.

Data Sovereignty: Clear control over where data resides. Files can be inspected; vectors are opaque.

Poisoning Attacks: Vector stores are more vulnerable to malicious data injections that can compromise agent behavior.

Memory-Specific Training: Models trained specifically on memory operations may close the gap between simple and complex approaches.

Hybrid Retrieval: TEMPR (semantic + keyword + graph + temporal) shows that combining strategies outperforms single approaches.

Benchmark Evolution: New benchmarks will focus on memory management, not just retrieval. Adversarial robustness will be central.

Latency vs Accuracy: Production systems must balance memory quality with response time. Hindsight (89.6%) is slower than Mem0 (68.5% at 1.44s p95).

Open Questions

Some questions that research has not yet answered:

  • How do architectures perform with multimodal memory (images, documents)?
  • What are the attack surfaces for memory poisoning in each approach?
  • How does performance degrade over very long time horizons (months, years)?
  • What governance frameworks should wrap each type of architecture?

Key Conclusions

Simplicity can win. Filesystem-based (74%) outperformed Graph-based (68.5%) on LoCoMo. Complexity is no guarantee of performance.

Familiarity matters. Agents use tools they know from training (grep, find) better than new specialized APIs.

Benchmarks have limits. Results are not directly comparable. Models, prompts, and methodologies vary. Context is essential.

Governance varies by architecture. Filesystems offer greater auditability. Vector stores are more opaque. Choice impacts compliance.

Start with Tier 1. For most cases, filesystem-based is sufficient. Add complexity when data justifies it.

Human gap persists. Even the best systems fall below humans. Human-in-the-loop is still valuable.


At Victorino Group, we implement governed agentic AI for companies that cannot afford to fail. If you need memory for your agents with full control over data and decisions, let’s talk.

If this resonates, let's talk

We help companies implement AI without losing control.

Schedule a Conversation