Your Agent Forgets Everything. Here Are Three Ways to Fix It — and One Question Nobody Is Answering

An MIT study found that 95% of generative AI pilots fail. Not because the models are bad. Because the agents do not adapt.

Every session starts fresh. The agent re-reads raw transcripts. It re-discovers the same constraints. It repeats the same mistakes. Three months in, the pilot looks exactly like it did on day one. Leadership kills the project.

Two publications from April 2026 attack this problem from different angles. IBM Research published ALTK-Evolve, a system that converts agent trajectories into scored, reusable guidelines. Rahul Garg, writing on Martin Fowler’s blog, described the Feedback Flywheel, a framework for teams to systematically capture signals from AI tools and feed them back into shared infrastructure.

Both are serious work. Both advance the field. And both leave the same question unanswered.

The First Approach: Teach the Agent from Its Own Mistakes

ALTK-Evolve, from Vatche Isahagian and colleagues at IBM Research, starts from a simple observation: agents repeat errors because they re-read raw execution traces instead of extracting principles from them.

The architecture runs a two-phase loop. The downward phase captures agent trajectories through observability tools like Langfuse and OpenTelemetry. Every tool call, every reasoning step, every outcome gets logged. The upward phase is where the learning happens: a separate LLM reads failed trajectories, extracts reusable guidelines, scores them by reliability, and prunes low-performing ones over time.

When the agent encounters a new task, the system retrieves only the highest-scoring guidelines relevant to that specific context. Not the full history. Not a dump of everything the agent has ever seen. Just the principles that have proven reliable.

The results on the AppWorld benchmark are worth examining closely.

Easy tasks improved 5.2 percentage points (79% to 84.2%). Medium tasks improved 6.3 points (56.2% to 62.5%). Hard tasks improved 14.2 points (19.1% to 33.3%).

That hard-task number deserves scrutiny. A 74% relative increase sounds dramatic, and it is. But the baseline was 19.1%. Moving from one-in-five to one-in-three is meaningful progress. It is not production-ready performance.

The pattern is telling: the harder the task, the more the agent benefits from accumulated guidelines. Simple tasks need less help. Complex tasks, where the agent is most likely to get lost in multi-step reasoning, benefit most from curated principles that prevent known failure paths.

ALTK-Evolve is open source and includes a Claude Code plugin. The engineering is solid. But the interesting question is not about engineering.

The Second Approach: Build a Team Flywheel

Garg’s Feedback Flywheel, published on Martin Fowler’s site, operates at a different level. Where ALTK-Evolve teaches individual agents, the Flywheel teaches teams how to compound their AI practice.

The core insight: “A fast output that requires extensive rework is not a productivity gain. It is rework with extra steps.”

Teams plateau with AI assistants because they use them ad hoc. One developer discovers a prompting technique and keeps it to herself. Another builds a useful CLAUDE.md file but never shares the pattern. A third finds that a particular instruction prevents a recurring error but stores it only in her local setup.

Garg identifies four signal types that teams should capture and feed back:

Context signals become priming documents. When your agent needs the same background information repeatedly, codify it once and share it across the team.

Instruction signals become shared commands. The prompt patterns that work get extracted from individual practice into team infrastructure.

Workflow signals become team playbooks. Not just what to tell the agent, but when to use it and when not to.

Failure signals become guardrails. When an agent produces a category of error, encode the prevention as a constraint that every team member’s setup inherits.

This is familiar territory for anyone who has built self-improving coding agents. The AGENTS.md pattern, repository memory, instruction persistence. Garg’s contribution is extending the concept from individual developer practice to team-level infrastructure.

His sharpest observation: “Teams adopting AI tools at roughly the same time can arrive at very different places six months later.” The difference is not talent. It is whether the team builds feedback loops or relies on individual heroics.

The Third Approach: Let the Agent Redesign Its Own Memory

Neither ALTK-Evolve nor the Flywheel is the most aggressive approach to agent self-improvement. That distinction belongs to the paradigm we examined last month: agents that redesign their own memory architecture.

Zak El Fassi’s experiment showed an agent evaluating its own memory system, identifying structural blind spots, and executing fixes through parallel subagents. Recall jumped from 60% to 93%. Cost: two dollars.

Where ALTK-Evolve extracts guidelines from trajectories, and the Flywheel captures team signals, self-designed memory lets the agent restructure the substrate of its own cognition. It is not learning what to do differently. It is reorganizing how it stores and retrieves everything it knows.

Here is what all three approaches share: none of them answers who validates what the agent learns.

ALTK-Evolve scores guidelines automatically. A guideline that leads to task failures gets downranked. One that leads to successes gets promoted. But scoring against task completion is not the same as validating the guideline itself. A guideline could produce correct outputs through flawed reasoning. It could encode a shortcut that works on benchmarks but fails in production. It could embed a bias that no automated scoring metric would catch.

The Flywheel relies on teams to curate signals. But Garg offers no framework for quality control over what enters the shared infrastructure. Who reviews the guardrails before they become team defaults? What happens when two team members encode contradictory instructions? How do you deprecate a guideline that was useful six months ago but is counterproductive now?

Self-designed memory, as we explored in depth, has the most acute version of this problem. The agent that evaluates, restructures, and validates its own memory is both judge and defendant. The validator’s dilemma: any system that grades its own performance will systematically undercount the failure modes it cannot conceptualize.

We mapped the observability layer needed to watch agents in production. Factory’s Signals system detects friction patterns across thousands of sessions. But detecting friction is not the same as validating learning. You can observe that an agent is performing better on a metric without understanding whether the knowledge it acquired is sound, complete, or safe.

The Governance Architecture That Does Not Exist Yet

What would a validated learning system look like?

Separation of learner and validator. The agent that uses guidelines should not be the sole judge of their quality. This principle is old. Financial audits require external auditors. Code reviews require different eyes. We made this argument for memory; it applies equally to learned guidelines and team playbooks.

Provenance tracking for every learned rule. Where did this guideline come from? Which failures generated it? When was it last validated? Against what tasks? ALTK-Evolve tracks scores, which is a start. But a score is not provenance.

Adversarial testing of learned knowledge. After an agent learns a new guideline, test it against scenarios designed to expose the guideline’s limits. ALTK-Evolve prunes low-scoring guidelines. That removes the obviously bad ones. It does not catch the subtly wrong ones that happen to produce correct outputs for incorrect reasons.

Deprecation and sunset mechanisms. Knowledge has a shelf life. A guideline extracted from GPT-4 behavior may not apply to Claude. A team playbook written for a monolith may be counterproductive in a microservices architecture. Without explicit expiration policies, learned knowledge accumulates indefinitely, and stale rules become invisible constraints.

Cross-validation between approaches. What happens when an ALTK-Evolve guideline contradicts a team Flywheel playbook? When an agent’s self-designed memory structure conflicts with organizational retention policies? These collisions are inevitable when multiple learning systems operate on the same agent. Nobody has proposed a resolution mechanism.

Why This Matters Now

These are not theoretical concerns. ALTK-Evolve is open source with a Claude Code plugin. The Flywheel is a framework that teams can implement today. Self-improving memory costs two dollars. The tools for agent learning are here. The tools for validating what agents learn are not.

We have spent the last three months building the intellectual foundation for this argument: how agents learn, how to make learning persist, how to observe agents in production, who validates self-designed memory. Each piece added a layer. This one connects them.

The trajectory is clear. Agents will learn. They will improve from their own mistakes. They will absorb team knowledge. They will redesign their own memory. The question is not whether this happens. The question is whether humans remain in the validation loop when it does.

An agent that learns without validation is not self-improving. It is self-reinforcing. Those are different things. One gets better. The other gets more confident.

This analysis synthesizes ALTK-Evolve: Agentic LLM Toolkit Evolution by Isahagian et al., IBM Research (April 2026), The Feedback Flywheel by Rahul Garg, Thoughtworks (April 2026), and Victorino Group’s prior research on agent memory governance and agent observability.

Victorino Group builds governance infrastructure for self-improving AI systems, from validation frameworks to separation-of-duties architecture for agent learning. Let’s talk.