For years, the dominant concern for AI agents was capability. Could agents reason well enough? Could they use tools? Could they plan across multiple steps? The concern has shifted to whether we can trust what they do over long periods of time when no one is watching.
As agents move from short, bounded tasks into hours and days of sustained autonomous work, new types of failure have emerged. Agents declare success on incomplete work, commit to the wrong strategy, drift without producing any inspectable signal, or exploit weaknesses in the evaluation system to make it appear as if they’re succeeding. These are not bugs, but structural properties of systems designed to act at length without adequate supervision around them.
This is the agent reliability crisis, and it is a crisis of architecture.
When Agents Judge Themselves
The first and most consequential problem is that agents are poor judges of their own work. A system operating inside a long-running task is designed to optimize towards progress and completion. When the evaluation function lives within the same loop as the execution, agents overstate output quality. They stop before external evidence confirms success and produce artifacts that look finished because the system that made them also assessed them.
The fix is not a matter of prompting more carefully. Evaluation must sit outside execution. The process that produced the work cannot be the process that verifies it. When those two functions are separated, with an independent supervisor checking claims against actual artifacts and measurable results, agents see tasks through to the end rather than stopping when they prefer to.
The Anchoring Effect
Another common failure mode is overcommitment, where an agent anchors on the first plausible solution it generates and continues to develop it even when it would make better sense to try other approaches. The inner loop keeps executing and the trajectory keeps advancing, all while the search space being explored narrows, and the expected value of continuing degrades.
In one trial we ran, an agent spent 14 hours on a sophisticated modeling approach, accumulating complexity and cost, while a separate run on the same task found a comparably strong result in under three hours by trying 19 simpler alternatives. The first agent wasn’t exactly failing, because no single step was obviously wrong, but it was succeeding at the wrong thing. The trajectory as a whole was the problem.
The inverse occurs as well. Agents sometimes conclude that a task is impossible because their current strategy is not working, when they should instead simply abandon the current strategy. A transient infrastructure failure, a bad initial estimate, or a misleading early result can cause an agent to write off an entire problem. These are, once again, supervision failures. The decision of whether to continue, pivot, or stop progress on a specific strategy requires a broader view than any agent operating from inside a single trajectory can have.
State, Memory, and the Drift Problem
Agents that run for hours accumulate history. That history should be useful, but often leads them astray. Memory that was accurate when recorded becomes stale. Failure summaries accumulate negative framing that discourages exploration. Constraints that no longer apply continue to narrow the search space. In one case, an agent loaded 10 accurate, well-sourced records about prior environmental difficulties and then remained inactive for 12 hours. The actual fix for the underlying problem was a 10-minute configuration change. The agent never tried it because its memory captured a record of a hostile environment.
Counterintuitively, deleting records is often more valuable than adding them. An append-only memory store is an accumulation problem waiting to manifest. Persistent memory needs active curation, with explicit policies about what to keep, what to downweight, and what to remove entirely. Memory that degrades judgment should not survive.
Tools Designed for Humans, not Agents
A less visible failure point is the interface between agents and the tools they use. Most tools were designed for human callers who bring judgment and contextual reading to every interaction. Agents cannot do that reliably. When a tool returns an ambiguous response to a timeout or invalid input, a human interprets it instantly. An agent turns it into a reasoning problem. Tools intended for agent consumption need structured, machine-readable responses with explicit success and failure states and clear signals about whether a failure is retryable. The tool layer directly shapes agent behavior, and poorly designed interfaces degrade reliability across every run that touches them.
A related problem accumulates at the context level. Long-running agents gather capabilities over time, and loading everything by default means the working context contains material that is rarely relevant to the current decision. The fix is to treat the default context as a capability index rather than a capability library, with tools and reference materials loaded explicitly when needed. The same discipline applies to memory retrieval. Durable state should be queried for the relevant slice, not copied wholesale into the prompt. Context is a limited resource, and how it is managed is an architectural choice with direct consequences for reliability.
Lack of Observability
When an agent goes silent during a multi-hour computation, the question of whether it is working productively or has stopped making progress cannot be answered from outside the system. We need observability to monitor background jobs, or long running tasks that live outside the agent loop.
Observability cannot be bolted on after the fact, but must instead be built into the system from the start as a structured, durable record of every decision, tool call, state change, and supervisory action. That record is what makes human review possible, what lets supervisors intervene when trajectories go wrong, and what makes outcomes auditable after the fact.
The agent reliability crisis will not be resolved by better frontier models, but by building better systems around them. Agents need a supervisor, an evaluation layer, a memory substrate, constraint enforcement, and an observability surface. As long as those components are absent or weak, capable agents will keep producing unreliable outcomes.
About the author: Doris Xin is CEO and cofounder of Disarray, a company developing the underlying architecture for RSI
and enabling long-horizon autonomy for agents. She collaborated with Google Research and Microsoft Research, was an early ML engineer at LinkedIn, and contributed to Spark MLlib as Databricks’ first intern. She earned her PhD at UC Berkeley’s RISELab as an NSF Graduate Research Fellow.
The post The Agent Reliability Crisis appeared first on AIwire.


