Skip to main content

3 posts tagged with "agent-reliability"

View all tags

Production AI Incident Response: When Your Agent Goes Wrong at 3am

· 11 min read
Tian Pan
Software Engineer

A multi-agent cost-tracking system at a fintech startup ran undetected for eleven days before anyone noticed. The cause: Agent A asked Agent B for clarification. Agent B asked Agent A for help interpreting the response. Neither had logic to break the loop. The $127 weekly bill became $47,000 before a human looked at the invoice.

No errors were thrown. No alarms fired. Latency was normal. The system was running exactly as designed—just running forever.

This is what AI incidents actually look like. They're not stack traces and 500 errors. They're silent behavioral failures, runaway loops, and plausible wrong answers delivered at production scale with full confidence. Your existing incident runbook almost certainly doesn't cover any of them.

The Retry Storm Problem in Agentic Systems: Why Every Failed Tool Call Burns Your Token Budget

· 10 min read
Tian Pan
Software Engineer

Every backend engineer knows that retries are essential. Every distributed systems engineer knows that retries are dangerous. When you put an LLM agent in charge of retrying tool calls, you get both problems at once — plus a new one: every retry burns tokens. A single flaky API endpoint can turn a $0.01 agent task into a $2 meltdown in under a minute.

The retry storm problem isn't new. Distributed systems have dealt with thundering herds and cascading failures for decades. But agentic systems amplify the problem in ways that microservice patterns don't fully address, because the retry logic lives inside a probabilistic reasoning engine that doesn't understand backpressure.

The Stale World Model Problem in Long-Running Agents

· 10 min read
Tian Pan
Software Engineer

An AI agent reads a file at turn 3, reasons about its contents through turns 4 through 30, and then — at turn 31 — writes a modified version back to disk. The file was edited by another process at turn 17. The agent overwrites the newer version with a stale one, silently. No exception is raised. No alert fires. From the outside, the agent completed its task successfully.

This is the stale world model problem, and it's one of the most under-discussed failure modes in production agentic systems. Unlike context window overflows or tool call failures — which surface as errors — world model staleness produces agents that look operational while making decisions on outdated information. The failures are quiet, often irreversible, and they compound over the length of a task.