19 posts tagged with "debugging"

AI-Native Logging: Capture Decisions, Not Just I/O

May 5, 2026 · 10 min read

Software Engineer

A customer support agent was generating hallucinated troubleshooting steps for 12% of tickets. The HTTP logs showed 200 OK across the board. Latency was normal. Error rates were flat. The system looked healthy by every conventional metric — and it was quietly fabricating answers at scale.

When engineers finally instrumented the decision layer, the root cause emerged in minutes: similarity scores for retrieved chunks were all below 0.4, confidence in the context was 0.28, and yet the model's stated output confidence read 0.91. A massive mismatch — invisible in traditional logs, obvious in a trace that captured the decision state.

This is the fundamental problem with applying conventional logging to LLM systems. I/O logs tell you your system ran. AI-native logging tells you whether it reasoned correctly.

The Agentic Debugger's Trap: When Your Agent Patches Faster Than You Can Diagnose

May 2, 2026 · 10 min read

Tian Pan

Software Engineer

A staff engineer I worked with last quarter caught a bug that had already been "fixed" three times in the previous six weeks. Three different engineers. Three different files. Three green CI runs. Three accepted agent-generated patches. Each patch made the failing test pass and the user-reported error disappear. Each one moved the bug somewhere else, where it waited until a different surface area triggered it again. The fourth time it surfaced, the data corruption it caused had been silently compounding for forty days.

The bug was a single off-by-one in a pagination cursor. The agent had been right that the symptom would go away. It had been wrong about why. And the engineers — competent, senior, well-intentioned — had each accepted a passing patch before they understood the failure mechanism.

This is the agentic debugger's trap: your agent can produce a fix faster than you can build the mental model needed to evaluate whether the fix is correct. Patch velocity outruns diagnosis. The bug count drops, the CI dashboard goes green, and you ship a codebase whose failure modes you no longer understand.

The 'Try a Bigger Model' Reflex Is a Refactor Smell

April 28, 2026 · 10 min read

Tian Pan

Software Engineer

A regression lands in standup: the support agent answered three customer questions wrong overnight. Someone says, "let's try Opus on this route and see if it fixes it." Forty minutes later the eval pass rate ticks back up, the team closes the ticket, and the inference bill quietly tripled on that path. Six weeks later the same shape of regression appears on a different route, and the same fix is applied. Your team has just trained a Pavlovian reflex: quality regression → escalate compute. The bigger model is the most expensive debugging tool in your stack, and you're now reaching for it first.

The trouble isn't that bigger models don't help. They do — sometimes a lot. The trouble is that bigger models are a strictly dominant masking strategy. When the prompt has a conflicting instruction, the retrieval is returning stale chunks, the tool description is being misread, or the eval set doesn't cover the failing distribution, a more capable model will round the corner of the failure without fixing any of those things. The next regression has the same root cause, the bill has compounded, and the underlying system is more brittle, not less, because the slack created by the upgrade kept anyone from looking under the hood.

Tool Reentrancy Is the Bug Class Your Function-Calling Layer Doesn't Know Exists

April 28, 2026 · 11 min read

Tian Pan

Software Engineer

The agent took four hundred milliseconds to answer a simple question, then crashed with a recursion-limit error. The trace showed twenty-five tool calls. Reading the trace top-to-bottom, an engineer would conclude the agent was confused — calling the same handful of tools in slightly different orders, never converging. That conclusion is wrong. The agent wasn't confused. It was stuck in a cycle: tool A invoked the model, the model picked tool B, tool B's implementation invoked the model again to format its output, and the formatter chose tool A. The trace UI rendered four nested calls as four sibling calls in a flat list, and the cycle was invisible to the only human who could have caught it.

This is tool reentrancy, and it's a bug class your function-calling layer almost certainly doesn't model. Concurrency-safe code has decades of primitives for it: reentrant mutexes that count nested acquisitions by the same thread, recursion limits at the language level, stack inspection APIs, and a cultural understanding that any function which calls back into the runtime needs a clear contract about what re-entry is allowed. Tool-calling layers default to fire-and-forget. There is no call stack the runtime can inspect, no cycle detector before dispatch, no reentrancy attribute on the tool definition, and the trace UI is shaped like a log, not a graph. The result is that every tool catalog past about a dozen entries silently becomes a recursion the framework can't see.

Pattern-Matching Failures: When Your LLM Solves the Wrong Problem Fluently

April 23, 2026 · 11 min read

Tian Pan

Software Engineer

A user pastes a long, complicated bug report into your AI assistant. It looks like a classic null-pointer question, with the same phrasing and code layout as thousands of Stack Overflow posts. The model responds confidently, cites the usual fix, and sounds authoritative. The user thanks it. The bug is still there. The report was actually about a race condition; the null-pointer framing was incidental to how the user described the symptom.

This is the single hardest bug class to catch in a production LLM system. The model did not refuse. It did not hedge. It did not hallucinate a fake API. It solved the wrong problem, fluently, and everyone downstream — the user, your eval pipeline, your guardrails — saw a plausible on-topic answer and moved on. I call these pattern-matching failures: the model latched onto surface features of the query and produced a confident answer to something adjacent to what was actually asked.

Data Lineage for AI Systems: Tracking the Path from Source to Response

April 20, 2026 · 10 min read

Tian Pan

Software Engineer

A user files a support ticket: "Your AI assistant told me the contract renewal deadline was March 15th. It was February 28th. We missed it." You pull up the logs. The response was generated. The model didn't error. Every metric is green. But you have no idea which document it retrieved, what the model read, or whether the date came from the context or was hallucinated entirely.

This is the data lineage gap. And it's not a monitoring problem — it's an architecture problem baked in from the start.

Pipeline Attribution in Compound AI Systems: Finding the Weakest Link Before It Finds You

April 20, 2026 · 10 min read

Tian Pan

Software Engineer

Your retrieval precision went up. Your reranker scores improved. Your generator faithfulness metrics look better than last quarter. And yet your users are complaining that the system is getting worse.

This is one of the more disorienting failure modes in production AI engineering, and it happens more often than teams expect. When you build a compound AI system — one where retrieval feeds a reranker, which feeds a generator, which feeds a validator — you inherit a fundamental attribution problem. End-to-end quality is the only metric that actually matters, but it's the hardest one to act on. You can't fix "the system is worse." You need to fix a specific component. And in a four-stage pipeline, that turns out to be genuinely hard.

Hallucination Is Not a Root Cause: A Debugging Methodology for AI in Production

April 19, 2026 · 10 min read

Tian Pan

Software Engineer

When a lawyer cited non-existent court cases in a federal filing, the incident was widely reported as "ChatGPT hallucinated." When a consulting firm's government report contained phantom footnotes, the postmortem read "AI fabricated citations." When a healthcare transcription tool inserted violent language into medical notes, the explanation was simply "the model hallucinated." In each case, an expensive failure got a three-word root cause that made remediation impossible.

"The model hallucinated" is the AI equivalent of writing "unknown error" in a stack trace. It describes what happened without telling you why it happened or how to fix it. Every hallucination has a diagnosable cause — usually one of four categories — and each category demands a different engineering response. Teams that understand this distinction ship AI systems that degrade gracefully. Teams that don't keep playing whack-a-mole with prompts.

The Mental Model Shift That Separates Good AI Engineers from the Rest

April 19, 2026 · 10 min read

Tian Pan

Software Engineer

The most common pattern among engineers who struggle with AI work isn't a lack of technical knowledge. It's that they keep asking the wrong question. They want to know: "Does this work?" What they should be asking is: "At what rate does this fail, and is that rate acceptable for this use case?"

That single shift — from binary correctness to acceptable failure rates — is the core of what experienced AI engineers think differently about. It sounds simple. It isn't. Everything downstream of it is different: how you debug, how you test, how you deploy, what you monitor, what you build your confidence on. Engineers who haven't made this shift will keep fighting their tools and losing.

Tracing the Planning Layer: Why Your Agent Traces Are Missing Half the Story

April 17, 2026 · 11 min read

Tian Pan

Software Engineer

Your agent called the wrong tool three times before finally succeeding, and your trace dashboard shows you exactly which tools were called, in what order, with full latency breakdowns. What the trace doesn't show you is the part that matters: why the agent thought those tool calls were the right move, what goal it was trying to satisfy, and what assumption it was operating under when it made each wrong decision.

This is the gap at the center of agent observability in 2026. Practitioners have invested heavily in tool-call tracing. The tooling is mature, the OpenTelemetry semantic conventions are established, and the dashboards are beautiful. But agent debugging keeps running into the same wall: you have complete visibility into what the agent did, and zero visibility into why.

The Debugging Regression: How AI-Generated Code Shifts the Incident-Response Cost Curve

April 17, 2026 · 9 min read

Tian Pan

Software Engineer

In March 2026, a single AI-assisted code change cost one major retailer 6.3 million lost orders and a 99% drop in North American order volume — a six-hour production outage traced to a change deployed without proper review. It wasn't a novel attack. There was no exotic failure mode. The system just did what the AI told it to do, and no one on-call had the mental model to understand why that was wrong until millions of customers had already seen errors.

This is the debugging regression. The productivity gains from AI-generated code are front-loaded and visible on dashboards. The costs are back-loaded and invisible until your alerting wakes you up at 3am.

Debugging AI at 3am: Incident Response for LLM-Powered Systems

April 16, 2026 · 10 min read

Tian Pan

Software Engineer

You're on-call. It's 3am. Your alert fires: customer satisfaction on the AI chat feature dropped 18% in the last hour. You open the logs and see... nothing. Every request returned HTTP 200. Latency is normal. No errors anywhere.

This is the AI incident experience. Traditional on-call muscle memory — grep for stack traces, find the exception, deploy the fix — doesn't work here. The system isn't broken. It's doing exactly what it was designed to do. The outputs are just wrong.

About Tian Pan