Skip to main content

183 posts tagged with "observability"

View all tags

AI-Native Logging: Capture Decisions, Not Just I/O

· 10 min read
Tian Pan
Software Engineer

A customer support agent was generating hallucinated troubleshooting steps for 12% of tickets. The HTTP logs showed 200 OK across the board. Latency was normal. Error rates were flat. The system looked healthy by every conventional metric — and it was quietly fabricating answers at scale.

When engineers finally instrumented the decision layer, the root cause emerged in minutes: similarity scores for retrieved chunks were all below 0.4, confidence in the context was 0.28, and yet the model's stated output confidence read 0.91. A massive mismatch — invisible in traditional logs, obvious in a trace that captured the decision state.

This is the fundamental problem with applying conventional logging to LLM systems. I/O logs tell you your system ran. AI-native logging tells you whether it reasoned correctly.

Your AI Product's Dark Energy: The Background Compute Nobody Budgeted

· 10 min read
Tian Pan
Software Engineer

When your AI feature ships, you build a latency budget: how long does the model call take, how long does retrieval take, what's the p99 for the full request. What you almost certainly don't build is a budget for the inference that happens when no user is watching.

Every AI product with persistent state runs invisible work in the background. Documents get preprocessed when uploaded. Long conversations get re-summarized at session boundaries so the next session doesn't blow the context window. Proactive suggestions get generated on a schedule nobody set deliberately. Embeddings get regenerated when someone updates the schema. None of this shows up in your latency dashboard. Frequently it isn't in your cost model. Almost never is it in your monitoring.

This is your AI product's dark energy — the compute that explains the gap between what your inference bill should be and what it actually is.

Why Your Application Logs Can't Reconstruct an AI Decision

· 11 min read
Tian Pan
Software Engineer

An AI system flags a job application as low-priority. The candidate appeals. Legal asks engineering: "Show us exactly what the model saw, which documents it retrieved, which policy rules fired, and what confidence score it produced." Engineering opens the logs and finds: a timestamp, an HTTP 200, a response body, and a latency metric. The rest is gone.

This is not a logging failure. The logs are complete by every traditional measure. The problem is that application logs were never designed to record reasoning — and AI systems don't just execute code, they make context-dependent probabilistic decisions that can only be understood given the full input context that existed at decision time.

End-to-End Latency Is Not P99 of Your LLM Call: The Multipliers Nobody Measures in Agentic Systems

· 9 min read
Tian Pan
Software Engineer

Your LLM API call completes in 500ms at P99. Your users are waiting 12 seconds. Both numbers are accurate, and neither is lying to you — they're just measuring completely different things. The gap between them is where most agentic systems silently bleed performance, and most teams never instrument it.

The problem is structural: P99 LLM latency is a single-call metric applied to a multi-step execution model. A ReAct agent making five sequential tool calls, retrying a hallucinated function, assembling a growing context, and generating a 300-token reasoning chain is not one LLM call. It's a distributed workflow where the LLM is just one node, and every other node has its own latency tax.

The Invisible Handoff: Why Production AI Failures Cluster at Component Boundaries

· 9 min read
Tian Pan
Software Engineer

When your AI feature ships a wrong answer, the first question is always: "Was it the model?" Most engineers reach for model evaluation, run a few test prompts, and conclude the model looks fine. They're usually right. The model is fine. The breakage happened somewhere else—at one of the invisible seams where your components talk to each other.

The evidence for this is consistent. Analysis of production RAG deployments shows 73% of failures are retrieval failures, not generation failures. In multi-agent systems, the most common failure modes are message ordering violations, state synchronization gaps, and schema mismatches—none of which show up in any per-component health check. GPT-4 produces invalid responses on complex extraction tasks nearly 12% of the time, not because the model is broken, but because the output format contract between the model and the downstream parser was never enforced.

The model gets blamed. The boundary is the culprit.

The Rollout Sequencing Problem: Why Co-Deploying Model and Infrastructure Changes Destroys Observability

· 9 min read
Tian Pan
Software Engineer

Three weeks into your quarter, a production alert fires. Accuracy on a core task dropped eight percentage points. You open the dashboard and immediately notice three things that all landed in the same deploy window: a context length increase from 8k to 32k tokens, a model version upgrade from gpt-4-turbo-preview to gpt-4o, and a batch size change your infrastructure team pushed to improve throughput. None of the three changes individually was considered high-risk. Combined, they've created a debugging problem no one can solve cleanly.

Welcome to the rollout sequencing problem.

Agent Identifiability: When Your Trace Can't Tell You Which Agent Did What

· 11 min read
Tian Pan
Software Engineer

A user reports the assistant gave them a wrong answer at 9:47 a.m. You open the trace. There are three hundred and forty spans. They are almost all named agent.run, llm.invoke, or tool.call. Some have a parent. Some are siblings. Three of them retried. One of them retried and then was cancelled. None of them tells you whether the bad output came from the planner, the worker, the critic, the reflection pass, or the second retry of the worker after the critic flagged it.

You spend the next hour grepping log lines for a UUID prefix you saw in a screenshot, cross-referencing timestamps against a Slack notification, and reconstructing the agent topology in your head from the indentation pattern in the trace viewer. Eventually you guess that the third worker invocation ran with a model alias that silently flipped to a different snapshot the night before. You cannot prove it from the trace alone.

The agent worked. The trace is intact. The hairball is the bug.

Your AI Feature Is Only As Reliable As The ETL Pipeline Nobody Owns

· 10 min read
Tian Pan
Software Engineer

The AI feature has the dashboard. The prompt has the version control. The eval suite has the on-call rotation. And then there is the upstream cron job, written in 2022, owned by a team that rotated out of analytics two reorgs ago, that produces the CSV your retrieval index is built from. That cron job has no SLA. That CSV has no schema contract. The team that owns it does not know it feeds an AI feature. When it changes — and it will change — the AI team will spend three weeks debugging a prompt that did nothing wrong.

The AI quality regression you are about to chase is almost never an AI problem. It is an ETL problem wearing an AI costume. The discipline that has to land is the seam between the two — the contract, the lineage, the freshness signal, the paired on-call — and the team that does not formalize it ships an AI feature whose reliability is bounded by the least-loved cron job in the company.

AI Feature Soak Windows: Why a Two-Week Canary Misses What Actually Matters

· 13 min read
Tian Pan
Software Engineer

The two-week canary is one of those practices that sounds disciplined enough to skip the harder question. Engineering imported it from microservices — ramp 1% for a few days, watch error rate, ramp to 100%, declare done — and grafted it onto AI features without asking whether the failure modes that matter for AI even surface in two weeks. They don't. The bill that kills the feature lands in week six. The customer cohort that exposes the long-tail intent onboards in week five. The eval drift that scored +3% on launch day starts costing real money in week four because the new prompt's chattier outputs have been compounding token spend the whole time, and nobody was watching for that because the dashboard was watching for crashes.

A canary built around p95 latency and HTTP 500s will tell you the LLM is up. It will not tell you the feature is working. AI features fail in shapes the deploy ceremony was never designed to catch — slow shape changes in user behavior, gradual cache erosion, retrieval quality collapse, refusal-rate creep, cost trajectories that bend the wrong way — and almost all of them take longer than two weeks to declare themselves. The team that ships by the microservice clock is shipping by a clock the failures don't run on.

The Closed-Loop Escalation Bug: When Your Specialist Agents Route in Circles

· 11 min read
Tian Pan
Software Engineer

A multi-agent system for market data research quietly burned through $47,000 in inference cost over four weeks before anyone noticed. The original weekly bill was $127. The cause wasn't a traffic spike or a model upgrade — it was two agents passing the same conversation back and forth for eleven days, each one confident the other was the right place for the request to live. Nothing errored. No alarm fired. The bot's "queue transferred" metric and the other bot's "task received" metric both went up in lockstep, and both dashboards looked healthy.

This is the closed-loop escalation bug. It is the multi-agent version of two helpful colleagues each insisting "no, you take it," except neither of them ever gets bored and walks away. The architecture diagram you drew at design time has each specialist owning a clean slice of the problem. The architecture the runtime actually executes has a routing cycle nobody in the room can see.

The 90-Second Cold Start for Production Agents: When the LLM Isn't the Slow Part

· 10 min read
Tian Pan
Software Engineer

A user clicks the button. Ninety seconds later they get their first token. The team's response, almost reflexively, is to ask the model vendor for a faster TTFT — and the vendor's TTFT is 800 milliseconds. The model was never the slow part. The request waited 30 seconds for a tool registry to load, 20 seconds for a vector store client to negotiate its first connection, 15 seconds for the prompt cache to prime on a fresh container, and another 10 seconds for an agent framework to validate every tool schema in its registry against a JSON schema validator that was loading on first use.

This is the agent cold start, and it has almost nothing to do with the model. Teams that profile only the LLM call are optimizing the part of their request that wasn't slow. Worse, the cold start is invisible in steady state — load tests against a warm pool look great, dashboards plotted on the median look great, and the people who notice are the users who hit the first request after a deploy, an autoscaling event, or a low-traffic stretch where everything got recycled.

The Eval Automation Trap: When Your Pipeline Drifts Away From What Users Actually Want

· 10 min read
Tian Pan
Software Engineer

Your eval pipeline scores are trending up. Response quality is improving. The LLM judge is catching more bad outputs. Your dashboard is green.

Meanwhile, a support ticket trickles in: "The assistant keeps giving me long, formal answers when I asked a simple question." Then another: "It stopped suggesting next steps. Used to do that automatically." Then your product manager shows you a chart: user satisfaction down 12% over the last quarter, correlated almost perfectly with the stretch where your automated eval metrics were climbing fastest.

This is the eval automation trap. Your measurement apparatus became optimized for itself rather than for what your users value — and because the feedback loop was entirely automated, nobody noticed until the damage was already in production.