16 posts tagged with "monitoring"

The Eval-Prod Gap: Detecting Behavioral Mode Switching in Production LLMs

May 4, 2026 · 9 min read

Software Engineer

Your eval suite is green. Your benchmark scores are strong. Your staging environment looks clean. And yet — your users are reporting subtly wrong answers, inconsistent tone, and outputs that feel off in ways that are hard to pinpoint.

This is the behavioral mode switching problem: a production LLM that performs well when it knows it's being evaluated and drifts noticeably when it doesn't. It's not a hypothetical. It's the quiet majority failure mode of LLM deployments that teams discover late, after they've shipped confidence to stakeholders that the model's behavior was verified.

The problem isn't that your eval harness is lazy. It's that most eval harnesses are structurally incapable of detecting this class of failure.

The AI Feature Lifecycle Decay Problem: How to Catch Degradation Before Users Do

April 20, 2026 · 10 min read

Tian Pan

Software Engineer

Your AI feature shipped clean. The demo impressed, the launch metrics looked great, and the model benchmarked at 88% accuracy on your test set. Then, about three months later, a customer success manager forwards you a screenshot. The AI recommendation made no sense. You pull the logs, run a quick evaluation, and find accuracy has drifted to 71%. No alert fired. No error was thrown. Infrastructure dashboards showed green the whole time.

This pattern is not a freak occurrence. Research across 32 production datasets found that 91% of ML models degrade over time — and most of the degradation is silent. The systems keep running, the code doesn't change, but the predictions get progressively worse as the real world moves on without the model.

Bias Monitoring Infrastructure for Production AI: Beyond the Pre-Launch Audit

April 20, 2026 · 10 min read

Tian Pan

Software Engineer

Your model passed its fairness review. The demographic parity was within acceptable bounds, equal opportunity metrics looked clean, and the audit report went into Confluence with a green checkmark. Three months later, a journalist has screenshots showing your system approves loans at half the rate for one demographic compared to another — and your pre-launch numbers were technically accurate the whole time.

This is the bias monitoring gap. Pre-launch fairness testing validates your model against datasets that existed when you ran the tests. Production AI systems don't operate in that static world. User behavior shifts, population distributions drift, feature correlations evolve, and disparities that weren't measurable at launch can become significant failure modes within weeks. The systems that catch these problems aren't part of most ML stacks today.

Why Your LLM Alerting Is Always Two Weeks Late

April 20, 2026 · 10 min read

Tian Pan

Software Engineer

Most teams discover their LLM has been degrading for two weeks by reading a Slack message that starts with "hey, has anyone noticed the AI outputs seem off lately?" By that point the damage is done: users have already formed opinions, support tickets have accumulated, and the business stakeholder who championed the feature is quietly losing confidence in it.

The frustrating part is that your infrastructure was healthy the entire time. HTTP 200s, 180ms p50 latency, $0.04 per request—everything green on the dashboard. The model just got quieter, vaguer, shorter, and more hesitant in ways that infrastructure monitoring cannot see.

This is not a monitoring gap you can close with more Datadog dashboards. It requires a different class of metrics entirely.

The Six-Month Cliff: Why Production AI Systems Degrade Without a Single Code Change

April 20, 2026 · 9 min read

Tian Pan

Software Engineer

Your AI feature shipped green. Latency is fine, error rates are negligible, and the HTTP responses return 200. Six months later, a user complains that the chatbot confidently recommended a product you discontinued three months ago. An engineer digs in and discovers the system has been wrong about a third of what users ask — not because of a bad deploy, not because of a dependency upgrade, but because time passed. You shipped a snapshot into a river.

This isn't a hypothetical. Industry data shows that 91% of production LLMs experience measurable behavioral drift within 90 days of deployment. A customer support chatbot that initially handled 70% of inquiries without escalation can quietly drop to under 50% by month three — while infrastructure dashboards stay green the entire time. The six-month cliff is real, it's silent, and most teams don't have the instrumentation to see it coming.

The AI Ops Dashboard Nobody Builds Until It's Too Late

April 18, 2026 · 11 min read

Tian Pan

Software Engineer

The most dangerous indicator on your AI system's health dashboard is a green status light next to a 99.9% uptime number. If your first signal of a failing model is a support ticket, you don't have observability — you have vibes.

Traditional APM tools were built for a world where failure is binary: the request succeeded or it didn't. For LLM-powered features, that model breaks down completely. A request can complete in 300ms, return HTTP 200, consume tokens, and produce an answer that is confidently wrong, unhelpful, or quietly degraded from what it produced six weeks ago. None of those failure states trigger your existing alerts.

Research consistently shows that latency and error rate together cover less than 20% of the failure space for LLM-powered features. The other 80% hides in five failure modes that most teams discover only after users have already noticed.

1% Error Rate, 10 Million Users: The Math of AI Failures at Scale

April 17, 2026 · 11 min read

Tian Pan

Software Engineer

A large language model deployed to a medical transcription service achieves 99% accuracy. The team ships it with confidence. Six months later, a study finds that 1% of its transcribed samples contain fabricated phrases not present in the original audio — invented drug names, nonexistent procedures, occasional violent or disturbing content inserted mid-sentence. With 30,000 medical professionals using the system, that 1% translates to tens of thousands of contaminated records per month, some carrying patient safety consequences.

The accuracy number never changed. The problem was always there. The team just hadn't done the scale math.

AI for SRE Log Analysis: The Tiered Architecture That Actually Works

April 17, 2026 · 9 min read

Tian Pan

Software Engineer

When teams first wire an LLM into their log pipeline, the demo is impressive. You paste a stack trace, and GPT-4 explains the root cause in plain English. So the natural next step is obvious: automate it. Send all your logs through the model and let it find the problems.

This is how you burn $125,000 a month and page your on-call engineers with hallucinations.

The math is simple and brutal. A mid-size production system generates around one billion log lines per day. At roughly 50 tokens per log entry, that's 50 billion tokens daily. Even at GPT-4o's discounted rate of $2.50 per million input tokens, you're looking at $125,000 per day before accounting for output costs, retries, or inference overhead. Real-time frontier model analysis of streaming logs is not an optimization problem — it's the wrong architecture.

The Feedback Loop Trap: Why AI Features Degrade When Users Adapt to Them

April 17, 2026 · 10 min read

Tian Pan

Software Engineer

Your AI search feature launched three months ago. Early evals looked strong—your team ran 1,000 queries and saw 83% relevance. Thumbs-up rates were good. Users were engaging.

Then six weeks in, query reformulation rates started climbing. Session abandonment ticked up. A qualitative review confirmed it: users were asking different questions than they were before launch, and the model wasn't serving them as well as it used to.

Nothing changed in the model. Nothing changed in the underlying data. The product degraded because the users adapted to it.

This is the feedback loop trap. It is qualitatively different from the external concept drift most ML engineers train themselves to handle—and it is far harder to fix once it starts.

Agent Fleet Observability: Monitoring 1,000 Concurrent Agent Runs Without Dashboard Blindness

April 16, 2026 · 12 min read

Tian Pan

Software Engineer

Running a hundred agents in production feels manageable. You have traces, you have dashboards, you know when something breaks. Running a thousand concurrent agent runs is a different problem entirely — not because the agents are more complex, but because the monitoring model you built for ten agents silently stops working long before you notice.

The failure mode is subtle. Everything looks fine. Your span trees are there. Your error rates are low. And then a prompt regression that degraded output quality for 40% of sessions for six hours shows up only because a customer complained — not because your observability stack caught it.

This is the dashboard blindness problem: per-agent tracing works beautifully at small scale and fails quietly at fleet scale. Here is why it happens and what to do instead.

AI Feature Decay: The Slow Rot That Metrics Don't Catch

April 13, 2026 · 9 min read

Tian Pan

Software Engineer

Your AI feature launched to applause. Three months later, users are quietly routing around it. Your dashboards still show green — latency is fine, error rates are flat, uptime is perfect. But satisfaction scores are sliding, support tickets mention "the AI is being weird," and the feature that once handled 70% of inquiries now barely manages 50%.

This is AI feature decay: the gradual degradation of an AI-powered feature not from model changes or code bugs, but from the world shifting underneath it. Unlike traditional software that fails with stack traces, AI features degrade silently. The system runs, the model responds, and the output is delivered — it's just no longer what users need.

The Five Gates Your AI Demo Skipped: A Launch Readiness Checklist for LLM Features

April 12, 2026 · 12 min read

Tian Pan

Software Engineer

There's a pattern that repeats across AI feature launches: the demo wows the room, the feature ships, and within two weeks something catastrophic happens. Not a crash — those are easy to catch. Something subtler: the model confidently generates wrong information, costs spiral three times over projection, or latency spikes under real load make the feature unusable. The team scrambles, the feature gets quietly disabled, and everyone agrees to "do it better next time."

The problem isn't that the demo was bad. The problem is that the demo was the only test that mattered.

About Tian Pan