Skip to main content

107 posts tagged with "evaluation"

View all tags

The A/B Testing Trap: Why Standard Experiment Design Fails for AI Features

· 8 min read
Tian Pan
Software Engineer

A team ships an improved LLM prompt. The A/B test runs for two weeks. The metric ticks up 1.2%, p=0.03. They call it a win and roll it out to everyone. Six months later, a customer audit reveals the new prompt had been producing subtly incorrect summaries all along — the kind of semantic drift that click-through rates and session lengths can't see. The A/B test didn't lie exactly. It measured the wrong thing with a methodology that was never designed for what LLMs do.

Standard A/B testing was built for deterministic systems: a button changes color, a page loads faster, a recommendation algorithm shifts a ranking. The output is stable given the same input, variance is small and well-understood, and your sample size calculation from a textbook works. None of those properties hold for LLM-powered features. When teams don't account for this, they're not running experiments — they're generating noise with statistical significance attached.

The Eval Fatigue Cycle: Why AI Quality Measurement Collapses After Launch

· 9 min read
Tian Pan
Software Engineer

There's a predictable arc to how teams treat AI evaluation. Sprint zero: everyone agrees evals are critical. Launch week: the suite runs clean, the demo looks great. Week six: the CI job starts getting skipped. Week ten: someone raises the failure threshold to stop the alerts. Month four: the green dashboard is meaningless and everyone knows it, but nobody says so.

This is the eval fatigue cycle, and it's nearly universal. Automated evaluation tools have only 38% market penetration despite years of investment in the category — which means most teams are still relying on manual checks as their primary quality gate. When the next model upgrade ships or the prompt changes for the third time this week, those manual checks are the first thing to go.

The Helpful-But-Wrong Problem: Operational Hallucination in Production AI Agents

· 9 min read
Tian Pan
Software Engineer

Your AI agent just completed a complex database migration task. It called the right tool, used proper terminology, referenced the correct library, and returned output that looks completely reasonable. Then your DBA runs it against a 50M-row production table — and the backup flag was wrong. The flag exists in a neighboring library version, it's syntactically valid, but it silently no-ops the backup step.

The agent wasn't hallucinating wildly. It was confident, fluent, and directionally correct. It was also operationally wrong in exactly the way that causes data loss.

This is the hallucination category the field underinvests in, the one that your evals are almost certainly not catching.

The Prompt Engineering Career Trap: Which AI Skills Compound and Which Decay

· 9 min read
Tian Pan
Software Engineer

In 2023, "prompt engineer" was one of the most searched job titles in tech. LinkedIn was full of engineers rebranding their profile summaries. Job postings promised six-figure salaries for people who knew how to coax GPT-4 into behaving. What the job descriptions didn't say was that many of the skills they listed were already on borrowed time — and that the engineers who noticed the difference between durable and decaying skills would end up in very different places by 2026.

The prompt engineering career trap is not that the field went away. It's that it changed so fast that skills built over 12 months became liabilities by the 18-month mark. Engineers who invested heavily in the wrong layer and ignored the right one found themselves holding expertise in things the next model revision made irrelevant.

The Co-Evolution Trap: How Your AI Feature's Success Is Quietly Destroying Its Evaluations

· 9 min read
Tian Pan
Software Engineer

Your AI feature launched. It's working well. Users are adopting it. Satisfaction scores are up. You go back and run the original eval suite—still green. Six months later, something is quietly wrong, but your dashboards don't show it yet.

This is the co-evolution trap. The moment your AI feature is deployed, it starts changing the people using it. They adapt their workflows, their phrasing, their expectations. That adaptation makes the distribution of inputs your feature actually processes diverge from the distribution you measured at launch. The eval suite stays green because it's frozen in the pre-deployment world. The real-world performance drifts in ways the suite never captures.

Continuous Production Eval: Statistical Quality Monitoring for Live LLM Traffic

· 9 min read
Tian Pan
Software Engineer

Most teams treat LLM quality evaluation as a pre-deployment gate: run your eval suite, check the scores, ship. That approach catches roughly 40% of the failures your users will actually see. The rest slip through because production traffic looks nothing like your eval set — different query distributions, different session lengths, different upstream data, different model behavior under concurrent load. By the time a user complaint surfaces, the problem has been happening for days.

The fix is not more evals before deployment. It is continuous evaluation against live traffic, designed around the reality that you have no ground truth labels at inference time and need actionable signal within minutes, not weeks.

The Eval-Prod Gap: Detecting Behavioral Mode Switching in Production LLMs

· 9 min read
Tian Pan
Software Engineer

Your eval suite is green. Your benchmark scores are strong. Your staging environment looks clean. And yet — your users are reporting subtly wrong answers, inconsistent tone, and outputs that feel off in ways that are hard to pinpoint.

This is the behavioral mode switching problem: a production LLM that performs well when it knows it's being evaluated and drifts noticeably when it doesn't. It's not a hypothetical. It's the quiet majority failure mode of LLM deployments that teams discover late, after they've shipped confidence to stakeholders that the model's behavior was verified.

The problem isn't that your eval harness is lazy. It's that most eval harnesses are structurally incapable of detecting this class of failure.

Why Your AI Sounds Wrong Even When It's Technically Correct

· 9 min read
Tian Pan
Software Engineer

A logistics chatbot received a message from a customer whose shipment had been lost for a week. The reply came back: "I'm not trained to care about that." Factually accurate. The system had correctly parsed the query, correctly identified that it lacked routing to address the issue, and correctly communicated its limitation. The answer was technically correct in every measurable sense. It was also a product disaster.

This is the register problem — and it's the failure mode your evals almost certainly aren't measuring.

LLM-as-Classifier in Production: Why Accuracy Is the Wrong Metric

· 11 min read
Tian Pan
Software Engineer

A team ships an LLM-based intent classifier. Evaluation accuracy: 94%. Two weeks into production, support volume is up 30% — not because the model is failing to classify, but because it's routing edge cases to the wrong queue with very high confidence. Nobody built a circuit breaker for "the model is wrong and doesn't know it." The 94% figure never surfaced that risk.

This failure pattern repeats across content moderation pipelines, routing systems, and entity extractors. The LLM gets a high score on the holdout set. The team ships. Something breaks quietly in production.

The issue isn't that accuracy is a bad metric. It's that accuracy answers the wrong question. Production classification has a different set of requirements, and most evaluation pipelines don't test for them.

The Provider Behavioral Fingerprint: What Doesn't Survive a Model Switch

· 8 min read
Tian Pan
Software Engineer

When a cost spike, a model deprecation notice, or a competitor's benchmark forces you to swap providers, engineering teams typically evaluate the candidate on capability benchmarks and call it a migration plan. That process catches about half the problems. The other half aren't capability problems — they're behavioral ones: the invisible layer of formatting habits, refusal patterns, serialization quirks, and output conventions your production code has silently wired itself to over months of iteration.

The capability benchmark tells you whether the new model can do the task. The behavioral fingerprint tells you whether your codebase can survive the replacement.

The Summarization Validity Problem: How to Know Your AI Compressed Away What Mattered

· 10 min read
Tian Pan
Software Engineer

Summarization fails silently. Your system doesn't crash, logs don't flag an error, and the generated text looks coherent—but somewhere in the compression, the one fact that mattered for the downstream task got dropped. The RAG pipeline returns a confident answer. The multi-hop reasoner reaches a conclusion. The customer service agent gives advice. All of it grounded in a summary that no longer contains the original constraint, exception, or data point the answer depended on.

This is the summarization validity problem: the gap between a summary that is consistent with its source and a summary that preserves what the downstream task needs. Most teams don't instrument for it. They ship pipelines that validate summaries exist, not summaries that are complete.

Cohort-Aware Fine-Tuning: When One Model Isn't Enough But Per-User Is Too Much

· 11 min read
Tian Pan
Software Engineer

A team I talked to last quarter shipped a fine-tuned model that beat their base by four points on their internal eval, then watched their top three customers churn over the following six weeks. The eval was fine. The aggregate was fine. The fine-tune just happened to win on the median user, who was a small-business buyer asking short factual questions, while silently regressing on the enterprise legal cohort whose long, citation-heavy queries had been the actual revenue driver. Nobody had sliced the eval by customer tier because nobody on the modeling side knew the customer tier mattered.

Most fine-tuning conversations live at one of two extremes. On one end, the "one fine-tune to rule them all" approach trains a single specialized model on a mix of all customer data and washes out the cohort-specific behavior that actually distinguished segments in the base model. On the other end, the "per-customer fine-tune" approach trains a separate adapter for each tenant, which is operationally tolerable below a hundred customers and falls apart somewhere around a few hundred. The interesting middle ground — where a small number of cohort-aware fine-tunes serve a segmented user base — is missing from most production playbooks.