Skip to main content

3 posts tagged with "testing"

View all tags

Test-Driven Development for LLM Applications: Where the Analogy Holds and Where It Breaks

· 10 min read
Tian Pan
Software Engineer

A team built an AI research assistant using Claude. They iterated on the prompt for three weeks, demo'd it to stakeholders, and launched it feeling confident. Two months later they discovered that the assistant had been silently hallucinating citations across roughly 30% of outputs — a failure mode no one had tested for because the eval suite was built after the prompt had already "felt right" in demos.

This pattern is the rule, not the exception. The LLM development industry has largely adopted test-driven development vocabulary — evals, regression suites, golden datasets, LLM-as-judge — while ignoring the most important rule TDD establishes: write the test before the implementation, not after.

Here is how to do that correctly, and the three places where the TDD analogy breaks down so badly that following it literally will make your system worse.

Evaluating AI Agents: Why Grading Outcomes Alone Will Lie to You

· 10 min read
Tian Pan
Software Engineer

An agent you built scores 82% on final-output evaluations. You ship it. Two weeks later, your support queue fills up with users complaining that the agent is retrieving the wrong data, calling APIs with wrong parameters, and producing confident-sounding responses built on faulty intermediate work. You go back and look at the traces — and realize the agent was routing incorrectly on 40% of queries the whole time. The final-output eval never caught it because, often enough, the agent stumbled into a correct answer anyway.

This is the core trap in agent evaluation: measuring only what comes out the other end tells you nothing about how the agent got there, and "getting there" is where most failures live.

Your AI Product Needs Evals

· 8 min read
Tian Pan
Software Engineer

Every AI product demo looks great. The model generates something plausible, the stakeholders nod along, and everyone leaves the meeting feeling optimistic. Then the product ships, real users appear, and things start going sideways in ways nobody anticipated. The team scrambles to fix one failure mode, inadvertently creates another, and after weeks of whack-a-mole, the prompt has grown into a 2,000-token monster that nobody fully understands anymore.

The root cause is almost always the same: no evaluation system. Teams that ship reliable AI products build evals early and treat them as infrastructure, not an afterthought. Teams that stall treat evaluation as something to worry about "once the product is more mature." By then, they're already stuck.