Skip to main content

14 posts tagged with "ci-cd"

View all tags

Eval-as-Code: When Your Release Gate Is a Notebook on Someone's Laptop

· 13 min read
Tian Pan
Software Engineer

The number that decides whether a model goes to production is being produced by a Jupyter notebook running on a single engineer's MacBook, against a CSV that lives in a Slack DM, scored by a judge model that nobody pinned. Two weeks later, after the engineer has touched the notebook three more times and the API provider has silently shipped a minor model update, nobody on the team can reproduce the number — including the engineer who originally generated it. And yet that number is the gate. It decided that GPT-4o-mini was good enough to replace GPT-4 in the customer support flow. It decided the new prompt template shipped. It decided the fine-tune was promoted. The team is treating it like a load-bearing artifact and storing it like a sticky note.

This is the eval gap. The industry has spent five years writing about evaluation as a methodology problem — which scoring technique, which judge model, which rubric, which dataset — and almost no time writing about evaluation as an engineering problem. But the moment your eval suite starts gating production releases, it inherits every requirement that the rest of your production stack lives by: reproducibility, version control, ownership, observability, dependency management, latency and reliability budgets, and a pipeline that survives the engineer who built it leaving the team. Most teams skip this layer entirely and discover its absence only after a major incident, usually one where the eval score said green and the customer experience said red.

Eval as a Pull Request Comment, Not a Job: Embedding LLM Quality Gates in Code Review

· 11 min read
Tian Pan
Software Engineer

Most teams that say "we have evals" mean: there is a dashboard, somebody runs the suite weekly, and the numbers get pasted into a Slack channel that nobody reads. Reviewers approve a prompt change without ever seeing whether it moved the suite, and the regression shows up two weeks later in a customer ticket. The eval exists; the eval is not in the loop.

The fix is structural, not motivational. Evals only gate quality when they live where the change lives — in the pull request comment, next to the diff, with a per-PR delta and a regression callout that the reviewer cannot scroll past. Anywhere else, they are a performative artifact: real work was done to build them, and they catch nothing.

Why Your Prompt Library Should Be a Monorepo, Not a Cookbook

· 11 min read
Tian Pan
Software Engineer

A team I worked with recently had three different "summarize this contract" prompts. One lived in a Notion page that the legal-tech squad copy-pasted into their service. One lived in a prompts/ folder in the customer-success backend, slightly modified to handle their tone preferences. One lived inline in a Python file inside the data team's notebook, hardcoded between two f-string interpolations. When OpenAI deprecated the model they all ran on, the migration plan involved Slack archaeology — each owner had to be tracked down, each variant had to be re-evaluated, and two of the three subtly broke in production for a week before anyone noticed.

This is what a prompt cookbook looks like at scale. Cookbooks make sense for ten prompts and one team. They become unmanageable somewhere around a hundred prompts and four teams. By the time you're running an AI organization, your prompts/ folder of .md files behaves exactly like vendored copy-paste code from 2008: every consumer has its own snapshot, drift is invisible, and breaking changes ripple outward in unpredictable ways.

Eval Differential as Branch Protection: Ship Score Diffs, Not Score Floors

· 10 min read
Tian Pan
Software Engineer

A team I worked with had a clean-looking eval gate: every prompt PR had to score above 0.85 on the golden set or the merge button stayed grey. They were proud of it. Six weeks in, average quality had quietly drifted from 0.93 to 0.87 — every PR cleared the bar, every PR landed, and no individual change owned the regression because none of them broke the rule. The bar was set against a snapshot of last quarter's quality, not against last week's.

That's the failure mode of an absolute-threshold eval gate: a PR that drops the score from 0.92 to 0.86 ships green, while a PR that lifts the score from 0.80 to 0.84 fails the same gate. The team learns "ship if it clears the bar" — a quality story. The signal you actually want is "ship if this change is non-regressive on the slices that matter" — a regression-detector story.

Coverage tools figured this out a decade ago. They report the diff against the parent commit and they break it down per file. Eval gates haven't caught up.

Prompt Linting Is the Missing Layer Between Eval and Production

· 11 min read
Tian Pan
Software Engineer

The incident report read like a unit-test horror story. A prompt edit removed a five-line safety clause as part of a "preamble cleanup." Every eval in the suite passed. Every judge score held within tolerance. Two weeks later, a customer-facing assistant produced a response that should have been refused, the kind that triggers a Trust & Safety page at 11pm. The post-mortem traced the regression to a single deletion in a PR that nobody had flagged because the suite that was supposed to catch regressions had no opinion on whether the safety clause was present — it only had opinions on whether the model behaved well in the cases the suite remembered to ask about.

This is the gap between behavioral evals and structural correctness. Evals measure what the model produces; they do not measure what the prompt is. And prompts, like code, have a structural layer that exists independently of behavior — sections that must be present, references that must resolve, variables that must interpolate, length budgets that must hold, deprecated identifiers that must not appear. When that structural layer breaks, the behavior often stays green for a while, until the right edge case in production surfaces the failure as an incident.

Semantic Diff for Prompts: Why Git Diff Lies About What Your Prompt Change Will Do

· 10 min read
Tian Pan
Software Engineer

A teammate opens a pull request that rewrites your agent's system prompt from 420 lines to 380. The diff is green-and-red carnage: deleted paragraphs, moved sections, tightened language. You approve it because the cleanup looks sensible. A week later, refund-request accuracy has dropped eight points and nobody can say which line did it.

A different teammate adds the word "concise" to one instruction. Three characters of diff. Nobody reviews it closely because there is almost nothing to review. That edit flips tool-call behavior on 22% of queries.

The Unmergeable Agentic Refactor: Why Multi-File Diffs Break at the Seam

· 9 min read
Tian Pan
Software Engineer

A 40-file refactor from a coding agent lands on your desk. You open the PR, scroll through the diff, and every hunk looks fine. The rename is consistent, the imports are tidy, the tests compile in isolation. You merge. Forty minutes later, CI on main goes red because two call sites in a sibling package still pass three arguments to a function that now takes four, and the type checker that would have caught it was never part of the agent's inner loop.

This is the most common failure mode in agent-authored refactors today, and it has almost nothing to do with the quality of the individual edits. Each file, reviewed on its own, looks like something a careful human would have written. The bug lives at the seams — the boundaries where edits from different files have to agree. File-level review hides seam-level correctness, and most review workflows were designed around files.

AI as a CI/CD Gate: What Agents Can and Cannot Reliably Block

· 9 min read
Tian Pan
Software Engineer

An AI reviewer blocks a merge. A developer stares at the failing check, clicks "view details," skims three paragraphs of boilerplate, and files a "force-push exception" without reading the actual finding. Within a week, every engineer on the team has internalized that the AI gate is background noise — something to dismiss, not engage with.

This is the outcome most teams building AI CI/CD gates actually ship, even when the underlying model is technically capable. The problem is not whether AI can review code. The problem is what you ask it to block, and what you expect to happen when it does.

Prompt Regression Tests That Actually Block PRs

· 10 min read
Tian Pan
Software Engineer

Ask any AI engineering team if they test their prompts and they'll say yes. Ask if a bad prompt can fail a pull request and block a merge, and you'll get a much quieter room. The honest answer for most teams is no — they have eval notebooks they run occasionally, maybe a shared Notion doc of known prompt quirks, and a vague sense that things are worse than they used to be. That is not testing. That is hoping.

The gap exists because prompt testing feels qualitatively different from unit testing. Code either behaves correctly or it doesn't. Prompts produce outputs on a spectrum, outputs are non-deterministic, and running enough examples to feel confident costs real money. Those are real constraints. None of them are insurmountable. Teams that have built prompt CI that actually blocks merges are not spending fifty dollars a build — they're running in under three minutes at under a dollar using a few design decisions that make the problem tractable.

AI Agents in Your CI Pipeline: How to Gate Deployments That Can't Be Unit Tested

· 10 min read
Tian Pan
Software Engineer

Shipping a feature that calls an LLM is easy. Knowing whether the next version of that feature is better or worse than the one in production is hard. Traditional CI/CD gives you a pass/fail signal on deterministic behavior: either the function returns the right value or it doesn't. But when the function wraps a language model, the output is probabilistic — the same input produces different outputs across runs, across model versions, and across days.

Most teams respond to this by skipping the problem. They run their unit tests, do a quick manual check on a few prompts, and ship. That works until it doesn't — until a model provider silently updates the underlying weights, or a prompt change that looked fine in isolation shifts the output distribution in ways that only become obvious in production at 3 AM.

The better answer isn't to pretend LLM outputs are deterministic. It's to build CI gates that operate on distributions, thresholds, and rubrics rather than exact matches.

The Agent Test Pyramid: Why the 70/20/10 Split Breaks Down for Agentic AI

· 12 min read
Tian Pan
Software Engineer

Every engineering organization that graduates from "we have a chatbot" to "we have an agent" hits the same wall: their test suite stops making sense.

The classical test pyramid — 70% unit tests, 20% integration tests, 10% end-to-end — is built on three foundational assumptions: units are cheap to run, isolated from external systems, and deterministic. Agentic AI systems violate all three at once. A "unit" is a model call that costs tokens and returns different answers each time. An end-to-end run can take several minutes and burn through API budget that a junior engineer's entire sprint's tests couldn't justify. And isolation is nearly impossible when the agent's intelligence emerges precisely from interacting with external tools and state.

LLM-Powered Test Generation: Using AI to Find Bugs in Your Software, Not Just Write It

· 9 min read
Tian Pan
Software Engineer

Most engineering teams using LLMs are focused on code generation — getting the model to write features faster. But there's a higher-leverage application that gets far less attention: using LLMs to generate the tests that find bugs humans miss. Not testing the AI — testing your software with AI.

The pitch is compelling. Hand-written test suites are shaped by human imagination, which means they cluster around the scenarios developers think of. LLMs explore state spaces differently. They generate inputs and edge cases that feel alien to the original author — and that's precisely where undiscovered bugs live.

But the reality is messier than the pitch. Raw LLM-generated tests fail compilation more than half the time. Over 85% of failures come from incorrect assertions. And integrating non-deterministic generation into a deterministic CI pipeline creates its own class of engineering problems. Here's how to make it work anyway.