Skip to main content

20 posts tagged with "devops"

View all tags

The LLM Provider Incident Runbook: Staying Up When Your AI Stack Goes Down

· 11 min read
Tian Pan
Software Engineer

In December 2024, OpenAI's entire platform went dark for over four hours. A new telemetry service had been deployed with a configuration that caused every node in a massive fleet to simultaneously hammer the Kubernetes API. DNS broke. The control plane buckled. Every service went with it. Recovery took so long partly because the team lacked what they later called "break-glass tooling" — pre-built emergency mechanisms they could reach for when normal procedures stopped working.

If you were running an AI-powered product that day, you were making decisions fast under pressure. Multi-provider routing? Graceful degradation? Cached responses? Or just a status page and a prayer?

This is the runbook you should have written before that call came in.

The Provider Abstraction Tax: Building LLM Applications That Can Swap Models Without Rewrites

· 10 min read
Tian Pan
Software Engineer

A healthcare startup migrated from one major frontier model to a newer version of the same provider's offering. The result: 400+ engineering hours to restore feature parity. The new model emitted five times as many tokens per response, eliminating projected cost savings. It started offering unsolicited diagnostic opinions—a liability problem. And it broke every JSON parser downstream because it wrapped responses in markdown code fences. Same provider, different model, total rewrite.

This is the provider abstraction tax: not the cost of switching providers, but the cumulative cost of not planning for it. It is not a single migration event. It is an ongoing drain—the behavioral regressions you discover three weeks after an upgrade, the prompt engineering work that does not transfer across models, the retry logic that silently fails because one provider measures rate limits by input tokens separately from output tokens. Teams that build directly on a single provider accumulate this debt invisibly, until a deprecation notice or a pricing change makes the bill come due all at once.

The Model EOL Clock: Treating Provider LLMs as External Dependencies

· 11 min read
Tian Pan
Software Engineer

In January 2026, OpenAI retired several GPT models from ChatGPT with two weeks' notice — weeks after its CEO had publicly promised "plenty of notice" following an earlier backlash. For teams that had built workflows around those models, the announcement arrived like a pager alert on a Friday afternoon. The API remained unaffected that time. But it won't always.

Every model you're currently calling has a deprecation date. Some of those dates are already listed on your provider's documentation page. Others haven't been announced yet. The operational question isn't whether your production model will be retired — it's whether you'll find out in time to handle it gracefully, or scramble to migrate after users start seeing failures.

Prompt Linting: The Pre-Deployment Gate Your AI System Is Missing

· 8 min read
Tian Pan
Software Engineer

Every serious engineering team runs a linter before merging code. ESLint catches undefined variables. Prettier enforces formatting. Semgrep flags security anti-patterns. Nobody ships JavaScript to production without running at least one static check first.

Now consider what your team does before shipping a prompt change. If you're like most teams, the answer is: review it in a PR, eyeball it, maybe test it manually against a few inputs. Then merge. The system prompt for your production AI feature — the instruction set that controls how the model behaves for every single user — gets less pre-deployment scrutiny than a CSS change.

This gap is not a minor process oversight. A study analyzing over 2,000 developer prompts found that more than 10% contained vulnerabilities to prompt injection attacks, and roughly 4% had measurable bias issues — all without anyone noticing before deployment. The tooling to catch these automatically exists. Most teams just haven't wired it in yet.

Agent Credential Rotation: The DevOps Problem Nobody Mapped to AI

· 8 min read
Tian Pan
Software Engineer

Every DevOps team has a credential rotation policy. Most have automated it for their services, CI pipelines, and databases. But the moment you deploy an autonomous AI agent that holds API keys across five different integrations, that rotation policy becomes a landmine. The agent is mid-task — triaging a bug, updating a ticket, sending a Slack notification — and suddenly its GitHub token expires. The process looks healthy. The logs show no crash. But silently, nothing works anymore.

This is the credential rotation problem that nobody mapped from DevOps to AI. Traditional rotation assumes predictable, human-managed workloads with clear boundaries. Autonomous agents shatter every one of those assumptions.

Simulation Environments for Agent Testing: Building Sandboxes Where Consequences Are Free

· 10 min read
Tian Pan
Software Engineer

Your agent passes every test in staging. Then it hits production and sends 4,000 emails, charges a customer twice, and deletes a record it wasn't supposed to touch. The staging tests weren't wrong — they just tested the wrong things. The staging environment made the agent look safe because everything it could break was fake in the wrong way: mocked just enough to not crash, but realistic enough to fool you into thinking the test meant something.

This is the simulation fidelity trap. It's different from ordinary software testing failures. For a deterministic function, a staging environment that mirrors production schemas and APIs is usually sufficient. For an agent, behavior emerges from the interaction between reasoning, tool outputs, and accumulated state across a multi-step trajectory. A staging environment that diverges from production in any of those dimensions will produce agents that are systematically over-confident about how they'll behave under real conditions.

CI/CD for LLM Applications: Why Deploying a Prompt Is Nothing Like Deploying Code

· 10 min read
Tian Pan
Software Engineer

Your code ships through a pipeline: feature branch → pull request → automated tests → staging → production. Every step is gated. Nothing reaches users without passing the checks you've defined. It's boring in the best way.

Now imagine you need to update a system prompt. You edit the string in your dashboard, hit save, and the change is live immediately — no tests, no staging, no diff in version control, no way to roll back except by editing it back by hand. This is how most teams operate, and it's the reason prompt changes are the primary source of unexpected production outages for LLM applications.

The challenge isn't that teams are careless. It's that the discipline of continuous delivery was built for deterministic systems, and LLMs aren't deterministic. The entire mental model needs to be rebuilt from scratch.

Prompt Versioning and Change Management in Production AI Systems

· 9 min read
Tian Pan
Software Engineer

A team added three words to a customer service prompt to make it "more conversational." Within hours, structured-output error rates spiked and a revenue-generating pipeline stalled. Engineers spent most of a day debugging infrastructure and code before anyone thought to look at the prompt. There was no version history. There was no rollback. The three-word change had been made inline, in a config file, by a product manager who had no reason to think it was risky.

This is the canonical production prompt incident. Variations of it play out at companies of every size, and the root cause is almost always the same: prompts were treated as ephemeral configuration instead of software.