Skip to main content

2 posts tagged with "postmortem"

View all tags

Your SRE Postmortem Template Is Missing Six Fields That Decide Every LLM Incident

· 11 min read
Tian Pan
Software Engineer

The first time you run an LLM incident through a classic SRE postmortem template, the template wins and the incident loses. Timeline, contributing factors, mitigation, prevention — every field is filled in, every box ticked, and at the end of the document nobody can answer the only question that matters: which variable actually moved? Not the deploy event. Not the infra fault. Not the code change. The prompt revision, the model slice the router picked, the judge configuration scoring the eval that failed to fire, the retrieval index state that was serving when the quality complaints landed, the tool schema versions the planner was composing, the traffic mix that hit during the bad window. None of those have a row.

The SRE template wasn't designed for systems where the source of truth is an observed behavior rather than a code path. The variables that move silently in an LLM stack are the ones the template never had to enumerate. Borrowing the template anyway is what produces the "we don't know what changed" postmortem that files itself under "investigating" forever.

Why 'Fix the Prompt' Is a Root Cause Fallacy: Blameless Postmortems for AI Systems

· 9 min read
Tian Pan
Software Engineer

Your LLM-powered feature starts returning nonsense. The on-call engineer pages the ML team. They look at the output, compare it to what the prompt was supposed to produce, and within the hour the ticket is resolved: "bad prompt — tweaked and redeployed." Incident closed. Postmortem written. Action items: "improve prompt engineering process."

Two weeks later, the same class of failure happens again. Different prompt, different feature — but the same invisible root cause.

The "fix the prompt" reflex is the AI engineering equivalent of blaming the last developer to touch a file. It gives postmortems a clean ending without requiring anyone to understand what actually broke. And unlike traditional software, where this reflex is merely lazy, in AI systems it's structurally dangerous — because non-deterministic systems fail in ways that prompt changes cannot fix.