AI Ops Is Not Platform Engineering: How Running LLM Services Breaks Your SRE Playbook
Your SRE team is excellent at running microservices. They've mastered blue-green deployments, canary rollouts, distributed tracing, SLO burn-rate alerts, and postmortem culture. Then someone ships an LLM-powered feature, and within a week an incident happens that none of those practices were designed to handle: the model starts generating plausible-sounding but structurally wrong outputs, no error is logged, no health check fails, and users have been silently getting garbage for four hours before anyone noticed.
This isn't a skills gap. It's an architectural gap. Running LLM services is a distinct operational discipline from running microservices, and the practices that don't transfer will burn your team if you don't identify them explicitly.
