A 22% agent latency regression that turned out to be a cpufreq governor change in a runner image — and why CI benchmark numbers measure the host, not your code.
A coding agent does not push when work is ready — it pushes to find out if work is ready. CI cost stops scaling with commits and starts scaling with plan steps, and the forecast model finance built last year no longer holds.
Auto-summarization preserves the conversational arc but quietly mutates the literal phrasing downstream tools depend on. Here is how that bug surfaces and how to architect around it.
An endpoint alias is not an artifact. When the auditor asks which checkpoint produced a decision, only per-decision checkpoint pinning gives a defensible answer.
Recency-and-length pruning evicts the constraint a later turn silently depends on, and the user reads a confident wrong answer as a competence regression. Pruning is the dual of retrieval, and the team that tuned it for token count is silently regressing answer quality.
Compaction preserves what your agent said and forgets what your user chose. Treat conversation memory as two streams — semantic and structured — or ship a privacy violation in the second.
A negotiated unit price is not a constant — it is the output of a state machine the vendor runs against your account. When seasonality trips the volume floor, the discount lapses and your forecast quietly goes wrong.
When LLM-generated prompts replace hand-written ones, the per-task labeling rate you signed in 2023 becomes a silent margin transfer until the renewal cycle forces a price-discovery fight.
A revoked dataset license can turn a deployed fine-tuned model into a regulated artifact overnight. Provenance, unlearning, and architecture choices made before training decide how recoverable that becomes.
Agent identity has no quarterly review, no team transfer, and no termination event. The day-one IAM grant becomes the day-ninety inheritance, and the org chart is the real obstacle to fixing it.
Your eval pipeline sets seed=42 and reports reproducible numbers. The provider may have dropped it at the gateway, the batch size shifted under load, or the system fingerprint rotated overnight — and your benchmark was always one batch away from a different answer.
The seed parameter on hosted LLM APIs is a best-effort hint, not a contract. Why byte-exact CI assertions flake — and what to assert instead.