Skip to main content

80 posts tagged with "infrastructure"

View all tags

The Codebase Index Your Coding Agent Rebuilt From a Checkout Three Weeks Behind Main

· 10 min read
Tian Pan
Software Engineer

A coding agent on your team opens a pull request that calls parseUserToken() four times across two files. The function does not exist in the repository, has not existed for nineteen days, and was replaced by decodeSessionClaim() in a commit your engineers all remember reviewing. The agent did not invent the name. It read the name from its semantic index — a vector store rebuilt from a working copy that was twenty-one days behind main. The agent's edit step, by contrast, ran git pull at session start and operated on fresh code. Two views of the same repository, three weeks apart, and the agent confidently bridged them with code that does not compile against anything real.

This is the failure mode that doesn't announce itself. The agent ran. The tests appeared to pass. The PR landed. The first reviewer noticed only because a stubbed-out function shared a name with an unrelated helper and tripped the linter. By then the agent had spent a full sprint writing against a phantom version of the codebase, and no one on the team — including the agent — had any signal that something was wrong.

The KV Cache Eviction Your Provider Called Cache Pressure and Your Bill Called a Doubled Prefix Charge

· 11 min read
Tian Pan
Software Engineer

Your application opens a long conversation with a forty-thousand-token system prompt and a full tool inventory. Turn 1 pays the prefix at write rates and the provider's KV cache warms up. Turn 2 arrives ninety seconds later. You assume it's a cache hit. Sometimes it is. Sometimes the same forty thousand tokens land on your invoice again at uncached prices, and nothing in your code changed between turn 1 and turn 2.

The thing that changed was somebody else's traffic. The KV cache is shared infrastructure. Your tenant was colocated on a serving node that, in the ninety seconds between your two turns, took on enough other tenants to evict your prefix from memory. The provider's dashboard will describe this as "cache pressure." Your finance team will describe it as a line item that doubled. Both descriptions are accurate. Neither is in your code.

The Model Identifier Your Provider Re-Pointed to a Finetune for One Tenant and to Base for Everyone Else

· 11 min read
Tian Pan
Software Engineer

A customer support team escalates: "Your assistant used to handle refund-eligibility questions correctly. Last week it started getting them wrong." The on-call engineer pulls a transcript, replays the exact prompt against the same model identifier in a dev account, gets the correct answer, and closes the ticket as "cannot reproduce." Two weeks later the same complaint shows up from a different customer. The engineer replays again, in the same dev account, and gets the correct answer again. The team starts blaming a prompt change nobody made.

The model identifier in the request never changed. The string in the response field matched the string in the request field. The eval suite stayed green for six weeks. The model serving production traffic was a different set of weights from the model serving the eval suite, and had been for the entire life of the account — except for the last six weeks, when it became the same set of weights and the team noticed only because a customer noticed first.

The SSE Keep-Alive Your Reverse Proxy Stripped, And The Prompt You Paid For Twice

· 10 min read
Tian Pan
Software Engineer

Your agent called a tool that took 35 seconds. During those 35 seconds, no tokens flowed from the model back to the browser. The provider's SSE stream was still open. Your tool was still running. The user's spinner was still spinning. And somewhere in the middle, a reverse proxy you do not control decided the connection had been quiet for too long, closed it, and your client's reconnection logic dutifully restarted the entire request from scratch.

The first response was 4,200 prompt tokens and 600 completion tokens. The second response was 4,200 prompt tokens and 600 completion tokens. The user got one answer. Your invoice got two.

The GPU Reservation Your Batch Workload Starved Your Real-Time Path On

· 9 min read
Tian Pan
Software Engineer

The nightly fine-tune job starts at 02:00 UTC. It walks into the shared GPU pool, takes every slot it can find, and holds them. By 09:30, when the first inference traffic of the business day arrives, the autoscaler tries to claim capacity that has been continuously occupied for seven and a half hours. The first ninety minutes of the morning run at roughly four times the baseline p99 latency. The dashboard reports a "noisy morning tail" that the inference team attributes to user behavior, because the actual contention lives in a job queue nobody on the inference team owns.

This is the GPU-sharing failure mode that the cost-attribution slide in your capacity review does not capture. The sharing was sold as a utilization win — train at night, serve in the day, fill the trough. What actually shipped was a latency tail you cannot escape until the pool is partitioned by latency class, not by team or by clock.

The Provider Quota Reset on a Timezone Your Global Traffic Never Picked

· 8 min read
Tian Pan
Software Engineer

Your monthly token quota resets at 00:00 UTC. Your largest customer is in Tokyo and hits peak load at 21:00 UTC — 6:00 AM their next morning. By the time the reset arrives, the Tokyo workday has already chewed through the last six hours of the cycle on quota-exhaustion fallback. The 429s look "occasional" because the UTC calendar axis on your dashboard hides the daily reset boundary inside an ordinary timestamp.

This is not a rate limit bug. It is a calendar bug. The provider chose a reset clock for their bookkeeping convenience, and the geography of your traffic decided which customers got the empty end of the cycle. The team that priced the quota as a uniform resource is rationing it on a calendar the user never sees.

The Reranker You Added That Slowed Recall More Than It Improved Precision

· 11 min read
Tian Pan
Software Engineer

The offline eval was unambiguous. After bolting a cross-encoder on top of the top-50 from vector search, nDCG@5 went up four points. The team shipped it on a Tuesday. By Thursday, p99 retrieval latency had crossed the SLO by 700 milliseconds, and customer success was forwarding screenshots of empty results pages that the old pipeline would have populated. The graph that mattered — user-perceived answer quality — was down. The reranker was a regression that the team had branded as an improvement, and the eval rubric was the thing that hid the regression in plain sight.

This is one of the most common failure modes in production retrieval, and it is rarely described as what it actually is: an evaluation bug. The reranker did what it was advertised to do. It reordered the top-50 with finer-grained precision. The problem is that the metric used to justify it — offline nDCG, computed at infinite budget, against the full reranked list — describes a world the production system does not live in. In production, the answer that ships is not the best-scored reranked list. It is whatever the system can return before the request deadline. And once you write the metric that way, the reranker's contribution is no longer a four-point lift. It is a curve.

The Streaming Response Your Backend Infrastructure Was Not Built For

· 12 min read
Tian Pan
Software Engineer

Streaming was a product decision. Somebody on the design team watched a competitor's chat UI tick out tokens like a typewriter, watched a user's shoulders relax when the first character appeared two hundred milliseconds in instead of after a four-second blank stare, and the decision was made: we stream. The pull request changed three files in the API gateway. The model output now flushes incrementally over Server-Sent Events. The launch went out on a Tuesday and the satisfaction score moved up by a measurable amount on a Wednesday. Nobody opened a ticket against infrastructure.

A month later the on-call engineer is staring at three dashboards that no longer agree with each other. The autoscaler is provisioning twice as many pods as the CPU graphs say it should need. The p99 latency dashboard is broken — not malfunctioning, but uninterpretable, because the histogram buckets stop at five seconds and most spans now live in the overflow. The capacity model that priced the previous quarter's bill said the service could handle twelve hundred requests per second per node. The graph in front of the on-call says it is handling four hundred and falling over.

The Vector Index That Was Sharded by Ingestion Date

· 9 min read
Tian Pan
Software Engineer

There is a specific kind of recall lie that hides inside time-partitioned vector indexes, and the people who built the offline eval are usually the last to find it. The dashboard says recall@10 is 0.94. The retriever is shipping the right snippet 94% of the time. The product team is shipping more retrieval-grounded features on the back of that number. And then the support tickets arrive: "the assistant cited a guide that does not match the answer," "the assistant linked to last week's version of the policy," "the assistant could not find a document I uploaded two months ago." None of those tickets contradict the 0.94. They are evidence that the 0.94 is measuring the wrong thing.

The mechanism is simple and easy to miss. The vector index is sharded by ingestion date because that is the easiest way to keep write throughput high, retire old data, and keep the hot working set in fast memory. The offline test set is generated nightly from production logs, which means the queries are drawn from the same recent window that the freshest shard happens to hold. Recall is measured against ground truth that lives one or two shards deep. The retriever performs beautifully on those queries because, in production, those queries are the ones the routing layer keeps inside the same shard.

The Token Budget You Cannot See Until You Hit It

· 10 min read
Tian Pan
Software Engineer

Your team negotiated a monthly token allocation with your inference provider. The contract specifies the cap. The dashboard in the provider portal shows yesterday's usage with a one-day lag. The API itself returns per-minute rate-limit headers — anthropic-ratelimit-tokens-remaining, x-ratelimit-remaining-requests — and nothing about the monthly bucket you actually have to plan against. And your agent fleet has no mechanism to slow down as the budget depletes, because the only signal that arrives in real time is the 429 — which arrives after the budget is already gone, dressed up as the same transient error your retry logic was tuned to ignore.

This is a different shape of problem than rate limiting. Rate limits are a fast-moving throttle the consumer must react to within seconds; the headers tell you the bucket has a thousand tokens left and refills in forty seconds, and a well-written client backs off and tries again. Monthly quota is a slow-moving budget the consumer must plan against over weeks. The two get confused because they share the failure code and sometimes share the dashboard, but they require different controls — and the gap between what the provider exposes and what the consumer needs is where the worst incident of the month lives.

The Cached Prompt Prefix That Grew Arms and Legs

· 11 min read
Tian Pan
Software Engineer

Six months ago your prompt prefix was 4,000 tokens. It was stable, cache-warm, and amortized to almost nothing — the per-call surcharge for system instructions was a rounding error against the per-call cost of the response. Today that prefix is 11,000 tokens, your cache hit rate has slid from 92% to 31%, and your inference bill is up 4x. Nobody on the team can point to the PR that did it. There is no commit message saying "increase prompt tokens by 7,000." Every change was small, every change was defended, every change shipped clean.

The prefix grew arms and legs the way a basement collects boxes. One team needed the user's tier injected so the agent could explain plan limits. Another needed today's date in the user's timezone for "remind me tomorrow" to work. A third stapled in the active A/B variant name so eval traces could be sliced. Marketing added the current promo banner so the agent could mention it on prompt. Compliance added a feature-flag manifest so the model could refuse beta features for users not in the rollout. Each was a one-line addition. Each was defensible in isolation. The aggregate destroyed your cache.

The Carbon Line Item Nobody Puts in the AI Feature Spec

· 10 min read
Tian Pan
Software Engineer

Open any AI feature review and you will hear the same three numbers debated: latency, token cost, and accuracy. Someone pulls up the p95 chart, someone else does the math on cost-per-thousand-requests, and a third person argues the eval score is good enough to ship. Nobody mentions energy. Nobody mentions carbon. And because nobody mentions it, the environmental footprint of the feature still gets decided — implicitly, by whoever wins the argument about the dollar figure.

That is the quiet problem with AI sustainability. It is not that teams choose a high-carbon design on purpose. It is that they never choose at all. The footprint is a side effect of a cost decision, and cost only loosely tracks carbon. A routing rule that looks like a clean win on the spend dashboard can quietly double emissions, and no one in the room would know, because the number that would have told them was never on a dashboard.

This post treats energy and carbon as what they actually are: a measurable, ownable property of an AI system, on the same footing as latency and cost. Not a corporate-values footnote. A line item.