Skip to main content

202 posts tagged with "agents"

View all tags

The Agent Wall-Clock Budget That Raced Your Tool's Own Timeout

· 11 min read
Tian Pan
Software Engineer

There is a class of agent bug that does not appear in any single component when you look at it in isolation. The model is fine. The tool is fine. The retry policy is fine. The timeout values are even, on paper, generous. And yet a tool that consistently completes in eight seconds keeps landing against an agent that has already declared it a failure at seven point nine, replanned around an "error" that never happened, and started a second call that the first call's result is about to collide with.

The bug is not in any of the boxes. It is in the gap between two clocks that nobody agreed should be the same clock.

The Downstream API That Kept Writing After the User Cancelled the Conversation

· 10 min read
Tian Pan
Software Engineer

The user hits stop. The browser closes the SSE connection. Your AI SDK fires onAbort. The agent runtime sees the signal, stops requesting more tokens from the model, and tears down its loop. From inside your codebase, the cancellation looks crisp. Every subsystem you can see is doing the right thing.

Meanwhile, two seconds earlier, the model emitted a tool call. The runtime dispatched it. The tool's execute function opened a TCP connection to a third-party API and posted a payload. That HTTP request is still in flight, the third party's server is still processing it, and the third party has no way of knowing that the conversation it is serving no longer exists. The write commits. The user's mental model says they escaped the action by hitting stop. The downstream system's database says otherwise.

The MCP Server Your Team Forgot Was Running with Prod Credentials

· 10 min read
Tian Pan
Software Engineer

A new engineer joined the team on Monday. By Wednesday, she had a working local agent setup: an MCP server bridged to the company's deployment API, pointed at staging, talking to her editor. The onboarding doc walked her through the OAuth flow. The token she pasted into the server's environment file was the one her teammate had emailed her — the same token the CI pipeline uses to ship to staging. By Friday, she had joined the team for a working session at a coworking space.

The MCP server was still running. Bound to 127.0.0.1. No authentication. The token was loaded into the process. She didn't think about it because she was not using it. But any tab that visited any website that day could speak to her local server through her own browser. So could any other laptop on the coworking wifi, because she had not noticed that the server was actually bound to 0.0.0.0. The OAuth token your CI pipeline uses to push to staging was now reachable by anyone who could trick a browser into making a request to a local IP — which, in 2026, is a one-pop-up problem.

This post is about that class of failure: the gap between "I'm developing on my laptop" and "my laptop is a server reachable by adversaries." MCP servers, by design, sit right in that gap. Most teams have not noticed.

The Model Identifier Your Provider Re-Pointed to a Finetune for One Tenant and to Base for Everyone Else

· 11 min read
Tian Pan
Software Engineer

A customer support team escalates: "Your assistant used to handle refund-eligibility questions correctly. Last week it started getting them wrong." The on-call engineer pulls a transcript, replays the exact prompt against the same model identifier in a dev account, gets the correct answer, and closes the ticket as "cannot reproduce." Two weeks later the same complaint shows up from a different customer. The engineer replays again, in the same dev account, and gets the correct answer again. The team starts blaming a prompt change nobody made.

The model identifier in the request never changed. The string in the response field matched the string in the request field. The eval suite stayed green for six weeks. The model serving production traffic was a different set of weights from the model serving the eval suite, and had been for the entire life of the account — except for the last six weeks, when it became the same set of weights and the team noticed only because a customer noticed first.

The OAuth Scope Your Agent Inherited When On-Behalf-Of Quietly Became Act-As

· 10 min read
Tian Pan
Software Engineer

The security review said the agent acts "on behalf of" the user. The OAuth token said something else, and the audit log agreed with the token.

A small distinction in language did a lot of architectural work nobody noticed. "On behalf of" is the language a security review reaches for when it wants to capture an arrangement where the agent is a delegate, recognizable as a delegate, and constrained by being a delegate. "Act as" is the runtime behavior when the agent holds a token indistinguishable from the user's own and is therefore the user as far as every downstream system can tell. These two phrases describe completely different threat models. A typical enterprise OAuth integration ships the second one and prices it as the first.

The PII Redactor That Protected Your Logs and Let the Model Leak the Outputs

· 12 min read
Tian Pan
Software Engineer

A PII redactor that runs only on inbound traffic is a one-way valve installed at the wrong end of the pipeline. It catches user-submitted names, emails, and account numbers before they reach your logs. It does nothing about the model's own outputs — the place where the same model is now actively assembling text that may contain those same identifiers, drawn from RAG retrievals, tool returns, conversation history, or content the user pasted from another tenant's data. Every team I've watched ship an input-side redactor has a follow-up ticket in the backlog labeled "output-side parity." Most of those tickets never close, because no incident surfaces the gap for six months, and after six months the ticket has accumulated enough re-prioritization to look like a feature request rather than a missing half of a security control.

The failure mode is invariant: input redaction is treated as the canonical control because it is the easier engineering problem and the easier audit story. You wrote a regex set, you ran a labeled benchmark, you proved precision and recall on a fixed corpus, you shipped it behind a feature flag, and the security review accepted it as the PII boundary. The output side has none of that benefit. The model's response is generative, the surface area is unbounded, and the test methodology — "what should it not say in any of infinitely many contexts" — is structurally harder than "what should we strip from a known input." So the team that ships the inlet treats the outlet as future work and the future never arrives until a customer reports another customer's email landing in their transcript.

The Revoked Tool Your Agent Kept Calling Because the Registry Cache Was an Hour Stale

· 11 min read
Tian Pan
Software Engineer

A user opens the integrations page, finds the Stripe connector they installed last month, clicks Remove, and closes the tab. They believe they have just rescinded an authority. What they have actually done is decrement a row in a database that the agent currently talking to them will not read again for another forty-three minutes. In the interval, the agent will try to call that Stripe tool, the registry's authorization layer will correctly say no, the agent's harness will see the denial as a transient downstream blip and retry three times, and the user's own Stripe audit log will record three unauthorized access attempts arriving from a vendor they thought they had just severed.

The user's escalation will read, almost verbatim: your platform kept trying to access my Stripe after I removed it. That is exactly what happened, and the root cause sits one layer deeper than the bug report ever reaches. The tool registry was the source of truth for what the agent was allowed to do. The agent did not read the source of truth. It read a cache.

The SSE Keep-Alive Your Reverse Proxy Stripped, And The Prompt You Paid For Twice

· 10 min read
Tian Pan
Software Engineer

Your agent called a tool that took 35 seconds. During those 35 seconds, no tokens flowed from the model back to the browser. The provider's SSE stream was still open. Your tool was still running. The user's spinner was still spinning. And somewhere in the middle, a reverse proxy you do not control decided the connection had been quiet for too long, closed it, and your client's reconnection logic dutifully restarted the entire request from scratch.

The first response was 4,200 prompt tokens and 600 completion tokens. The second response was 4,200 prompt tokens and 600 completion tokens. The user got one answer. Your invoice got two.

The Agent's I-Don't-Know Rate That Fell After You Added More Tools

· 9 min read
Tian Pan
Software Engineer

You added the search tool, then the calendar tool, then the CRM tool, then four database wrappers and a calculator. The dashboard moved the way you wanted: task-completion ticked up, latency held, the "I don't know" rate dropped from 14% to 4%. Looks like a capability win. It is not. The planner did not learn more; it learned less abstention. Every question now looks answerable because there is always some tool that pattern-matches the query well enough to call. The 10 percentage points of "I don't know" you removed did not turn into correct answers — they turned into confident wrong ones, distributed across the long tail where nobody is grading carefully.

This is the false-competence trap of tool surface expansion. It is the most common way a team ships a regression while celebrating an improvement. The eval rubric measures whether the agent attempted the task and produced a plausible-shaped answer; it does not measure whether the agent should have refused. Abstention is not free, but it is the cheapest correct behavior available, and you stop being able to see it the moment your tool palette gets large enough that something always fires.

The Compaction Strategy That Summarized Away the User's Original Question

· 10 min read
Tian Pan
Software Engineer

A user asked our support agent: "Why was invoice INV-2025-08-44719 charged twice on April 3rd?" Forty-five minutes and eighteen tool calls later, the agent confidently reported back: there was no evidence of any duplicate billing on the account that quarter. The user, understandably, escalated. When we replayed the trace, the answer became obvious. The agent had compacted its conversation at turn nine. The summary said the user was "asking about a duplicate charge in early April." It did not contain the string "INV-2025-08-44719." Every subsequent tool call — the ledger lookup, the chargeback API query, the audit log scan — was issued against a paraphrased intent, not the literal invoice number the user typed.

The bug was not in the tools. It was not in the model's reasoning. It was that our context manager had a contract with every downstream component, and nobody had written it down. The contract said: "I will preserve meaning." The components needed: "I will preserve strings."

The Conversation Memory Pruning Heuristic That Erased the Context the Next Question Needed

· 9 min read
Tian Pan
Software Engineer

A user opens your long-session agent and says, in turn 3, "I'm vegetarian and on a tight budget." The conversation continues. Eleven turns later, the pruner runs. It counts tokens, finds turn 3 old and short, and drops it to keep the window inside budget. Turn 14 asks, "what should I cook tonight?" The model, looking at a window where the constraint no longer exists, recommends a $40 ribeye. The user reads this as the agent getting worse, opens the satisfaction survey, and rates the session a 2.

Nothing in your stack will report a memory failure. The token-budget dashboard will show the window staying healthily under the cap. The latency dashboard will be green. The eval suite — which scores single-turn answers against a held-out set — will report no regression. The only signal that the agent's competence dropped is a thumbs-down rating that your product team will attribute to "model variance." It will not be model variance. It will be a pruning heuristic doing exactly what it was tuned to do, on the wrong objective.

The Conversation Summarization That Erased the Consent Flag the User Gave You

· 11 min read
Tian Pan
Software Engineer

At turn 3, your user clicked "do not retain my code." At turn 7, they toggled off "use my conversations to improve the model." At turn 12, they opted out of cross-session memory. At turn 40, your context budget runs out. The compaction pass folds turns 1–30 into a tidy 200-token summary that reads beautifully: it captures what the user asked, what your agent did, and what came of it. At turn 41, your agent — armed with that summary and the most recent ten turns — confidently writes the user's code into a memory store the user opted out of at turn 7.

Your audit log now contains a consent event at t=3, a violating action at t=41, and between them a paragraph of prose that has no field for why the action was permitted. The summarizer was trained to compress conversations, not to forward control state. Nobody told it the consent toggle was load-bearing. Nobody could have, because consent wasn't in the conversation — it was in a structured field next to it, and the structured field didn't survive the trip through summarization.