Skip to main content

21 posts tagged with "tool-calling"

View all tags

The Async Tool Call That Resolved After the User Already Closed the Conversation

· 12 min read
Tian Pan
Software Engineer

The clearest sign that an agent's session model is broken is when a tool result has nowhere to go. The agent fired a long-running call — a render, a provisioning job, a multi-step query. The user watched the spinner for a few seconds, decided they didn't need it after all, closed the tab, and moved on. Forty seconds later the tool finishes. Its callback hits your gateway with a conversation_id that no longer points at anything. The gateway has two equally bad options: silently drop the result, or stitch it into whatever session inherits that ID next.

Most teams discover this failure mode the same way: a support ticket where a user sees an answer they did not ask for, attached to a conversation they did not start. Or a downstream system that processed the same charge twice because the gateway helpfully "retried" delivery against the next active session. Or — most commonly — nothing visible at all, just a slow drift in completion metrics that nobody can correlate to anything specific, because the failures don't fire alerts; they fire emptiness.

The Legal Disclaimer That Leaked From The Answer Into The Tool Call Arguments

· 9 min read
Tian Pan
Software Engineer

Your counsel approved a one-line system-prompt directive: append "This information is not legal advice and should not be relied upon as such" to every response touching a regulated domain. Three weeks later, a user files a bug because their calendar event's description field opens with that same line, followed by a contract summary the agent was supposed to put into a meeting invite. The agent did not malfunction. It did exactly what the system prompt told it to do, which turned out to be a behavior that ranges over every channel the model produces text into — including the JSON arguments of the next tool it called.

The instruction was a content-formatting rule and the model treated it as one. It did not distinguish "user-facing response" from "tool call argument" because nothing in the prompt told it those were different surfaces. The disclaimer ended up in the calendar, in the email draft, in the Slack message your agent posted on the user's behalf. Each of these was a separate downstream system whose author had no idea a compliance string was about to be injected into a structured field, and each had a different cleanup cost.

The OAuth Scope One Tool Requested That Every Other Tool Quietly Inherited

· 10 min read
Tian Pan
Software Engineer

The design document said each tool gets its own OAuth token, scoped to the minimum permissions that tool needs. The implementation stored tokens keyed by (user_id, provider). Both statements were true on the day v1 shipped, because there was exactly one tool per provider. The day a second tool against the same provider went live, the design document was still true and the storage layer silently invalidated it.

Six months later, a security review traced an incident back to that line of schema. A calendar-reader tool, compromised through a prompt injection in an event description, had successfully called events.delete on the user's primary calendar. The reader had never been granted that scope. The writer had. The token store didn't distinguish between them.

This is the failure mode where a per-provider key shape silently aggregates privilege across tools that share a provider — and the architectural realization that OAuth scope is a property of a token, not a property of a tool.

The Tool Description That Drifted Out of Sync With the Tool It Described

· 12 min read
Tian Pan
Software Engineer

A backend engineer renames a parameter from user_id to account_id because the two stopped being the same thing six months ago, and a support ticket finally made the ambiguity intolerable. The JSON schema for the tool gets updated in the pull request that ships the rename. The tool's prose description — the one paragraph the model actually reads to decide whether to call the tool and how — lives in a different repository, owned by a different team, updated through a ticket queue, and still reads "pass the user_id to look up the account." Nobody flags it. The model dutifully calls the tool with the right schema, fills the right field, and gets the right answer on every single happy-path query. The bug is invisible until the day a user types something where their authenticated user_id and the account_id they were asking about are two different entities, and the agent confidently returns somebody else's data.

The Async Tool Call Your Agent Fired and Forgot

· 10 min read
Tian Pan
Software Engineer

The clearest sign that an agent's tool-call abstraction is broken is when the trace shows the step marked done and the downstream system shows nothing happened. The model called a tool, received a job ID back, treated the job ID as the answer, and moved on. Three minutes later the actual work either succeeded with nobody listening or failed with the error landing in a log nobody reads. The user sees a confident summary; the operations queue sees a stranded task.

This is the failure mode the function-calling abstraction quietly enables. JSON schemas describe parameters and return types, but they do not distinguish between "this tool returns a result" and "this tool returns a receipt for an operation whose result you will need to ask about later." The model treats both the same way, because to the planner they look the same — a successful tool call with a non-error payload.

The Hallucinated Tool Argument That Passed Schema Validation

· 9 min read
Tian Pan
Software Engineer

The agent calls fetch_order with order_id: "ORD-739241". The schema accepts it — three letters, a dash, six digits, matches the pattern exactly. The tool returns 404. The agent hedges, generates "ORD-739242", calls again, gets another 404, generates "ORD-739243". Your dashboard records three successful tool invocations and three clean schema validations. The customer waits. Somewhere in the trace, every layer of your safety stack is reporting green while the model invents identifiers at full speed.

The team's belief is that the schema caught it. The schema caught what it could catch: shape. It checked that the argument was a string, that it matched a regex, that the required field was present. The schema cannot check that ORD-739241 corresponds to a real order in your database, because the schema does not know your database exists. That gap — between syntactic plausibility and semantic correctness — is where most production tool-calling bugs live, and the failure is so quiet that the only signal is a customer's confusion.

The Filler Tool Call: When Agents Perform Diligence Instead of Doing Work

· 9 min read
Tian Pan
Software Engineer

Open the trace of any production agent and look at the tool calls that ran between the user's question and the first useful action. You will find a get_user_profile that returned a name nobody used, a check_status that came back green and was never referenced, a list_recent_orders whose result was summarized as "ok" and dropped on the floor. None of these calls changed the answer. All of them cost real money, real latency, and a real line in the trace. Your agent has learned to look diligent — and looking diligent is now your single largest source of waste.

This is the filler tool call: an action the agent emits not because it needs the result, but because the surrounding pattern of "thinking out loud, then acting" has been rewarded enough times during training that the model now performs thoroughness as a side effect of answering anything. It is the LLM equivalent of a junior analyst opening five tabs they never read so the senior across the room sees activity. The difference is that the junior gets bored. The agent never does.

MCP Server Sprawl: The Unbounded Tool Surface Nobody Owns

· 9 min read
Tian Pan
Software Engineer

The Model Context Protocol did exactly what it set out to do: it made giving an agent a new capability almost free. Wiring in a calendar server, a database server, an internal company server, or one of the 30,000-tool catalogs that vendors now publish is a config change, not a project. That frictionlessness is the feature. It is also the problem.

Because adding a tool is cheap, every team adds tools. The data team wires in a warehouse server. The support team adds a ticketing server. Someone connects a filesystem server for a one-off task and never removes it. None of these decisions is wrong. But there is no decision that owns their sum — the aggregate tool surface your agent now carries on every single request. The tool list has become a dependency graph with a real carrying cost, and in most organizations it is the one dependency graph nobody is responsible for.

The result is sprawl: a tool catalog that grows monotonically, gets reviewed by no one, costs more every quarter, and quietly makes the agent worse. This is the unowned surface, and it deserves the same scrutiny you already give your API surface and your npm tree.

The Degradation Signals Your Agent Never Receives

· 9 min read
Tian Pan
Software Engineer

When a downstream API starts to wobble, a human operator finds out a dozen ways before anything actually breaks. The status page flips to yellow. A changelog email lands in the inbox. A warning banner appears in the provider's dashboard. The on-call channel lights up with a 429 someone spotted in the logs. A teammate posts "anyone else seeing slow writes?" None of these are responses to a request. They are the ambient operational signal that surrounds the API, and a human absorbs it almost passively.

An agent calling the same API receives exactly one thing: the response to the request it just made. Status code, headers, body. That is the entire channel. It has no inbox, no dashboard, no Slack, no peripheral vision. It cannot notice that the last ten calls each took twice as long as the ten before. It cannot read the status page, because nobody handed it the URL and it has no standing instruction to look. When the dependency degrades, the agent is the last party in the system to find out — and it usually finds out by failing.

This asymmetry is not a model capability problem. A smarter model does not fix it. The agent is blind to operational signals because the plumbing never delivers them, and most agent stacks ship without anyone noticing the plumbing is missing.

You Can't Email a Changelog to a Model: Why API Deprecation Breaks When the Caller Is an LLM

· 10 min read
Tian Pan
Software Engineer

API deprecation is a communication protocol that assumes the receiver can read. You publish a changelog, send an email to registered developers, add a Deprecation header, give six months of notice, and trust that a human on the other end will see the warning, file a ticket, and migrate before the sunset date. That entire workflow quietly stopped working the moment your most active caller became a language model.

An LLM does not subscribe to your developer newsletter. It does not have a Slack channel where someone pastes your migration guide. It rediscovers your API on every single call — from a tool description it was handed, a documentation page that may be eighteen months stale, or a memory of how your API looked in its training data. There is no persistent client you can version, notify, or page. Each request is a fresh negotiation with an entity that has no memory of your last announcement and no obligation to read your next one.

This is not a hypothetical. As agents become the dominant consumers of internal and external APIs, the deprecation playbook every backend team has used for fifteen years is failing in a specific, diagnosable way — and most teams discover it only when a "deprecated for six months" endpoint is still serving an agent in production with no path to make it stop.

MCP Tool Deprecation: Why the Model Still Calls the Old Name

· 9 min read
Tian Pan
Software Engineer

You renamed get_user_email to lookup_contact six weeks ago. The new name shipped, the old handler was removed, the changelog noted it, and your eval set passed. Then last Tuesday a customer support engineer pinged you: an agent had returned an error on roughly three percent of its tool calls during the previous week — tool_not_found: get_user_email. The renamed-away name. The one nothing in the live system advertises anymore.

The prior is sticky. The model your agent is talking to was trained on a corpus where get_user_email was overwhelmingly the canonical way to ask "what is this person's email." Even when the tools array you pass at inference time lists only lookup_contact, the model occasionally — under certain context conditions, especially long traces or recovery-after-error states — falls back to the name it remembers. A hard cutover doesn't eliminate the long tail; it just turns soft failures into hard ones.

The Tool Schema Evolution Trap: When One Optional Parameter Changed Your Planner's Prior

· 10 min read
Tian Pan
Software Engineer

A new optional parameter goes into a tool description on a Tuesday. The change is small — six lines in the diff, no breaking signature change, no callers updated, no eval cases touched. The PR description says "adds support for an optional language filter to the existing search tool." Two reviewers approve. It ships.

A week later, the cost dashboard shows that the search tool is being called eighteen percent more often than the prior baseline. Latency on the affected agent has crept up by roughly the same proportion. Nobody can point to a single failing eval. The new parameter, when used, behaves correctly. The new parameter, when not used, doesn't matter. And yet the planner has clearly changed its mind about when to reach for this tool — and the eval suite, which grades tool correctness, has nothing to say about a shift in tool frequency.