Skip to main content

8 posts tagged with "function-calling"

View all tags

Retiring an Agent Tool the Planner Learned to Depend On

· 10 min read
Tian Pan
Software Engineer

You unregister lookup_account_v1 from the tool catalog, swap in lookup_account_v2, and edit one paragraph of the system prompt to point at the new name. Tests pass. Three days later, support tickets start mentioning that the assistant "keeps trying to call something that doesn't exist," or — more disturbingly — that it answers customer questions with confident, plausible numbers and never hits the database at all. The deprecation didn't fail at the wire. It failed in the planner.

This is the gap between treating a tool deprecation as a syntactic change and treating it as a behavioral migration. The agent didn't just have your function in a registry; it had months of plans, multi-step recipes, and few-shot examples that routed through that function as a checkpoint. Pulling it out is closer to retiring an internal API your downstream services have informally hardcoded — except the downstream service is a model whose habits you cannot grep, and whose fallback when its preferred tool disappears is to invent one.

Argument Hallucination Is a Drift Signal, Not a Model Bug

· 10 min read
Tian Pan
Software Engineer

The ticket says "model hallucinated a user ID." The triage label is model-quality. The fix is one more sentence in the system prompt. Six weeks later a different tool starts hallucinating a date format, and the loop runs again. After a year of this, the prompt has grown into a 4,000-token apology for the entire backend, and the team is convinced the model is just unreliable on tool arguments.

The model isn't unreliable. The model is a contract-conformance machine reading the contract you gave it — and the contract you gave it has been quietly drifting away from the contract on the other side of the wire. Most production "argument hallucinations" are not model failures. They are integration tests your tool description is silently failing, surfacing as model output because that is the only place in the stack where the divergence becomes visible.

Tool Schemas Are Prompts, Not API Contracts

· 11 min read
Tian Pan
Software Engineer

The most expensive line in your agent codebase is the one that auto-generates tool schemas from your existing OpenAPI spec. It looks like a clean engineering choice — single source of truth, no duplication, auto-sync on every API change. It is also why your agent picks searchUsersV2 when it should have picked searchUsersV3, fills limit=20 because your spec's example said so, and silently drops the tenant_id because it was buried in the seventh parameter slot.

Nothing about this shows up in unit tests. The schema validates. The endpoint exists. The agent's call is well-formed JSON. And yet the model uses the tool wrong, every time, in ways your QA pipeline never sees because it tests the API, not the agent's reading of the API.

The bug is conceptual. OpenAPI was designed to describe APIs to humans who write SDK code; tool schemas are read by an LLM at every single call as a piece of the prompt. Treating them as the same artifact is the same category mistake as auto-generating user-facing copy from your database column names.

Your Tool Descriptions Are Prompts, Not API Docs

· 10 min read
Tian Pan
Software Engineer

The tool description is not documentation. It is the prompt the model reads, every single turn, to decide whether this tool fires and how. You are not writing for the developer integrating against the tool — the developer already has the schema, the types, the examples in the PR. You are writing for a stochastic reader that has never seen this codebase, is holding twenty other tool descriptions in the same context window, and has to pick one in the next forward pass.

Most teams don't. They paste the OpenAPI summary into the description field, stick the JSON Schema under it, and ship. Then the agent undercalls the tool, confidently calls the wrong adjacent tool, or fires the right tool with parameters that were "obviously" wrong to any human reading the schema. The team blames the model. The model was reading exactly what you wrote.

Tool Manifest Lies: When Your Agent Trusts a Schema Your Backend No Longer Honors

· 10 min read
Tian Pan
Software Engineer

The most dangerous bug in a production agent isn't the one that throws. It's the one where a tool description says returns user_id and the backend quietly started returning account_id two sprints ago, and the model is still happily inventing user_id in downstream reasoning — because the manifest said so, and the few-shot history reinforced it, and nothing in the loop ever fetched ground truth.

This is manifest drift: the slow, silent divergence between what your tool descriptions claim and what your endpoints actually do. It rarely produces stack traces. It produces bad decisions with clean audit trails — the worst class of bug in agent systems.

Tool Docstring Archaeology: The Description Field Is Your Highest-Leverage Prompt

· 11 min read
Tian Pan
Software Engineer

The highest-leverage prompt in your agent is not in your system prompt. It is the one-sentence description you wrote under a tool definition six months ago, committed alongside the implementation, and never touched again. The model reads it on every turn to decide whether to invoke the tool, which arguments to bind, and how to recover when the response doesn't match expectations. Engineers treat it as API documentation for humans. The model treats it as a prompt.

The gap between those two framings is where the worst kind of tool-use bugs live: the model invokes the right function name with the right arguments, and the right API call goes out — but for the wrong reasons, in the wrong situation, or in preference over a better tool sitting next to it. No exception fires. Your eval suite still passes. The regression only shows up as a slow degradation in whatever metric you use to measure whether the agent is actually helping.

Writing Tools for Agents: The ACI Is as Important as the API

· 9 min read
Tian Pan
Software Engineer

Most engineers approach agent tools the same way they approach writing a REST endpoint or a library function: expose the capability cleanly, document the parameters, handle errors. That's the right instinct for humans. For AI agents, it's exactly wrong.

A tool used by an agent is consumed non-deterministically, parsed token by token, and selected by a model that has no persistent memory of which tool it used last Tuesday. The tool schema you write is not documentation — it is a runtime prompt, injected into the model's context at inference time, shaping every decision the agent makes. Every field name, every description, every return value shape is a design decision with measurable performance consequences. This is the agent-computer interface (ACI), and it deserves the same engineering investment you'd put into any critical user-facing interface.

Tool Use in Production: Function Calling Patterns That Actually Work

· 9 min read
Tian Pan
Software Engineer

The most surprising thing about LLM function calling failures in production is where they come from. Not hallucinated reasoning. Not the model picking the wrong tool. The number one cause of agent flakiness is argument construction: wrong types, missing required fields, malformed JSON, hallucinated extra fields. The model is fine. Your schema is the problem.

This is good news, because schemas are cheap to fix.