Your Tool-Result Cache Is a Stale-Data Contract You Never Wrote
The trace looks clean. The agent called get_inventory_status, the tool returned {"available": 142, "warehouse": "SEA-3"}, and the model wove that into a confident answer. The customer placed an order. The warehouse said the item had been out of stock since 9 a.m. The cached row was four hours old. Nobody on the team had decided four hours was acceptable — that was just whatever the cache framework defaulted to when the platform team wired up the wrapper.
This is the failure mode that gets misfiled as a hallucination. The model isn't confabulating; it is faithfully reasoning over a stale tool result that nobody bothered to label as stale. The trace logs a clean call and a clean response, the eval set never saw a stale-cache case, and the regression compounds quietly across every customer who hits the same TTL window.
