Skip to main content

29 posts tagged with "multi-agent"

View all tags

Prompt Contract Testing: How Teams Building Different Agents Coordinate Without Breaking Each Other

· 10 min read
Tian Pan
Software Engineer

When two microservices diverge in their API assumptions, your integration tests catch it before production does. When two agents diverge in their prompt assumptions, you find out when a customer gets contradictory answers—or when a cascading failure takes down the entire pipeline. Multi-agent AI systems fail at rates of 41–87% in production. More than a third of those failures aren't model quality problems; they're coordination breakdowns: one agent changed how it formats output, another still expects the old schema, and nobody has a test for that.

The underlying problem is that agents communicate through implicit contracts. A research agent agrees—informally, in someone's mental model—to return results as a JSON object with a sources array. The orchestrating agent depends on that shape. Nobody writes this down. Nobody tests it. Six weeks later the research agent's prompt is refined to return a ranked list instead, and the orchestrator silently drops half its inputs.

The Consistency Gap: Why Parallel LLM Calls Contradict Each Other and How to Fix It

· 10 min read
Tian Pan
Software Engineer

Imagine a multi-agent pipeline that processes a user's support ticket. Agent A reads the ticket history and decides the user is a power user who needs an advanced response. Agent B reads the same ticket history in a parallel call and decides the user is a beginner who needs step-by-step guidance. Both agents finish at the same time and hand their outputs to a composer agent—which now has to reconcile two fundamentally incompatible mental models of the same person.

This isn't a rare edge case. Research analyzing production multi-agent failures found that 36.9% of failures are caused by inter-agent misalignment: conflicting outputs, context loss during handoffs, and incompatible conclusions reached independently. The consistency gap—the tendency for parallel LLM calls to contradict each other about shared entities—is one of the most underappreciated failure modes in agentic systems.

Agent Identifiability: When Your Trace Can't Tell You Which Agent Did What

· 11 min read
Tian Pan
Software Engineer

A user reports the assistant gave them a wrong answer at 9:47 a.m. You open the trace. There are three hundred and forty spans. They are almost all named agent.run, llm.invoke, or tool.call. Some have a parent. Some are siblings. Three of them retried. One of them retried and then was cancelled. None of them tells you whether the bad output came from the planner, the worker, the critic, the reflection pass, or the second retry of the worker after the critic flagged it.

You spend the next hour grepping log lines for a UUID prefix you saw in a screenshot, cross-referencing timestamps against a Slack notification, and reconstructing the agent topology in your head from the indentation pattern in the trace viewer. Eventually you guess that the third worker invocation ran with a model alias that silently flipped to a different snapshot the night before. You cannot prove it from the trace alone.

The agent worked. The trace is intact. The hairball is the bug.

The Closed-Loop Escalation Bug: When Your Specialist Agents Route in Circles

· 11 min read
Tian Pan
Software Engineer

A multi-agent system for market data research quietly burned through $47,000 in inference cost over four weeks before anyone noticed. The original weekly bill was $127. The cause wasn't a traffic spike or a model upgrade — it was two agents passing the same conversation back and forth for eleven days, each one confident the other was the right place for the request to live. Nothing errored. No alarm fired. The bot's "queue transferred" metric and the other bot's "task received" metric both went up in lockstep, and both dashboards looked healthy.

This is the closed-loop escalation bug. It is the multi-agent version of two helpful colleagues each insisting "no, you take it," except neither of them ever gets bored and walks away. The architecture diagram you drew at design time has each specialist owning a clean slice of the problem. The architecture the runtime actually executes has a routing cycle nobody in the room can see.

When Your Agents Disagree: Conflict Resolution Patterns for Parallel AI Systems

· 9 min read
Tian Pan
Software Engineer

Here is the uncomfortable fact that multi-agent system designs rarely surface in architecture reviews: when you run two agents over the same task, they will not agree on the answer somewhere between 20% and 40% of the time, depending on task type. Most systems respond to this by silently picking one answer. The logs show a final decision; the intermediate disagreement disappears. Everything looks healthy until something downstream breaks, and you spend three to five times longer debugging it than you would a single-agent failure — because you can't tell which agent was wrong, or even that they disagreed at all.

Disagreement between agents is not a fringe case to handle later. As parallel agent topologies become a standard architecture pattern, conflict resolution graduates from a footnote into a first-class reliability discipline.

Cross-Team Agent SLAs Don't Compose: The 99% Math Your Org Forgot to Budget

· 11 min read
Tian Pan
Software Engineer

Team A's agent advertises a 99% success rate. Team B's agent advertises 99%. The new joint workflow that calls both lands at 98% on a good day, 96% on a bad one — and the team that owns the joint workflow is now the de facto SRE for two systems they don't own, can't reproduce locally, and didn't write the eval set for. Each upstream team is hitting its SLO. The composite product is missing its SLO. Nobody's pager is ringing on the right side of the boundary.

This is the math of independent failure rates, and it has been hiding in plain sight ever since the org started letting agents call each other. Five components at 99% reliability give you 95% end-to-end. Ten components give you 90%. A 20-step process at 95% per-step succeeds 36% of the time — more than half of operations fail before completion. By the time a workflow chains 50 components — not unusual once an enterprise agent starts calling sub-agents that call tool agents — a system where every individual piece is "99% reliable" will fail roughly four out of ten requests.

Researchers analyzing five popular multi-agent frameworks across more than 150 tasks identified failure rates between 41% and 87%, with the top three failures being step repetition, reasoning–action mismatch, and unawareness of termination conditions — and unstructured multi-agent networks have been observed to amplify errors up to 17× compared to single-agent baselines. The math isn't subtle. The problem is that the org's SLO sheets, dashboards, on-call rotations, and PRDs are still scoped one agent at a time.

Debate Diversity Collapse: When Three Agents Vote 3-0 Because They Read the Same Internet

· 11 min read
Tian Pan
Software Engineer

The architecture diagram says "ensemble of three frontier models, debate-and-reconcile, majority vote." The trace says all three agents converged on the same answer in round one and spent two more rounds politely paraphrasing each other. The eval says +0.4 points over a single call. The bill says 4.2x. Somewhere in there, somebody decided the panel was working.

Multi-agent debate is sold as a way to get disagreement-driven reasoning: three minds arguing toward a better answer than any one of them would reach alone. It depends on the agents actually disagreeing. Frontier LLMs trained on overlapping web corpora, instruction-tuned against overlapping preference datasets, and aligned against overlapping safety taxonomies share priors more than the architecture diagrams admit. After a round of "let's reconcile," what you observe is not three perspectives converging on truth — it is three samples from one distribution converging on the mode they were never that far from.

The pattern has a name in the recent literature: when an ensemble's vote-disagreement rate trends to zero independent of question difficulty, you have debate diversity collapse. The panel is still voting. The vote no longer carries information.

The Silent Corruption Problem in Parallel Agent Systems

· 12 min read
Tian Pan
Software Engineer

When a multi-agent system starts behaving strangely — giving inconsistent answers, losing track of tasks, making decisions that contradict earlier reasoning — the instinct is to blame the model. Tweak the prompt. Switch to a stronger model. Add more context.

The actual cause is often more mundane and more dangerous: shared state corruption from concurrent writes. Two agents read the same memory, both compute updates, and one silently overwrites the other. The resulting state is technically valid — no exceptions thrown, no schema violations — but semantically wrong. Every agent that reads it afterward reasons correctly over incorrect information.

This failure mode is invisible at the individual operation level, hard to reproduce in test environments, and nearly impossible to distinguish from model error by looking at outputs alone. O'Reilly's 2025 research on multi-agent memory engineering found that 36.9% of multi-agent system failures stem from interagent misalignment — agents operating on inconsistent views of shared information. It's not a theoretical concern.

Temperature Governance in Multi-Agent Systems: Why Variance Is a First-Class Budget

· 11 min read
Tian Pan
Software Engineer

Most production multi-agent systems apply a single temperature value—copied from a tutorial, set once, never revisited—to every agent in the pipeline. The classifier, the generator, the verifier, and the formatter all run at 0.7 because that's what the README said. This is the equivalent of giving every database query the same timeout regardless of whether it's a point lookup or a full table scan. It feels fine until you start debugging failure modes that look like model errors but are actually sampling policy errors.

Temperature is not a global dial. It's a per-role policy decision, and getting it wrong creates distinct failure signatures depending on which direction you miss in.

Backpressure in Agent Pipelines: When AI Generates Work Faster Than It Can Execute

· 9 min read
Tian Pan
Software Engineer

A multi-agent research tool built on a popular open-source stack slipped into a recursive loop and ran for 11 days before anyone noticed. The bill: $47,000. Two agents had been talking to each other non-stop, burning tokens while the team assumed the system was working normally. This is what happens when an agent pipeline has no backpressure.

The problem is structural. When an orchestrator agent decomposes a task into sub-tasks and spawns sub-agents to handle each one, and those sub-agents can themselves spawn further sub-agents or fan out across multiple tool calls, you get exponential work generation. The pipeline produces work faster than it can execute, finish, or even account for. This is the same problem that reactive systems, streaming architectures, and network protocols solved decades ago — and the same solutions apply.

Consensus Protocols for Multi-Agent Decisions: What Happens When Your Agents Disagree

· 9 min read
Tian Pan
Software Engineer

You have three agents analyzing a customer support ticket. Two say "refund immediately," one says "escalate to fraud review." You pick the majority answer and ship the refund. Three days later, the fraud team asks why you auto-refunded a known chargeback pattern.

This is the consensus problem in multi-agent systems, and it turns out that distributed systems engineers solved important pieces of it decades ago. But naively transplanting those solutions — or worse, defaulting to majority vote — creates failure modes that are uniquely dangerous when your "nodes" are language models with opinions.

Race Conditions in Concurrent Agent Systems: The Bugs That Look Like Hallucinations

· 13 min read
Tian Pan
Software Engineer

Three agents processed a customer account update concurrently. All three logged success. The final database state was wrong in three different ways simultaneously, and no error was ever thrown. The team spent two weeks blaming the model.

It wasn't the model. It was a race condition.

This is the failure mode that gets misdiagnosed more than any other in production multi-agent systems: data corruption caused by concurrent state access, mistaken for hallucination because the downstream agents confidently reason over corrupted inputs. The model isn't making things up. It's faithfully processing garbage.