Skip to main content

13 posts tagged with "streaming"

View all tags

Inter-Token Jitter: The Streaming UX Failure Your p95 Dashboards Can't See

· 11 min read
Tian Pan
Software Engineer

Your latency dashboard is green. Time-to-first-token is under the 800ms target on p95. Total completion time is under the four-second budget on p99. Then a senior PM forwards a support thread: "the assistant froze for like three seconds in the middle of an answer," "it stuttered and then dumped a whole paragraph," "I thought it crashed." Three users uninstalled this week with the same complaint. Nobody on the team can reproduce it on their laptop, and every metric you log says the system is healthy.

The metric that would explain the bug is the one you're not measuring: the distribution of gaps between consecutive tokens. A clean p95 total time can hide a stream where 8% of responses contain a 2.5-second pause halfway through, and to a user watching characters appear in real time, that pause reads as a broken system — not a slow one. Your dashboard is measuring the movie's runtime; your user is watching the movie.

Speculative Decoding Is a Streaming Protocol Decision, Not an Inference Optimization

· 12 min read
Tian Pan
Software Engineer

The "identical output" guarantee that ships with every speculative decoding paper is a guarantee about token distributions, not about what your user sees. Read the proofs carefully and you find a clean mathematical equivalence: the rejection-sampling acceptance criterion is designed so that the output distribution after speculation is exactly the distribution the target model would have produced on its own. That guarantee binds the bytes that leave the inference engine. It says nothing about the bytes that arrived on the user's screen five hundred milliseconds ago and have to be taken back.

If you stream draft tokens to the client the moment the small model emits them, you are running an A/B test on your own users every time the verifier rejects a suffix. Half a paragraph rewrites itself. A function name changes after the IDE has already syntax-highlighted it. A TTS voice has already pronounced "the answer is likely no" before the verifier swaps in "the answer is yes, with caveats." The math says the final distribution is the same as the slow path. The user's experience says they watched the model change its mind in public.

This is the part of speculative decoding that doesn't make it into the speedup numbers. It is also the part that turns "free 3× throughput" into a half-quarter of streaming-protocol work that nobody scoped.

Streaming JSON Parsers: The Gap Between Tokens and Typed Objects

· 12 min read
Tian Pan
Software Engineer

The model is emitting JSON token by token. Your UI wants to render fields the moment they materialize — a confidence score before the long answer body, the arguments of a tool call as the model fills them in. Then someone wires up JSON.parse on every chunk and the whole thing falls over, because JSON.parse is all-or-nothing. It needs a balanced document to return anything. Until the model emits the closing brace, you have nothing to show.

This is not a parser problem you can fix with a try/catch. The standard JSON parser was designed against a content-length-known HTTP response. Partial input is not a state it models — it is "input error." When you treat a token stream as if it were an HTTP body, you inherit thirty years of "the document is either complete or invalid," and your UI pays the bill.

The Cancellation Tax: Your Inference Bill After the User Hits Stop

· 9 min read
Tian Pan
Software Engineer

Your stop button is a lie. When a user clicks it, your UI stops rendering tokens; your provider, in most configurations, keeps generating them. The bytes never reach a browser, but they reach your invoice. The gap between what the user saw and what you paid for is the cancellation tax, and it is the single most under-reported line item on AI cost dashboards.

The reason the tax exists is structural. Autoregressive inference is a GPU-bound pipeline: by the time your client closes the TCP connection, the model has already been scheduled, KV-cached, and is emitting tokens at 30–200 per second. Most serving stacks do not check for client liveness between tokens. They finish the job, log the usage, and bill you. The client saw ten tokens; the log recorded eight hundred. Langfuse, Datadog, and every other observability platform will faithfully report the eight hundred, because that's what the provider's usage block reported.

The Output Commitment Problem: Why Streaming Self-Correction Destroys User Trust More Than the Original Error

· 10 min read
Tian Pan
Software Engineer

A user asks your agent a question. Tokens start flowing. Three sentences in, the model writes "Actually, let me reconsider — " and pivots to a different answer. The revised answer is better. The user closes the tab.

This is the output commitment problem, and it is one of the most consistently underestimated UX failures in shipped AI products. The engineering mindset treats self-correction as a feature — the model noticed its own error, that is the system working as intended. The user-perception mindset treats it as a disaster — the product demonstrated, live, that its first confident claim was wrong. Those two readings are both correct, and they do not reconcile on their own.

The core asymmetry is that streaming makes thinking legible, and legible thinking is auditable thinking. A model that hallucinated silently and then produced a clean final answer would look competent. The same model, streaming every half-thought, looks like it is flailing. The answer quality is identical. The perception is not.

Time-to-First-Token Is the Latency SLO You Aren't Instrumenting

· 11 min read
Tian Pan
Software Engineer

Pull the last week of production traces and look at your latency dashboard. You almost certainly have p50 and p99 on total request latency. You probably have token throughput. You may even have a tokens-per-second chart, because a provider benchmark talked you into it. What you almost certainly do not have is a per-model, per-route, per-tenant histogram of time to first token — the single number that governs how fast your product feels.

This is not a small oversight. For any streaming interface — chat, code completion, agent sidebars, voice — perceived speed is set by how long the user stares at a blinking cursor before anything appears. Once the first token lands, the user is reading; subsequent tokens compete with their reading speed, not with their patience. Total latency matters for throughput planning and budget. TTFT matters for whether the product feels alive.

The gap between these two numbers is widening. Reasoning models can produce identical total latency to their non-reasoning siblings while pushing TTFT from 400 ms to 30 seconds. A routing change that "keeps latency flat" can silently turn a snappy assistant into a hanging window. If you are not graphing TTFT, you are shipping UX regressions you cannot see.

The Latency Perception Gap: Why a 3-Second Stream Feels Faster Than a 1-Second Batch

· 11 min read
Tian Pan
Software Engineer

Your users don't have a stopwatch. They have feelings. And those feelings diverge from wall-clock reality in ways that matter enormously for how you build AI interfaces. A response that appears character-by-character over three seconds will consistently feel faster to users than a response that materializes all at once after one second — even though the batch system is objectively faster. This isn't irrational or a bug in human cognition. It's a well-documented perceptual phenomenon, and if you're building AI products without accounting for it, you're optimizing for the wrong metric.

This post breaks down the psychology behind latency perception, the metrics that actually predict user satisfaction, the frontend patterns that exploit these perceptual quirks, and when streaming adds more complexity than it's worth.

SSE vs WebSockets vs gRPC Streaming for LLM Apps: The Protocol Decision That Bites You Later

· 11 min read
Tian Pan
Software Engineer

Most teams building LLM features pick a streaming protocol the same way they pick a font: quickly, without much thought, and they live with the consequences for years. The first time the choice bites you is usually in production — a CloudFlare 524 timeout that corrupts your SSE stream, a WebSocket server that leaks memory under sustained load, or a gRPC-Web integration that worked fine in unit tests and silently fails when a client needs to send messages upstream. The protocol shapes your failure modes. Picking based on benchmark throughput is the wrong frame.

Every major LLM provider — OpenAI, Anthropic, Cohere, Hugging Face — streams tokens over Server-Sent Events. That fact is a strong prior, but not because SSE is fast. It's because SSE is stateless, trivially compatible with HTTP infrastructure, and its failure modes are predictable. The question is whether your application has requirements that force you off that path.

AI-Native API Design: Why REST Breaks When Your Backend Thinks Probabilistically

· 11 min read
Tian Pan
Software Engineer

Most backend engineers can recite the REST contract from memory: client sends a request, server processes it, server returns a status code and body. A 200 means success. A 4xx means the client did something wrong. A 5xx means the server broke. The response is deterministic, the timeout is predictable, and idempotency keys guarantee safe retries.

LLM backends violate every one of those assumptions. A 200 OK can mean your model hallucinated the entire response. A successful request can take twelve minutes instead of twelve milliseconds. Two identical requests with identical parameters will return different results. And if your server times out mid-inference, you have no idea whether the model finished or not.

Teams that bolt LLMs onto conventional REST APIs end up with a graveyard of hacks: timeouts that kill live agent tasks, clients that treat hallucinated 200s as success, retry logic that charges a user's credit card three times because idempotency keys weren't designed for probabilistic operations. This post walks through where the mismatch bites hardest and what the interface patterns that actually hold up in production look like.

The Streaming Infrastructure Behind Real-Time Agent UIs

· 12 min read
Tian Pan
Software Engineer

Most agent streaming implementations break in one of four ways: the proxy eats the stream silently, the user closes the tab and the agent runs forever burning tokens, the page refreshes and the task is simply gone, or a tool call fails mid-stream and the agent goes quietly idle. None of these are model problems. They are infrastructure problems that teams discover in production after their demo went fine on localhost.

This post is about that gap — the server-side architecture decisions that determine whether a real-time agent UI is actually reliable, not just impressive in a demo environment.

Voice AI in Production: Engineering the 300ms Latency Budget

· 10 min read
Tian Pan
Software Engineer

Most teams building voice AI discover the latency problem the same way: in production, with real users. The demo feels fine. The prototype sounds impressive. Then someone uses it on an actual phone call and says it feels robotic — not because the voice sounds bad, but because there's a slight pause before every response that makes the whole interaction feel like talking to someone with a bad satellite connection.

That pause is almost always between 600ms and 1.5 seconds. The target is under 300ms. The gap between those two numbers explains everything about how voice AI systems are actually built.

Streaming AI Applications in Production: What Nobody Warns You About

· 10 min read
Tian Pan
Software Engineer

The first sign something is wrong: your staging environment streams perfectly, but in production every user sees a blank screen, then the entire response appears at once. You check the LLM provider — fine. You check the backend — fine. The server is streaming tokens. They just never make it to the browser.

The culprit, 90% of the time: NGINX is buffering your response.

This is the most common streaming failure mode, and it's entirely invisible unless you know to look for it. It also captures something broader about production streaming: the problems aren't usually in the LLM integration. They're in all the infrastructure between the model and the user.