Skip to main content

2 posts tagged with "webgpu"

View all tags

Browser-Native AI Is a Per-Feature Decision: Four Axes Your Team Hasn't Priced

· 12 min read
Tian Pan
Software Engineer

The model-in-the-tab story used to be easy to dismiss: small models, novelty demos, a cute Whisper transcription that ran for thirty seconds before the laptop fan turned on. That story is dead. Quantization improved, WebGPU shipped in every major browser, on-device caches got a persistent quota, and 4-bit 3B models now stream tokens at a rate users perceive as "snappy" on a $500 laptop. The "should this run server-side?" question is no longer a default — it is a load-bearing architectural decision your product team is making by accident every time they accept the platform team's first answer.

The mistake that follows is bigger than the demo getting worse. Teams pick one backend — usually server inference, sometimes browser inference — for the entire product, and then pay the wrong tax on every feature that doesn't fit. The privacy-sensitive feature loses to the latency-sensitive one because the architecture forces a single answer. Or worse, the team picks browser-native because the demo was magical, then ships a fleet experience where 30% of users on the long-tail device population get a degraded product the dashboard can't see.

Browser-native AI is not a faster TensorFlow.js. It is a different runtime with a different SRE story, a different cost model, and a four-axis trade-off that does not collapse into a single answer. Treating it as "the cheap version of the API call" is the architectural mistake of 2026.

Browser-Native LLM Inference: The WebGPU Engineering You Didn't Know You Needed

· 10 min read
Tian Pan
Software Engineer

Most AI features are architected the same way: user input travels to an API, a cloud GPU processes it, and a response travels back. That round trip is so normalized that engineers rarely question it. But it carries a hidden tax: 200–800ms of network latency on every interaction, an API key that must live somewhere accessible (and therefore vulnerable), and a hard dependency on uptime you don't control.

Browser-native LLM inference via WebGPU breaks all three of those assumptions. The model runs on the user's GPU, inside a browser sandbox, with no network round-trip. This isn't a future capability — as of late 2025, WebGPU ships by default across Chrome, Firefox, Edge, and Safari, covering roughly 82.7% of global browser traffic. The engineering question has shifted from "can we do this?" to "when does it beat the cloud, and how do we route intelligently between the two?"