Skip to main content

6 posts tagged with "performance"

View all tags

Semantic Caching for LLM Applications: What the Benchmarks Don't Tell You

· 8 min read
Tian Pan
Software Engineer

Every vendor selling an LLM gateway will show you a slide with "95% cache hit rate." What that slide won't show you is the fine print: that number refers to match accuracy when a hit is found, not how often a hit is found in the first place. Real production systems see 20–45% hit rates — and that gap between marketing and reality is where most teams get burned.

Semantic caching is a genuinely useful technique. But deploying it without understanding its failure modes is how you end up returning wrong answers to users with high confidence, wondering why your support queue doubled.

Speculative Execution in AI Pipelines: Cutting Latency by Betting on the Future

· 11 min read
Tian Pan
Software Engineer

Most LLM pipelines are embarrassingly sequential by accident. An agent calls a weather API, waits 300ms, calls a calendar API, waits another 300ms, calls a traffic API, waits again — then finally synthesizes an answer. That 900ms of total latency could have been 300ms if those three calls had run in parallel. Nobody designed the system to be sequential; it just fell out naturally from writing async calls one after another.

Speculative execution is the umbrella term for a family of techniques that cut perceived latency by doing work before you know you need it — running parallel hypotheses, pre-fetching likely next steps, and generating multiple candidate outputs simultaneously. These techniques borrow directly from CPU design, where processors have speculatively executed future instructions since the 1990s. Applied to AI pipelines, the same instinct — commit to likely outcomes, cancel the losers, accept the occasional waste — can produce dramatic speedups. But the coordination overhead can also swallow the gains whole if you're not careful about when to apply them.

Load Testing LLM Applications: Why k6 and Locust Lie to You

· 11 min read
Tian Pan
Software Engineer

You ran your load test. k6 reported 200ms average latency, 99th percentile under 800ms, zero errors at 50 concurrent users. You shipped to production. Within a week, users were reporting 8-second hangs, dropped connections, and token budget exhaustion mid-stream. What happened?

The test passed because you measured the wrong things. Conventional load testing tools were designed for stateless HTTP endpoints that return a complete response in milliseconds. LLM APIs behave like nothing those tools were built to model: they stream tokens over seconds, charge by the token rather than the request, saturate GPU memory rather than CPU threads, and respond completely differently depending on whether a cache is warm. A k6 script that hammer-tests your /chat/completions endpoint will produce numbers that look like performance data but contain almost no signal about what production actually looks like.

LLM Latency Decomposition: Why TTFT and Throughput Are Different Problems

· 11 min read
Tian Pan
Software Engineer

Most engineers building on LLMs treat latency as a single dial. They tune something — a batch size, a quantization level, an instance type — observe whether "it got faster," and call it done. This works until you hit production and discover that your p50 TTFT looks fine while your p99 is over 3 seconds, or that the optimization that doubled your throughput somehow made individual users feel the system got slower.

TTFT and throughput are not two ends of the same slider. They are caused by fundamentally different physics, degraded by different bottlenecks, and fixed by different techniques. Treating them as interchangeable is the root cause of most LLM inference incidents I've seen in production.

LLM Latency in Production: What Actually Moves the Needle

· 10 min read
Tian Pan
Software Engineer

Most LLM latency advice falls into one of two failure modes: it focuses on the wrong metric, or it recommends optimizations that are too hardware-specific to apply unless you're running your own inference cluster. If you're building on top of a hosted API or a managed inference provider, a lot of that advice is noise.

This post focuses on what actually moves the needle — techniques that apply whether you control the stack or not, grounded in production data rather than benchmark lab conditions.

Three Skills to Boost Team Performance

· 3 min read

Teamwork is crucial. Even geniuses like Turing need help from others to crack the Enigma. So, what are the key factors that enable a team to succeed? People naturally believe it is the abilities and levels of individual members, but the reality might surprise you.

At the beginning of The Culture Code, the author describes an interesting competition among kindergarten children, business school students, and lawyers: participants had to build the tallest structure possible using raw spaghetti, tape, string, and marshmallows. The competition ended with the kindergarten children winning. Why did the seemingly least capable group manage to defeat the others? Upon reviewing the competition, we found that business school students typically analyzed the problem first, discussed the right strategy, and then quietly established a hierarchy; whereas the kindergarten children simply started building and experimenting with different approaches.

A strong team culture emphasizes communication among team members rather than individual skills. Such a culture maximizes overall performance. To foster a positive team culture that enhances collective performance, there are three key skills.

1. Create a Safe Work Environment

People are more likely to unleash their full potential in a familiar environment, so creating a safe space is crucial. The sense of safety within a team comes from familiarity and connection among its members. If you want to cultivate a safe work environment, it is essential to learn to listen and let others know they are heard. When people know that what they say is being listened to and valued, they feel secure. You can provide appropriate feedback while listening, which enhances interaction and makes people feel needed.

2. Be Vulnerable to Build Trust

Although it may seem counterintuitive, showing vulnerability can indeed enhance team performance. We often observe the behaviors of those around us and learn by imitation. Admitting your weaknesses and mistakes to team members shows that they can do the same. This helps to strengthen internal trust among the team.

At the same time, displaying your shortcomings expresses an expectation for collaboration. When you show that you rely on others for help, they can also comfortably acknowledge their need for assistance. Over time, everyone understands that they shouldn’t bear everything alone, naturally fostering a sense of unity within the team.

3. Establish Common Goals and Reinforce Them

A steadfast pursuit of common goals is key to good team performance. A team's common goal refers to the beliefs and values that motivate the actions of its members. This common goal clarifies the team's self-identity and communicates it to the outside world. Psychologist Gabriele Oettingen has demonstrated through several studies that communicating common goals helps unite members and makes achieving those goals easier.

To deepen members' impressions, repetition is essential. To express things more clearly, repeating them ten or even a hundred times is worthwhile. You can repeatedly convey the company's mission in meetings or turn the goals into catchy slogans.