Blog

Page 7

12 articles

Your Eval Suite Is a Production Workload: When Nightly Tests Starve Live Traffic
A nightly eval suite and a live product sharing one provider organization is a noisy-neighbor outage waiting to happen. Here is how to isolate quota, gate PRs on token impact, and treat eval as the production workload it actually is.
llm-opsevaluation
Jun 211 min
Your Latency SLO Is a Function of Other Teams' Prompt Sizes
Shared token-per-minute LLM limits decouple your latency SLO from your own service. The fix is to denominate internal capacity in the unit the provider throttles in — not in requests or dollars.
insiderllm
Jun 210 min
Your RAG Corpus Trust Boundary Is Whoever Can Write to Its Sources
Indirect prompt injection through a competitor's public review can turn your RAG pipeline into an exfiltration channel. The trust boundary is not who wrote the ingestion code, it is who can write to the sources.
securityrag
Jun 210 min
Retrieval Pipeline Residency: The Embedding That Crossed the Border Your LLM Call Didn't
Your inference endpoint is pinned to Frankfurt. Your embedding API, vector control plane, rerank service, prompt cache, and trace store are not. A walkthrough of the six residency surfaces in a RAG request and the org gap where each one quietly crosses the border.
insiderdata-residency
Jun 19 min
The 40-Point Gap Between Your Interviewers When the Candidate Says 'I'd Just Prompt It'
A forty-point disagreement on the same candidate is not a candidate problem — it's a rubric problem. How to calibrate an AI-engineer hiring loop your own team cannot yet agree on.
hiringai-engineering
Jun 19 min
The 429 Whose Body Said OK And Your Client Believed The Body
When a 429's body says ok, naive clients trust the body, skip the backoff, and turn a rate limit into a retry-storm outage. The fix is structural: read status, headers, and body together and let the strictest claim win.
insiderrate-limiting
Jun 19 min
The A/B Test Powered by Token Counts Instead of Outcomes
When the experiment platform makes token counts easy and user outcomes hard, prompt A/B tests ship local maxima the team cannot distinguish from regressions.
insiderai-engineering
Jun 113 min
The Agent Budget That Approved Cost-Per-Call and Never Measured Cost-Per-Resolved-Task
An agent that drives cost-per-call down 25% while cost-per-resolved-task drifts up 40% is the most common unit-economics failure in agentic deployments. Here is why the vendor SKU is not the unit of work, and how to put the right metric on the wall.
insiderai-agents
Jun 110 min
When 'Escalate to Human' Becomes the Queue: The Hidden Incentive Bug in Your AI Support Stack
Deflection dashboards lie. The reward function you shipped quietly turned 'escalate to human' into your AI agent's cheapest action — and your support team into its overflow queue.
ai-agentscustomer-support
Jun 110 min
The Agent Plan That Branched on a Fact Your Context Pruner Already Dropped
When a context pruner evicts a tool result that a later plan step silently depends on, the agent keeps branching against evidence that no longer exists — and the trace looks like a hallucination.
ai-agentscontext-engineering
Jun 111 min
The Agent Rollout Cadence Your Customer Success Team Could Not Absorb
When the AI team ships behavior changes weekly behind feature flags but customer success trains monthly, the gap shows up as customer trust quietly collapsing. The fix is a coordination contract, not more meetings.
insiderai-agents
Jun 111 min
The Agent Runbook Your Incident Commander Could Not Execute
Most agent runbooks read fine in daylight and run blocked at 02:17 because the author has access the on-call SRE does not. Federation, declared scopes, break-glass endpoints, and drills are the fix.
ai-agentssre
Jun 110 min

About Tian Pan

I'm Tian Pan, an engineer-founder focused on agentic engineering — building autonomous AI systems and scaling engineering teams. I write practical guides on system design, technical leadership, and shipping with AI agents. Previously an early engineer at Uber, Brex, and IoTeX.

Page 7

Your Eval Suite Is a Production Workload: When Nightly Tests Starve Live Traffic

Your Latency SLO Is a Function of Other Teams' Prompt Sizes

Your RAG Corpus Trust Boundary Is Whoever Can Write to Its Sources

Retrieval Pipeline Residency: The Embedding That Crossed the Border Your LLM Call Didn't

The 40-Point Gap Between Your Interviewers When the Candidate Says 'I'd Just Prompt It'

The 429 Whose Body Said OK And Your Client Believed The Body

The A/B Test Powered by Token Counts Instead of Outcomes

The Agent Budget That Approved Cost-Per-Call and Never Measured Cost-Per-Resolved-Task

When 'Escalate to Human' Becomes the Queue: The Hidden Incentive Bug in Your AI Support Stack

The Agent Plan That Branched on a Fact Your Context Pruner Already Dropped

The Agent Rollout Cadence Your Customer Success Team Could Not Absorb

The Agent Runbook Your Incident Commander Could Not Execute

About Tian Pan