Skip to main content

4 posts tagged with "chunking"

View all tags

The Citation Index Your Chunker Shifted by One When It Started Prefixing Line Numbers

· 11 min read
Tian Pan
Software Engineer

The chunker started prepending [line N] to every chunk. The eval went green. Every citation the model produced after that day pointed to the paragraph one position before the actual evidence, on every document, in the regulated industry the product serves. The team did not find out from the eval. The team found out from an auditor who looked at the cited sentence, read it, and pointed out that it contradicted the claim it was supposed to support.

This is the kind of regression that survives a code review, a manual QA pass on three sample documents, and a feature-flag rollout. None of those checks were wrong in isolation. They were all asking the same question — does a citation appear where one is expected — and none of them were asking the question the auditor asked, which is whether the citation points at the sentence the claim came from. The gap between those two questions is where the off-by-one lived for as long as it lived.

What makes this failure mode worth a separate write-up is not the bug itself. Off-by-one errors are old news. The interesting part is that the failure was produced by two systems that continued to agree on the structure of an integer while silently disagreeing about what the integer meant.

The Chunk Boundary That Bisected the Sentence Your Answer Depended On

· 9 min read
Tian Pan
Software Engineer

Your RAG pipeline chunks documents into 512-token spans with 50-token overlap. It is a clean industry default. Somewhere in your corpus there is a sentence — "Refunds are processed within five business days unless the order originated from the EU region, in which case the regulatory window is fourteen days" — that landed across a chunk boundary. Chunk N contains the first half. Chunk N+1 contains the second.

A user asks "how long do EU refunds take." Retrieval scores chunk N highest because the query embedding aligns with "EU region" in the first fragment. Chunk N+1, which contains the only actual answer, ranks too low to be retrieved alongside. The agent answers "five business days" with a confident citation to chunk N. The customer is in Frankfurt. The answer is wrong. The pipeline behaved exactly as designed.

This is the failure mode that does not show up in your chunk-quality eval. The chunks are well-formed. The corpus is well-formed. The embedding model is well-formed. The boundaries between chunks — the lines you drew through your own documents — are where the answer lives.

Chunking for Agents vs. RAG: Why One Strategy Breaks Both

· 9 min read
Tian Pan
Software Engineer

Most teams pick a chunk size, tune it for retrieval quality, and call it done. Then they build an agent on the same index and wonder why the agent fails in strange ways — it executes half a workflow, ignores conditional logic, or confidently acts on incomplete instructions. The chunk size that maximized your NDCG score is exactly what's making your agent unreliable.

RAG retrieval and agent execution are not the same problem. They have different goals, different failure modes, and fundamentally different definitions of what a "good chunk" looks like. When you optimize chunking for one, you systematically degrade the other. Most teams don't realize this until they've already built on the wrong foundation.

Corpus Architecture for RAG: The Indexing Decisions That Determine Quality Before Retrieval Starts

· 12 min read
Tian Pan
Software Engineer

When a RAG system returns the wrong answer, the post-mortem almost always focuses on the same suspects: the retrieval query, the similarity threshold, the reranker, the prompt. Teams spend days tuning these components while the actual cause sits untouched in the indexing pipeline. The failure happened weeks ago when someone decided on a chunk size.

Most RAG quality problems are architectural, not operational. They stem from decisions made at index time that silently shape what the LLM will ever be allowed to see. By the time a user complains, the retrieval system is doing exactly what it was designed to do — it's just that the design was wrong.