Skip to main content

12 posts tagged with "context-engineering"

View all tags

Agentic Engineering: Build Your Own Software Pokémon Army

· 13 min read
Tian Pan
Software Engineer

How one person replaced a 15-person engineering team with autonomous AI agents — and the spectacular failures along the way.

This material was prepared for CIVE 7397 Guest Lecture at the University of Houston.

I didn't study CS in college. I was a management major in Beijing. Somehow I ended up at Yale for a CS master's, then at Uber building systems for 90 million users, then at Brex and Airbnb, and eventually started my own company.

I'm telling you this because the rules of who can build software are being rewritten right now — and your background might be more of an advantage than you think.

Act I: The Solo Grind

150 Lines Per Day Is the Ceiling

Every engineer starts the same way. Blank editor. Blinking cursor. A ticket that says "Build a subscription billing system."

A senior engineer — someone with ten years of experience — produces about 100 to 150 lines of production code per day. The rest is meetings, code reviews, debugging, context-switching. That's the ceiling.

The "10x engineer" was the myth we all chased. But even a 10x engineer was still one person. Productivity scaled linearly with headcount. Want to ship faster? Hire more people — each one takes three to six months to onboard.

And the worst part? Knowledge lived in people's heads. Why was that system designed that way? Ask Chen. Oh, Chen left. Good luck.

The Real Bottleneck: Brain Bandwidth

At Uber, the hardest part of any task was never writing the code. It was the research phase — figuring out where and what to change.

When the codebase is massive, the docs are gone, and the previous owner quit, you spend 80% of your time building a mental model of someone else's system. The bottleneck was always people — their availability, their context window, their bus factor. Not compute. Not ideas.

And then something showed up at the workshop door.

Copilot, Cursor, and the Rare Candy Effect

You discover Copilot. Then Cursor. Then Windsurf. Press Tab and entire functions materialize. It's like someone handed you a Rare Candy after years of manual grinding.

The gains are real — we have field studies now:

  • Microsoft & Accenture ran a randomized trial across 4,000 developers: 26% more merged PRs.
  • Cognition's Devin completes file migrations 10x faster than humans.
  • Junior developers saw +35% productivity gains; seniors got +8 to 16%.

But even with these gains, the ceiling is still you. You're faster at cutting wood, but you haven't built a factory. You're still the one reading specs, making decisions, debugging at 2am.

Rare Candy buffs you. It doesn't give you a Pokémon. And the only way to break through the ceiling is to remove yourself from the production line entirely.

Act II: Catching Your First Pokémon

From Typing Code to Writing Specs

This is the moment everything changes — and it's deceptively simple.

You write a spec. Not code — a spec. Acceptance criteria, constraints, edge cases. You hand it to an autonomous agent like Claude Code. You walk away.

The agent reads your codebase, plans its approach, writes code, runs tests, reads the errors, fixes them, loops. You come back to a pull request. You just caught your first Pokémon.

This is fundamentally different from Cursor or Copilot. Those are power tools — they boost your output. An autonomous agent is a separate worker. The critical skill shifts from prompt engineering to context engineering: designing the world your Pokémon operates in.

My Non-Negotiable Workflow

I always start in Plan Mode. The agent analyzes the codebase and proposes an approach. I review the plan, adjust it, then say "execute."

One rule I never break: "You debug it yourself. I only want results." The agent has to curl the API, read the logs, and write tests to prove its own work. If it can't verify itself, the spec isn't good enough.

Why Context Engineering Beats Prompt Engineering

You've caught your first Pokémon. How do you make it good?

Anthropic's own guidance says the quality of an agent depends less on the model itself and more on how its context is structured and managed. The model is the engine. The context — specs, codebase structure, feedback signals — is the skill book. What you teach it determines how well it fights.

Three inputs matter:

  • Specs. Write clear specifications with acceptance criteria before the agent writes a single line of code. A vague spec gets vague code. A precise spec gets working software.
  • Codebase. Structure your repo so the agent can navigate it — clear file naming, clean module boundaries, up-to-date docs. The agent reads your code the same way a new hire would on day one. If a new hire would be lost, your agent will be lost.
  • Feedback signals. Tests, type checkers, linters. Without feedback, your Pokémon will confidently produce garbage and tell you everything's fine. We've all had coworkers like that.

Defects at Scale: Building the Inspection Line

Your Pokémon wrote code. It compiles. You feel great.

Then you run the tests. Half fail. The agent hallucinated an API endpoint that doesn't exist, used a deprecated library, and introduced a subtle race condition.

This is the central challenge: a Pokémon without quality control manufactures defects at scale. The most important thing you build is not the production system — it's the inspection line.

The agent operates in a tight loop: write → test → fail → read error → fix → repeat, until every check passes green. The magic isn't perfect output on the first try — it never does that. The magic is that the feedback loop runs in seconds, not hours.

My inspection line in practice:

  • Backend: the agent curls the actual API and verifies responses.
  • Frontend: Playwright MCP — the agent opens a real browser, navigates the UI, clicks buttons, and verifies rendered output.
  • Every task: the agent writes its own tests as a deliverable.

The teams getting real value from agents aren't the ones with the best models. They're the ones with the tightest inspection lines.

From One Pokémon to a Full Party

One Pokémon handles one bounded task. Real software projects have many moving parts. You need a party — and for a party to work, you need shared tooling and a shared playbook.

MCP (Model Context Protocol) is the item bag. Any Pokémon can reach in and grab any tool, any API, any data source. It gives your agents hands.

CLAUDE.md and custom skills are the trainer's manual. Custom slash commands — /today, /blog, /ci — encode repeatable combo moves. CLAUDE.md is the rulebook every agent reads on startup: same context, same standards, no babysitting required.

As Anthropic advises: find the simplest solution possible, and only increase complexity when needed.

Your party is assembled. Everything is running. It looks beautiful on the whiteboard. Then it breaks.

The Abyss: When Everything Breaks

The Silent Failure That Shipped

The most dangerous failure isn't the loud one — it's the silent one.

I had a coding agent make changes that passed all existing tests, looked correct in review, and shipped. Days later, I discovered it had broken a subtle invariant that no test covered. No error logs. No crash. Just wrong behavior that took days to trace back to the agent's commit.

That's the nightmare scenario: a Pokémon that produces defective work that passes inspection. Your inspection line has blind spots, and the agent will find every single one.

The Research Confirms It

This isn't just my experience. A NeurIPS 2025 study analyzed 1,600 execution traces across seven multi-agent frameworks and found:

  • Failure rates of 41% to 87% across frameworks.
  • 14 distinct failure modes identified.
  • Coordination breakdowns were the #1 category at 36.9% of all failures — agents losing context during handoffs, contradicting each other, going in circles.

Why Adding More Agents Makes It Worse

Your instinct after a wipeout: "I need more agents." That instinct is wrong.

Google DeepMind and MIT tested this rigorously — 180 configurations, 5 architectures, 3 model families:

  • A centralized orchestrator improved performance by 80.9% on parallelizable tasks.
  • But all multi-agent setups degraded performance by 39–70% on sequential work.
  • Gains plateau at 4 agents. Beyond that, you're paying coordination tax with no return.
  • Uncoordinated agents amplify errors 17.2x. Even with a coordinator: 4.4x.

The lesson: don't add Pokémon. Add the right Pokémon.

Act III: Rebuilding Smarter

Four Principles That Survived Every Explosion

The naive optimism is gone. In its place: hard-won knowledge.

The SWE-Bench leaderboard evaluated 80 unique approaches to agentic coding and found no single architecture consistently wins. But four principles held up:

  1. Inspection over production. Your team wiped because unchecked errors cascaded. The fix isn't stronger Pokémon — it's better inspection gates.
  2. Context beats model. Agents didn't fail because models were weak. They failed because they lacked context. Better skill books beat better engines every time.
  3. Start with one. Gains plateau at four agents (per DeepMind/MIT). Start simple. Add agents only when forced to.
  4. Co-learn with AI. Don't just assign tasks — ask agents to audit your codebase, research best practices, and update CLAUDE.md. Every conversation makes the next one better.

A practical note on costs: you don't need a fortune to start. Claude.ai free tier, GitHub Copilot student plan, and Cursor free tier get you surprisingly far. I run my entire operation on multiple $200/mo subscriptions with a CLI-to-API proxy — roughly 1/7 to 1/10 the cost of raw API calls.

What One Person's Gym Actually Looks Like

This is not a metaphor. This is my literal setup today:

  • 10 Claude Code agents running in parallel across 4 Macs and 6 screens.
  • 5 agent writers producing SEO content 24/7 through an automated yarn blog loop.
  • 1 person running a startup that would have needed 10–15 people two years ago.

Here's how a typical day works:

  • Morning: I run /today. An agent reviews my TODO.md, checks what's in progress, and proposes priorities.
  • Workday: I dispatch tasks to 10 coding agents, each with a bounded spec. While they work, I review PRs and make architecture decisions.
  • Background: Five agent writers run continuously — writing, editing, publishing. I review during breaks.
  • Bug fixes: GitHub Copilot handles small, bounded tasks — quick fixes, adding test coverage.
  • Every six months: Roadmap and OKR planning — irreducibly human, but even that I do with Claude, Gemini, and ChatGPT to reach a quorum.

Six Rules for Training the Army

Two years of running this system gave me six rules. All from painful experience:

  1. "You debug it yourself." The agent curls the API, searches logs, writes tests. If it can't self-verify, the spec needs work.
  2. Tokens consumed = efficiency. The only metric: how many agents can I keep busy simultaneously? Idle agents are wasted capacity.
  3. Work without supervision. The best agents don't wait for assignments. Cron jobs. Infinite task loops. See something that needs doing? Do it.
  4. Architecture = freedom to fail. Good architecture contains the blast radius. Agents can experiment but can't break what matters.
  5. Measurable, improvable, composable. If you can't measure a capability, you can't improve it. Everything should be testable and combinable.
  6. Use agents for everything. Not just code — content, video, social media, customer support, calendar. Then: build tools for agents, not just for humans.

What Makes a Gym Leader

The DORA Gap: Individual Gains, Zero Organizational Improvement

Here's the uncomfortable truth. The DORA 2025 Report — Google's annual study of software delivery — found that while 80% of individual developers report AI productivity gains, organizational delivery metrics show no improvement. AI amplifies existing quality. The Pokémon doesn't fix the strategy.

The Pokémon handles commodity work: boilerplate, tests, spec-to-code translation, docs, well-defined bugs. That stuff is getting cheap fast.

The trainer handles the hard stuff: defining what to build and why. Designing testable systems. Writing specs worth translating. Making architecture decisions under uncertainty.

The Four Skills That Won't Get Automated

  • Context engineering — designing the skill books your Pokémon learn from.
  • Evaluation design — building the inspection line. If you can't evaluate output, you can't run a gym.
  • Systems thinking — understanding where defects cascade. Pokémon do local optimization; trainers do global coherence.
  • Product taste — when anyone can build anything, the question becomes what's worth building.

Why Non-CS Backgrounds Have an Edge

People with CS backgrounds tend to be conservative at the edges of what agents can do. They know too much about what should be hard, so they self-censor. "There's no way the agent can handle distributed transactions." They never ask.

People without CS backgrounds use their imagination. They say "what if I just told it to do this?" and discover it works far more often than experts expected. They push boundaries because they don't know where the boundaries are.

That was me. I didn't know what was "supposed" to be hard, so I tried everything. That's how I built a system that people with ten years more experience hadn't attempted.

The Paradigm Shift: Three Pillars

Everything in this post points to something bigger — a fundamental shift in how software gets built.

Using AI as "fancy autocomplete" is like bolting an electric motor onto a steam engine. You get a little more power, but you're stuck with the old architecture. The real revolution is tearing the steam engine out entirely.

Pillar 1: AI-first design. Stop asking "how can AI help my workflow?" Start asking "what obstacles can I remove so AI can do the work?" This mindset separates trainers who get 2x gains from those who get 100x.

Pillar 2: Closed-loop iteration. Remove humans from the execution loop. Let AI iterate autonomously with full environment access. Extending reliable autonomy from minutes to hours is the trillion-dollar question — every improvement unlocks exponential gains in what one person can build.

Pillar 3: Harness engineering. Humans define boundaries. Decouple architecture into minimal components. Use multi-agent cross-validation. You're not writing code — you're designing the harness that keeps the system honest.

Your First Quest

You started as a solo grinder — just you and a blinking cursor. You got Rare Candy and things got faster, but the ceiling was still you. You caught your first Pokémon, learned context engineering, built an inspection line, assembled a party — and watched it wipe spectacularly.

Then you rebuilt. Smarter. With constraints. With hard-won principles.

The Pokémon will keep getting stronger — new models, new protocols, new frameworks every quarter. But the trainer who designs the system, who decides what to build, how to inspect it, and when to ship it — that person doesn't get automated away.

That person can be you.

Tonight: pick one project. Write a one-page spec. Hand it to Claude Code. Review what comes back.

You just caught your first Pokémon.

Six Context Engineering Techniques That Make Manus Work in Production

· 11 min read
Tian Pan
Software Engineer

The Manus team rebuilt their agent framework four times in less than a year. Not because of model changes — the underlying LLMs improved steadily. They rebuilt because they kept discovering better ways to shape what goes into the context window.

They called this process "Stochastic Graduate Descent": manual architecture searching, prompt fiddling, and empirical guesswork. Honest language for what building production agents actually looks like. After millions of real user sessions, they've settled on six concrete techniques that determine whether a long-horizon agent succeeds or spirals into incoherence.

The unifying insight is simple to state and hard to internalize: "Context engineering is the delicate art and science of filling the context window with just the right information for the next step." A typical Manus task runs ~50 tool calls with a 100:1 input-to-output token ratio. At that scale, what you put in the context — and how you put it there — determines everything.

The Action Space Problem: Why Giving Your AI Agent More Tools Makes It Worse

· 9 min read
Tian Pan
Software Engineer

There's a counterintuitive failure mode that most teams encounter when scaling AI agents: the more capable you make the agent's toolset, the worse it performs. You add tools to handle more cases. Accuracy drops. You add better tools. It gets slower and starts picking the wrong ones. You add orchestration to manage the tool selection. Now you've rebuilt complexity on top of the original complexity, and the thing barely works.

The instinct to add is wrong. The performance gains in production agents come from removing things.

Four Strategies for Engineering Agent Context That Actually Scales

· 8 min read
Tian Pan
Software Engineer

There's a failure mode in production agents that most engineers discover the hard way: your agent works well on the first few steps, then starts hallucinating halfway through a task, misses details it was explicitly given at the start, or issues a tool call that contradicts instructions it received twenty steps ago. The model didn't change. The task didn't get harder. The context did.

Long-running agents accumulate history the way browser tabs accumulate memory — silently, relentlessly, until something breaks. Every tool response, observation, and intermediate reasoning trace gets appended to the window. The model sees all of it, which means it has to reason through all of it on every subsequent step. As context grows, precision drops, reasoning weakens, and the model misses information it should catch. This is context rot, and it's one of the most common failure modes in production agents.

Context Engineering: Memory, Compaction, and Tool Clearing for Production Agents

· 10 min read
Tian Pan
Software Engineer

Most production AI agent failures don't happen because the model ran out of context. They happen because the model drifted long before it hit the limit. Forrester has named "agent drift" the silent killer of AI-accelerated development — and Forrester research from 2025 shows that nearly 65% of enterprise AI failures trace back to context drift or memory loss during multi-step reasoning, not raw token exhaustion.

The distinction matters. A hard context limit is clean: the API rejects the request, the agent stops, you get an error you can handle. Context rot is insidious: the model keeps running, keeps generating output, but performance quietly degrades. GPT-4's accuracy drops from 98.1% to 64.1% based solely on where in the context window information is positioned. You don't get an error signal — you get subtly wrong answers.

This post covers the three primary tools for managing context in production agents — compaction, tool-result clearing, and external memory — along with the practical strategies for applying them before your agent drifts.

CLAUDE.md and AGENTS.md: The Configuration Layer That Makes AI Coding Agents Actually Follow Your Rules

· 9 min read
Tian Pan
Software Engineer

Your AI coding agent doesn't remember yesterday. Every session starts cold — it doesn't know you use yarn not npm, that you avoid any types, or that the src/generated/ directory is sacred and should never be edited by hand. So it generates code with the wrong package manager, introduces any where you've banned it, and occasionally overwrites generated files you'll spend an hour recovering. You correct it. Tomorrow it makes the same mistake. You correct it again.

This is not a model quality problem. It's a configuration problem — and the fix is a plain Markdown file.

CLAUDE.md, AGENTS.md, and their tool-specific cousins are the briefing documents AI coding agents read before every session. They encode what the agent would otherwise have to rediscover or be corrected on: which commands to run, which patterns to avoid, how your team's workflow is structured, and which directories are off-limits. They're the equivalent of a thorough engineering onboarding document, compressed into a form optimized for machine consumption.

Effective Context Engineering for AI Agents

· 11 min read
Tian Pan
Software Engineer

Nearly 65% of enterprise AI failures in 2025 traced back to context drift or memory loss during multi-step reasoning — not model capability issues. If your agent is making poor decisions or losing coherence across a long task, the most likely cause is not the model. It is what is sitting in the context window.

The term "context engineering" is proliferating fast, but the underlying discipline is concrete: active, deliberate management of what enters and exits the LLM's context window at every inference step in an agent's trajectory. Not a prompt. A dynamic information architecture that the engineer designs and the agent traverses. The context window functions as RAM — finite, expensive, and subject to thrashing if you don't manage it deliberately.

Harness Engineering: The Discipline That Determines Whether Your AI Agents Actually Work

· 10 min read
Tian Pan
Software Engineer

Most teams running AI coding agents are optimizing the wrong variable. They obsess over model selection — Claude vs. GPT vs. Gemini — while treating the surrounding scaffolding as incidental plumbing. But benchmark data and production war stories tell a different story: the gap between a model that impresses in a demo and one that ships production code reliably comes almost entirely from the harness around it, not the model itself.

The formula is deceptively simple: Agent = Model + Harness. The harness is everything else — tool schemas, permission models, context lifecycle management, feedback loops, sandboxing, documentation infrastructure, architectural invariants. Get the harness wrong and even a frontier model produces hallucinated file paths, breaks its own conventions twenty turns into a session, and declares a feature done before writing a single test.

Context Engineering: The Discipline That Matters More Than Prompting

· 9 min read
Tian Pan
Software Engineer

Most engineers building LLM systems spend the first few weeks obsessing over their prompts. They A/B test phrasing, argue about whether to use XML tags or JSON, and iterate on system prompt wording until the model outputs something that looks right. Then they hit production, add real data, memory, and tool calls — and the model starts misbehaving in ways that no amount of prompt tuning can fix. The problem was never the prompt.

The real bottleneck in production LLM systems is context — what information is present in the model's input, in what order, how much of it there is, and whether it's relevant to the decision the model is about to make. Context engineering is the discipline of designing and managing that input space as a first-class system concern. It subsumes prompt engineering the same way software architecture subsumes variable naming: the smaller skill still matters, but it doesn't drive outcomes at scale.

Context Engineering: The Invisible Architecture of Production AI Agents

· 10 min read
Tian Pan
Software Engineer

Most AI agent bugs are not model bugs. The model is doing exactly what it's told—it's what you're putting into the context that's broken. After a certain point in an agent's execution, the problem isn't capability. It's entropy: the slow accumulation of noise, redundancy, and misaligned attention that degrades every output the model produces. Researchers call this context rot, and every major model—GPT-4.1, Claude Opus 4, Gemini 2.5—exhibits it, at every input length increment, without exception.

Context engineering is the discipline of managing this problem deliberately. It's broader than prompt engineering, which is mostly about the static system prompt. Context engineering covers everything the model sees at inference time: what you include, what you exclude, what you compress, where you position things, and how you preserve cache state across a long-running task.

Why Your AI Agent Wastes Most of Its Context Window on Tools

· 10 min read
Tian Pan
Software Engineer

You connect your agent to 50 MCP tools. It can query databases, call APIs, read files, send emails, browse the web. On paper, it has everything it needs. In practice, half your production incidents trace back to tool use—wrong parameters, blown context budgets, cascading retry loops that cost ten times what you expected.

Here's the part most tutorials skip: every tool definition you load is a token tax paid upfront, before the agent processes a single user message. With 50+ tools connected, definitions alone can consume 70,000–130,000 tokens per request. That's not a corner case—it's the default state of any agent connected to multiple MCP servers.

Context Engineering for Personalization: How to Build Long-Term Memory Into AI Agents

· 8 min read
Tian Pan
Software Engineer

Most agent demos are stateless. A user asks a question, the agent answers, the session ends — and the next conversation starts from scratch. That's fine for a calculator. It's not fine for an assistant that's supposed to know you.

The gap between a useful agent and a frustrating one often comes down to one thing: whether the system remembers what matters. This post breaks down how to architect durable, personalized memory into production AI agents — covering the four-phase lifecycle, layered precedence rules, and the specific failure modes that will bite you if you skip the engineering.