Skip to main content

3 posts tagged with "quality"

View all tags

The AI Bystander Effect: Why Five-Team Launches Ship Eval Suites Nobody Watches

· 10 min read
Tian Pan
Software Engineer

In 1964, thirty-eight people watched Kitty Genovese being attacked outside their apartment building in Queens. None of them called the police until it was too late. Latané and Darley spent the next decade explaining why: the more people who can see a problem, the less likely any single one of them is to act. They called it diffusion of responsibility. In their famous seizure experiment, 85% of participants intervened when they thought they were alone with the victim. When they believed four others could also hear the seizure, only 31% did.

Now picture your last AI feature launch. Product wrote the prompt. Engineering picked the model and wired the gateway. The data team curated the retrieval corpus. Safety bolted on the input and output filters. Customer support drafted the escalation playbook. Five teams in the room. Each one shipped its piece on time. Three months in, the feature's accuracy has quietly slid from 89% to 71%, the eval suite has not been run since launch week, and when you ask who owns the regression, every team can name three other teams that own it more.

The Rubber-Stamp Collapse: Why AI-Authored PRs Are Hollowing Out Code Review

· 10 min read
Tian Pan
Software Engineer

A senior engineer approves a 400-line PR in four minutes. The diff is clean. Names are sensible. Tests pass. Two weeks later the on-call engineer is paging through a query that returns the right shape of rows but from the wrong column — user.updated_at where user.created_at was meant — and the cohort analysis dashboard has been quietly lying to the CFO for nine days. The reviewer was competent. The code was well-structured. The bug was invisible in the diff because it wasn't a syntactic smell. It was a semantic one, and the reviewer had nothing to anchor against because no one had written down what the change was supposed to do.

This is the failure mode that shows up once the majority of diffs in your repo start life as model output. Reviewers stop asking "is this correct?" and start asking "does this look like code?" The answer is almost always yes. AI-authored code is grammatically fluent in a way that bypasses the review heuristics engineers spent a decade sharpening on human-written slop.

Patrick McKenzie: Why is Stripe's Engineering Quality So High?

· 2 min read

You need enough chips to play the game — hire a sufficient number of high-caliber talents who care about quality and are smart enough. You must repeatedly emphasize the company's culture of valuing quality, forming formal routines to check large pieces of work and fix what needs fixing.

Tactically, there is a best practice — reduce the difficulty of doing the right thing. The Stripe tech team makes various trade-offs to ensure that any engineer can improve any part of the system. Encourage a sense of ownership.

There are dedicated internal tools to check the level of internationalization, which may seem tedious but is worth the time. It goes back to the company's culture; when an individual contributor says, "I spent some time on i18n last week," they should assume that leadership values this enough to respond, "Of course, you took the time to do this, great job."

"Open a ticket for the relevant team, and someone will handle it" is a good practice, but if you can push this system to resolve tickets faster and better, you can motivate people to open tickets.

The company provides dedicated channels, such as mailing list aliases, to report product quality bugs. There are dedicated teams to triage these tasks or assign them to the appropriate groups for fixing, along with established routines to inform the entire company about the bug fix rate.

Before making significant API changes, both internal and external testing should be conducted. Regularly ask, "Who has a real Stripe account on hand? Can we update to the beta version and try it out?" People need to set aside dedicated time for this and document it thoroughly — imagine having a group of picky customers; while you may not be able to use your product as deeply and broadly as users do, this approach is much better than guessing.

Discovering that "a piece of payment code hasn't been touched in 5 years, and I don't know how it works, and there are no tests" is rare but valuable for the engineering team.

None of the above is high-tech, nor is it a sufficient condition to guarantee quality. Stripe never settles for the current level of quality and does not passively say, "Our standards are high," but rather maintains a proactive approach to continuously improve.