Stripe published a post about Minions back in February — their internal system of one-shot coding agents. The numbers: 1,000+ PRs merged per week, ~80% first-pass success rate, 400+ MCP tools in a shared toolshed.
We've been building something similar with NRNS, so naturally we read it closely. Some of their decisions match ours. Others pushed us to think differently. Here's the breakdown.
Where we agree
Don't let LLMs orchestrate
This is the big one. Stripe uses deterministic code for routing and orchestration. The LLM does the creative work — writing code, making judgment calls — but plain code decides which agent gets which task. We do the same. TypeScript routes tasks based on type, skills, and availability.
It's tempting to throw an LLM at the coordination problem. Don't. Deterministic orchestration is debuggable and predictable. You can trace exactly why a task went to a specific agent without asking a model to explain itself.
Gate completion on real checks
Stripe runs layered verification: lint, type-check, CI, then human review. A PR isn't "done" until all layers pass. We do the same — before NRNS marks a task complete, it has to pass type-checking, linting, and affected tests. Failures move the task to "needs review" with the errors attached. By the time you look at it, the mechanical stuff is handled.
Better tools, not more retries
Stripe's Toolshed has 400+ MCP tools. Not raw shell commands — high-level stuff like "run tests for this service" or "check types against production." When an agent fails, they improve the tools and context, not add retry logic.
Same philosophy here. NRNS agents use MCP tools — update_task, assign_task, run_tests — not raw CLI access. Higher-level tools mean fewer weird failure modes and better audit trails.
Where we differ
One-shot vs. collaborative
Stripe's agents are one-shot: prompt in, PR out. If it fails, they improve the system for next time rather than retrying. At their volume (1,000+ PRs/week) that math works — system improvements compound fast.
We went a different way. NRNS agents can ask questions mid-task, hand off subtasks to other agents, and resume after feedback. For smaller teams, a one-shot failure means real delay. Being able to ask "REST or GraphQL?" before writing the wrong thing saves hours.
Containers vs. worktrees
Stripe spins up a full devbox container per agent. Gold standard for isolation, but heavy infrastructure. We use git worktrees — each agent gets its own branch and working directory, starts in seconds, no containers needed. Your code stays on your machine. We'll probably add container support eventually for server-side execution, but worktrees are plenty for now.
Internal tool vs. product
Biggest difference: Minions is built for Stripe's codebase, review culture, and CI. It doesn't need to handle arbitrary repos or team structures. NRNS has to work with your monorepo or your 12 microservices, your commit conventions, your branch policies. Different design challenge entirely.
What this tells us
Stripe running this in production at scale is a strong signal. The question isn't "will AI agents write code for teams" anymore. It's "how do you make them reliable?"
The answer from both their experience and ours: deterministic orchestration, good tools, layered verification. Treat agent infrastructure as real engineering, not a weekend hack on a chat API.
Stripe built this for Stripe. We're building it for everyone else — still early, but the architecture is coming together. If you want to follow along, the waitlist is open.