Adam Miles - AI on Real Codebases

How I run an entire module rebuild, start to finish

Adam Miles — Mon, 29 Jun 2026 15:04:08 GMT

In Rebuilding a brownfield application module-by-module with AI, I made the case for the strangler-fig approach: instead of a big-bang rewrite, you replace a legacy system one module at a time while the product keeps shipping. That post is the why. This one is the how.

What follows is the actual command sequence I run to take one module from an empty folder to shipped code. Each command runs in its own fresh AI session and writes a numbered file on disk, and that file is the handoff to the next command, so each step picks up exactly where the last one stopped. The module is just this sequence: run once at the top, then looped over each feature inside it.

The sequence

1. Define the module

I start with `/module:00-define`. This writes a short charter: what the module is, who uses it, what already exists in the legacy system for this area, and the constraints any rebuild has to respect. It is deliberately cheap and locks nothing. No feature list, no decisions, just the problem framed clearly.

Why it matters: the framing is the whole point. For a scheduling module, that is a sentence or two: let staff manage the day’s appointments in one place, cut double-bookings, replace a clunky legacy calendar the team has outgrown. No screens, no fields. That framing gives the research a real target, and gives every later step a shared understanding of what we are building and why.

2. Research the landscape

Next is `/module:01-research`, which runs deep competitive and industry research against that charter. It reads across the whole landscape: what competitors ship, what the industry treats as table stakes, where the real opportunities are, and what our own users are actually trying to do. It proposes; it decides nothing.

This step is the easiest to underrate. Our product manager is regularly surprised by features the research proposes, and the surprise is the tell: the AI understood the user’s problem well enough to find a gap the team had missed. Weak research produces a confident module that solves the wrong problem. Strong research is where most of the value gets created.

3. Lock the scope

Then `/module:02-requirements` turns the research into a frozen scope: the capabilities the module must have, what we keep or kill from the legacy system, and the decisions every feature inside the module will inherit. This step ends with the first human gate, the freeze. Nothing expensive gets built until I confirm it, because everything after this point is built on top of it.

It is also not the AI working alone. This step runs as a working session between the AI, a developer, and the domain expert or product manager. The research did the reading; this step surfaces the decisions that steer the module: what is in scope, where the technical trade-offs are, and the judgment calls that only someone who knows the business and the customer can make. It proposes, the people in the room decide. These answers set the direction of every feature that follows, so the freeze is only as good as the people confirming it.

4. Break it into features

With scope frozen, `/module:03-feature-map` splits the module into its features and puts them in build order. It scaffolds a numbered folder per feature, ordered so that features producing data come before the features that read it.

Why it matters: the ordering is the payoff. Features are built in dependency order, so each feature’s inputs already exist by the time you reach it. The numbered folders make this self-documenting: the lowest-numbered unfinished folder is always the next thing to build. And because every capability from the frozen scope has to land in exactly one feature, nothing falls through the cracks.

Steps 1 through 4 run once. The next four run per feature, looping until the module is done.

5. Lock the feature’s requirements

Now I drop into a single feature. A larger feature gets sliced into smaller phases first, and the requirements and prototype steps run over each phase in turn. An optional `/feature:00-research` covers anything the module research did not reach. Then `/feature:03-requirements` does a codebase recon and freezes the feature’s requirements and a decision register: every decision, the recommendation, the final answer, and the reason where I overrode it.

Why it matters: like locking the module scope, this is a working session, not the AI alone. The difference is altitude. Module requirements set the high-level scope; this drops into one feature, where intent actually lives and where the AI will otherwise fill in the blanks with its own assumptions. The register that comes out of it is the contract everything downstream is checked against, which is how you catch “we are about to build the wrong thing” while it still costs a conversation instead of a rewrite. The codebase recon is the other half: it teaches the AI the unwritten conventions of this feature’s corner of the system before a line is written, so the generated code fits what is already there. This is the one step I never skip.

6. Prototype it

If the feature has a user interface, `/feature:04-prototype` builds a clickable prototype. Then comes the second human gate, the prototype review: I open it in the browser and actually use it, and `/feature:05-freeze` locks the final design once it matches what we intended. Features with no screen of their own, a pure backend or data job, skip the prototype build and go straight from requirements to a thin freeze.

Why it matters: this is where we find out what needs to change, while changing it is still cheap. The agreed prototype then becomes the contract the implementation builds against. A spec document leaves room for interpretation, and in the past that room is exactly where the work went sideways: the wrong tangents nobody caught until late. Iterating on something you can click closes that gap before any implementation starts. And the prototype is not throwaway: it is the actual production UI, so the design work done here is not redone later. (I go deep on this step in [Most “AI in a day” ships slop. This pipeline has to clear the bar first.])

7. Implement it

`/feature:06-plan` decomposes the frozen feature into implementation steps sized to fit an AI session, then `/feature:07-implement-all` works through them: it dispatches the right specialist for each step, runs review gates, and commits when things pass.

Why it matters: the size limit is what keeps the AI sharp. Each step runs in a fresh agent with clean context, so no single session has to hold the whole feature in its head. The planner refuses oversized steps before they run, which is the single biggest reason large features go wrong with AI. (That context-discipline problem, and why orchestration is what addresses it, is the subject of [Orchestration solves one AI problem. It hides another.) Splitting the work also lets the right specialist take each piece, instead of one general-purpose agent doing everything adequately and nothing well. And because every step is checked against the frozen register before it commits, the implementation cannot quietly invent scope that nobody signed off on.

8. Test it

`/test:create-scenarios` generates test scenarios from the frozen requirements, and `/test:run-scenarios` drives them against the real app.

Why it matters: code that compiles is not code that works. On a system like this, the only real test is whether the feature behaves correctly end to end, with real-shaped data, without breaking the features next to it, and that is exactly what unit tests miss. Generating the scenarios from the frozen requirements closes the loop the contract opened: the same document that said what to build becomes the checklist for whether it got built. A feature is not done when it runs. It is done when it does what it promised.

This is also where the third human gate, the ship decision, lives. The AI prepares the case: the review against the frozen register plus the scenario results. It does not get to ship. I read that and make the call.

Then repeat, feature by feature

That four-step loop runs in the build order the breakdown set: take the lowest-numbered unfinished feature from requirements to tested, then move to the next number, until the last feature exits the loop.

Because each command writes its state to a file, I never have to keep a session open to hold context. I can close everything, come back days later, and the files on disk tell the next command exactly where to start. (That handoff-through-files trick is the state machine I described in the first post.) Nothing goes stale, and nothing is lost between sessions.

Knowing where you are: `/module:status`

With several features each moving through requirements, prototype, implementation, and test, the one thing I need at a glance is where the module stands. That is what `/module:status` is for. It reads the same files the pipeline writes and prints a compact rollup: a row per feature, a glyph per step. Then it does the thing that matters more than the status itself: it hands me the literal command to paste for the next unfinished feature’s next step. I never have to remember where I left off. It writes nothing and decides nothing, so I reach for it constantly.

That is the whole spine. One module is this sequence, start to finish. The strangler-fig rewrite is the same loop run module after module, until the legacy system is gone. The sequence is fixed. The discipline is in running it every time.

I’m Adam Miles. 25 years of shipping software, including the messy brownfield kind. I write about using AI to move real codebases forward: legacy, greenfield, and the awkward middle.

If you’re rebuilding a legacy system, or thinking about it, I’m happy to compare notes.

The companion to this post:

Rebuilding a brownfield application module-by-module with AI: the strategic case for the strangler migration this sequence runs inside. If this post is the how, that one is the why.

Deeper dives on two of the steps:

Most “AI in a day” ships slop. This pipeline has to clear the bar first: a deeper look at the prototype step.
Orchestration solves one AI problem. It hides another: why context discipline is the load-bearing problem, and where orchestration helps.

Rebuilding a brownfield application module-by-module with AI

Adam Miles — Mon, 29 Jun 2026 14:59:42 GMT

A brownfield system is software that already exists and already runs: years of accumulated code, real customers depending on it, constraints you inherited rather than chose. It’s the opposite of greenfield, the blank repo and clean prompt that most AI coding demos are built on. Those demos look impressive on Twitter and tell you almost nothing about whether AI can ship on the codebases that actually run the world.

They look more like this: a legacy backend, and a team that knows AI could help but isn’t sure where to start without breaking production.

I’m in the middle of rebuilding one of those codebases right now. Module by module. With humans in the loop at the right gates, not all of them. This isn’t a retrospective: it’s the system I’m running on today, refined over several months of daily use.

One data point to anchor what this methodology does: UX and UI work that used to take us four to six weeks now takes a single 1-2 hour session: one developer, one domain expert or PM, and a clickable prototype at the end of it. The rest of this post is the strategic case for why that’s possible and when you’d want to do it.

Why module-by-module, and not a from-scratch rewrite

The fantasy version of a legacy rewrite is this: take the team offline for six months, build a clean new application, ship it, sunset the old one. Most engineering leaders have considered it. Most have rejected it, and so did we. Here’s why:

We don’t have the luxury of going dark for six months. The business runs on this software. Customers depend on it for their daily operations. Even if we tried, the requirements would shift underneath us before we shipped, and we’d ship a system already six months out of date.
Big-bang migrations are a customer punishment. Every customer hits the same wall on the same day: workflows change, integrations break, muscle memory resets. The customers who didn’t ask for a rewrite (almost all of them) experience it as a regression, and your team spends the next six months firefighting churn instead of building.
The cutover concentrates the risk into one day. Everything either works or it doesn’t the moment you flip traffic from old to new. Module-by-module spreads that risk across dozens of small cutovers, each independently rollback-able. You trade one terrifying day for many boring weeks.

The alternative, the one we’re actually doing, is the strangler pattern. Named after the strangler fig, which grows around its host tree and replaces it from the outside in until the host can be retired and the fig is load-bearing on its own.

In practice that means:

All new development happens on the new backend. No new endpoints get added to the legacy backend, ever. The legacy backend stops growing the day you commit to this.
Any time we touch a legacy endpoint for a change, we strongly consider porting it. Not every touch is a port. But “is it cheap to port this now?” is a question we ask every time, and the answer is yes more often than the team expects.
Customers don’t experience the migration. From their perspective, features keep shipping. The UI gets better. Performance improves on the modules we’ve rebuilt. They never hit a “welcome to the new app” wall. Their muscle memory survives.

And critically: we’re not doing parity. Each module that gets rebuilt is also an opportunity to improve. Features customers have been asking about for years, features the legacy stack couldn’t accommodate, finally become possible because the module is being touched anyway. The strangler pattern isn’t lift-and-shift. It’s rebuild-while-advancing.

This is where AI changes the math. The strangler pattern stops being the expensive option and becomes the only sane one for complex brownfield applications.

The methodology in one screen

Every feature, from a one-day fix to a multi-week module rebuild, moves through the same phases. The shape is two tiers. A handful of steps run once per module to set direction, then a tight loop runs once per feature until the module is done. Not all features need every step, but the ordering is invariant.

Module-level, run once:

Define the module. A short charter: what it is, who uses it, what the legacy system already does here, what a rebuild has to respect.
Research the landscape. Deep competitive and industry research against that charter. The AI becomes an expert in the problem, not a code generator pointed at the repo.
Lock the scope. Turn the research into a frozen scope every feature inherits. First human gate: the freeze. A developer and a domain expert confirm what’s in, and what’s out.
Break it into features. Split the scope into features in dependency order, so each feature’s inputs already exist when you reach it.
Per feature, looped until the module is done:
Lock the feature’s requirements. A working session that freezes this slice’s requirements.
Prototype it. A multi-agent pipeline produces a clickable, polish-gated UI in 1-2 hours. The prototype is the production UI: same React app, same design system, only the backend mocked. (Its own deep-dive: Most “AI in a day” ships slop. This pipeline has to clear the bar first) Second human gate: the prototype review. Someone opens it in a browser and uses it. This is a product review, not a code review.
Implement it. An orchestrator dispatches specialist subagents through plan-sized steps, runs review gates between them, and commits when they pass.
Test it. Scenarios automatically generated from the frozen requirements get tested against the real app. Automated testing takes care of 95% of test scenarios, and flags the 5% that need a human to run by hand. Third human gate: the ship decision. A human uses the module and decides whether it’s ready for production.

The full command sequence that runs a module through every one of these steps, start to finish, is its own post: How I run an entire module rebuild, start to finish.

Three things hold this together. The AI recommends, but a human always makes the final call: the AI does the reading, the human owns the decision. The work is broken into small pieces, because the AI does careful work on a small, well-defined task and sloppy work on a big vague one. And every step is written down as it finishes, so the work can always be picked up later exactly where it stopped.

AI for the parts where AI is good, humans for the parts where humans are good, and gates between them that neither side can skip.

What this is worth: the actual time savings

This isn’t a “10x productivity” story. Brownfield rewrites that used to be 18-month death marches now run as a sequence of 1-2 week phases with clear gates. The AI does most of the typing, reading, and scaffolding, but not the team’s judgment about what to build.

And the time savings aren’t the whole story. The output is better:

It’s an actually-clickable prototype, not a static mock. The team reacts to a running thing.
The requirements are sharper, and some are ideas the team wouldn’t have produced on its own. Not because the AI is smarter, but because it reads at a breadth no human has time for. It studies how dozens of other products solve the same problem, including ones in neighboring markets the team would never think to look at, and surfaces patterns we hadn’t considered.
It catches requirements gaps that would have shipped. A working prototype surfaces the “wait, that’s not what we meant” moments that a spec document hides. That alone has saved us from multiple multi-week reworks.

That’s the trade: less time, better output, and a working artifact to validate it. Everything downstream of the prototype (the freeze, the plan gates, the orchestrated implementation) exists to hold that gain across an entire rewrite, not just a single feature.

What I’d tell a team starting this work

If you’re staring at a legacy system wondering whether AI can help with the rewrite, the answer is yes, but the approach matters more than the tools. What I’d put on the wall before you start:

Break your application into large modules, and rebuild one at a time. The module is the unit of work, of risk, and of cutover.
Make competitor and adjacent-industry research a real step, not a footnote.
Put at least a developer and a product manager in the room for requirements. Some of the decisions are business decisions, and they need someone who owns the business there to make them.
Build the prototype before you commit to the implementation, and freeze requirements after it, not before. The prototype is the cheapest place to be wrong, and the contract everything downstream builds against.
The brownfield rewrite is the hardest thing software does. AI doesn’t make it easy. It makes it possible at a different speed and cost, but only if you build the scaffolding that respects how hard the problem actually is.

I’m Adam Miles. 25 years of shipping software, including the messy brownfield kind. I write about using AI to move real codebases forward: legacy, greenfield, and the awkward middle.

If you’re rebuilding a legacy system, or thinking about it, I’m happy to compare notes. I’m doing this work right now, the lessons are fresh, and I learn as much from the conversations as I share.

The companion to this post:

How I run an entire module rebuild, start to finish: this same spine written out as the exact command sequence that takes one module from empty folder to shipped. If this post is the why, that one is the how.

Deeper dives on two of the steps:

Most “AI in a day” ships slop. This pipeline has to clear the bar first: the multi-agent pipeline that compresses feature design from weeks to hours.
Subagents solve one AI problem. Orchestration hides another: the engineering discipline that keeps long-running AI work in the sharp zone.

Subagents solve one AI problem. Orchestration hides another.

Adam Miles — Tue, 19 May 2026 19:51:31 GMT

You already know AI quality degrades as context fills. Past about 40% you start to feel it; past 70% you’re in the Dumb Zone. (Working heuristic, not a constant. The readable write-ups: [Justin Smith](The 40% Rule: Beating Claude’s ‘Dumb Zone’ on Large Codebases), [Dale Husband](Escaping the Dumbzone, Part 1: Why Your AI Gets Stupider the More You Talk to It).) The single-session toolkit is well-covered: `/clear`, `/compact`, scoped file reads, deliberate session resets, and spawning subagents to do research for you. Use them. They work.

This post isn’t about any of that. It’s about what happens when you take the next step and let an orchestrator drive multiple AI agents on your behalf, each with its own context window you can’t see.

The orchestration problem

An orchestrator is a foreman. It doesn’t lay bricks or wire outlets. It reads the blueprint, calls the right specialist in, inspects what they handed back, calls the next specialist, and keeps the job moving until the building stands. No specialist sees the whole job; only the foreman does.

A good foreman also doesn’t run their crew into the ground. Tired bricklayers make mistakes, so the foreman rotates them out before that happens. The orchestrator does the same: dispatch isn’t just about who’s doing the work, it’s about who’s still sharp enough to do it well.

The state-machine layer (inspect handoff, decide next step, dispatch) is what makes orchestration different from a bash script that runs commands in a fixed order.

The invisible failure mode

Subagents solved the visible problem. Orchestration brings it back invisible. When you spawn a subagent manually, you’re still in the loop: you see the digest, you decide whether to spawn another, you catch what looks off. When an orchestrator spawns subagents on your behalf, the context windows that matter are no longer ones you can see. No dashboard tells you agent #3 came back from a 65%-full session and agent #5 came back from an 18%-full session. Each long-running subagent degrades as it fills, and you weren’t the one who decided to spawn it.

A worker at 70% context misreads a column type, returns a migration that looks right at a glance, and the orchestrator builds an API endpoint on top of it. A long session crashes loud. An orchestration produces bad work quietly, the whole time.

The good news: it’s a structural problem with a structural solution. The piece that makes it work is handoff files. A handoff file is a short, structured note an agent writes before its session ends. The next agent reads the note and picks up from a 0% context window. The state lives on disk, not in any session’s memory.

With that piece in place, the rest follows. Three rules keep every session in the tree under the 40% line: one for the orchestrator, one for the subagents it dispatches, one for the long-running workers that save their state to disk.

Rule 1: The orchestrator never does the work itself

The hardest session to keep sharp is the orchestrator’s own. A wrong call by an orchestrator at 50% context isn’t one mistake. It’s a dozen subagents working on the wrong thing.

The mechanism that keeps the orchestrator’s context low is simple: it doesn’t do the work. No source file reads. No tool output dumps in its own session. No long-form analysis. The orchestrator’s only job is to dispatch: spawn the right subagent, inspect the handoff that comes back, decide what’s next, spawn again. Every byte of “real work” that lives in the orchestrator’s session is a byte that didn’t need to be there.

Rule 2: Delegate discovery to read-only locators

The orchestrator stays light by not reading source files. But knowing which source files matter, and what’s in them, is still required to make good decisions. That’s discovery: read-only research about the codebase, the database, the conventions. Someone has to do it.

The answer is a specialized subagent (a locator) that the orchestrator delegates the discovery to. Treat every file read as a context withdrawal. Spawning a subagent is almost nothing; pulling five files into the parent’s context window stays there for the rest of the session.

A subagent that reads 2,000 lines of source and returns 80 words of summary is the trade you want, every time.

One general-purpose search agent isn’t enough. Every distinct research domain in your project deserves its own locator. I run two currently:

`code-finder`: filesystem locator. “Where is X defined?” “What touches Y?” Returns file:line maps with ≤10-line excerpts. Never dumps whole files.
`db-finder`: database locator. Read-only DB access via MCP tools. Returns table:column digests instead of raw `describe_table` JSON, plus flags for project-specific conventions a general agent wouldn’t know.

You don’t need an orchestrator to want these. The locators pay off in any manual session too. Next time you’d point Claude at half a dozen files to answer “where does this get called from,” let it route to a `code-finder` instead and hand back a tight file:line map while your own context stays clean. The discovery happened; the source it waded through never entered your window. They’re the cheapest, highest-leverage thing in this whole post to steal, in an orchestrator or out of one.
For cross-layer questions (”where does this stored proc get called from the API?”), both locators get spawned in parallel from the same turn. One turn, two agents, two digests. The orchestrator never reads a file.

I don’t type anything different to use these. Claude reads the CLAUDE.md rules (”DB questions go to db-finder, code questions go to code-finder, cross-layer questions go to both in parallel”) and routes the request automatically. Write the rule once; get correct routing forever.

Each locator is 40-60 lines of plain English plus an output template, and pays for itself the third time you use it.

Rule 3: Long-running agents save state to disk and hand themselves off

Workers (implementers, reviewers, anything multi-step) take many turns and do real writes. No session can hold that safely across the budget, so the state goes where it already belongs: on disk.

When a worker finishes its assigned task, it writes a `handoff.md` describing what it did, then ends. The orchestrator reads the handoff to decide what comes next. If more work remains, a fresh worker reads the same handoff and continues. The orchestrator never holds the worker’s session context; the worker never carries forward across sessions.

A good handoff says four things:

What was produced. “Wrote three migrations and one stored proc.”
Completed steps. What not to redo.
Remaining steps. In execution order.
Recovery notes. Gotchas, partial work in flight, decisions made along the way.

A real one:

Produced: Migration `2024-05-17-add-couples-flag.sql`, stored proc `sp_Booking_GetCouples`, updated `BookingController.cs`, integration tests for the new endpoint.
Completed: Steps 1–4 of phase 02-backend-implementation. Tests written and passing.
Remaining: Step 5 (add the audit-log row on couples-flag change), step 6 (wire the new proc into the existing `BookingService` retry path).
Recovery notes: The proc returns `’True’`/`’False’` BOOLSTR, not `bit`. Any caller has to coerce. The `BookingController` already does; `BookingService` retry path does not yet.

Every worker writes one in the same shape, every time, so any future agent can parse it without guessing. Refusal cases are first-class citizens too: when a worker can’t do the work (wrong tool, oversized request, blocked by missing inputs) it still writes a handoff saying why and what needs to change.

Even if you never build a full orchestrator, the handoff pattern is stealable today. At the end of any meaningful session, ask Claude to write one for “the next you,” save it in the repo, and open the next session by reading it. You’ll feel the difference immediately.

Task completion isn’t the only handoff trigger. A worker that gets handed 30 sub-steps will start in the Safe Zone and end deep in the Dumb Zone if it tries to run them all in one session. So the agents themselves know about the budget. My implementer agents self-monitor their step and turn count. After a set threshold, they finish the step they’re on, write a handoff, and return. The orchestrator catches the early return and spawns a fresh implementer with clean context to continue. Nothing in the system is allowed to outlive its budget.

The same discipline keeps the orchestrator out of trouble. If you’ve applied the three rules (it dispatches instead of working, discovery goes to locators, long tasks hand themselves off), the orchestrator’s own context stays lean almost by construction. So if you ever watch an orchestrator climb past 40%, don’t reach for a bigger budget. Take a step back and find what’s bloating it: it’s reading files it should be delegating, or holding work that belonged in a subagent. A well-run orchestrator should rarely get near the line. One caveat: an orchestrator can’t safely resume itself (a Claude-harness limitation I ran into), so mine don’t try. When one crosses the threshold, it saves its state and stops, printing the command to resume. I re-run it in a fresh session and it picks up where it left off. A one-line handoff instead of a silent failure. If you want the details of why self-resume breaks, message me.

There’s a second benefit most people miss: each forced handoff is an implicit review checkpoint. When work is split into bounded sessions, the orchestrator inspects each handoff before spawning the next agent. It’s AI reviewing AI’s output, with fresh context, at a deliberate stopping point. Drift gets caught at the first step, while it’s still small and cheap to fix.

The orchestration discipline, in one screen

The orchestrator dispatches; it never does the work itself. No source reads, no tool output in its own session.
Discovery goes to read-only locators. Digest output, word-capped, gotcha-surfacing. One locator per domain.
Long-running subagents hand themselves off. Structured handoff files on disk, self-triggered when budget pressure hits. Workers and orchestrators both stop while still sharp. A worker’s parent respawns it; the orchestrator has no parent to do that, so it can only resume as a fresh top-level session.

The promise this delivers: larger multi-step work runs cleanly across many fresh sessions, with no single session falling into the degraded zone. Work where the combination of steps is the size, not any single step, becomes work the orchestrator handles one fresh session at a time.

None of this is magic. All of it is boring discipline applied to the one resource everyone underestimates the moment they start chaining agents.

Subagents are the visible fix. The discipline in this post is the invisible one.

---

I’m Adam Miles. 25 years of shipping software, including the messy brownfield kind. I write about using AI to move real codebases forward: legacy, greenfield, and the awkward middle.

If you’re building AI orchestration for real codebases and running into context discipline problems, I’m happy to compare notes. I’m doing this work right now, the lessons are fresh, and I learn as much from the conversations as I share.

Most "AI in a day" ships slop. This pipeline has to clear the bar first.

Adam Miles — Tue, 19 May 2026 08:47:56 GMT

Requirements to production-grade code in a workday — gated on a 4/5 quality score against the team’s own design system, with the prototype that becomes the shipped UI, plus what it changes about how you work.

We all know AI can take a feature from requirements to production code now. That part is easy, and the demos are everywhere. So the interesting question stopped being can it build the thing and became does the thing it builds actually hit the mark, and that’s where almost everyone falls back to AI feature slop: fast, plausible, and subtly wrong in the ways that matter.

What separates slop from production-quality output isn’t the generation. It’s the discipline around it. The real unlock here isn’t speed at all, it’s that a built-in research step and a working, clickable prototype surface decisions and ideas the team hadn’t thought of yet: the edge case nobody scoped, the field that should have come from a column already in the database, the flow that only looks wrong once you click it. The prototype becomes a shared surface the whole team reacts to instead of a spec they imagine. Then a measured quality bar and human gates placed at exactly the right moments keep the output honest. Consistency with the system everything else uses isn’t a nice-to-have; it’s enforced. The day is just what’s left over once that discipline is doing its job.

Here’s what a single command produces. I run the orchestrator, walk away, and about 1.5 to 2 hours later come back to a clickable, interactive prototype and a git commit. The UI is production-quality first time, not after rounds of revision: scored above 4/5 on the quality checklist before I see it. Every screen has populated, empty, loading, and error states. The components live inside the production application, in the folder structure features ship from, using the same design system everything else uses. Zero touches between the trigger and the result.

A morning requirements session and an after-lunch team review bracket that two-hour run. After sign-off, the backend gets built and the whole feature gets tested — and on a feature that fits a common pattern, that lands the same workday too.

That same outcome used to take six to ten weeks: initial scoping, multiple rounds of static mocks, stakeholder reviews, the engineering pushback when a mock turned out to be infeasible, the rebuild as code, the iteration cycles after that. Now it lands in a workday.

Not the demo you’ve seen

Claude can sketch a prototype from a one-line prompt. That isn’t this. This pipeline runs against a real codebase with twenty years of schema behind it and walks a PM through thirty-plus grounded requirements questions before a line of UI gets generated. Domain, UX, and UI passes run until a polish reviewer scores the result 4/5 against the team’s own design contract. The output isn’t a demo of what AI could do. It’s the UI that ships in the next release.

This post is how the pipeline works, what it produces, and what it changes about the way you build features.

Why the old way was the rational way

The old workflow had three load-bearing problems:

The artifact arrived late. Teams argued about static mocks for two weeks, then started over once a clickable thing finally existed.
The quality bar was vibes. “Looks good to me” is not a gate. The “approved” mock might have been brilliant, or it might have had eight UX problems nobody caught. Which one you got depended on who was in the room.
Requirements gaps surfaced mid-implementation, when fixing them cost days instead of minutes.

None of that was a mistake. It was a sensible response to a real constraint: building the artifact was the most expensive thing in the cycle, so you spent weeks studying the blueprint before anyone broke ground. The multi-week design phase was an elaborate workaround for the fact that you couldn’t cheaply try the thing before committing to it.

AI changes the economics. The artifact stops being expensive. When a single command produces a clickable, polished, real-component prototype between breakfast and lunch, the workflow inverts. Framing the house is now cheaper than arguing over the blueprint, so the rational move is to build first and decide second.

The Pipeline

One slash command kicks off a sequence of specialized AI agents, each catching what the previous pass couldn’t. Five phases, plus two half-step review passes:

Seven-phase pipeline diagram

Build

An `implementer` agent reads the frozen requirements and builds the components, screens, mocked API endpoints, and a manifest of states (populated, empty, loading, error) for each screen. Between iterations, Playwright drives a real browser to screenshot every screen and state; the orchestrator runs visual sanity checks and spawns fix-up builders when anything renders blank, clips at the edges, or overlaps.

Builder self-review

A fresh implementer, with no conversation context from the original build (it still reads the code that was just committed), re-reads the requirements, looks at the screenshots, and fixes what the original missed.

Domain expert review

An agent role-playing as someone who knows the industry and the users critiques the prototype against how real users would actually do this work. The implementer applies the remediation.

UX review - two rounds

A UX reviewer scores the prototype against a rubric: flow clarity, information hierarchy, discoverability, state transitions, sensible defaults, error recovery. The implementer applies fixes. A second round runs because the first round’s fixes always introduce new problem, two rounds is the minimum that produces stable UX.

UI review

A separate agent reviews visual quality against the design system: styles used correctly, spacing consistent, typography on-spec, components matching the library. Fixes applied. Screenshots retaken.

UI polish up to three rounds, gated on a quality bar

A polish-expert agent, running on a larger model because polish requires judgment, scores every visual aspect on a 5-point scale. The loop only exits when the average is ≥ 4/5 and no individual item scores below 4 and all blocking items pass. The rubric is anchored to the design system itself: spacing tokens, type scale, component variants, color usage, the rules already encoded as the team’s design contract. It’s not subjective taste-grading; it’s spec compliance against rules the codebase already enforces.

Learnings + commit

The orchestrator captures non-obvious discoveries into a `learnings.md` file, then commits the whole prototype to git.

The result on a typical feature: a fully-reviewed, polish-gated, clickable prototype that would have taken a designer-and-developer team a week or more to produce by hand.

A walked-through example

Here’s an actual run, real numbers from one feature: a record list with an edit form, the kind of thing every product ships dozens of.

9:00am — Requirements session.

Engineer + PM. The AI walks through 32 questions, fields on the list, filters, edit-form validation rules, what happens on save, what happens on conflict, who can see what. The PM accepts about 25 of the AI’s recommendations, overrides 7 (mostly around business-rule edge cases the AI couldn’t have known). 45 minutes. Decision register saved to disk.

What makes those 32 questions sharper than a normal scoping doc: the AI has read-only access to the local database schema and data. It knows what tables exist, what columns are on them, what naming conventions the project uses. So when it recommends fields, it’s grounded in what’s actually there. A recommendation like ”surface the existing `effective_date` column from the appointments table” is qualitatively different from ”add an effective date field, somehow.” The prototype never proposes endpoints that would require schema changes nobody wants to make.

9:50am — Engineer triggers the pipeline

`/prototype:02-orchestrate-prototype`. Walks away. Goes to a meeting. Gets coffee.

11:20am — Pipeline finishes

Engineer comes back to a Slack notification, a git commit, and a dev server URL. Opens the URL. Five screens. Every screen has populated/empty/loading/error states. The list filters work. The edit form validates. The error state renders correctly when the mocked save endpoint returns a 422. Polish score: 4.3/5, all blocking items passed.

1:00pm, team review gate.

Engineer + PM + designer in a room, dev-server URL on the shared screen. They click through. About 30 minutes of real use. Three things surface, and none of them are AI mistakes. They’re decisions that sounded right in the requirements session and now, in front of the working artifact, are obviously the wrong call: an empty-state CTA that prompts the wrong action, a “save and continue” button nobody actually wants, an “effective date” field that turns out to mean two different things. The kind of thing that used to cost a team weeks because nobody notices until the feature is built. Here, it’s a comment on a prototype.

1:30pm, notes go back into the pipeline.

Each change can run as its own one-shot iteration or batch together with others, and the team can cycle through review-and-iterate as many times as the prototype warrants. Simple features land in one pass. Denser or more ambiguous ones go two or three. This team batches and runs one pass. 20 minutes later the prototype is fixed, screenshots retaken, polish bar re-cleared. The “effective date” ambiguity gets written into the final requirements freeze as two separate fields.

2:00pm, requirements frozen.

Engineer runs the next command in the chain. It splits the backend work into right-sized phases against the API contracts the prototype already implied, and a separate orchestration takes over to write the database changes, the endpoints, and the tests. That run takes 30 to 90 minutes.

2:40pm, discovery first.

Before any scripted testing, the engineer points an exploratory agent at the new feature and lets it loose with no test plan. It works the screen the way a curious new user would: it pokes at every button, field, and menu, tries the odd path nobody specified, and watches for anything that errors or behaves wrong. This pass isn’t grounded in the requirements on purpose — it’s hunting for the problems nobody thought to write a test for. What it finds gets collected into one batch, ranked worst-first. Then it stops and asks. And here the engineer’s whole job is one decision: skim the list and tick which bugs to fix. That’s it. From there the agent takes each approved bug, writes the fix, applies it, and then re-runs the exact steps that triggered the bug to prove the fix actually worked — and only reports back the ones it has confirmed resolved. The human supplies the judgment about what matters; the AI does all of the how, including grading its own work.

3:10pm, the grounded test pass.

Running discovery first wasn’t arbitrary. It clears the rubble — the dead buttons and broken paths — so that when the grounded pass runs, anything it flags is a real gap against the requirements, not a symptom of some obvious breakage drowning out the signal. Now a second test command goes the other direction: it reads the frozen requirements and writes out the full list of scenarios the feature is supposed to satisfy, then works through most of them on its own: the screen-level checks by driving a real browser, the data-level checks by querying the live database. Where the discovery pass hunted for the unknown, this one confirms the known. Every requirement the team signed off on becomes a concrete, repeatable test that reruns for free from then on. The few scenarios that genuinely need human judgment are flagged as such, and the engineer works through that short list before the feature is called done.

Maybe 90 minutes of actual human attention across the day. The team walks out with a frozen spec, a clickable prototype that matches it, and a backend implementation already running against the approved contracts.

The contrarian claim: the prototype is the production UI

When most teams say “prototype,” they mean a throwaway approximation: a Figma mock, a click-through, or at best a coded mock in Storybook. The mock gets approved. Then someone rebuilds it as production code. The build phase is essentially the second time the same UI gets made.

That’s not what happens here. The components I build in the prototype phase are the components that ship to production. They live inside the application itself, in the same folder structure features ship from, using the same design system tokens production components do. There is no second build. The “prototype” label refers to the backend layer: API endpoints are mocked, no database is involved, data shapes are honest but the implementation is disposable. The frontend is production-grade from day one.

When implementation begins, the diff between “prototype” and “production” is concentrated in the data layer. A `mockApi.getBookings()` call becomes `api.getBookings()`, returning the same shape the prototype was already rendering. Same component tree. Same styling. Same behavior. The thing the team approved in thtotype review is what ships.

Prototype-IS-production-UI diff diagram

This matters for two reasons:

1. The 4/5 polish bar is also the production bar. In the usual flow, prototype polish doesn’t carry forward, and production has to clear a separate bar under time pressure. When the prototype passes the polish loop here, it’s not “good enough for review.” It’s “good enough to ship.”

2. There’s no second design-and-rebuild cycle.The gap between “prototype approved” and “UI ready for real APIs” is the time it takes to swap the API layer. Often a single day. Sometimes hours.

Past the prototype: trusting the code AI writes

A polished prototype is the easy thing to believe. The hard sell is the backend. Show a skeptical engineer a clean UI and they’ll grant you the UI. Tell them an AI wrote the migrations, the stored procs, and the endpoints against a database with twenty years of schema behind it, and the trust evaporates, and they’re right to be cautious. So this is the part that has to earn it, and it earns it two ways.

The interesting decisions are already made, so the backend is a fill-in, not a design problem. By the time backend work starts, the approved prototype has already pinned the contracts. The form renders these fields, so the endpoint returns these fields. The list filters on these columns, so the query takes these parameters. Nobody is in a room negotiating what an endpoint should look like, because the prototype the team signed off on already answered that — in a shape grounded in real schema from the very first requirements question. What’s left isn’t the creative part of backend work; it’s the mechanical translation of a settled contract into migrations, procs, and endpoints. That’s exactly the kind of constrained, well-specified work an orchestration of agents is good at, and exactly why it can run unattended in under 90 minutes.

You don’t trust the output by reading every line. You trust it because of how it’s verified. This is the real answer to the skeptic. The two test passes pull in opposite directions on purpose: the requirements-blind discovery pass proves the feature doesn’t break in the ways nobody anticipated, and the grounded pass proves it does everything the team explicitly asked for. What survives both isn’t “code an AI wrote and a human skimmed.” It’s code measured against the requirements and the real world, and every one of those checks becomes a permanent, replayable test that re-runs for free on the next change.

That’s the whole trust argument in one line: the contracts make the backend cheap to build, and the two-sided verification makes it safe to ship — without anyone pretending they read every generated line and understood it.

What this changes about how you work

Who does what in the new flow

Once the artifact is cheap and the polish bar is the ship bar, the human work changes shape. The design-phase roles become foreman work, not trade work: PM, engineer, and stakeholder all shift from producing the design artifact to steering the one the AI produces. Expertise gets applied to a working artifact instead of an imagined one:

Engineer: owns the requirements pass with the PM, triggers the orchestrator, shepherds the prototype to the review gate. A workday to first prototype in place of multi-week revise-and-rebuild cycles. More features designed per quarter.

PM: drives the requirements conversation directly, making the high-judgment calls (business rules, edge cases, what the customer actually needs) instead of authoring a long-form spec that gets reinterpreted downstream. Their judgment lands on the artifact the team approves.

Stakeholders: click through the deployed prototype the same way they would click through a shipped feature. Feedback grounded in actual use, not in imagining what a mockup might feel like.

Teams with a dedicated designer plug them into the human review gate: applying taste to a working prototype, directing the next iteration, catching the emotional-resonance and cultural-fit calls that a checklist-based AI review won’t. Same roles, same names, more time on the work that needed their judgment.

Redistributed roles

The order of operations flips

Roles aren’t the only thing that rearranges. The sequence of decisions does too. Three categories that used to happen before the prototype now happen in front of it:

Requirements get a first pass before the prototype, enough to give the build agent something to work from, then freeze after the team has clicked through and surfaced the real gaps. ”That flow doesn’t make sense; this field should have been on the other screen; what happens when a user does X-then-Y, we never thought about that.” The prototype isn’t a downstream artifact of requirements. It’s a tool that produces better requirements.
UX is exempt from the requirements phase entirely. No modal-vs-drawer debates upstream of the artifact. By the time the prototype reaches the team they’re reacting to a thing, not debating an idea.
API shape comes out of the components the AI builds, not the other way around. By the time the backend orchestration kicks off, the contracts are already pinned, and a separate set of AI agents implements them.

The pattern across all three: judgment used to be applied to the imagined version of the thing. Now it’s applied to the working one.

The honest trade-offs

Compressing a whole feature into a day has real costs. The ones worth flagging:

The UX-reviewer has limits. Emotional resonance, cultural sensitivity, accessibility nuances the checklist doesn’t catch: these require human review. The human review gate is where those get caught. Don’t skip it.
The prototype is only as good as the requirements. The requirements command walks the team through structured questions and doesn’t move on until each one has a concrete answer. The team’s job is showing up and answering questions, not authoring a 40-page PRD.
The workday timeline assumes scaffolding is in place. If you’re starting from scratch (no orchestrator, no review checklists, no design-system tokens, no requirements pipeline), that’s a few days of work before the first prototype lands in a day.
This works best for features that fit common patterns, which is most of them. Booking screens, record lists, edit forms, multi-step wizards: probably 95% of the features any product needs to ship are well-understood patterns, and the pipeline produces strong output on every one of them. The remaining 5% (genuinely novel interactions: multi-touch gestures, real-time collaboration, anything without a category prior) require more human steering during the review rounds.

What I’d tell a team thinking about this

Three things hold up across every team I’ve watched try this:

The AI needs explicit acceptance criteria, or it grades on vibes. Vague review prompts produce vague reviews. Specific criteria (spacing tokens, type scale, empty-state coverage, error-recovery flows) produce specific, actionable ones. Build the criteria first. The rest of the pipeline is in service of them.
Teams who skip the human review break. Between 15 and 40 percent of features hit a “wait, that’s not quite right” moment at the gate, and those moments almost always surface a missed assumption in the requirements. The AI produces a strong default; the human’s job is the editing pass that turns the default into the right answer.
Don’t make the prototype pipeline handle everything. It’s a single-feature tool, not a system-design tool. Cross-feature flows, multi-screen state machines, third-party integrations: those still need real design work upstream.

I’m Adam Miles. 25 years of shipping software, including the messy brownfield kind. I write about using AI to move real codebases forward: legacy, greenfield, and the awkward middle.