What if the test passes but the result still feels bogus?

That means the test has a coverage gap. Add a follow-up test, re-lock, iterate. Do not loosen the test to get past the feature.

Building with Claude Code in Your Rails App

Q: What does this cost compared to letting Claude build features unattended?

More expensive per feature in time, much less expensive over the lifetime of the codebase. Vibe-coded features compound problems; by feature 10 you are rewriting features 1 through 9.

A founder asks Claude Code to build a "share invoice with my accountant" feature. The agent works for ten minutes. It produces code. It looks right. Two weeks later something else breaks, and on closer reading the share feature wasn't quite what the founder asked for either: it shared the invoice publicly, with no authentication, because nothing told the agent that "share with accountant" meant a private link.

That is what happens when you skip the part where you and the agent agree on what "done" looks like before the agent starts working.

I work with founders who build with AI. The loop I run with them has six phases, each with a specific artifact, a specific approval, and a way for you to verify the work even though you can't read most of the code.

The whole thing is closer to agile than to vibe coding. It pivots on a single idea: the test is the spec you can read.

This article walks through the loop and shows you what a system test looks like in practice. It is the second article in the Getting Started series. The first one got you set up. This one shows you how we build.

If you haven't installed Rails and Claude Code yet, start there. If you have a running Rails app but no clear spec for the feature you want, the spec phase is its own work and we'll cover it in a separate article. For now, assume you have a feature you want and we're about to agree on what it should do.

The thesis: definition of done, made executable

Agile teams have a phrase for what we're after here: the definition of done. It's the shared written understanding of what "finished" means for a piece of work. In a traditional team, the definition of done lives in a document or a ticket. People read it and argue about whether they did it.

When the worker is an AI agent, the definition of done has to be executable. The agent doesn't read your document and pause to argue with you. It picks the most plausible interpretation and ships. If your definition of done is fuzzy, the output will be plausible-but-wrong, and you won't catch it because you can't read the code.

So we make the definition of done a test. The test is written in a form close enough to plain English that you can read it. The agent treats the test as the contract. When the test passes, the work is done. When the test fails, the agent isn't finished.

That is the load-bearing move. Everything else in the loop exists to get to a good test and to lock it before the agent starts implementing.

The Spec-Driven Development series on martinfowler.com makes the same observation more bluntly: with AI agents in the loop, "your role isn't just to steer. It's to verify." The test is what makes verification possible without you reading code.

This loop is also one concrete shape of what I've called the middle loop elsewhere: the supervisory engineering work that sits between inner-loop coding and outer-loop delivery. The middle loop article frames the why. This one is the how.

Where everything lives: GitHub or GitLab

GitHub and GitLab serve three roles in this loop, and naming them upfront is worth a minute of your time.

Storage of your code. Your Rails app lives on your Mac while you work; the platform is where it's backed up and shareable. Your project gets pushed up there as a repository (repo for short).
The structure of the loop itself. Specs live as issues. Tests live in pull requests (PRs). My review and the "go" lock are comments on the PR. The whole flow is visible on the platform.
CI: continuous integration. Every time tests or code change, the platform runs the full test suite automatically on its own servers. Green checkmarks mean the tests pass; red Xs mean they don't. CI is the safety net and the ultimate sign-off: even if we both said "go", CI's verdict on a green build is the unarguable one.

Either platform works; pick one and stay there.

How Claude Code talks to GitHub or GitLab. Both platforms publish a command-line tool: gh for GitHub, glab for GitLab. Claude Code uses these behind the scenes to create issues, open PRs, read review comments, and watch CI. You won't see the commands; you'll see Claude saying things like "I've opened PR #4 with the failing tests."

NOTE

If you ran /jr-rails-bootstrap from Article 1's Quick path, both CLIs are already installed and authenticated. If you went the manual route, the install commands live in Step 7 of Article 1's manual reference. Default to HTTPS auth - SSH key setup is parked for a future article.

Phase 1: Spec agreement

We start with a conversation. You tell me what you want to build. I ask questions. We agree on a piece of work small enough to ship in a single pull request, large enough to be worth shipping at all.

If the piece is too big, we break it into sub-issues, so each one fits in a single round of the loop without overwhelming the test suite or the review. The spec lives in a GitHub or GitLab issue, with sub-issues as children. The parent issue is the feature. The child issues are the increments.

If you've never created an issue before, it's two clicks from your repository page: "Issues" → "New issue", give it a title, write the spec into the description, submit. To "point Claude Code at an issue" later, paste the issue URL into your prompt (or reference it by number like issue #4).

This phase ends when both of us can say, in two sentences, what each issue is for and why it matters. If you can't say it, we haven't agreed. If I can't say it, the spec is too vague to test.

The most common failure here is the founder specifying the implementation when they should be specifying the outcome. "Add a share button to the invoice page" is implementation. "When a user is on an invoice they own, they should be able to send the accountant a private link" is outcome. We work in outcomes. The implementation belongs to the agent.

This is a version of the XY problem: asking how to do Y when the real question is X. With AI agents the cost of getting it wrong compounds, because the agent will happily build Y and you won't notice until you're three features deep into the wrong abstraction.

An actual spec looks like this. The issue body we write together in Phase 1 ends up something like:

Title: Share invoice with accountant via private link

Outcome
When a user is on an invoice they own, they can send their accountant
a private link that only the accountant can open. The link expires
after 30 days.

Out of scope (for now)
- Multiple recipients per share
- Customising the expiry duration
- Tracking whether the accountant viewed it (separate issue later)

Acceptance
- User can initiate a share from the invoice page
- An email goes to the accountant with the link
- Only the recipient email can open the link
- The link expires after 30 days

That's it. Plain language. The "Outcome" is what we'll test against. The "Out of scope" stops the agent from over-building. The "Acceptance" list becomes the test names in Phase 2.

What fits in one round? Roughly: one user-facing flow, one or two new database tables, three to five system tests. The invoice-sharing example above is the right size.

Here's the same idea written too big:

Title: Build invoice management

Outcome
Users need to create, edit, share, and archive invoices. They should
see paid/unpaid status. Multiple users per account. Email notifications
when invoices are viewed. PDF export. Recurring billing.

That's six features. The agent will try to build all of them at once and produce a tangle. The right move is to break it into individual issues - create invoices, list invoices, edit invoices, share invoice with accountant, each going through this loop separately.

IMPORTANT

Pause here. Draft your spec in the format above, in your repo's Issues, and bring it to your senior reviewer (me, or whoever's playing that role for your project). The rest of this article walks through what happens after the spec is locked. Reading on now is fine; acting on it without an agreed spec is the most common way founders get into the bog.

Phase 2: System tests as spec translation

Once we agree on the spec, you (with your Claude Code) translate it into Rails system tests. You open Claude Code in your project, point it at the spec issue, and ask something like: "write the failing system tests for issue #N. Don't implement anything yet."

The agent writes the tests, opens a pull request, and pushes it up. The PR contains the tests and nothing else. The tests fail because there's no implementation yet. That's expected.

Here's the kind of thing your agent produces:

require "application_system_test_case"

class InvoiceSharingTest < ApplicationSystemTestCase
  setup do
    @user = users(:alice)
    @invoice = invoices(:alice_invoice_one)
    sign_in @user
  end

  test "user can share invoice with their accountant via private link" do
    visit invoice_path(@invoice)
    click_on "Share with accountant"
    fill_in "Accountant's email", with: "accountant@example.com"
    click_on "Send"

    assert_text "Share link sent to accountant@example.com"
    assert_emails 1
  end

  test "share link expires after 30 days" do
    share = @invoice.share_links.create!(
      recipient_email: "accountant@example.com",
      created_at: 31.days.ago,
    )

    visit shared_invoice_path(share.token)
    assert_text "This share link has expired"
  end

  test "share link is only viewable by the recipient email" do
    share = @invoice.share_links.create!(
      recipient_email: "accountant@example.com",
    )

    visit shared_invoice_path(share.token)
    fill_in "Your email", with: "wrong@example.com"
    click_on "Continue"

    assert_text "This link isn't for you"
  end
end

Read it without trying to understand the syntax. (One note: the sign_in @user line near the top references logging in. The jr-rails-new skill from Article 1 sets up Rails' built-in authentication by default, so your project already has a user/login system to test against.)

The three test names tell you what is being checked:

A user can share an invoice with their accountant via a private link.
The share link expires after 30 days.
Only the recipient email can view the link.

The body of each test says how the behaviour is observed. "Visit the invoice page, click 'Share with accountant', fill in the email, click 'Send', expect to see a confirmation message." That sequence is what your user actually does.

The Capybara vocabulary, for reference

visit means open this page. click_on means click the link or button with this label. fill_in ... with means type into the labeled field. assert_text means expect to see this text on the page. assert_emails 1 means: an email should have been sent (here, exactly one).

There's no Ruby you need to know to read these tests. Once you've read four or five of them, you can skim them as fluently as you'd skim a Notion doc.

If you read this and think "yes, that's what I asked for", you'll say "go" at the lock gate in Phase 3. If you read it and think "wait, I never said anything about a 30-day expiry", that's the conversation we need to have before we lock.

System tests are end-to-end tests that drive a real browser against a real database. They describe what a user sees and does, and they read close to English.

I deliberately ask you to drive this step. Your agent has the Rails knowledge to write the test; your job is to direct it from the spec and read what came out. The lock gate in Phase 3 catches what either of you missed, so first-pass perfection isn't the goal. The goal is the mental model. You see the spec become a test, you read what your own agent produced, you develop a feel for what a testable spec looks like. By the time you reach Phase 6 (verification), you're reading tests as fluently as English prose.

We use system tests, not the lower-level kinds Rails also offers, at this stage. System tests are the readable layer. A lower-level test for the Invoice model tells you nothing useful unless you already know what the model is. A system test tells you what happens when someone clicks "Share with accountant."

There's a tradeoff. System tests are slower to run and more brittle than unit tests. We address that in the simplification pass later.

Phase 3: The "go" lock gate

Once the PR is up, two reviews happen.

You read the tests first. Not the implementation, since there isn't one yet. Just the tests. You're checking one thing: do these tests describe what you actually want? If not, you ask your agent to revise. (Often the agent over-specified or missed an edge case the spec implied but didn't state.)

I review the tests next. Same question at a different grain: are they covering the right behaviour at the right level, without being brittle or missing scenarios? If anything's off, I tell you, and you and your agent revise. (In practice my review lives as comments on the PR, usually within a few hours during your working day. If you flag a question, we'll often jump on a 15-minute call.)

When both reviews pass, I say "go." The test is now locked. I won't change it. Your agent won't change it. It's the contract.

This is the moment that makes the rest of the loop work. The lock turns a fluid conversation into an immutable artifact. The agent can no longer interpret you. It has to satisfy the test.

Phase 4: Implementation

With the test locked, you ask your agent to implement against it. In Claude Code: "make the failing tests in PR #N pass without modifying the tests themselves."

Claude Code writes models, controllers, views, migrations: whatever the test needs. It iterates until the test passes. If it gets stuck it asks you a question or escalates.

Most of what the agent does during this phase is invisible. It reads your codebase to find the right place to add things, writes a database migration if the schema needs to change, generates the model/controller/view files, runs the tests itself, sees what fails, fixes, runs again. The lock means it can't drift the spec to make the work easier. The test passing is the only signal it's done.

The shape of what gets produced (Rails class stubs)

For the invoice-sharing example:

# db/migrate/..._create_share_links.rb
# Adds a share_links table: belongs to an invoice, holds the
# recipient email and a unique token, plus the usual timestamps.

# app/models/share_link.rb
class ShareLink < ApplicationRecord
  belongs_to :invoice

  # Generates a random URL-safe token before the record is saved.
  before_create { ... }

  # True if more than 30 days have passed since creation.
  def expired? ... end
end

# app/controllers/invoice_shares_controller.rb
class InvoiceSharesController < ApplicationController
  # Creates a share link for the current user's invoice and emails
  # it to the recipient.
  def create ... end
end

# app/mailers/share_link_mailer.rb
class ShareLinkMailer < ApplicationMailer
  # Emails the share link to the accountant.
  def invitation(share) ... end
end

Stubs only. You're not expected to read or write any of this. But seeing the shape of what "Claude Code is implementing" actually means helps you know what to look for in Phase 6: did files appear in these folders, did the tests turn green.

You don't need to read most of this code. The test is what you're verifying against. I do.

Phase 5: Code review

The agent made the tests pass. Now I read what it wrote.

This is a different review from the test-locking review in Phase 3. Then, we both checked whether the tests describe what you want. Now I'm checking whether the implementation does what the tests demand and nothing else: whether it fits the rest of your codebase, whether it took shortcuts that will hurt later, whether it introduced anything the tests don't catch.

In practice I'm looking for architectural fit (is the new code in the right place?), security (any unsafe defaults, missing permission checks?), performance (any database queries that will be slow at scale?), and style (does it read like the rest of your codebase?). The tests cover correctness of what you can describe. This review catches what you can't.

If something needs fixing, I leave comments on the PR. Your agent reads them and addresses them. The test suite reruns. We iterate until the PR is clean and I approve. Then you move on to verification.

This is the second of the two gates that make the loop work. The lock gate at Phase 3 protects you from the wrong tests; this code review protects you from the right tests being passed in the wrong way.

Phase 6: Verification

When my review is in and the tests still pass, you do one final check. You read the locked tests one more time, scan the green checkmarks in the PR, and either approve or flag.

The check is short because the test is the contract and I've already vetted the code behind it. If the test passes and the test describes what you wanted, the implementation is correct enough to ship. If something feels off (the test passes but the app behaves strangely in your hands), that's when you ping me, because something the test doesn't cover has slipped through.

What we simplify later (test suite performance, post-MVP)

System tests are slow. A test that spins up a browser, navigates pages, fills forms, and waits for responses can take five to ten seconds. A unit test takes milliseconds. If we keep every test as a system test, the suite eventually takes twenty minutes to run and slows the agent down considerably.

After the loop closes (after the feature is merged, you've used it, things are stable), we go back and rewrite the tests that don't really need a browser into faster non-browser equivalents. The "share link expires after 30 days" test, for example, has no real interactive element worth checking in a browser. It can be rewritten to run in milliseconds instead of seconds.

We keep the interactive system tests around because they're the readable layer. The agent can read them. You can read them. They're the spec, durably.

The simplification happens later because we don't want to slow down the loop before we've shipped working software. Speed of iteration matters more than test suite speed while we're building. Test suite speed matters more once the feature has landed and we want to move on to the next one.

What this isn't

This article doesn't cover the upstream spec phase. Most founders spec too broadly. Narrowing the spec to something testable is its own work and gets its own article.

It doesn't cover running multiple agents in parallel against the same codebase via git worktrees. That's an advanced layer on top of this loop, useful once you've shipped a few features and want to parallelize. For the first ten features, the single-track version is enough.

It doesn't cover deployment. Article 4 in this series will.

It doesn't replace technical judgement. Good system tests need Rails knowledge to write and to vet. In our loop your agent has the writing knowledge; I have the vetting judgement at the lock gate. The loop is the scaffolding for that division of labour.

What now?

If you've been chatting with Claude about a product idea and you have a Rails app from the install guide, this loop is how you actually build features. The two ways to start working with me are the same as before: a codebase audit if you've already shipped something and want a read, or a CTO retainer if you want me in the loop while you build.

FAQ

Why system tests and not unit tests?

Unit tests describe what a single object does in isolation. System tests describe what a user sees and does, in a real browser against a real database. A founder can read a system test without knowing Rails; a unit test requires knowing what the objects are.

What if I approve a test and the result still feels bogus?

That means the test has a coverage gap. We add a follow-up test, re-lock, iterate. The trap to avoid is loosening the test to get past a difficult feature.

What does this cost compared to letting Claude build features unattended?

More expensive per feature in time. Much less expensive over the lifetime of the codebase. Vibe-coded features compound problems; by feature 10 you're rewriting features 1 through 9. The loop sounds slow because each feature has more steps. The slow comes back as compounding stability.

Can I run this loop without a fractional CTO?

You can run a version of it. The hard part is test quality. Founders I work with who try the loop solo usually get the structure right and the test quality wrong, which means the tests pass and the implementation is plausible-but-wrong, which is what the loop was supposed to prevent.

How long does one feature take in this loop?

Spec agreement is a 30-minute conversation. Test writing is one or two hours. Lock gate is 15 to 30 minutes of your time. Implementation is whatever Claude Code takes, often 10 to 30 minutes. Code review is 30 to 60 minutes of my time. Verification is 10 minutes. Per shippable feature: half a day to a day of elapsed time, of which you're actively engaged for about two hours.

What if the agent can't make the test pass?

Usually it can. When it can't, the test is over-specified, the spec was wrong, or the agent doesn't know enough about your codebase. We diagnose, fix the right thing (often the spec, sometimes the test, occasionally a Rails convention the agent didn't know), and the loop continues.

Julian Rubisch is a fractional CTO based in Vienna. 16 years of Rails. Maintainer of StimulusReflex. Codebase audits and CTO retainers at railsreviews.com.

Last updated: 2026-05-11