Harness Engineering Report

February 20, 2026 · 8 min read

A survey of how teams are setting up automated coding agent pipelines (Feb 2026).

1. Stripe Minions -- Enterprise Internal Fleet

Scale: 1,300 PRs/week, 0 human-written code Trigger: Slack message, CLI, web UI, or automated (flaky test detected)

  Slack msg / CLI / auto-trigger
          |
          v
  ┌─────────────────┐
  │  Warm Devbox     │  <-- EC2, pre-cloned repo, ~10s ready
  │  (isolated)      │      no internet, no prod access
  └────────┬────────┘
           |
           v
  ┌─────────────────┐
  │  Blueprint       │  <-- state machine: deterministic + agentic nodes
  │  Orchestration   │
  └────────┬────────┘
           |
     ┌─────┴──────┐
     |             |
     v             v
  [Agentic]    [Deterministic]
  "Implement"  "Run linters"
  "Fix CI"     "Push changes"
     |             |
     └─────┬───────┘
           |
           v
  ┌─────────────────┐
  │  Local Lint      │  <-- heuristic, <5s
  │  (shift left)    │
  └────────┬────────┘
           |
           v
  ┌─────────────────┐
  │  CI: selective   │  <-- from 3M+ tests, only relevant
  │  test run        │
  └────────┬────────┘
           |
      pass? ──no──> autofix? ──yes──> apply, retry once
           |                    no──> hand to human
          yes
           |
           v
  ┌─────────────────┐
  │  PR created      │  <-- follows Stripe PR template
  │  (human review)  │
  └─────────────────┘

Context sources:

Rule files (Cursor format, directory-scoped)
MCP "Toolshed" (~500 internal tools, curated subset per agent)
Pre-hydrated links from conversation context

Key insight: "Often one, at most two CI runs." Forked Block's Goose as base agent.

Full reference | Source

2. OpenAI Harness Engineering -- Zero Human Code

Scale: ~1M LOC in 5 months, 3.5 PRs/engineer/day Trigger: Human writes a prompt describing a task

  Engineer writes prompt
          |
          v
  ┌─────────────────┐
  │  Codex agent     │  <-- reads AGENTS.md (table of contents, ~100 lines)
  │  (isolated       │      walks dir tree root -> CWD
  │   worktree)      │      loads docs/ as needed (progressive disclosure)
  └────────┬────────┘
           |
           v
  ┌─────────────────┐
  │  Work depth-     │  <-- break goal into building blocks
  │  first           │      design -> code -> test -> review
  └────────┬────────┘
           |
           v
  ┌─────────────────┐
  │  Custom linters  │  <-- Codex-written, error msgs include remediation
  │  (architectural  │      enforce layer deps, naming, file size
  │   constraints)   │
  └────────┬────────┘
           |
           v
  ┌─────────────────┐
  │  Agent self-     │  <-- review own changes
  │  review          │      request additional agent reviews
  │                  │      respond to feedback, iterate
  └────────┬────────┘
           |
           v
  ┌─────────────────┐
  │  Agent-to-agent  │  <-- humans optional in review
  │  review loop     │      squash & merge when satisfied
  └────────┬────────┘
           |
           v
  ┌─────────────────┐
  │  PR merged       │
  └─────────────────┘

  ── background ──────────────────────
  "Garbage collection" agents run periodically:
    - scan for stale docs
    - detect architectural violations
    - open fix-up PRs
  "Doc-gardening" agent:
    - cross-link and validate knowledge base

Three pillars:

Context engineering (knowledge base + observability + browser via Chrome DevTools)
Architectural constraints (custom linters + structural testing)
Garbage collection (periodic entropy-fighting agents)

Key insight: "When the agent struggles, treat it as a signal. Identify what's missing and feed it back into the repo -- by having the agent write the fix."

Full reference | Source

3. Code Factory / Ralph -- Solo Autonomous Loop

Scale: Ships features while you sleep, 1 agent in a bash loop Trigger: ./scripts/compound/loop.sh N or ralph.sh

  prd.json (task inventory)
  prompt.md (instructions)
  AGENTS.md (conventions)
          |
          v
  while stories remain:
          |
          v
    ┌───────────────┐
    │ Agent picks    │  <-- reads prd.json, selects next by priority
    │ next story     │
    └───────┬───────┘
            |
            v
    ┌───────────────┐
    │ Implement      │  <-- single context window per story
    └───────┬───────┘
            |
            v
    ┌───────────────┐
    │ Typecheck +    │  <-- must be fast, "broken code compounds"
    │ Tests          │
    └───────┬───────┘
            |
       pass? ──no──> skip, log failure
            |
           yes
            |
            v
    ┌───────────────┐
    │ Auto-commit    │
    │ Mark story     │
    │ done           │
    └───────┬───────┘
            |
            v
    ┌───────────────┐
    │ Append to      │  <-- pattern accumulation
    │ progress.txt   │      by iteration 10, agent understands patterns
    └───────┬───────┘
            |
            └──> next iteration

  ── code review layer (Code Factory) ──
  Risk tiers:
    Low    -> fully automated merge
    Medium -> automated with CI gates
    High   -> require human confirmation

  Review agent validates PR:
    - review state must match current HEAD SHA
    - evidence: tests + browser recording + review
    - auto-resolve only bot-only stale threads

Key files: ralph.sh, prd.json, prompt.md, progress.txt, AGENTS.md

Key insight: Small stories, fast feedback, explicit criteria. "By iteration 10, the agent understands patterns from previous stories."

Full reference | Source

4. dmux -- Parallel Agents via tmux + Worktrees

Scale: N concurrent agents, each in isolated git worktree Trigger: Press n in dmux TUI, type a prompt

  dmux TUI
    |
    |──> press 'n'
    |
    v
  ┌─────────────────┐
  │ Generate slug    │  <-- AI-generated branch name via OpenRouter
  └────────┬────────┘
           |
           v
  ┌─────────────────┐
  │ Create git       │  <-- .dmux/worktrees/<slug>/
  │ worktree         │      full independent working copy
  └────────┬────────┘
           |
           v
  ┌─────────────────┐
  │ Split tmux pane  │
  │ Launch agent     │  <-- claude/codex/opencode
  │ (--acceptEdits)  │
  └────────┬────────┘
           |
           v
  ┌─────────────────┐
  │ Agent works      │  <-- status detected via LLM analysis of terminal
  │ autonomously     │      polls every 1s
  └────────┬────────┘
           |
           v
  press 'm' to merge
           |
           v
  ┌─────────────────┐
  │ AI commit msg    │  <-- conventional commits via OpenRouter
  │ Merge to main    │
  │ Remove worktree  │
  └─────────────────┘

  Hooks fire at each stage:
    worktree_created -> e.g. pnpm install
    pre_merge        -> e.g. run tests
    post_merge       -> e.g. git push, close issue

A/B mode: Run two agents on same prompt side-by-side to compare outputs.

Key insight: Git worktrees give true isolation -- each agent has its own working copy, no conflicts. Hooks enable custom automation at every lifecycle point.

Full reference | Source

5. Superconductor -- Parallel Agents with Live Previews

Scale: N agents per ticket, cloud sandboxes, live browser previews Trigger: Web dashboard, iOS app, Slack, or GitHub comment (@superconductor)

  Create ticket (informal description)
          |
          v
  ┌─────────────────┐
  │ Launch N agents  │  <-- each gets isolated container
  │ on same ticket   │      full repo, dev tools, test runners
  └────────┬────────┘
           |
     ┌─────┼─────┐
     v     v     v
  [Agent1][Agent2][Agent3]   <-- Claude/Codex/Amp/Gemini
     |     |     |
     v     v     v
  [Live] [Live] [Live]      <-- browser previews appear ~30s
  [prev] [prev] [prev]
     |     |     |
     └─────┼─────┘
           |
           v
  ┌─────────────────┐
  │ Compare previews │  <-- interact with each, test functionality
  │ Diff viewer      │      audit code changes across agents
  └────────┬────────┘
           |
           v
  ┌─────────────────┐
  │ Select best      │
  │ One-click PR     │
  └─────────────────┘

Key insight: Fire many agents in parallel on the same task. Visual comparison of live previews is the quality gate, not just code review.

Full reference | Source

6. Terragon -- Background Fire-and-Forget Fleet

Scale: ~30 concurrent tasks/day, auto-PR creation Trigger: Web dashboard, terry CLI, GitHub comment, mobile, Slack

  Create task (any interface)
          |
          v
  ┌─────────────────┐
  │ Spawn cloud      │  <-- fresh isolated container
  │ sandbox          │      clone repo, create branch
  └────────┬────────┘
           |
           v
  ┌─────────────────┐
  │ Agent executes   │  <-- writes code, runs tests, iterates
  │ autonomously     │      checkpoints pushed to GitHub
  │ (background)     │      AI-generated commits
  └────────┬────────┘
           |
           v
  ┌─────────────────┐
  │ PR created       │  <-- automatic when agent finishes
  │ automatically    │
  └────────┬────────┘
           |
           v
  ┌─────────────────┐
  │ Human reviews    │  <-- dashboard, CLI, or GitHub
  │ and merges       │
  └─────────────────┘

  If agent struggles:
    "Abandon and retry with different instructions"
    (more effective than course-correcting)

Best for: exploration/prototyping, one-shot cleanup, boilerplate, context-intensive debugging.

Key insight: Async-first. Close your laptop, come back to finished PRs. Volume alone doesn't guarantee gains -- task selection matters.

Full reference | Source

7. Gas Town (Steve Yegge) -- K8s for Agents

Scale: 20-30 parallel Claude Code instances Trigger: Task queue

  Task queue
      |
      v
  ┌─────────────────┐
  │ Orchestrator     │  <-- "K8s for agents"
  │ (Gas Town)       │
  └────────┬────────┘
           |
     ┌─────┼─────┼─────┐
     v     v     v     v
  [Agent][Agent][Agent][Agent] ... x20-30
     |     |     |     |
     v     v     v     v
  [Git-backed persistent state]
           |
           v
  ┌─────────────────┐
  │ Merge queue      │  <-- conflict resolution between agents
  └────────┬────────┘
           |
           v
  ┌─────────────────┐
  │ Patrol agents    │  <-- quality control watchdogs
  └────────┬────────┘
           |
           v
  merged to main

K8s analogy: Pod=Agent, Health check="Is it done?", Service mesh=Merge queue, DaemonSet=Patrol agent.

Full reference

Comparison Matrix

System	Trigger	Agents	Isolation	Quality Gate	Human Role
Stripe Minions	Slack/auto	1 per task	Devbox (EC2)	Linters + selective CI + autofix	Review PR
OpenAI Harness	Prompt	1 per task	Worktree	Custom linters + agent review	Prioritize, validate
Code Factory	Cron/manual	1 (loop)	Branch	Typecheck + tests + browser recording	Review high-risk
dmux	TUI key	N (tmux)	Git worktree	Hooks (custom)	Merge decision
Superconductor	Ticket	N per ticket	Cloud container	Live preview comparison	Select best
Terragon	Any interface	N (cloud)	Container	CI + auto-PR	Review PR
Gas Town	Task queue	20-30	Git state	Patrol agents + merge queue	Supervise

Common Patterns

All systems follow roughly the same skeleton:

  trigger (human or automated)
      |
      v
  isolate (worktree / container / devbox)
      |
      v
  agent works (agentic + deterministic nodes)
      |
      v
  fast feedback (lint / typecheck / tests -- shift left)
      |
      v
  quality gate (CI / agent review / live preview / patrol)
      |
      v
  output (PR / branch / merged code)
      |
      v
  human decision point (review / select / merge / abandon)

Universal principles:

Isolation first -- every agent gets its own sandbox
Shift feedback left -- catch errors before CI, not after
Context is scarce -- small focused instructions > one giant file
Constraints enable speed -- linters and gates prevent drift
Humans supervise loops, not sit inside them

1. Stripe Minions -- Enterprise Internal Fleet​

2. OpenAI Harness Engineering -- Zero Human Code​

3. Code Factory / Ralph -- Solo Autonomous Loop​

4. dmux -- Parallel Agents via tmux + Worktrees​

5. Superconductor -- Parallel Agents with Live Previews​

6. Terragon -- Background Fire-and-Forget Fleet​

7. Gas Town (Steve Yegge) -- K8s for Agents​

Comparison Matrix​

Common Patterns​

1. Stripe Minions -- Enterprise Internal Fleet

2. OpenAI Harness Engineering -- Zero Human Code

3. Code Factory / Ralph -- Solo Autonomous Loop

4. dmux -- Parallel Agents via tmux + Worktrees

5. Superconductor -- Parallel Agents with Live Previews

6. Terragon -- Background Fire-and-Forget Fleet

7. Gas Town (Steve Yegge) -- K8s for Agents

Comparison Matrix

Common Patterns