Harness Engineering Report
A survey of how teams are setting up automated coding agent pipelines (Feb 2026).
1. Stripe Minions -- Enterprise Internal Fleet
Scale: 1,300 PRs/week, 0 human-written code Trigger: Slack message, CLI, web UI, or automated (flaky test detected)
Slack msg / CLI / auto-trigger
|
v
┌─────────────────┐
│ Warm Devbox │ <-- EC2, pre-cloned repo, ~10s ready
│ (isolated) │ no internet, no prod access
└────────┬────────┘
|
v
┌─────────────────┐
│ Blueprint │ <-- state machine: deterministic + agentic nodes
│ Orchestration │
└────────┬────────┘
|
┌─────┴──────┐
| |
v v
[Agentic] [Deterministic]
"Implement" "Run linters"
"Fix CI" "Push changes"
| |
└─────┬───────┘
|
v
┌─────────────────┐
│ Local Lint │ <-- heuristic, <5s
│ (shift left) │
└────────┬────────┘
|
v
┌─────────────────┐
│ CI: selective │ <-- from 3M+ tests, only relevant
│ test run │
└────────┬────────┘
|
pass? ──no──> autofix? ──yes──> apply, retry once
| no──> hand to human
yes
|
v
┌─────────────────┐
│ PR created │ <-- follows Stripe PR template
│ (human review) │
└─────────────────┘
Context sources:
- Rule files (Cursor format, directory-scoped)
- MCP "Toolshed" (~500 internal tools, curated subset per agent)
- Pre-hydrated links from conversation context
Key insight: "Often one, at most two CI runs." Forked Block's Goose as base agent.
2. OpenAI Harness Engineering -- Zero Human Code
Scale: ~1M LOC in 5 months, 3.5 PRs/engineer/day Trigger: Human writes a prompt describing a task
Engineer writes prompt
|
v
┌─────────────────┐
│ Codex agent │ <-- reads AGENTS.md (table of contents, ~100 lines)
│ (isolated │ walks dir tree root -> CWD
│ worktree) │ loads docs/ as needed (progressive disclosure)
└────────┬────────┘
|
v
┌─────────────────┐
│ Work depth- │ <-- break goal into building blocks
│ first │ design -> code -> test -> review
└────────┬────────┘
|
v
┌─────────────────┐
│ Custom linters │ <-- Codex-written, error msgs include remediation
│ (architectural │ enforce layer deps, naming, file size
│ constraints) │
└────────┬────────┘
|
v
┌─────────────────┐
│ Agent self- │ <-- review own changes
│ review │ request additional agent reviews
│ │ respond to feedback, iterate
└────────┬────────┘
|
v
┌─────────────────┐
│ Agent-to-agent │ <-- humans optional in review
│ review loop │ squash & merge when satisfied
└────────┬────────┘
|
v
┌─────────────────┐
│ PR merged │
└─────────────────┘
── background ──────────────────────
"Garbage collection" agents run periodically:
- scan for stale docs
- detect architectural violations
- open fix-up PRs
"Doc-gardening" agent:
- cross-link and validate knowledge base
Three pillars:
- Context engineering (knowledge base + observability + browser via Chrome DevTools)
- Architectural constraints (custom linters + structural testing)
- Garbage collection (periodic entropy-fighting agents)
Key insight: "When the agent struggles, treat it as a signal. Identify what's missing and feed it back into the repo -- by having the agent write the fix."
3. Code Factory / Ralph -- Solo Autonomous Loop
Scale: Ships features while you sleep, 1 agent in a bash loop
Trigger: ./scripts/compound/loop.sh N or ralph.sh
prd.json (task inventory)
prompt.md (instructions)
AGENTS.md (conventions)
|
v
while stories remain:
|
v
┌───────────────┐
│ Agent picks │ <-- reads prd.json, selects next by priority
│ next story │
└───────┬───────┘
|
v
┌───────────────┐
│ Implement │ <-- single context window per story
└───────┬───────┘
|
v
┌───────────────┐
│ Typecheck + │ <-- must be fast, "broken code compounds"
│ Tests │
└───────┬───────┘
|
pass? ──no──> skip, log failure
|
yes
|
v
┌───────────────┐
│ Auto-commit │
│ Mark story │
│ done │
└───────┬─ ──────┘
|
v
┌───────────────┐
│ Append to │ <-- pattern accumulation
│ progress.txt │ by iteration 10, agent understands patterns
└───────┬───────┘
|
└──> next iteration
── code review layer (Code Factory) ──
Risk tiers:
Low -> fully automated merge
Medium -> automated with CI gates
High -> require human confirmation
Review agent validates PR:
- review state must match current HEAD SHA
- evidence: tests + browser recording + review
- auto-resolve only bot-only stale threads
Key files: ralph.sh, prd.json, prompt.md, progress.txt, AGENTS.md
Key insight: Small stories, fast feedback, explicit criteria. "By iteration 10, the agent understands patterns from previous stories."
4. dmux -- Parallel Agents via tmux + Worktrees
Scale: N concurrent agents, each in isolated git worktree
Trigger: Press n in dmux TUI, type a prompt
dmux TUI
|
|──> press 'n'
|
v
┌─────────────────┐
│ Generate slug │ <-- AI-generated branch name via OpenRouter
└────────┬────────┘
|
v
┌─────────────────┐
│ Create git │ <-- .dmux/worktrees/<slug>/
│ worktree │ full independent working copy
└────────┬────────┘
|
v
┌─────────────────┐
│ Split tmux pane │
│ Launch agent │ <-- claude/codex/opencode
│ (--acceptEdits) │
└────────┬────────┘
|
v
┌─────────────────┐
│ Agent works │ <-- status detected via LLM analysis of terminal
│ autonomously │ polls every 1s
└────────┬────────┘
|
v
press 'm' to merge
|
v
┌─────────────────┐
│ AI commit msg │ <-- conventional commits via OpenRouter
│ Merge to main │
│ Remove worktree │
└─────────────────┘
Hooks fire at each stage:
worktree_created -> e.g. pnpm install
pre_merge -> e.g. run tests
post_merge -> e.g. git push, close issue
A/B mode: Run two agents on same prompt side-by-side to compare outputs.
Key insight: Git worktrees give true isolation -- each agent has its own working copy, no conflicts. Hooks enable custom automation at every lifecycle point.
5. Superconductor -- Parallel Agents with Live Previews
Scale: N agents per ticket, cloud sandboxes, live browser previews
Trigger: Web dashboard, iOS app, Slack, or GitHub comment (@superconductor)
Create ticket (informal description)
|
v
┌─────────────────┐
│ Launch N agents │ <-- each gets isolated container
│ on same ticket │ full repo, dev tools, test runners
└────────┬────────┘
|
┌─────┼─────┐
v v v
[Agent1][Agent2][Agent3] <-- Claude/Codex/Amp/Gemini
| | |
v v v
[Live] [Live] [Live] <-- browser previews appear ~30s
[prev] [prev] [prev]
| | |
└─────┼─────┘
|
v
┌─────────────────┐
│ Compare previews │ <-- interact with each, test functionality
│ Diff viewer │ audit code changes across agents
└────────┬────────┘
|
v
┌─────────────────┐
│ Select best │
│ One-click PR │
└─────────────────┘
Key insight: Fire many agents in parallel on the same task. Visual comparison of live previews is the quality gate, not just code review.
6. Terragon -- Background Fire-and-Forget Fleet
Scale: ~30 concurrent tasks/day, auto-PR creation
Trigger: Web dashboard, terry CLI, GitHub comment, mobile, Slack
Create task (any interface)
|
v
┌─────────────────┐
│ Spawn cloud │ <-- fresh isolated container
│ sandbox │ clone repo, create branch
└──────── ┬────────┘
|
v
┌─────────────────┐
│ Agent executes │ <-- writes code, runs tests, iterates
│ autonomously │ checkpoints pushed to GitHub
│ (background) │ AI-generated commits
└────────┬────────┘
|
v
┌─────────────────┐
│ PR created │ <-- automatic when agent finishes
│ automatically │
└────────┬────────┘
|
v
┌─────────────────┐
│ Human reviews │ <-- dashboard, CLI, or GitHub
│ and merges │
└─────────────────┘
If agent struggles:
"Abandon and retry with different instructions"
(more effective than course-correcting)
Best for: exploration/prototyping, one-shot cleanup, boilerplate, context-intensive debugging.
Key insight: Async-first. Close your laptop, come back to finished PRs. Volume alone doesn't guarantee gains -- task selection matters.
7. Gas Town (Steve Yegge) -- K8s for Agents
Scale: 20-30 parallel Claude Code instances Trigger: Task queue
Task queue
|
v
┌─────────────────┐
│ Orchestrator │ <-- "K8s for agents"
│ (Gas Town) │
└────────┬────────┘
|
┌─────┼─────┼─────┐
v v v v
[Agent][Agent][Agent][Agent] ... x20-30
| | | |
v v v v
[Git-backed persistent state]
|
v
┌─────────────────┐
│ Merge queue │ <-- conflict resolution between agents
└────────┬────────┘
|
v
┌─────────────────┐
│ Patrol agents │ <-- quality control watchdogs
└────────┬────────┘
|
v
merged to main
K8s analogy: Pod=Agent, Health check="Is it done?", Service mesh=Merge queue, DaemonSet=Patrol agent.
Comparison Matrix
| System | Trigger | Agents | Isolation | Quality Gate | Human Role |
|---|---|---|---|---|---|
| Stripe Minions | Slack/auto | 1 per task | Devbox (EC2) | Linters + selective CI + autofix | Review PR |
| OpenAI Harness | Prompt | 1 per task | Worktree | Custom linters + agent review | Prioritize, validate |
| Code Factory | Cron/manual | 1 (loop) | Branch | Typecheck + tests + browser recording | Review high-risk |
| dmux | TUI key | N (tmux) | Git worktree | Hooks (custom) | Merge decision |
| Superconductor | Ticket | N per ticket | Cloud container | Live preview comparison | Select best |
| Terragon | Any interface | N (cloud) | Container | CI + auto-PR | Review PR |
| Gas Town | Task queue | 20-30 | Git state | Patrol agents + merge queue | Supervise |
Common Patterns
All systems follow roughly the same skeleton:
trigger (human or automated)
|
v
isolate (worktree / container / devbox)
|
v
agent works (agentic + deterministic nodes)
|
v
fast feedback (lint / typecheck / tests -- shift left)
|
v
quality gate (CI / agent review / live preview / patrol)
|
v
output (PR / branch / merged code)
|
v
human decision point (review / select / merge / abandon)
Universal principles:
- Isolation first -- every agent gets its own sandbox
- Shift feedback left -- catch errors before CI, not after
- Context is scarce -- small focused instructions > one giant file
- Constraints enable speed -- linters and gates prevent drift
- Humans supervise loops, not sit inside them