Skip to main content

Harness Engineering Report

· 8 min read

A survey of how teams are setting up automated coding agent pipelines (Feb 2026).


1. Stripe Minions -- Enterprise Internal Fleet

Scale: 1,300 PRs/week, 0 human-written code Trigger: Slack message, CLI, web UI, or automated (flaky test detected)

  Slack msg / CLI / auto-trigger
|
v
┌─────────────────┐
│ Warm Devbox │ <-- EC2, pre-cloned repo, ~10s ready
│ (isolated) │ no internet, no prod access
└────────┬────────┘
|
v
┌─────────────────┐
│ Blueprint │ <-- state machine: deterministic + agentic nodes
│ Orchestration │
└────────┬────────┘
|
┌─────┴──────┐
| |
v v
[Agentic] [Deterministic]
"Implement" "Run linters"
"Fix CI" "Push changes"
| |
└─────┬───────┘
|
v
┌─────────────────┐
│ Local Lint │ <-- heuristic, <5s
│ (shift left) │
└────────┬────────┘
|
v
┌─────────────────┐
│ CI: selective │ <-- from 3M+ tests, only relevant
│ test run │
└────────┬────────┘
|
pass? ──no──> autofix? ──yes──> apply, retry once
| no──> hand to human
yes
|
v
┌─────────────────┐
│ PR created │ <-- follows Stripe PR template
│ (human review) │
└─────────────────┘

Context sources:

  • Rule files (Cursor format, directory-scoped)
  • MCP "Toolshed" (~500 internal tools, curated subset per agent)
  • Pre-hydrated links from conversation context

Key insight: "Often one, at most two CI runs." Forked Block's Goose as base agent.

Full reference | Source


2. OpenAI Harness Engineering -- Zero Human Code

Scale: ~1M LOC in 5 months, 3.5 PRs/engineer/day Trigger: Human writes a prompt describing a task

  Engineer writes prompt
|
v
┌─────────────────┐
│ Codex agent │ <-- reads AGENTS.md (table of contents, ~100 lines)
│ (isolated │ walks dir tree root -> CWD
│ worktree) │ loads docs/ as needed (progressive disclosure)
└────────┬────────┘
|
v
┌─────────────────┐
│ Work depth- │ <-- break goal into building blocks
│ first │ design -> code -> test -> review
└────────┬────────┘
|
v
┌─────────────────┐
│ Custom linters │ <-- Codex-written, error msgs include remediation
│ (architectural │ enforce layer deps, naming, file size
│ constraints) │
└────────┬────────┘
|
v
┌─────────────────┐
│ Agent self- │ <-- review own changes
│ review │ request additional agent reviews
│ │ respond to feedback, iterate
└────────┬────────┘
|
v
┌─────────────────┐
│ Agent-to-agent │ <-- humans optional in review
│ review loop │ squash & merge when satisfied
└────────┬────────┘
|
v
┌─────────────────┐
│ PR merged │
└─────────────────┘

── background ──────────────────────
"Garbage collection" agents run periodically:
- scan for stale docs
- detect architectural violations
- open fix-up PRs
"Doc-gardening" agent:
- cross-link and validate knowledge base

Three pillars:

  1. Context engineering (knowledge base + observability + browser via Chrome DevTools)
  2. Architectural constraints (custom linters + structural testing)
  3. Garbage collection (periodic entropy-fighting agents)

Key insight: "When the agent struggles, treat it as a signal. Identify what's missing and feed it back into the repo -- by having the agent write the fix."

Full reference | Source


3. Code Factory / Ralph -- Solo Autonomous Loop

Scale: Ships features while you sleep, 1 agent in a bash loop Trigger: ./scripts/compound/loop.sh N or ralph.sh

  prd.json (task inventory)
prompt.md (instructions)
AGENTS.md (conventions)
|
v
while stories remain:
|
v
┌───────────────┐
│ Agent picks │ <-- reads prd.json, selects next by priority
│ next story │
└───────┬───────┘
|
v
┌───────────────┐
│ Implement │ <-- single context window per story
└───────┬───────┘
|
v
┌───────────────┐
│ Typecheck + │ <-- must be fast, "broken code compounds"
│ Tests │
└───────┬───────┘
|
pass? ──no──> skip, log failure
|
yes
|
v
┌───────────────┐
│ Auto-commit │
│ Mark story │
│ done │
└───────┬───────┘
|
v
┌───────────────┐
│ Append to │ <-- pattern accumulation
│ progress.txt │ by iteration 10, agent understands patterns
└───────┬───────┘
|
└──> next iteration

── code review layer (Code Factory) ──
Risk tiers:
Low -> fully automated merge
Medium -> automated with CI gates
High -> require human confirmation

Review agent validates PR:
- review state must match current HEAD SHA
- evidence: tests + browser recording + review
- auto-resolve only bot-only stale threads

Key files: ralph.sh, prd.json, prompt.md, progress.txt, AGENTS.md

Key insight: Small stories, fast feedback, explicit criteria. "By iteration 10, the agent understands patterns from previous stories."

Full reference | Source


4. dmux -- Parallel Agents via tmux + Worktrees

Scale: N concurrent agents, each in isolated git worktree Trigger: Press n in dmux TUI, type a prompt

  dmux TUI
|
|──> press 'n'
|
v
┌─────────────────┐
│ Generate slug │ <-- AI-generated branch name via OpenRouter
└────────┬────────┘
|
v
┌─────────────────┐
│ Create git │ <-- .dmux/worktrees/<slug>/
│ worktree │ full independent working copy
└────────┬────────┘
|
v
┌─────────────────┐
│ Split tmux pane │
│ Launch agent │ <-- claude/codex/opencode
│ (--acceptEdits) │
└────────┬────────┘
|
v
┌─────────────────┐
│ Agent works │ <-- status detected via LLM analysis of terminal
│ autonomously │ polls every 1s
└────────┬────────┘
|
v
press 'm' to merge
|
v
┌─────────────────┐
│ AI commit msg │ <-- conventional commits via OpenRouter
│ Merge to main │
│ Remove worktree │
└─────────────────┘

Hooks fire at each stage:
worktree_created -> e.g. pnpm install
pre_merge -> e.g. run tests
post_merge -> e.g. git push, close issue

A/B mode: Run two agents on same prompt side-by-side to compare outputs.

Key insight: Git worktrees give true isolation -- each agent has its own working copy, no conflicts. Hooks enable custom automation at every lifecycle point.

Full reference | Source


5. Superconductor -- Parallel Agents with Live Previews

Scale: N agents per ticket, cloud sandboxes, live browser previews Trigger: Web dashboard, iOS app, Slack, or GitHub comment (@superconductor)

  Create ticket (informal description)
|
v
┌─────────────────┐
│ Launch N agents │ <-- each gets isolated container
│ on same ticket │ full repo, dev tools, test runners
└────────┬────────┘
|
┌─────┼─────┐
v v v
[Agent1][Agent2][Agent3] <-- Claude/Codex/Amp/Gemini
| | |
v v v
[Live] [Live] [Live] <-- browser previews appear ~30s
[prev] [prev] [prev]
| | |
└─────┼─────┘
|
v
┌─────────────────┐
│ Compare previews │ <-- interact with each, test functionality
│ Diff viewer │ audit code changes across agents
└────────┬────────┘
|
v
┌─────────────────┐
│ Select best │
│ One-click PR │
└─────────────────┘

Key insight: Fire many agents in parallel on the same task. Visual comparison of live previews is the quality gate, not just code review.

Full reference | Source


6. Terragon -- Background Fire-and-Forget Fleet

Scale: ~30 concurrent tasks/day, auto-PR creation Trigger: Web dashboard, terry CLI, GitHub comment, mobile, Slack

  Create task (any interface)
|
v
┌─────────────────┐
│ Spawn cloud │ <-- fresh isolated container
│ sandbox │ clone repo, create branch
└────────┬────────┘
|
v
┌─────────────────┐
│ Agent executes │ <-- writes code, runs tests, iterates
│ autonomously │ checkpoints pushed to GitHub
│ (background) │ AI-generated commits
└────────┬────────┘
|
v
┌─────────────────┐
│ PR created │ <-- automatic when agent finishes
│ automatically │
└────────┬────────┘
|
v
┌─────────────────┐
│ Human reviews │ <-- dashboard, CLI, or GitHub
│ and merges │
└─────────────────┘

If agent struggles:
"Abandon and retry with different instructions"
(more effective than course-correcting)

Best for: exploration/prototyping, one-shot cleanup, boilerplate, context-intensive debugging.

Key insight: Async-first. Close your laptop, come back to finished PRs. Volume alone doesn't guarantee gains -- task selection matters.

Full reference | Source


7. Gas Town (Steve Yegge) -- K8s for Agents

Scale: 20-30 parallel Claude Code instances Trigger: Task queue

  Task queue
|
v
┌─────────────────┐
│ Orchestrator │ <-- "K8s for agents"
│ (Gas Town) │
└────────┬────────┘
|
┌─────┼─────┼─────┐
v v v v
[Agent][Agent][Agent][Agent] ... x20-30
| | | |
v v v v
[Git-backed persistent state]
|
v
┌─────────────────┐
│ Merge queue │ <-- conflict resolution between agents
└────────┬────────┘
|
v
┌─────────────────┐
│ Patrol agents │ <-- quality control watchdogs
└────────┬────────┘
|
v
merged to main

K8s analogy: Pod=Agent, Health check="Is it done?", Service mesh=Merge queue, DaemonSet=Patrol agent.

Full reference


Comparison Matrix

SystemTriggerAgentsIsolationQuality GateHuman Role
Stripe MinionsSlack/auto1 per taskDevbox (EC2)Linters + selective CI + autofixReview PR
OpenAI HarnessPrompt1 per taskWorktreeCustom linters + agent reviewPrioritize, validate
Code FactoryCron/manual1 (loop)BranchTypecheck + tests + browser recordingReview high-risk
dmuxTUI keyN (tmux)Git worktreeHooks (custom)Merge decision
SuperconductorTicketN per ticketCloud containerLive preview comparisonSelect best
TerragonAny interfaceN (cloud)ContainerCI + auto-PRReview PR
Gas TownTask queue20-30Git statePatrol agents + merge queueSupervise

Common Patterns

All systems follow roughly the same skeleton:

trigger (human or automated)
|
v
isolate (worktree / container / devbox)
|
v
agent works (agentic + deterministic nodes)
|
v
fast feedback (lint / typecheck / tests -- shift left)
|
v
quality gate (CI / agent review / live preview / patrol)
|
v
output (PR / branch / merged code)
|
v
human decision point (review / select / merge / abandon)

Universal principles:

  1. Isolation first -- every agent gets its own sandbox
  2. Shift feedback left -- catch errors before CI, not after
  3. Context is scarce -- small focused instructions > one giant file
  4. Constraints enable speed -- linters and gates prevent drift
  5. Humans supervise loops, not sit inside them