Harness Engineering: Leveraging Codex in an Agent-First World
Source: OpenAI Blog | Martin Fowler analysis Author: Ryan Lopopolo, OpenAI Date: 2026
The Experiment
Starting late August 2025, OpenAI built an internal project with a radical constraint: no manually-written code. Even the initial AGENTS.md was written by Codex. Everything -- application logic, tests, CI, docs, tooling -- generated by agents.
Results
- ~1M lines of code over 5 months
- ~1,500 PRs merged by 3 engineers (later 7)
- Average: 3.5 PRs per engineer per day
- Throughput increased as team grew
The Shift
A software engineering team's primary job is no longer to write code, but to design environments, specify intent, and build feedback loops that allow agents to do reliable work.
Humans interact through prompts: describe a task, run the agent, let it open a PR. Humans work at a different abstraction layer -- prioritizing work, translating user feedback into acceptance criteria, and validating outcomes.
Context Management
Failed: One Big AGENTS.md
Context is scarce. A giant instruction file crowds out the task, code, and docs. When everything is "important," nothing is.
Solution: AGENTS.md as Table of Contents
Short AGENTS.md (~100 lines) as a map with pointers to deeper sources:
docs/directory as system of record- Progressive disclosure: agents start small, learn where to look
- Codex walks directory tree root -> CWD, loading
AGENTS.override.mdthenAGENTS.md - Combined size capped at 32 KiB
Knowledge Base Structure
- Design docs -- catalogued, indexed, with verification status + "core beliefs"
- Architecture docs -- domain/package layer map
- Quality grades -- per domain, per layer, tracked over time
- Execution plans -- checked into repo, with progress and decision logs
- Doc-gardening agent -- scans for stale docs, opens fix-up PRs
Architectural Constraints
Rigid model: each domain divided into fixed layers with strictly validated dependency directions.
Enforced mechanically via custom linters (Codex-generated). Custom lint error messages inject remediation instructions into agent context.
Constraints are what allow speed without decay. Rules become multipliers: once encoded, they apply everywhere at once.
Agent Tools
- Standard dev tools (
gh, local scripts, repo-embedded skills) - Chrome DevTools Protocol -- DOM snapshots, screenshots, navigation for UI validation
- Local observability stack -- logs, metrics, traces, ephemeral per worktree
PR Workflow
Minimal blocking merge gates. PRs are short-lived.
Agent-driven lifecycle:
- Review own changes locally
- Request additional agent reviews (local + cloud)
- Respond to human/agent feedback
- Iterate until all reviewers satisfied
- Squash and merge
Over time, almost all review effort pushed to agent-to-agent.
Three Components of the Harness
- Context Engineering -- knowledge base + dynamic context (observability, browser)
- Architectural Constraints -- LLM-based + deterministic linters, structural testing
- "Garbage Collection" -- periodic agents finding doc inconsistencies, constraint violations, fighting entropy
Key Philosophy
When the agent struggles, treat it as a signal: identify what is missing -- tools, guardrails, documentation -- and feed it back into the repository, always by having the agent itself write the fix.