Harness Engineering: Leveraging Codex in an Agent-First World

February 20, 2026 · 3 min read

Source: OpenAI Blog | Martin Fowler analysis Author: Ryan Lopopolo, OpenAI Date: 2026

The Experiment

Starting late August 2025, OpenAI built an internal project with a radical constraint: no manually-written code. Even the initial AGENTS.md was written by Codex. Everything -- application logic, tests, CI, docs, tooling -- generated by agents.

Results

~1M lines of code over 5 months
~1,500 PRs merged by 3 engineers (later 7)
Average: 3.5 PRs per engineer per day
Throughput increased as team grew

The Shift

A software engineering team's primary job is no longer to write code, but to design environments, specify intent, and build feedback loops that allow agents to do reliable work.

Humans interact through prompts: describe a task, run the agent, let it open a PR. Humans work at a different abstraction layer -- prioritizing work, translating user feedback into acceptance criteria, and validating outcomes.

Context Management

Failed: One Big AGENTS.md

Context is scarce. A giant instruction file crowds out the task, code, and docs. When everything is "important," nothing is.

Solution: AGENTS.md as Table of Contents

Short AGENTS.md (~100 lines) as a map with pointers to deeper sources:

docs/ directory as system of record
Progressive disclosure: agents start small, learn where to look
Codex walks directory tree root -> CWD, loading AGENTS.override.md then AGENTS.md
Combined size capped at 32 KiB

Knowledge Base Structure

Design docs -- catalogued, indexed, with verification status + "core beliefs"
Architecture docs -- domain/package layer map
Quality grades -- per domain, per layer, tracked over time
Execution plans -- checked into repo, with progress and decision logs
Doc-gardening agent -- scans for stale docs, opens fix-up PRs

Architectural Constraints

Rigid model: each domain divided into fixed layers with strictly validated dependency directions.

Enforced mechanically via custom linters (Codex-generated). Custom lint error messages inject remediation instructions into agent context.

Constraints are what allow speed without decay. Rules become multipliers: once encoded, they apply everywhere at once.

Agent Tools

Standard dev tools (gh, local scripts, repo-embedded skills)
Chrome DevTools Protocol -- DOM snapshots, screenshots, navigation for UI validation
Local observability stack -- logs, metrics, traces, ephemeral per worktree

PR Workflow

Minimal blocking merge gates. PRs are short-lived.

Agent-driven lifecycle:

Review own changes locally
Request additional agent reviews (local + cloud)
Respond to human/agent feedback
Iterate until all reviewers satisfied
Squash and merge

Over time, almost all review effort pushed to agent-to-agent.

Three Components of the Harness

Context Engineering -- knowledge base + dynamic context (observability, browser)
Architectural Constraints -- LLM-based + deterministic linters, structural testing
"Garbage Collection" -- periodic agents finding doc inconsistencies, constraint violations, fighting entropy

Key Philosophy

When the agent struggles, treat it as a signal: identify what is missing -- tools, guardrails, documentation -- and feed it back into the repository, always by having the agent itself write the fix.

The Experiment​

Results​

The Shift​

Context Management​

Failed: One Big AGENTS.md​

Solution: AGENTS.md as Table of Contents​

Knowledge Base Structure​

Architectural Constraints​

Agent Tools​

PR Workflow​

Three Components of the Harness​

Key Philosophy​