Skip to main content
Unlisted page
This page is unlisted. Search engines will not index it, and only users having a direct link can access it.

Harness Engineering: Leveraging Codex in an Agent-First World

· 3 min read

Source: OpenAI Blog | Martin Fowler analysis Author: Ryan Lopopolo, OpenAI Date: 2026


The Experiment

Starting late August 2025, OpenAI built an internal project with a radical constraint: no manually-written code. Even the initial AGENTS.md was written by Codex. Everything -- application logic, tests, CI, docs, tooling -- generated by agents.

Results

  • ~1M lines of code over 5 months
  • ~1,500 PRs merged by 3 engineers (later 7)
  • Average: 3.5 PRs per engineer per day
  • Throughput increased as team grew

The Shift

A software engineering team's primary job is no longer to write code, but to design environments, specify intent, and build feedback loops that allow agents to do reliable work.

Humans interact through prompts: describe a task, run the agent, let it open a PR. Humans work at a different abstraction layer -- prioritizing work, translating user feedback into acceptance criteria, and validating outcomes.

Context Management

Failed: One Big AGENTS.md

Context is scarce. A giant instruction file crowds out the task, code, and docs. When everything is "important," nothing is.

Solution: AGENTS.md as Table of Contents

Short AGENTS.md (~100 lines) as a map with pointers to deeper sources:

  • docs/ directory as system of record
  • Progressive disclosure: agents start small, learn where to look
  • Codex walks directory tree root -> CWD, loading AGENTS.override.md then AGENTS.md
  • Combined size capped at 32 KiB

Knowledge Base Structure

  • Design docs -- catalogued, indexed, with verification status + "core beliefs"
  • Architecture docs -- domain/package layer map
  • Quality grades -- per domain, per layer, tracked over time
  • Execution plans -- checked into repo, with progress and decision logs
  • Doc-gardening agent -- scans for stale docs, opens fix-up PRs

Architectural Constraints

Rigid model: each domain divided into fixed layers with strictly validated dependency directions.

Enforced mechanically via custom linters (Codex-generated). Custom lint error messages inject remediation instructions into agent context.

Constraints are what allow speed without decay. Rules become multipliers: once encoded, they apply everywhere at once.

Agent Tools

  • Standard dev tools (gh, local scripts, repo-embedded skills)
  • Chrome DevTools Protocol -- DOM snapshots, screenshots, navigation for UI validation
  • Local observability stack -- logs, metrics, traces, ephemeral per worktree

PR Workflow

Minimal blocking merge gates. PRs are short-lived.

Agent-driven lifecycle:

  1. Review own changes locally
  2. Request additional agent reviews (local + cloud)
  3. Respond to human/agent feedback
  4. Iterate until all reviewers satisfied
  5. Squash and merge

Over time, almost all review effort pushed to agent-to-agent.

Three Components of the Harness

  1. Context Engineering -- knowledge base + dynamic context (observability, browser)
  2. Architectural Constraints -- LLM-based + deterministic linters, structural testing
  3. "Garbage Collection" -- periodic agents finding doc inconsistencies, constraint violations, fighting entropy

Key Philosophy

When the agent struggles, treat it as a signal: identify what is missing -- tools, guardrails, documentation -- and feed it back into the repository, always by having the agent itself write the fix.