Skip to main content
Unlisted page
This page is unlisted. Search engines will not index it, and only users having a direct link can access it.

Simon Willison on AI Coding Agents and Workflows

· 17 min read

Research compiled February 2026. Simon Willison is the creator of Datasette, co-creator of Django, and one of the most prolific and practical writers on LLM-assisted development. Blog: simonwillison.net | Twitter/X: @simonw | Newsletter: simonw.substack.com


Table of Contents

  1. Core Philosophy
  2. Defining Agents: "Tools in a Loop"
  3. Vibe Coding vs. Vibe Engineering
  4. His Actual Daily Setup
  5. Parallel Coding Agents
  6. Async Coding Agents (Fire-and-Forget)
  7. Designing Agentic Loops
  8. Security: The Lethal Trifecta and Sandboxing
  9. Context Management
  10. Claude Skills vs. MCP
  11. Practical Techniques for LLM-Assisted Coding
  12. Tools He Built and Uses
  13. Key Quotes
  14. Source Links

Core Philosophy

Willison's overarching stance: LLMs amplify existing expertise. The more skills and experience you have as a software engineer, the faster and better the results you get from working with LLMs and coding agents.

The biggest advantage is not getting work done faster -- it is being able to ship projects that would not have been justified spending time on at all. LLMs accelerate learning, and letting developers execute ideas faster means they learn even more.

Key mental model: Think of LLMs as "an over-confident pair programming assistant who's lightning fast at looking things up, can churn out relevant examples at a moment's notice and can execute on tedious tasks without complaint." But they will absolutely make mistakes -- sometimes subtle, sometimes huge -- with errors that can be deeply inhuman, like hallucinating a non-existent library or method.

Critical warning: "If someone tells you that coding with LLMs is easy they are (probably unintentionally) misleading you." Using LLMs to write code is difficult and unintuitive, requiring significant effort to find the sharp and soft edges.


Defining Agents: "Tools in a Loop"

After collecting 211 different definitions of "agent" from Twitter and growing frustrated that Anthropic's own developer conference used the word dozens of times without defining it, Willison landed on a definition from Hannah Moran at Anthropic:

"Agents are models using tools in a loop."

This distinguishes agents by their iterative process -- not simply models or tools individually, but language models that repeatedly call external tools and use their outputs to inform subsequent decisions. Agents operate through cyclical reasoning rather than single-pass inference. The loop mechanism is central.

Willison notes: "2025 really has been the year of 'agents', no matter which of the many conflicting definitions you decide to use (I eventually settled on 'tools in a loop')."

Source: Agents are models using tools in a loop


Vibe Coding vs. Vibe Engineering

Willison draws a sharp distinction between two modes of AI-assisted development:

Vibe Coding (from Andrej Karpathy): "The fast, loose and irresponsible way of building software with AI -- entirely prompt-driven, and with no attention paid to how the code actually works." Useful for weekend projects, exploration, learning. Willison built 77+ HTML+JavaScript tools this way without reading implementation details.

Vibe Engineering (Willison's term): Responsible AI-assisted development where "seasoned professionals accelerate their work with LLMs while staying proudly and confidently accountable for the software they produce." The name is deliberately cheeky.

The 12 Practices of Vibe Engineering

  1. Automated Testing -- Agents perform best with comprehensive test suites; test-first development is particularly effective
  2. Advance Planning -- High-level planning before coding improves iteration
  3. Comprehensive Documentation -- Enables agents to use APIs without reading source code
  4. Strong Git Habits -- Version control becomes critical; agents excel at git bisect
  5. Effective Automation -- CI/CD, linting, preview deployments amplify agent productivity
  6. Culture of Code Review -- Reviewing agent output requires genuine expertise
  7. Management Skills -- "Getting good results out of a coding agent feels uncomfortably close to getting good results out of a human collaborator"
  8. Manual QA -- Beyond tests, rigorous edge-case testing remains essential
  9. Research Skills -- Determining optimal solutions before implementation
  10. Preview Environments -- Safe feature testing before production
  11. Outsourcing Intuition -- Knowing what AI handles well versus manual work
  12. Updated Estimation -- Accounting for AI's variable impact on timelines

Central insight: "One of the lesser spoken truths of working productively with LLMs as a software engineer on non-toy-projects is that it's difficult."

Source: Vibe engineering


His Actual Daily Setup

Primary Tools (as of late 2025 / early 2026)

ToolUse Case
Claude Code (Sonnet 4.5)Primary local terminal agent
Codex CLI (GPT-5-Codex)Primary local terminal agent, used alongside Claude Code
Claude Code for WebAsync fire-and-forget agent (sandboxed)
Codex CloudAsync tasks, launched from phone
Google JulesFree alternative async agent
GitHub Copilot Coding AgentPR-based async agent
llm CLI tool (his own)Quick prompts, logging to SQLite, RAG workflows
files-to-prompt (his own)Pipe entire directories into LLM context

How He Runs Multiple Agents

  • Multiple terminal windows with different agents in separate directories
  • Mixture of Claude Code and Codex CLI running simultaneously
  • For isolation: creates fresh checkouts in /tmp rather than using git worktrees
  • Runs in YOLO mode (--dangerously-skip-permissions) for tasks where malicious instructions cannot sneak into context
  • Recognizes he should "start habitually running my local agents in Docker containers to further limit the blast radius"

Claude Code as General Agent

Willison's key insight (January 2026): "Claude Code is, with hindsight, poorly named -- it's not purely a coding tool: it's a tool for general computer automation. Anything you can achieve by typing commands into a computer is something that can now be automated by Claude Code. It's best described as a general agent."

Claude Cowork (January 2026): Anthropic's "Claude Code for the rest of your work" -- same underlying engine with a less intimidating UI, automatic filesystem sandboxing via Apple's VZVirtualMachine, aimed at non-technical users.

Sources:


Parallel Coding Agents

Despite initial skepticism, Willison found himself "quietly starting to embrace the parallel coding agent lifestyle, finding an increasing number of tasks that can be fired off in parallel without adding too much cognitive overhead."

Four Key Application Patterns

  1. Research and Proof of Concepts -- Testing whether new libraries work together. Libraries too new to be in training data do not matter; agents can checkout repos and read code to figure out usage.

  2. System Understanding -- Ask agents to "make notes on where your signed cookies are set and read, or how your application uses subprocesses and threads."

  3. Low-Stakes Maintenance -- Fixing deprecation warnings, resolving test suite issues without interrupting primary focus.

  4. Carefully Specified Work -- Code reviewed faster when starting from detailed specifications rather than open-ended requests.

The "Send Out a Scout" Pattern

From Josh Bleecher Snyder: "Hand the AI agent a task just to find out where the sticky bits are, so you don't have to make those mistakes." Use the agent as reconnaissance before committing to an approach.

Source: Embracing the parallel coding agent lifestyle


Async Coding Agents (Fire-and-Forget)

Willison practices code research -- answering software development questions by writing and executing code rather than relying on speculation. Async agents excel at this.

Workflow

  1. Create a dedicated GitHub repository (separate from production code)
  2. Enable full network access for agents in research repos
  3. Formulate clear research goals in 2-3 paragraphs
  4. Submit as async task (fire-and-forget)
  5. Agents file pull requests with results

Willison reports running 2-3 code research projects a day with minimal time investment. Can launch from phone.

Why Async Agents Are Compelling

  • Great answer to security challenges (code runs on someone else's infra, not your laptop)
  • Parallelizable -- fire off multiple tasks at once
  • Hallucination mitigation: "The code itself doesn't lie: if they write code and execute it and it does the right things then they've demonstrated... that something really does work."

Honest Assessment

Willison acknowledges these outputs constitute "total slop" -- unreviewed AI-generated content. He quarantines research in dedicated repositories and requests that platforms add noindex support to prevent search engine indexing of AI-generated research.

Source: Code research projects with async coding agents


Designing Agentic Loops

Willison identifies "designing agentic loops" as a critical new skill for getting the most out of coding agents. The skill involves carefully selecting which tools and feedback mechanisms the agent uses.

When Agentic Loops Shine

Problems with clear success criteria requiring trial-and-error. The signal: "ugh, I'm going to have to try a lot of variations here."

Examples:

  • Debugging failing tests through iterative investigation
  • Performance optimization (SQL indexing, container sizing)
  • Dependency upgrades with automated test validation
  • Docker image optimization while maintaining test passage

Tool Selection Strategy

Rather than complex MCP setups, create an AGENTS.md file documenting available commands:

To take a screenshot, run:
shot-scraper http://www.example.com/ -w 800 -o example.jpg

LLMs effectively leverage existing tools (Playwright, ffmpeg) they already understand, recovering from mistakes through iteration.

Critical Amplifier

Automated test suites dramatically multiply agent effectiveness. Agents need measurable success criteria to iterate toward solutions reliably. Fast feedback loops enable productive agentic workflows: fast compilation, fast tests, fast tool responses.

YOLO Mode Trade-offs

Three implementation options for unrestricted execution:

  1. Secure sandbox (Docker, Apple container tool) restricting file/secret/network access
  2. Ephemeral environments (GitHub Codespaces, ChatGPT Code Interpreter)
  3. Calculated risk with isolated, monitored environments

Credential Scoping Pattern

Provide credentials only to test/staging with tight constraints. Example: Willison created a dedicated Fly.io organization with $5 budget limit and scoped API key for isolated infrastructure experimentation.

Source: Designing agentic loops


Security: The Lethal Trifecta and Sandboxing

The Lethal Trifecta

Three combined factors create critical vulnerability:

  1. Access to private data
  2. Exposure to untrusted content
  3. Ability to communicate externally

When all three are present, attackers can extract secrets. Example: a malicious HTML file tricks an agent into grepping environment variables (like GitHub tokens) and exfiltrating them to attacker-controlled servers.

The fundamental rule: "Anyone who can get their tokens into your context should be considered to have full control over what your agent does next."

Sandboxing Recommendations

Primary defense: Run coding agents in sandboxes, preferably "on someone else's computer."

Recommended platforms:

  • Claude Code for Web (sandboxed by default)
  • Codex Cloud
  • Gemini Jules
  • Docker containers locally

Two-layer control problem:

  1. Filesystem access (manageable) -- restrict file read/write permissions
  2. Network access (critical/difficult) -- prevents data exfiltration, the "data exfiltration leg of the lethal trifecta"

Technical Implementation (macOS)

Apple's sandbox-exec command with policy documents controlling file visibility, network allowlists, and process execution. Anthropic's approach: HTTP proxy mediates agent network traffic with domain allowlists. They released an open-source sandbox-runtime library.

MCP Context Pollution Solved (January 2026)

Willison on MCP Tool Search: "Context pollution is why I rarely used MCP, now that it's solved there's no reason not to hook up dozens or even hundreds of MCPs to Claude Code."

MCP Tool Search reduces token overhead by 85% (from ~77K tokens to ~8.7K for 50+ tools), dynamically loading tools into context only when needed.

Sources:


Context Management

Willison's key insight: "Most of the craft of getting good results out of an LLM comes down to managing its context -- the text that is part of your current conversation."

Principles

  • Context is not free. Every token influences behavior, for better or worse.
  • Context includes entire conversation history, not just current prompt.
  • Starting fresh conversations "resets that context back to zero."
  • Pre-populate context using tools like Claude Projects' GitHub integration.
  • Explicitly understand what information enters the LLM to get better results.

From Prompt Engineering to Context Engineering

Willison evolved from "prompt engineering" to "context engineering" -- everything that surrounds the prompt: goals, constraints, examples, tools, memory, tests, and retrieved knowledge that steer an LLM to do the next correct thing.

He describes the shift: "Language models change you from a programmer who writes lines of code, to a programmer that manages the context the model has access to, prunes irrelevant things, adds useful material to context, and writes detailed specifications."

Context Problems to Avoid

  • Context Poisoning: hallucinations making it into the context
  • Context Distraction: long contexts causing over-focus on irrelevant parts
  • Context Confusion: superfluous information leading to low-quality responses
  • Context Clash: new information conflicting with existing prompt information

Source: Simon Willison on context-engineering


Claude Skills vs. MCP

What Skills Are

Skills are folders containing Markdown files with YAML metadata and optional executable scripts. The system scans available skills at session start, reading brief descriptions from frontmatter -- each skill consuming only dozens of tokens until fully loaded when needed.

Why Willison Thinks They Are a Bigger Deal Than MCP

Advantages over MCP:

  • Extreme simplicity compared to MCP's protocol specification (hosts, clients, servers, resources, transports)
  • Easy to iterate and improve -- just Markdown files and scripts
  • Platform-agnostic -- work with Codex CLI, Gemini CLI despite no native integration
  • Low token overhead -- dozens of tokens per skill vs. tens of thousands for MCP

Willison predicts "a Cambrian explosion in Skills" exceeding the MCP adoption wave.

The General Agent Pattern via Skills

Example: a data journalism agent combining skills for census data access, SQLite/DuckDB operations, S3 publishing, story discovery methodology, and D3 visualization -- all implemented "with a folder full of Markdown files and maybe a couple of example Python scripts."

Practical Configuration: AGENTS.md and CLAUDE.md

  • CLAUDE.md: The "constitution" for Claude Code. Lives at project root or ~/.claude/CLAUDE.md for global defaults. Sets instructions for every session.
  • AGENTS.md: Documents available commands and tools for agents. Simpler than MCP -- just describe how to use existing CLI tools.

Source: Claude Skills are awesome, maybe a bigger deal than MCP


Practical Techniques for LLM-Assisted Coding

1. The Authoritarian Approach (Production Code)

Provide exact specifications with function signatures:

async def download_db(url, max_size_bytes=5 * 1025 * 1025): -> pathlib.Path

Then detail requirements in English. Treats LLMs "like a digital intern, hired to type code for me based on my detailed instructions." Saves 15+ minutes on functions you could write manually.

2. Iterative Refinement Over Perfect First Prompts

Bad initial results are not failures -- they are starting points. Follow-up prompts like "break that repetitive code into a function" often yield better results. The LLM "can re-type it dozens of times without ever getting frustrated."

3. Strategic Example Provision

Dump several complete working examples as context, then ask the LLM to build inspired by them. Willison used this for his JavaScript OCR application combining Tesseract.js and PDF.js.

4. Plan-Then-Execute for Larger Changes

For refactorings: tell the LLM to write a plan, iterate over it until reasonable, save it as a kind of meta program, then instruct it to implement step by step.

5. Vibe-Coding for Exploration

"Fully give in to the vibes" for weekend projects and learning. Do not read implementation details. Deploy and test -- human takes over when needed.

6. Non-Negotiable Testing

"You absolutely cannot outsource to the machine testing that the code actually works." This is the one thing that must stay human.

7. Provide Documentation for Training Cutoff Gaps

Models trained on data from months ago will not know about breaking library changes. Workaround: provide recent examples, documentation snippets, or changelog entries in prompts.

Source: Here's how I use LLMs to help me write code


Tools He Built and Uses

llm -- CLI for Large Language Models

  • Command-line tool and Python library for interacting with OpenAI, Anthropic, Google, Meta, and local models
  • Plugin system for model providers (Claude, Gemini, Ollama, Mistral, etc.)
  • Logs all prompts and responses to SQLite -- explorable with Datasette
  • Version 0.26 added tool support (LLMs can call Python functions)
  • Supports RAG workflows as bash scripts against local SQLite databases
  • Install: pip install llm or brew install llm
  • GitHub: simonw/llm | Docs: llm.datasette.io

files-to-prompt -- Directory-to-Prompt Converter

  • Turns a whole directory of code into a single prompt ready to pipe into an LLM
  • -m/--markdown option for Markdown output with fenced code blocks
  • Supports reading file lists from stdin
  • GitHub: simonw/files-to-prompt

shot-scraper -- Browser Automation for Agents

  • Takes screenshots and executes JavaScript against web pages via headless Chrome (Playwright)
  • Useful in AGENTS.md for giving agents visual feedback
  • gh: prefix loads scripts from GitHub
  • simonwillison.net/tags/shot-scraper

llm-prompts -- Reusable Prompt Collection

Datasette

  • His flagship project: a tool for exploring and publishing data in SQLite databases
  • Pairs with llm for logging/analyzing LLM usage patterns
  • datasette.io

Key Quotes

"Claude Code is, with hindsight, poorly named -- it's not purely a coding tool: it's a tool for general computer automation."

"Agents are models using tools in a loop."

"The biggest advantage is speed of development" -- enabling shipping of projects that would not have been justified building manually.

"One of the new skills required to get the most out of AI-assisted coding tools is designing agentic loops: carefully selecting tools to run in a loop to achieve a specified goal."

"Getting good results out of a coding agent feels uncomfortably close to getting good results out of a human collaborator."

"Anyone who can get their tokens into your context should be considered to have full control over what your agent does next."

"Context pollution is why I rarely used MCP, now that it's solved there's no reason not to hook up dozens or even hundreds of MCPs to Claude Code."

"I find myself instinctively thinking 'neat feature idea, not worth the time it will take to build and maintain it though' -- and then prompting Claude Code anyway, because my 25+ years of intuitions don't match reality any more."

"You absolutely cannot outsource to the machine testing that the code actually works."

"A friend called Claude Code catnip for programmers and it really feels like this."


Blog Posts (simonwillison.net)

Tag Pages

Newsletter (Substack)

Twitter/X Posts

Other