Guide

AI Coding with Claude, Codex and others

Last updated: 2026-02-21Updated automatically on each guide edit.

A living document by TheCrux — Last updated: January 2026

Tool-specific details (Claude Code commands, flags, file locations) are volatile. Where we state specifics, we add a Last verified date. If your installed version differs, prefer the official docs.


What Is This?

A practical guide to programming with AI CLI tools — Claude Code, Codex CLI, Gemini CLI, and others. These terminal-based agents don't just suggest code; they read your codebase, write files, run commands, and iterate based on results.

This guide is written by AI with human editorial oversight. It's maintained using TheCrux, which sources and summarizes practitioner content—blog posts, threads, documentation—and uses multiple AI models to synthesize updates. The goal: surface emerging best practices, highlight where approaches diverge, and help you develop your own informed workflow.


Quick Start by Experience Level

If You're New to AI CLI Tools

Start here: The shift from IDE-based AI (Copilot, Cursor) to terminal-based agents (Claude Code, Codex) represents a fundamental change. You're not asking for code suggestions—you're having a conversation with a partner that can read, write, and execute code in your repo.

Your first session:

  1. Install your tool of choice (Claude Code, Codex CLI, or Gemini CLI)
  2. Navigate to a project you know well (familiarity helps you evaluate the output)
  3. Start with a small, well-defined task: "Add a function that validates email addresses"
  4. Watch what it does—read the file, make changes, maybe run tests
  5. Review the diff before accepting

If you start with Claude Code: generate a sane baseline CLAUDE.md first.

text
/init

Then prune it aggressively (see Create a Project Context File).

What to expect: You'll be slower at first. That's normal. The learning curve is about developing intuition for what tasks agents handle well vs. where they struggle.

These tools are high-leverage but not “push-button magic.” The fastest teams tend to be the ones who adopt a more disciplined loop: plan, work in small steps, verify, commit.

If You're Already Using These Tools

Become more productive by:

  • Creating a CLAUDE.md / AGENTS.md (or equivalent) that encodes your project's conventions
  • Building custom slash commands for repeated workflows (/pr, /test, /catchup)
  • Experimenting with parallel agents on independent tasks
  • Adding tests as a feedback mechanism for the agent

Common plateau: Many developers get stuck at "one agent, one task, manual review." The next level is developing workflows where agents can iterate autonomously (via tests) and where you can run multiple agents on parallel work streams.

Steve Yegge's 8 Stages of AI-Assisted Coding

From Welcome to Gas Town:

StageDescription
1Zero or near-zero AI
2Coding agent in IDE with permission prompts
3IDE agent, YOLO mode (trust increasing)
4IDE agent, using it mainly for reviewing diffs
5CLI, single agent
6CLI, multi-agent YOLO — regularly use 3-5 parallel instances. "You are very fast"
710+ agents, hand-managed
8Building your own orchestrator

Most developers reading this guide are probably at stages 2-4. The goal is to help you reach stage 5-6, where productivity gains become substantial.


Why Now: The 2026 Moment

Key developments that changed the game:

  • Context windows expanded — up to 200k tokens reduces paging, but agents still rely on selective reading/retrieval; most repos won’t fit end-to-end
  • Agentic loops matured — Write → test → fix → repeat, without human intervention
  • Terminal-first tools emerged — Claude Code, Codex CLI, Gemini CLI moved beyond IDE plugins

Senior engineers benefit most. They know how to review AI output, catch subtle bugs, and steer toward good architecture. (Thorsten Ball)

Even longtime skeptics converted: Kent Beck, DHH, Thorsten Ball all shifted from skepticism to advocacy once models improved. (Pragmatic Engineer)

What's gaining value: System design, product-mindedness, knowing when to trust/reject AI. What's declining: Prototyping-from-scratch expertise, language polyglot specialization.

Practitioner snapshot: Peter Steinberger describes his output as increasingly limited by inference time and "hard thinking", not typing speed—especially for the large class of software that is "move data around and present it." A recurring implication for CLI tools: starting with a CLI makes verification trivial (agents can run it and check output), which tightens the iteration loop. (See: Shipping at Inference-Speed)

January 2026: Agents go mainstream. Anthropic launched Cowork—a desktop agent research preview in the Claude macOS app (Claude Max tier), built on Claude’s agent foundations rather than a literal repackaging of Claude Code. Simon Willison called it "a general agent that looks well positioned to bring the wildly powerful capabilities of Claude Code to a wider audience." The implication: if agents work for file organization and expense reports, the CLI tools developers use are just the beginning.


Foundational Concepts

What Is Agentic Coding?

"

An LLM agent is something that runs tools in a loop to achieve a goal. — Simon Willison

Traditional AI assistants suggest code. Agents are different:

  • They read your codebase (files, git history, documentation)
  • They write and modify files directly
  • They execute commands (tests, builds, linters)
  • They iterate based on results (fixing errors, retrying)
"

One way to think about coding agents is that they are brute force tools for finding solutions to coding problems. If you can reduce your problem to a clear goal and a set of tools that can iterate towards that goal, a coding agent can often brute force its way to an effective solution. — Simon Willison

The Terminal Renaissance

"

Agentic coding is console based, 1970s-style, in a text terminal window with little to no UI... The developer workflow is simple: You ask the AI for code changes, then you review the diffs and test the output, in a loop, lather, rinse, repeat. — Steve Yegge

Why terminal over IDE? The agent needs to execute commands, not just suggest code. A terminal is the natural environment for this.

Key Terminology

TermDefinition
Agentic loopTool use → Result → Reasoning → Next tool use
Context windowHow much text the model can "see" (up to ~200k tokens; practical coverage depends on retrieval)
MCPModel Context Protocol — standard for integrating external tools
SubagentAn agent spawned by another agent for subtasks
YOLO modeRunning with permissions disabled (risky but fast)

Context Is the Scarce Resource

Practitioner reality: the main constraint isn’t “intelligence”, it’s context management.

Claude Code’s official best-practices guide is unusually explicit about this: the context window includes conversation + files read + command outputs, and performance can degrade as it fills.

Practical implications:

  • Prefer short sessions per task over one mega-session.
  • Keep investigations scoped; move heavy reading into subagents.
  • Treat long command output as toxic waste: capture the key lines, then clear/compact.

Source: Claude Code docs, Best Practices. Last verified: 2026-01-23.

The Right Mental Model

"

The LLM is an assistant, not an autonomously reliable coder. — Addy Osmani

Treat AI-generated code like a junior developer's contribution. It needs code review before merging, testing to verify it works, architectural guidance to fit your patterns, and supervision to catch mistakes.

You remain the senior dev. AI amplifies your expertise—it doesn't replace your judgment. This framing helps set expectations: fast output, but verification required.

In practice: Thorsten Ball describes moving from skepticism to productive use once he stopped expecting the AI to "get it right" and started treating it as a fast, tireless pair programmer who needs guidance. The shift: from "why doesn't it understand?" to "how do I steer it effectively?"

Cross-check with a second model: Some developers use one AI to generate code, then a separate session (or different model) to review it. Fresh context catches issues the original session might miss. Kent Beck uses this for complex refactors—one model proposes, another critiques.


Common Practices

These patterns work regardless of which tool you use.

1. Create a Project Context File

Every CLI tool supports a markdown file that shapes agent behavior:

ToolFile Name
Claude CodeCLAUDE.md
Codex CLIAGENTS.md
Gemini CLIGEMINI.md

What to include:

markdown
# CLAUDE.md

## Project
Brief description: what this does, who it's for.

## Stack
- Next.js 14 (App Router)
- TypeScript (strict mode)
- Supabase (Postgres + Auth)
- Tailwind CSS

## Commands
- `npm run dev` — dev server on :3000
- `npm test` — run vitest
- `npm run lint` — eslint check

## Conventions
- Prefer small, focused commits
- Write tests for new `lib/` functions
- Use early returns over nested conditionals

What NOT to include (emerging consensus):

Listing "known issues" is debatable. Some teams do it; others find the agent references stale issues inappropriately. A better pattern: link to your issue tracker rather than duplicating state in CLAUDE.md.

Keep it concise: 100-200 lines maximum. If longer, create per-folder context files.

Claude Code specific (official): start by generating it, then edit.

  • Run /init to generate a first draft based on detected tools and repo structure.
  • Treat the result as a starting point; keep only lines that measurably reduce mistakes.

Source: Claude Code docs, Write an effective CLAUDE.md. Last verified: 2026-01-23.

Use @file imports instead of bloating the root file (Claude Code)

Claude Code supports pulling in additional files via @path references inside CLAUDE.md.

markdown
See @README.md for project overview and @package.json for available npm commands.

# Additional instructions
- Git workflow: @docs/git-instructions.md

This pattern is high leverage because it:

  • Keeps the “always loaded” file small.
  • Lets you version deeper docs without turning CLAUDE.md into a junk drawer.

Source: Claude Code docs, Write an effective CLAUDE.md. Last verified: 2026-01-23.

CLAUDE.md is not the only place for durable guidance

Claude Code draws a sharp line between:

  • CLAUDE.md: global, always-loaded rules (keep short)
  • Skills: domain/workflow knowledge loaded on demand
  • Hooks: deterministic automation (no “did it follow instructions?” ambiguity)

This separation is worth adopting even if you’re not on Claude Code, because it mirrors how real teams manage rules: a small constitution + opt-in playbooks + enforced automation.

Context bundling tools: For larger codebases, tools like gitingest or repo2txt can bundle your codebase into a single file the agent can consume. Useful when you need the agent to understand the full picture.

"

The single most important file in your codebase for using Claude Code effectively is the root CLAUDE.md. This file is the agent's 'constitution.' — Anthropic Engineering

2. Plan Before You Code

"

One common mistake is diving straight into code generation with a vague prompt. — Addy Osmani

When to plan first: Multi-file changes, architectural decisions, anything where you'd hesitate if a junior dev proposed "let me just start coding." If you catch yourself thinking "wait, which approach?"—that's a planning signal.

When to skip planning: Single-file bug fixes, adding a field to a form, tasks where you already know exactly what the diff should look like. The overhead of planning exceeds the benefit.

The planning prompt pattern:

text
"I want to add user authentication. Before writing any code:
1. List the files that need to change
2. Propose a 3-step implementation plan
3. Note any architectural decisions I need to make
4. Identify risks or things that could go wrong"

Review the plan. Adjust it. Then proceed step by step.

The “Explore → Plan → Implement → Commit” loop (Claude Code official)

Claude Code’s docs recommend an explicit four-phase loop for non-trivial changes:

  1. Explore (read code, answer questions, no changes)
  2. Plan (write a plan you approve)
  3. Implement (execute, verifying as you go)
  4. Commit (checkpoint the result)

Even if your tool doesn’t have a dedicated “Plan Mode”, you can implement this as a discipline:

  • Start with: “Read X and explain how it works. Don’t change anything.”
  • Then: “Propose a 3-step plan. Wait for approval.”
  • Then: “Implement step 1 only; run tests; stop.”

Source: Claude Code docs, Explore first, then plan, then code. Last verified: 2026-01-23.

The spec.md approach (Addy Osmani):

Before writing code, use AI to rapidly iterate on a specification:

  1. Describe your idea to the AI
  2. Ask clarifying questions back and forth
  3. Have the AI compile findings into a spec.md
  4. Review and refine the spec
  5. Only then start implementation

This is "waterfall in 15 minutes" — you get the benefits of upfront planning without the overhead.

Emerging practice: Some developers have the agent write the plan to a markdown file, then start a fresh session with "/clear" and have the agent read the plan file to continue. This keeps context clean.

Break work into small, iterative chunks

"

A crucial lesson I’ve learned is to avoid asking the AI for large, monolithic outputs. Instead, we break the project into iterative steps or tickets and tackle them one by one. — Addy Osmani

Agents are most reliable when they can finish a small unit of work, validate it (tests/lint/build), and then move on.

Chunking prompt pattern:

text
We have the plan in plan.md. Implement ONLY Step 1.

Constraints:
- Touch the minimum number of files.
- After changes: run tests + lint.
- Stop once Step 1 is complete and report:
  - what changed
  - commands run and results
  - any follow-ups for Step 2

Practitioner rationale: big, all-at-once generation tends to produce inconsistent architecture and duplicated logic ("like multiple devs worked on it without coordinating"). Small steps make it easier to review diffs, keep context accurate, and roll back when needed.

3. Use Version Control as Your Safety Net

Git is your undo button. This is even more critical with agents because:

  • Agents can make changes you didn't expect
  • Large refactors happen quickly (faster than you can track mentally)
  • Reverting is faster than re-prompting

Workflow:

bash
# Before starting
git checkout -b feature/add-auth

# Let the agent work (most tools auto-commit)

# Review what happened
git log --oneline -10
git diff main

# If something went wrong
git reset --hard HEAD~3

Opinion varies on commit frequency. Some prefer auto-commits (easier to undo, clear history). Others batch commits (cleaner git log). Emerging practice: auto-commit during work, squash before merging to main.

Practitioner pattern (useful with agents): commit like save points. After each chunk that leaves the repo in a good state (tests passing), make a small commit. This is less about "perfect history" and more about cheap reversibility when the next agent step goes sideways.

Make git part of the agent’s working memory:

  • Paste diffs/commit logs into the session when context is stale.
  • Ask the agent to explain why each hunk exists.
  • Use git bisect with the agent when something regresses (LLMs are unusually good at reading diffs and being patient).

Guardrail (opinion, but widely repeated): never commit code you can’t explain. If an agent produces a complex fix, require a walkthrough (or simplify) before you merge.

Isolate experiments: branches and git worktree reduce blast radius and let you run parallel sessions safely. (See Running Parallel Agents.)

4. Manage Context (Tool-Dependent)

This principle varies significantly by tool.

Claude Code benefits from aggressive context clearing:

  • Use /clear when switching tasks
  • Context accumulates and can mislead the agent
  • The /compact command summarizes history to save tokens

Codex CLI persists sessions locally:

  • Use codex resume to continue previous work
  • Less need to clear because sessions are stored

Reality check (opinion, but common in high-velocity workflows): newer models can remain effective deep into long contexts, but tool UIs don’t always surface “the repo changed” as a first-class event. If you run long sessions, periodically force a refresh:

  • ask the agent to re-open the specific files it is about to edit
  • use git diff / git status as the source of truth
  • use a /catchup-style summary after compaction or breaks

Rewind beats arguing (Claude Code)

Claude Code checkpoints every action and supports rewinding conversation and/or code state.

Use this when the agent:

  • Took a wrong approach and “fixed the fix” three times.
  • Made a wide diff when you wanted a narrow one.
  • Pulled in a dependency or refactor you don’t want.

Commands/UX:

  • /rewind (or double-ESC) to restore an earlier checkpoint

This is not a replacement for git, but it’s faster than prompt wrestling.

Source: Claude Code docs, Rewind with checkpoints. Last verified: 2026-01-23.

Persisting sessions: treat them like branches

If your tool supports resuming, use it deliberately:

  • Name sessions by workstream (feature/refactor/incident)
  • Resume only when you’re continuing the same task
  • Start fresh for review (“writer/reviewer” pattern)

Claude Code examples:

bash
claude --continue
claude --resume

Source: Claude Code docs, Resume conversations. Last verified: 2026-01-23.

Tell the tool what to preserve during compaction (Claude Code)

Compaction is often helpful, but it can delete the exact things you needed (modified file list, test commands, current failing output). Claude Code supports instructing compaction behavior via CLAUDE.md.

Practical pattern to add (edit to fit your repo):

markdown
## Context management
When compacting, preserve:
- the list of modified files
- the commands we ran + pass/fail
- the current "next step" plan

Source: Claude Code docs, Manage context aggressively. Last verified: 2026-01-23.

The "catchup" pattern (useful across tools):

After clearing or starting fresh, have the agent read recent changes:

text
"Read the last 5 commits in this branch and summarize what changed.
Then continue with [next task]."

Or create a custom /catchup command that does this automatically.

5. Provide Feedback Loops (Tests Are Critical)

Agents work best when they can verify their own work. Tests provide this:

text
"Add a calculateDiscount function.
Write a failing test first, then implement the function.
Run the test. If it fails, fix and retry."
"

Test driven development (TDD) is a 'superpower' when working with AI agents. — Kent Beck

More on this in the Testing with AI Agents section.

Verification isn’t just tests

Agents improve dramatically when they can check work against any deterministic signal:

  • tests (unit/integration/e2e)
  • typecheck/build
  • expected output from a CLI command
  • screenshots / visual diffs (especially UI)

The key is to give the agent a way to know it’s done without relying on your vibes.

Source: Claude Code docs, Give Claude a way to verify its work. Last verified: 2026-01-23.

6. Update Docs and Run Improvement Loops

Often overlooked: As you work with agents, you learn what prompts work, what conventions help, what the agent struggles with. Encode these learnings.

Practices to adopt:

  • After completing a feature, update your context file (CLAUDE.md / AGENTS.md) with any new conventions you established
  • Create slash commands for patterns you repeat
  • Periodically review: "What did the agent do well? What did it struggle with? How can I help it next time?"
"

We realized we needed to teach our agent a little more about our development philosophy and steer it away from bad behaviors. The agent now understands our values around Test Driven Development and minimal changes. — Martin Fowler's team

7. Engineer for “Inference-Speed” Iteration (Practitioner Patterns)

This section is a practitioner synthesis (not universal law), largely reflecting workflows described in Peter Steinberger’s Shipping at Inference-Speed.

Start with a CLI to close the loop

Claim (practitioner): “Whatever you build, start with the model and a CLI first.”

Why this often works:

  • Fast verification: an agent can run a CLI, parse stdout/stderr, and iterate without you “being the UI.”
  • Lower surface area: fewer moving parts than a UI during early exploration.
  • Natural automation seam: today’s agents are already excellent at shell loops.

Guide connection: this reinforces Provide Feedback Loops and complements Running Parallel Agents because CLI-driven verification is easy to parallelize.

Optimize repo ergonomics for agents (not just humans)

Opinion (practitioner): design folder structure and docs so the model can navigate “obvious” shapes.

Concrete patterns:

  • Keep a docs/ folder per project for subsystem notes, invariants, and how-to-run commands.
  • When finishing a chunk, ask the agent to write/update a doc (e.g. “write docs to docs/<topic>.md”).
  • Prefer conventions that are easy for tools to infer: predictable filenames, consistent command names, repeatable scripts.

This is compatible with the guide’s emphasis on CLAUDE.md / AGENTS.md, but pushes it further: documentation is durable memory; chat history is not.

Shorter prompts, more grounding

Observation (practitioner): as models improve, prompts often get shorter—especially when you can ground the agent with:

  • the exact command to run
  • the failing output
  • a screenshot or snippet (“fix padding” / “this output is wrong”)

Practical takeaway: spend effort on inputs that constrain the space (tests, repro commands, fixtures, screenshots), not elaborate prose.

Model behavior differs: “slow readers” vs “eager writers”

Opinion (practitioner): some models/tools spend a long time silently reading files before writing; others produce edits quickly but may miss context.

How to use this:

  • For large refactors: tolerate slower startup if it reduces “fix-the-fix” iterations.
  • For small edits: faster, more eager models can win end-to-end.

Guide connection: this is an applied version of Who Plans: Human or Agent? and Single Agent vs. Multiple Agents.

Dependency and ecosystem choice matters more than ever

Claim (practitioner): the biggest high-leverage decisions shift toward language/ecosystem and dependencies.

Rationale:

  • agents are better when the ecosystem is popular (more examples in training data)
  • fewer, well-understood dependencies reduce failure modes

This aligns with the guide’s Dependencies review rubric and Dependency Safety.


Tool Landscape

The Big Three CLI Agents

AspectClaude CodeCodex CLIGemini CLI
MakerAnthropicOpenAIGoogle
Open SourceNoYes (Rust)Yes
Context200k tokensVaries by modelVery large
StrengthDeep reasoning, complex refactorsFast iteration, cost-effectiveLarge context, strong ecosystem
MCP SupportNativestdio-basedNative

Note on pricing/limits: Token allowances vary by plan and change frequently. Check official pricing pages before deciding:

Last verified: 2026-01-08

Features That Are Now Universal

All major CLI agents have converged on these capabilities:

Image input: Paste screenshots, error dialogs, or UI mockups directly into the conversation. Boris Cherny describes using the Claude Chrome extension to screenshot a broken UI, paste it, and say "fix this"—the agent sees the visual problem and fixes the code. Useful for CSS bugs, design implementation, and debugging visual regressions.

Multiple instances: Run 3-5+ agents in separate terminals on independent tasks. Simon Willison runs parallel agents for tasks like "write tests for module A" and "add feature B" simultaneously. The key constraint: they must touch different files (see Running Parallel Agents).

MCP integration: Connect agents to external tools—Linear for issues, Slack for context, Sentry for error reports. Teams check .mcp.json into git so everyone shares the same tool wiring.

Session persistence: Resume where you left off. Codex has explicit codex resume; Claude Code compacts context automatically. Useful when you hit rate limits or need to step away.

Custom commands: Define shortcuts like /pr (lint, test, prepare description) or /catchup (read recent commits, summarize state). Store them in .claude/commands/ or equivalent and commit them—they're team infrastructure. For Claude, include a description frontmatter so commands are invocable by tools:

markdown
---
description: "Run tests and summarize failures"
---

# /test

Claude Code Specifics

Boris Cherny's Workflow (Creator of Claude Code)

From Boris Cherny’s thread on how he uses Claude Code day-to-day (sources: x.com, Thread Reader). This is a practitioner snapshot, not a universal prescription (Boris explicitly emphasizes that the tool is meant to support many styles).

Parallel execution (terminal + web + phone): (practitioner report)

  • Runs ~5 Claude sessions locally in terminal tabs (numbered), using system notifications to know when one needs input.
  • Runs ~5–10 additional sessions on claude.ai/code in parallel.
  • Hands sessions back and forth between local and web (e.g., using Claude Code’s handoff/teleport capabilities) and sometimes starts sessions from his phone, then checks in later.

Model choice (opinion, with rationale): (practitioner opinion)

"

“I use Opus 4.5 with thinking for everything… even though it’s bigger & slower… since you have to steer it less and it’s better at tool use, it is almost always faster… in the end.”

This is a useful counterpoint to “always pick the cheapest/fastest model”: for some workflows, less steering + better tool use wins on end-to-end time.

Plan mode → execution mode: (practitioner workflow)

"

“Most sessions start in Plan mode (shift+tab twice)… go back and forth… until I like its plan. From there, I switch into auto-accept edits mode and Claude can usually 1-shot it.”

This aligns with the guide’s Plan Before You Code principle: you’re buying correctness and fewer iterations.

Compounding team memory via CLAUDE.md: (practitioner workflow)

  • Their team shares a single, versioned CLAUDE.md for the repo.
  • “Anytime we see Claude do something incorrectly we add it to the CLAUDE.md, so Claude knows not to do it next time.”
  • In code review, they’ll tag @.claude on PRs to propose additions/edits to CLAUDE.md (via the Claude Code GitHub action: /install-github-action).

Guide takeaway: treat CLAUDE.md as institutional memory and update it from real failures—similar to how teams evolve lint rules or review checklists.

Slash commands for the inner loop: (practitioner workflow)

  • Boris uses slash commands for repeated workflows and checks them into git under .claude/commands/.
  • Example: a /commit-push-pr command used many times daily.
  • Notably, the command uses inline bash to precompute git status and other context, reducing back-and-forth with the model.

This is a strong pattern: push computation into tools (shell, scripts) and keep the model focused on decisions.

Subagents for repeatable “PR-shaped” work: (practitioner workflow)

  • Uses subagents like code-simplifier (cleanup after changes) and verify-app (end-to-end verification instructions).

Hooks to automate the last 10%: (practitioner workflow)

  • Uses a PostToolUse hook to format generated code (“the last 10%”) to avoid CI formatting failures.

Permissions: avoid YOLO by default, allowlist instead: (practitioner workflow)

"

“I don’t use --dangerously-skip-permissions. Instead, I use /permissions to pre-allow common bash commands…”

  • Many of these permissions are shared via .claude/settings.json.

Tool use beyond the repo (MCP): (practitioner workflow)

  • Claude Code uses their tools: posts/searches Slack (via MCP), runs BigQuery queries (via bq), grabs Sentry logs, etc.
  • Their Slack MCP configuration is checked into .mcp.json and shared.

Long-running tasks: unblock progress, then verify: (practitioner workflow)

  • For tasks that take a long time, Boris either:
    1. prompts Claude to verify with a background agent when done,
    2. uses an agent Stop hook to make verification deterministic, or
    3. uses the “ralph-wiggum” plugin (attributed to Geoffrey Huntley).
  • In a sandbox, he may use --permission-mode=dontAsk or even --dangerously-skip-permissions to avoid prompt deadlocks.

Most important tip: build a verification loop: (practitioner claim)

"

“Give Claude a way to verify its work… [it] will 2–3x the quality of the final result.”

For web work, he describes using the Claude Chrome extension to open a browser, test the UI, and iterate. For other domains, verification might be tests, a CLI command, or a simulator. This connects directly to Testing with AI Agents.

Key Commands

CommandPurpose
/clearReset context
/compactSummarize to save tokens
/costShow token usage
/permissionsPre-allow safe bash commands
/doctorDiagnose environment

Custom slash commands: Create .claude/commands/yourcommand.md with a description frontmatter so Claude can invoke it programmatically. Boris uses /commit-push-pr dozens of times daily.

Subagents: Boris uses code-simplifier (cleans up after Claude) and verify-app (end-to-end testing) regularly.

Codex CLI Specifics

Interactive vs. Exec mode:

bash
# Interactive (full TUI)
codex

# Non-interactive (for automation)
codex exec "Add error handling to api/users.ts"

# Resume previous session
codex resume

Parallel tool calling: Codex benefits from batching file reads:

"

Before any tool call, decide ALL files/resources you will need. Batch everything. — OpenAI Codex Guide

Gemini CLI Specifics

Evidence level: Official docs are solid; public practitioner writeups are thinner than Claude/Codex. Treat this as reliable basics plus known sharp edges.

Official starting points:

Day-to-day usage that translates well across teams:

  • Keep project context in GEMINI.md and keep it short.
  • Put repeatable prompts into versioned slash commands.
  • Use a plan-first, step-by-step loop with verification after each step.

Custom slash commands:

Known sharp edges from issues:

Guidance: treat checkpoint/restore as a convenience, not your only safety net. Git branches or worktrees plus time-to-green checks remain the backbone.

Advanced Patterns

Custom Command Libraries

Build reusable commands for your workflow:

text
.claude/commands/
├── catchup.md    # Read recent changes, summarize state
├── pr.md         # Lint, test, prepare PR description
├── test.md       # Run tests, fix failures iteratively
└── improve.md    # Review code for improvements

Example catchup.md:

markdown
Read all files changed in this branch compared to main.
Summarize what has changed and the current state of this work.

MCP Integration

Connect agents to external tools via Model Context Protocol:

Common integrations:

  • Linear — Read issues, create tasks
  • Slack — Post updates, read channel context
  • Sentry — Access error reports
  • Figma — Read design specs

Example setup (Claude Code):

json
// .mcp.json
{
  "servers": {
    "linear": { "command": "npx", "args": ["@anthropic/mcp-linear"] }
  }
}

Note: You may see MCP config documented as .mcp.json (repo root) or under a tool-specific directory (e.g. .claude/…) depending on tool/version. Prefer the official docs for your installed version; the key practice is checking the configuration into git so the team shares the same tool wiring.

Hooks: Formatting and Verification Automation

Hooks let you automate “always do X after Y” without re-prompting.

Two common hook shapes (practitioner-inspired):

  • PostToolUse formatting: after Claude edits files, run formatter(s) so CI doesn’t fail on style. Boris describes this as catching “the last 10%.”
  • Stop-hook verification: when an agent finishes a long task, automatically run a verification step (tests, e2e script, smoke check) so work doesn’t stall waiting for you.

Treat hooks like code: keep them deterministic, fast, and safe. Pair with the guide’s permissions progression and Testing with AI Agents.


Testing with AI Agents

This deserves special attention. Testing is emerging as the critical enabler for effective agentic coding.

Why Tests Matter More Now

"

Those who get the most out of coding agents tend to be those with strong testing practices. An agent like Claude can 'fly' through a project with a good test suite as safety net. — Addy Osmani

Tests provide:

  1. Clear success criteria — The agent knows when it's done
  2. Autonomous iteration — Write → test → fix → test, without human intervention
  3. Safety net — Catch regressions from AI-generated code
  4. Confidence for larger changes — You can let the agent refactor knowing tests will catch breaks

TDD with Agents

"

By writing a test before you write any code, you are essentially 'prompting' the AI code generator with exactly the functionality you want. — Engineering Harmony

The pattern:

text
"Add a calculateDiscount function that:
- Takes a price and discount percentage
- Returns the discounted price
- Throws if percentage > 100

Write a test file first with cases for each requirement.
Run the test (it should fail).
Implement the function.
Run the test again. Fix until it passes."

The agent enters a tight feedback loop, iterating until tests pass.

Building Test Coverage with Agents

If you have a codebase with low coverage, agents can help:

text
"Look at lib/utils.ts. For each exported function that lacks tests,
write a test file. Start with the simplest functions. Run tests after each."

Emerging practice: Define a coverage threshold in CI. New PRs can't decrease coverage. This prevents regression and encourages incremental improvement.

Caveats

"

AI code assistants can generate plausible-looking test cases and code, but you don't know where it learned the semantics from. What AI offers may be incorrect, inefficient, or overly complex. — testRigor

Review AI-generated tests carefully. They can:

  • Test the wrong thing (testing implementation, not behavior)
  • Have false positives (tests that pass but don't verify what you think)
  • Mirror bugs in the implementation

Bleeding Edge: Orchestration & Memory

These tools are experimental but point to where things are heading.

Beads: Memory for Coding Agents

Problem: Agents wake up with no memory of previous sessions. You re-explain context every time.

Solution: Beads by Steve Yegge provides persistent, git-backed memory.

"

The experiment that triggered Beads was simple: move the plan into an issue tracker and give agents a way to query 'ready work.' Within minutes, the behavior shifted from meandering to disciplined: compute the ready set, pick a task, work it, record discovered work, repeat. — Steve Yegge

How it works:

  • Issues stored as JSONL in .beads/
  • Git-backed (versioned, branched, merged like code)
  • Agents query for "ready work" rather than relying on you to specify

Gas Town: Multi-Agent Orchestration

Problem: Managing 10+ parallel agents manually is chaotic.

Solution: Gas Town by Steve Yegge orchestrates multiple agents.

"

Gas Town is a Go-based orchestrator enabling developers to manage 20-30 parallel AI coding agents productively using tmux.

Features:

  • Manages agent roles (Mayor, Crew, Witness, etc.)
  • Handles merge queue and work swarming
  • Built on Beads for persistent state

Caveat: Yegge himself says "You need to be at least level 6 [of his 8-stage model] before you'll appreciate Gas Town." This is advanced tooling.

Agent Mail (MCP-Agent-Mail)

Problem: When multiple agents work on the same codebase, they can conflict.

Solution: Agent Mail provides message routing and file reservations for coordinating agents.

Features:

  • Agents can send messages to each other
  • File reservation system prevents edit conflicts
  • Git-backed communication archive

Recipes: Copy-Paste Workflows

Practical prompts and command sequences for common tasks.

Recipe 1: New Feature with TDD

Setup:

bash
git checkout -b feature/your-feature-name

Prompt:

text
I want to add [FEATURE DESCRIPTION].

Before writing implementation:
1. Create a test file at [path/to/tests/feature.test.ts]
2. Write failing tests that specify the behavior:
   - [Test case 1]
   - [Test case 2]
   - [Edge case]
3. Run the tests to confirm they fail
4. Implement the minimum code to pass
5. Run tests again, fix until green
6. Refactor if needed, keeping tests green

Recipe 2: Safe Refactoring

Setup:

bash
git checkout -b refactor/component-name
git stash  # save any uncommitted work

Prompt:

text
I need to refactor [FILE/COMPONENT].

Before making changes:
1. Read the existing code and understand its behavior
2. Identify all callers/consumers of this code
3. Write characterization tests if none exist (tests that capture current behavior)
4. Run tests to establish baseline

Then refactor in small steps:
- Make ONE logical change
- Commit with descriptive message
- Run tests
- If green, continue. If red, fix or revert.

After each commit, I'll review before you continue.

Recipe 3: Bug Fix with Regression Test

Prompt:

text
Bug: [DESCRIPTION OF BUG]
Expected: [WHAT SHOULD HAPPEN]
Actual: [WHAT HAPPENS NOW]

1. First, write a failing test that reproduces this bug
2. Run it to confirm it fails as expected
3. Find the root cause (don't guess—trace the code)
4. Fix the minimal code needed
5. Run test to confirm it passes
6. Run full test suite to check for regressions

Recipe 4: Code Review Prep

Prompt:

text
Prepare this branch for review:

1. Run linter and fix issues
2. Run tests and ensure all pass
3. Check for:
   - Console.logs or debug statements to remove
   - Commented-out code to delete
   - TODOs that should be addressed or ticketed
4. Generate a PR description summarizing:
   - What changed and why
   - How to test
   - Any migration/deployment notes

Recipe 5: Catch-Up After Context Clear

Prompt:

text
/catchup

Read the last 5 commits on this branch.
Summarize:
- What has been implemented
- What's currently broken or incomplete
- What the next logical step is

Then wait for my instruction.

Checklists

Make ownership explicit: some checks are human decisions, others are agent verification.

Human checklist (decisions)

  • I can explain the diff without guesswork.
  • Scope is tight: one job, one PR.
  • Risk is understood: auth, data, migrations, dependencies.
  • Rollback is easy or planned.
  • The agent’s evidence is real: commands + outputs, not claims.

Agent checklist (verification + receipts)

Default mode is verification-only. If it fails, run a separate fix step.

Prompt pattern:

text
Run this checklist and report PASS/FAIL for each item.

Rules:
- Include the exact command you ran and the key output snippet.
- If you didn’t run a command, mark FAIL and say what you need.
- Do not change files (verification only).

Checklist:
1) git status (show it)
2) tests (run: <your test command>)
3) lint/format (run: <your lint command>)
4) build/typecheck (run: <your build command>)
5) secrets scan: confirm no .env / keys touched (show grep/ripgrep result)
6) PR size sanity: list changed files + total LOC changed
7) summary: what/why/how to test + risks + rollback note

Operational tip: turn this into a saved command so it runs the same way every time.


Review Rubric: What to Check in AI-Generated Diffs

When reviewing agent output, check these categories:

Correctness

  • Does the code do what was asked?
  • Are edge cases handled?
  • Is the logic sound (not just syntactically valid)?

Security

  • No secrets hardcoded
  • Input validation on user data
  • SQL queries parameterized
  • No eval(), dangerouslySetInnerHTML, or equivalent
  • Auth/authz checks in place

Performance

  • No N+1 queries introduced
  • Large loops have appropriate limits
  • Expensive operations not in hot paths
  • Memory: no obvious leaks (unclosed handles, growing arrays)

Dependencies

  • New packages justified and trustworthy
  • Versions pinned (not latest)
  • No unnecessary dependencies added
  • License compatibility checked

Style & Maintainability

  • Follows project conventions (check CLAUDE.md / AGENTS.md)
  • Names are clear and consistent
  • Complex logic has comments
  • No dead code or commented-out blocks

Migrations & Data

  • Migrations are reversible where possible
  • Data changes are idempotent
  • Backfill scripts handle large datasets safely

Safety Patterns for Real Repos

Secrets and Environment

Rule: Never let the agent read or print secrets.

Add to your CLAUDE.md / AGENTS.md:

markdown
## Security
- NEVER read, print, or include contents of .env files
- NEVER commit files containing API keys, tokens, or passwords
- If you need an env var value, ask me to provide it

Patterns:

  • Use .env.example with placeholder values
  • Reference env vars by name, not value: process.env.DATABASE_URL
  • If agent suggests hardcoding a secret, reject the diff

Dependency Safety

Before accepting new dependencies:

text
Before adding [PACKAGE], tell me:
1. Weekly downloads on npm/pypi
2. Last publish date
3. Number of open security advisories
4. What it does that we can't do with existing deps

Lock versions explicitly:

json
// Good
"lodash": "4.17.21"

// Bad
"lodash": "^4.17.21"
"lodash": "latest"

Dangerous Operations

Add guardrails to CLAUDE.md / AGENTS.md:

markdown
## Forbidden Operations
- Never run `rm -rf` on directories outside the project
- Never force-push to main/master
- Never run database migrations in production without explicit approval
- Never modify .git directory directly

Running Parallel Agents

"

I began to embrace the parallel coding agent lifestyle. — Simon Willison

Running multiple agents simultaneously can multiply your throughput — but only if you prevent conflicts. Follow these rules.

When Parallel Works

  • Tasks are independent (different files/modules)
  • Well-defined scope upfront
  • You're comfortable context-switching between reviews

When to Avoid

  • Tightly coupled changes
  • Exploratory work where scope is unclear
  • You need deep focus on one complex problem

Rule 1: One Worktree Per Agent

bash
# Main repo stays clean
/project/main           # Human review and integration

# Each agent gets its own worktree
/project/agent-auth     # Agent working on auth feature
/project/agent-api      # Agent working on API changes
/project/agent-tests    # Agent expanding test coverage

Setup:

bash
git worktree add -b feature/auth ../project-agent-auth
git worktree add -b feature/api ../project-agent-api

Agent-managed approach: You can instruct the agent to set up its own worktree:

text
"Create a git worktree for this feature at ../project-auth.
Work in that directory. When done, let me know so I can review."

Rule 2: Non-Overlapping File Ownership

Before starting parallel agents, define boundaries:

AgentOwnsHands Off
Auth Agentsrc/auth/*, src/middleware/auth.tsEverything else
API Agentsrc/api/*, src/routes/*src/auth/*
Test Agenttests/*Source files (read-only)

Tell each agent explicitly:

text
You are working on [SCOPE].
Do NOT modify files outside: [FILE PATTERNS]
If you need changes elsewhere, stop and tell me.

Rule 3: Integration Branch

Never merge agents directly to main. Use an integration branch:

text
main
  └── integration
        ├── feature/auth (Agent 1)
        ├── feature/api (Agent 2)
        └── feature/tests (Agent 3)

Merge order:

  1. Merge lowest-risk changes first (tests, docs)
  2. Merge foundational changes before dependent ones
  3. Run full test suite after each merge
  4. Only merge integration → main after all conflicts resolved

Rule 4: Coordination Protocol

For teams using Agent Mail or similar:

  • Agents reserve files before editing
  • Release reservations after committing
  • Send status messages when hitting blockers
  • Check inbox before starting new work

Troubleshooting

Agent Is Stuck in a Loop

Symptoms: Agent keeps trying same fix, not making progress

Real example: An agent trying to fix a TypeScript error keeps adding type annotations, but the real issue is a missing dependency import. Each iteration adds more complexity without solving the root cause.

Solutions (try in order):

  1. Reduce scope: "Stop. Focus only on [specific small thing]"
  2. Reset context: /clear then re-explain with fresh context
  3. Switch models: Try a different model for a fresh perspective (Opus → Sonnet, or vice versa)
  4. Add tests: "Write a test that fails, then fix just that"
  5. Provide hints: Give the agent a specific approach to try
  6. Bisect: "The code worked at commit X. Find what broke it."

Practitioner insight: Simon Willison notes that when an agent loops, it's often because the problem is underspecified. Giving a concrete failing test case or exact error output often breaks the loop immediately.

Agent Misunderstands the Codebase

Symptoms: Agent makes changes that don't fit project patterns

Solutions:

  1. Update your context file (CLAUDE.md / AGENTS.md) with the patterns it's missing
  2. Point to specific example files: "Look at how we do this in src/auth/login.ts"
  3. Explain the "why" not just the "what"

Agent Output Is Too Verbose/Complex

Symptoms: Over-engineered solutions, unnecessary abstractions

Real example: Asked to "add email validation," the agent creates an abstract Validator base class, a ValidationResult type, a ValidationRegistry, and finally an EmailValidator extending the base—when const isValidEmail = (s) => /^[^\s@]+@[^\s@]+\.[^\s@]+$/.test(s) would suffice.

Solutions:

  1. Be explicit: "Use the simplest possible implementation. No new classes or abstractions."
  2. Set constraints: "This should be under 50 lines"
  3. Reject and retry: "This is too complex. Simpler approach: just a regex helper function"
  4. Reference existing patterns: "Look at how we did isValidUrl in utils.ts and follow that style"

Agent Keeps Suggesting Same Wrong Fix

Symptoms: Suggesting approach you've already rejected

Solutions:

  1. /clear and start fresh
  2. Explicitly state what NOT to do
  3. Provide the correct approach directly

Tests Pass But Code Is Wrong

Symptoms: Tests green, but manual testing reveals bugs

Solutions:

  1. Tests are testing wrong thing—rewrite with clear assertions
  2. Add integration/e2e tests, not just unit tests
  3. Review test coverage: are edge cases covered?

Staying Honest About Productivity

AI makes code cheaper. The bottleneck shifts to verification, review, and fixing mistakes.

Questions to ask yourself periodically:

  • Am I actually shipping faster, or just generating more code to review?
  • Are my PRs getting harder to review (bigger, more AI-generated churn)?
  • How often do I revert AI-generated changes?
  • Is review becoming a bottleneck for the team?

The trap: AI can flood the pipe with PRs that take longer to review than they saved to write.

"

"I can only focus on reviewing and landing one significant change at a time." — Simon Willison

If you're a team lead tracking metrics formally: time-to-green (prompt → tests pass), rework rate (fixup commits), and rollback rate tell you whether AI is actually helping.


Where Practices Diverge

Not everyone agrees. Here are the key debates.

Permission Skipping

ApproachArgument
Skip permissions (--dangerously-skip-permissions)"Unlocks huge productivity" — no confirmation dialogs for every action
Keep permissionsConfirmation prompts catch mistakes before they happen
Sandbox (Docker/VM)Run unrestricted but limit blast radius

Recommended progression:

text
Level 1: Safe Mode (default)
├── All permissions prompts enabled
├── Good for: learning the tool, unfamiliar codebases
└── Cost: slower, but maximum safety

Level 2: Allowlist Common Commands
├── Use `/permissions` to pre-allow safe patterns
│   - `npm test`, `npm run lint`
│   - `git add`, `git commit` (not push)
│   - File reads (not writes initially)
├── Good for: familiar projects, trusted workflows
└── Claude Code: /permissions add "npm test"

Level 3: Sandbox for YOLO
├── Run in Docker container or disposable VM
├── Network restricted, no access to credentials
├── Skip all permissions inside the sandbox
├── Good for: experimental work, high-velocity prototyping
└── Cost: setup overhead, can't test integrations

Level 4: Full YOLO (use sparingly)
├── --dangerously-skip-permissions
├── Only for: throwaway code, personal experiments
├── Never for: production access, client work
└── Always on a fresh branch with ability to revert
"

Running [agents] with permission checks disabled is dangerous and stupid, and you should only do it if you are willing to take dangerous and stupid risks. — Steve Yegge (despite using it himself)

Mainline-by-Default vs. Branch/Worktree Safety

There’s a sharp divergence between solo, high-velocity workflows and team/production workflows.

ApproachWhy people do itWhen it breaks down
Commit directly to main (practitioner opinion; common among solo builders)Lowest cognitive overhead; fewer merge conflicts; treats git history as a linear “walk up the mountain”Teams, CI gates, release branches, or any environment where main must remain deployable
Feature branches / PRs (industry default)Reviewability, CI policy enforcement, safer collaborationSlightly slower inner loop; more state to manage
git worktree per agent (agent-heavy teams)Parallelism without clobbering; clean separation of contextsMore setup; still needs an integration strategy

Guide stance: for anything shared or production-facing, keep the branch/worktree safety patterns in this guide. If you adopt mainline-by-default for solo work, compensate with strong verification loops (tests, formatters, smoke scripts) and “save point” commits.

Who Plans: Human or Agent?

ApproachWhen it Works
Agent plans firstWell-defined tasks, agent knows the codebase
Human plans, agent executesComplex architecture, novel problems
Iterative co-planningExploratory work, you're learning alongside the agent

Emerging practice: Have the agent propose a plan, review it, adjust, then approve. This combines agent knowledge with human judgment.

Single Agent vs. Multiple Agents

ApproachBest For
Single agent, deep focusComplex refactoring, architectural work
Parallel agentsIndependent features, test coverage expansion
"

I can only focus on reviewing and landing one significant change at a time, but I'm finding an increasing number of tasks that can be fired off in parallel without adding too much cognitive overhead. — Simon Willison

One Model vs. Multiple Models ("second opinion" workflows)

Not everyone agrees on whether you should stick to one model/tool or treat models as interchangeable.

ApproachWhy people choose itMain failure mode
Single model for everythingConsistent style, fewer moving parts, easier team standardizationYou can get stuck in a model’s blind spots; repeated wrong suggestions
Swap models when stuck ("model musical chairs")Fresh perspective; different models are better at different tasksContext transfer overhead; inconsistent conventions
Two-model loop (generator + reviewer)One model writes, another critiques; catches subtle mistakesCan create "review theater" if you don’t also run real tests

Guide stance: any of the above can work. If you choose multi-model, keep the workflow grounded in external verification (tests, linters, builds) and your Review Rubric—not just agreement between models.


Community Pulse (January 2026)

Synthesized from practitioner discussions on X. Updated monthly.

Hybrid workflows are emerging: Rather than picking one tool, practitioners combine them:

"

"My vibe coding combo: Claude Code with Opus 4.5 to build a functional full stack foundation... Gemini CLI with Gemini 3.0 to polish... Codex CLI with GPT-5.2-codex xhigh to scan" — @rafaelobitten

Hybrid setups like CCG-Workflow use Claude Code as supervisor with Codex and Gemini for collaborative development. (source, zh)

Tool strengths by task:

  • Claude Code: Faster for code generation, but "has more fine mistakes" (source)
  • Codex CLI: "Better for thinking-based work" and more usage hours per $20 (source, source)
  • Gemini CLI: "Performs better at frontend design", nearly free, similar UX to Claude Code (source, source)

The closing gap:

"

"Gemini CLI has caught up with Claude Code in terms of effectiveness. Lot less tool call failures." — @championswimmer

Bottom line: No clear winner—personal workflow matters more than picking "the best" tool.


Sources

Official Documentation

Thought Leaders

Emerging Tools

  • Beads — Memory for coding agents
  • Gas Town — Multi-agent orchestration
  • Clawdbot — Experimental CLI automation bot (early-stage)
  • Agent Mail — Multi-agent coordination

Contributing

We welcome input! The best way to improve this guide is to share a link to a credible article, blog post, or X thread with practical advice. Send us a suggestion and we'll review it for inclusion.


This guide is maintained by TheCrux and updated as practices evolve.

Get the daily briefing

Stay on top of the best analysis and practitioner workflows.

Start your briefing