The Flow-State Experiment: Measuring Papercuts in AI Coding Tools
Inspired by Geoffrey Huntley's papercuts framework—the idea that developer tool quality is measured not by catastrophic failures, but by cumulative micro-frictions that drain productivity and joy.
The Philosophy
When evaluating AI coding tools, we often focus on the wrong metrics: benchmark scores, parameter counts, or "vibe check" impressions. But the real measure of a tool's quality is how often it makes you stop, re-read, correct, or re-prompt—the small papercuts that accumulate into flow-state destruction.
This experiment systematically measures these friction points across three leading AI coding CLI tools:
| Tool | Model | Access Method |
|---|---|---|
| Claude Code | claude-sonnet-4-20250514 | claude CLI |
| Codex CLI | gpt-5.2-codex | codex exec |
| Gemini CLI | gemini-2.5-pro | gemini CLI |
Experiment Design
We designed two experiments targeting the most common sources of flow disruption:
Experiment 4: Mid-Stream Correction
The scenario: You ask for a Flask endpoint, then immediately request a modification.
Why this matters: Developers constantly iterate on generated code. The tool's ability to understand "now change X" without losing context is critical.
Prompt 4.1: Write a Flask endpoint /api/users that returns a JSON list of all users.
Use a simple in-memory list as the data source.
Prompt 4.2: Now add query parameter filtering so I can filter by active=true or active=false.
Experiment 5: Context Window Handling
The scenario: You provide a 100-line module and ask to add a single method.
Why this matters: The ideal response outputs just the new method. Re-outputting 100 lines forces the developer to scroll, compare, and manually integrate—pure friction.
Prompt: [100-line User module]
In the User class, add a new method get_full_name() that returns
f"{self.first_name} {self.last_name}".
Results: Experiment 4 (Mid-Stream Correction)
All three tools produced correct code for both prompts. The differences emerged in how they handled the task.
Claude Code Response
Prompt 4.1:
from flask import Flask, jsonify
app = Flask(__name__)
users = [
{"id": 1, "name": "Alice", "email": "alice@example.com"},
{"id": 2, "name": "Bob", "email": "bob@example.com"},
{"id": 3, "name": "Charlie", "email": "charlie@example.com"},
]
@app.route('/api/users', methods=['GET'])
def get_users():
return jsonify(users)
Prompt 4.2: Correctly added request import and filtering logic. Preserved the email field from 4.1.
Friction events: 0
Codex CLI Response
Prompt 4.1:
from flask import Flask, jsonify
app = Flask(__name__)
USERS = [
{"id": 1, "name": "Alice"},
{"id": 2, "name": "Bob"},
{"id": 3, "name": "Charlie"},
]
@app.get("/api/users")
def get_users():
return jsonify(USERS)
Prompt 4.2: Here's where things got interesting.
CRITICAL FINDING: Codex CLI in
execmode has no session memory.
When given Prompt 4.2, Codex:
- Did NOT remember the 4.1 response
- Searched the repository for Flask code
- Found our experiment session files
- Modified those files directly
Instead of outputting code, it ran:
rg -n "Flask" .
Then applied a patch to our experiment documentation.
Friction events: 3 (context loss, repo search, unintended file modification)
Gemini CLI Response
Prompt 4.1:
from flask import Flask, jsonify
app = Flask(__name__)
users = [
{"id": 1, "name": "Alice", "active": True},
{"id": 2, "name": "Bob", "active": False},
{"id": 3, "name": "Charlie", "active": True},
]
@app.route('/api/users', methods=['GET'])
def get_users():
return jsonify(users)
Notable: Gemini anticipated the active field in 4.1, making 4.2 a smaller change.
Prompt 4.2: Correctly added filtering. Some tool registry errors displayed but didn't affect output.
Friction events: 2 (cosmetic tool errors, rate limiting)
Results: Experiment 5 (Context Handling)
This experiment revealed the starkest differences.
Claude Code: Good but Verbose (12 lines)
def get_full_name(self) -> str:
"""Return the user's full name."""
return f"{self.first_name} {self.last_name}"
Plus placement guidance with surrounding methods. Minor over-engineering (added docstring).
Gemini CLI: Ideal Response (4 lines)
def get_full_name(self) -> str:
return f"{self.first_name} {self.last_name}"
Just the method. Correct indentation. Done.
Codex CLI: No Code Output (0 lines)
CRITICAL FINDING: Codex did NOT output the method.
Instead, it:
- Ran
rg -n "class User" . - Found
test-module.pyin the repo - Applied a patch directly to the file
- Responded: "Added
get_full_name()toUserintest-module.py"
This is a fundamental behavioral difference: Codex CLI defaults to agentic file modification, not conversational code output.
Friction Event Summary
| Tool | Exp 4 Friction | Exp 5 Friction | Total |
|---|---|---|---|
| Claude Code | 0 | 1 | 1 |
| Gemini CLI | 2 | 0 | 2 |
| Codex CLI | 3 | 3 | 6 |
Analysis: Three Interaction Models
These tools aren't just different in code quality—they represent fundamentally different interaction paradigms:
Claude Code: Conversational Assistant
- Maintains context across prompts
- Outputs code for human review
- Adds helpful but sometimes unrequested enhancements
- Best for: Back-and-forth iteration
Gemini CLI: Minimal Oracle
- Maintains context
- Outputs precise, minimal responses
- Tool registry errors visible but ignorable
- Best for: "Show me exactly what I asked for"
Codex CLI (exec mode): Agentic File Modifier
- No session memory between calls
- Searches repo and modifies files directly
- Assumes you want changes applied, not shown
- Best for: Automated code modifications
The Papercut Taxonomy
Claude Code Papercuts
- Minor Over-Engineering: Adds docstrings, extra fields not requested
- Verbose Context: Shows surrounding methods (helpful but adds output)
Codex CLI Papercuts
- No Session Memory: Each
execcall is isolated - Agentic Default: Modifies files instead of outputting code
- Repo Search: Searches for matching code in repo
- Requires Explicit Instructions: Must say "just output the code"
Gemini CLI Papercuts
- Tool Registry Errors: Internal errors displayed to user
- Rate Limiting: 429 errors during high usage
Recommendations
For Claude Code Users
- Expect slightly verbose but correct responses
- Minor cleanup may be needed (removing extra docstrings)
- Best choice for conversational, iterative development
For Codex CLI Users
- Use interactive mode for conversational workflows
- In
execmode, explicitly request "output only, don't modify files" - Be aware it will search and modify repo files by default
- Best choice for batch file modifications
For Gemini CLI Users
- Ignore tool registry errors (cosmetic issue)
- Expect minimal, precise output
- May encounter rate limiting during heavy use
- Best choice for precise, minimal code generation
Conclusion
The winner depends on your workflow:
| Workflow | Best Tool |
|---|---|
| Iterative conversation | Claude Code |
| Minimal, precise output | Gemini CLI |
| Automated file changes | Codex CLI |
The most important finding isn't which tool is "best"—it's that these tools have fundamentally different mental models. Codex CLI's agentic behavior isn't a bug; it's a design choice. But if you expect conversational assistance and get file modifications instead, that's a significant papercut.
Choose your tool based on your workflow, not benchmarks.
Experiment Repository
Full session transcripts, prompts, and metrics available at: github.com/nsameerd/ai-coding-papercuts-experiment
This is Article 1 of a three-part series on AI coding tool papercuts. Coming next: The Idiomatic Code Audit (testing modern syntax, security defaults, and library versions).