The Flow-State Experiment: Measuring Papercuts in AI Coding Tools

Inspired by Geoffrey Huntley's papercuts framework—the idea that developer tool quality is measured not by catastrophic failures, but by cumulative micro-frictions that drain productivity and joy.

The Philosophy

When evaluating AI coding tools, we often focus on the wrong metrics: benchmark scores, parameter counts, or "vibe check" impressions. But the real measure of a tool's quality is how often it makes you stop, re-read, correct, or re-prompt—the small papercuts that accumulate into flow-state destruction.

This experiment systematically measures these friction points across three leading AI coding CLI tools:

Tool	Model	Access Method
Claude Code	claude-sonnet-4-20250514	`claude` CLI
Codex CLI	gpt-5.2-codex	`codex exec`
Gemini CLI	gemini-2.5-pro	`gemini` CLI

Experiment Design

We designed two experiments targeting the most common sources of flow disruption:

Experiment 4: Mid-Stream Correction

The scenario: You ask for a Flask endpoint, then immediately request a modification.

Why this matters: Developers constantly iterate on generated code. The tool's ability to understand "now change X" without losing context is critical.

Prompt 4.1: Write a Flask endpoint /api/users that returns a JSON list of all users.
           Use a simple in-memory list as the data source.

Prompt 4.2: Now add query parameter filtering so I can filter by active=true or active=false.

Experiment 5: Context Window Handling

The scenario: You provide a 100-line module and ask to add a single method.

Why this matters: The ideal response outputs just the new method. Re-outputting 100 lines forces the developer to scroll, compare, and manually integrate—pure friction.

Prompt: [100-line User module]
        In the User class, add a new method get_full_name() that returns
        f"{self.first_name} {self.last_name}".

Results: Experiment 4 (Mid-Stream Correction)

All three tools produced correct code for both prompts. The differences emerged in how they handled the task.

Claude Code Response

Prompt 4.1:

from flask import Flask, jsonify

app = Flask(__name__)

users = [
    {"id": 1, "name": "Alice", "email": "alice@example.com"},
    {"id": 2, "name": "Bob", "email": "bob@example.com"},
    {"id": 3, "name": "Charlie", "email": "charlie@example.com"},
]

@app.route('/api/users', methods=['GET'])
def get_users():
    return jsonify(users)

Prompt 4.2: Correctly added request import and filtering logic. Preserved the email field from 4.1.

Friction events: 0

Codex CLI Response

Prompt 4.1:

from flask import Flask, jsonify

app = Flask(__name__)

USERS = [
    {"id": 1, "name": "Alice"},
    {"id": 2, "name": "Bob"},
    {"id": 3, "name": "Charlie"},
]

@app.get("/api/users")
def get_users():
    return jsonify(USERS)

Prompt 4.2: Here's where things got interesting.

CRITICAL FINDING: Codex CLI in exec mode has no session memory.

When given Prompt 4.2, Codex:

Did NOT remember the 4.1 response
Searched the repository for Flask code
Found our experiment session files
Modified those files directly

Instead of outputting code, it ran:

rg -n "Flask" .

Then applied a patch to our experiment documentation.

Friction events: 3 (context loss, repo search, unintended file modification)

Gemini CLI Response

Prompt 4.1:

from flask import Flask, jsonify

app = Flask(__name__)

users = [
    {"id": 1, "name": "Alice", "active": True},
    {"id": 2, "name": "Bob", "active": False},
    {"id": 3, "name": "Charlie", "active": True},
]

@app.route('/api/users', methods=['GET'])
def get_users():
    return jsonify(users)

Notable: Gemini anticipated the active field in 4.1, making 4.2 a smaller change.

Prompt 4.2: Correctly added filtering. Some tool registry errors displayed but didn't affect output.

Friction events: 2 (cosmetic tool errors, rate limiting)

Results: Experiment 5 (Context Handling)

This experiment revealed the starkest differences.

Claude Code: Good but Verbose (12 lines)

def get_full_name(self) -> str:
    """Return the user's full name."""
    return f"{self.first_name} {self.last_name}"

Plus placement guidance with surrounding methods. Minor over-engineering (added docstring).

Gemini CLI: Ideal Response (4 lines)

    def get_full_name(self) -> str:
        return f"{self.first_name} {self.last_name}"

Just the method. Correct indentation. Done.

Codex CLI: No Code Output (0 lines)

CRITICAL FINDING: Codex did NOT output the method.

Instead, it:

Ran rg -n "class User" .
Found test-module.py in the repo
Applied a patch directly to the file
Responded: "Added get_full_name() to User in test-module.py"

This is a fundamental behavioral difference: Codex CLI defaults to agentic file modification, not conversational code output.

Friction Event Summary

Tool	Exp 4 Friction	Exp 5 Friction	Total
Claude Code	0	1	1
Gemini CLI	2	0	2
Codex CLI	3	3	6

Analysis: Three Interaction Models

These tools aren't just different in code quality—they represent fundamentally different interaction paradigms:

Claude Code: Conversational Assistant

Maintains context across prompts
Outputs code for human review
Adds helpful but sometimes unrequested enhancements
Best for: Back-and-forth iteration

Gemini CLI: Minimal Oracle

Maintains context
Outputs precise, minimal responses
Tool registry errors visible but ignorable
Best for: "Show me exactly what I asked for"

Codex CLI (exec mode): Agentic File Modifier

No session memory between calls
Searches repo and modifies files directly
Assumes you want changes applied, not shown
Best for: Automated code modifications

The Papercut Taxonomy

Claude Code Papercuts

Minor Over-Engineering: Adds docstrings, extra fields not requested
Verbose Context: Shows surrounding methods (helpful but adds output)

Codex CLI Papercuts

No Session Memory: Each exec call is isolated
Agentic Default: Modifies files instead of outputting code
Repo Search: Searches for matching code in repo
Requires Explicit Instructions: Must say "just output the code"

Gemini CLI Papercuts

Tool Registry Errors: Internal errors displayed to user
Rate Limiting: 429 errors during high usage

Recommendations

For Claude Code Users

Expect slightly verbose but correct responses
Minor cleanup may be needed (removing extra docstrings)
Best choice for conversational, iterative development

For Codex CLI Users

Use interactive mode for conversational workflows
In exec mode, explicitly request "output only, don't modify files"
Be aware it will search and modify repo files by default
Best choice for batch file modifications

For Gemini CLI Users

Ignore tool registry errors (cosmetic issue)
Expect minimal, precise output
May encounter rate limiting during heavy use
Best choice for precise, minimal code generation

Conclusion

The winner depends on your workflow:

Workflow	Best Tool
Iterative conversation	Claude Code
Minimal, precise output	Gemini CLI
Automated file changes	Codex CLI

The most important finding isn't which tool is "best"—it's that these tools have fundamentally different mental models. Codex CLI's agentic behavior isn't a bug; it's a design choice. But if you expect conversational assistance and get file modifications instead, that's a significant papercut.

Choose your tool based on your workflow, not benchmarks.

Experiment Repository

Full session transcripts, prompts, and metrics available at: github.com/nsameerd/ai-coding-papercuts-experiment

This is Article 1 of a three-part series on AI coding tool papercuts. Coming next: The Idiomatic Code Audit (testing modern syntax, security defaults, and library versions).

The Flow-State Experiment: Measuring Papercuts in AI Coding Tools

The Flow-State Experiment: Measuring Papercuts in AI Coding Tools

The Philosophy

Experiment Design

Experiment 4: Mid-Stream Correction

Experiment 5: Context Window Handling

Results: Experiment 4 (Mid-Stream Correction)

Claude Code Response

Codex CLI Response

Gemini CLI Response

Results: Experiment 5 (Context Handling)

Claude Code: Good but Verbose (12 lines)

Gemini CLI: Ideal Response (4 lines)

Codex CLI: No Code Output (0 lines)

Friction Event Summary

Analysis: Three Interaction Models

Claude Code: Conversational Assistant

Gemini CLI: Minimal Oracle

Codex CLI (exec mode): Agentic File Modifier

The Papercut Taxonomy

Claude Code Papercuts

Codex CLI Papercuts

Gemini CLI Papercuts

Recommendations

For Claude Code Users

For Codex CLI Users

For Gemini CLI Users

Conclusion

Experiment Repository

Tags

Related Articles

Why LLM-Generated Code Accumulated 172 TypeScript Errors: A Technical Deep Dive

The LLM Collaboration Guide: How to Avoid 20 Critical Bugs in Production

The Silent Failure Cascade: Why LLM-Powered Code Breaks in Production

Related Articles

🔮
🔮Vibe Coding
Why LLM-Generated Code Accumulated 172 TypeScript Errors: A Technical Deep Dive
During a large-scale refactoring, we discovered 172 hidden TypeScript errors from LLM-generated code. This wasn't random—it revealed systematic patterns in how AI handles legacy removal, type inference, and build configurations. Learn the root causes and how to prevent them.
March 14, 202615 min read
#vibe coding#TypeScript#LLM code generation
March 14, 202615 min read

🔮
🔮Vibe Coding
The LLM Collaboration Guide: How to Avoid 20 Critical Bugs in Production
Learn how to turn LLMs from a liability into your most powerful engineering tool. Discover the three-phase workflow, constraint matrix framework, and production readiness checklist that prevents critical bugs in production.
March 12, 202624 min read
#LLM#AI-Assisted Code#Production
March 12, 202624 min read

🔮
🔮Vibe Coding
The Silent Failure Cascade: Why LLM-Powered Code Breaks in Production
LLM-generated code often fails silently in production due to implicit assumptions. Learn why this happens, how to detect it, and proven strategies to write defensive code that survives the real world.
March 7, 202612 min read
#LLM#AI-Assisted Code#Debugging
March 7, 202612 min read