Skip to main content
Back to Articles

The Flow-State Experiment: Measuring Papercuts in AI Coding Tools

A systematic comparison of Claude Code, Codex CLI, and Gemini CLI measuring the small friction points that disrupt developer flow. Our experiments reveal fundamental differences in how these tools handle mid-task corrections and context.

February 5, 202612 min readBy Mathematicon

The Flow-State Experiment: Measuring Papercuts in AI Coding Tools

Inspired by Geoffrey Huntley's papercuts framework—the idea that developer tool quality is measured not by catastrophic failures, but by cumulative micro-frictions that drain productivity and joy.


The Philosophy

When evaluating AI coding tools, we often focus on the wrong metrics: benchmark scores, parameter counts, or "vibe check" impressions. But the real measure of a tool's quality is how often it makes you stop, re-read, correct, or re-prompt—the small papercuts that accumulate into flow-state destruction.

This experiment systematically measures these friction points across three leading AI coding CLI tools:

Tool Model Access Method
Claude Code claude-sonnet-4-20250514 claude CLI
Codex CLI gpt-5.2-codex codex exec
Gemini CLI gemini-2.5-pro gemini CLI

Experiment Design

We designed two experiments targeting the most common sources of flow disruption:

Experiment 4: Mid-Stream Correction

The scenario: You ask for a Flask endpoint, then immediately request a modification.

Why this matters: Developers constantly iterate on generated code. The tool's ability to understand "now change X" without losing context is critical.

Prompt 4.1: Write a Flask endpoint /api/users that returns a JSON list of all users.
           Use a simple in-memory list as the data source.

Prompt 4.2: Now add query parameter filtering so I can filter by active=true or active=false.

Experiment 5: Context Window Handling

The scenario: You provide a 100-line module and ask to add a single method.

Why this matters: The ideal response outputs just the new method. Re-outputting 100 lines forces the developer to scroll, compare, and manually integrate—pure friction.

Prompt: [100-line User module]
        In the User class, add a new method get_full_name() that returns
        f"{self.first_name} {self.last_name}".

Results: Experiment 4 (Mid-Stream Correction)

All three tools produced correct code for both prompts. The differences emerged in how they handled the task.

Claude Code Response

Prompt 4.1:

from flask import Flask, jsonify

app = Flask(__name__)

users = [
    {"id": 1, "name": "Alice", "email": "alice@example.com"},
    {"id": 2, "name": "Bob", "email": "bob@example.com"},
    {"id": 3, "name": "Charlie", "email": "charlie@example.com"},
]

@app.route('/api/users', methods=['GET'])
def get_users():
    return jsonify(users)

Prompt 4.2: Correctly added request import and filtering logic. Preserved the email field from 4.1.

Friction events: 0


Codex CLI Response

Prompt 4.1:

from flask import Flask, jsonify

app = Flask(__name__)

USERS = [
    {"id": 1, "name": "Alice"},
    {"id": 2, "name": "Bob"},
    {"id": 3, "name": "Charlie"},
]

@app.get("/api/users")
def get_users():
    return jsonify(USERS)

Prompt 4.2: Here's where things got interesting.

CRITICAL FINDING: Codex CLI in exec mode has no session memory.

When given Prompt 4.2, Codex:

  1. Did NOT remember the 4.1 response
  2. Searched the repository for Flask code
  3. Found our experiment session files
  4. Modified those files directly

Instead of outputting code, it ran:

rg -n "Flask" .

Then applied a patch to our experiment documentation.

Friction events: 3 (context loss, repo search, unintended file modification)


Gemini CLI Response

Prompt 4.1:

from flask import Flask, jsonify

app = Flask(__name__)

users = [
    {"id": 1, "name": "Alice", "active": True},
    {"id": 2, "name": "Bob", "active": False},
    {"id": 3, "name": "Charlie", "active": True},
]

@app.route('/api/users', methods=['GET'])
def get_users():
    return jsonify(users)

Notable: Gemini anticipated the active field in 4.1, making 4.2 a smaller change.

Prompt 4.2: Correctly added filtering. Some tool registry errors displayed but didn't affect output.

Friction events: 2 (cosmetic tool errors, rate limiting)


Results: Experiment 5 (Context Handling)

This experiment revealed the starkest differences.

Claude Code: Good but Verbose (12 lines)

def get_full_name(self) -> str:
    """Return the user's full name."""
    return f"{self.first_name} {self.last_name}"

Plus placement guidance with surrounding methods. Minor over-engineering (added docstring).

Gemini CLI: Ideal Response (4 lines)

    def get_full_name(self) -> str:
        return f"{self.first_name} {self.last_name}"

Just the method. Correct indentation. Done.

Codex CLI: No Code Output (0 lines)

CRITICAL FINDING: Codex did NOT output the method.

Instead, it:

  1. Ran rg -n "class User" .
  2. Found test-module.py in the repo
  3. Applied a patch directly to the file
  4. Responded: "Added get_full_name() to User in test-module.py"

This is a fundamental behavioral difference: Codex CLI defaults to agentic file modification, not conversational code output.


Friction Event Summary

Tool Exp 4 Friction Exp 5 Friction Total
Claude Code 0 1 1
Gemini CLI 2 0 2
Codex CLI 3 3 6

Analysis: Three Interaction Models

These tools aren't just different in code quality—they represent fundamentally different interaction paradigms:

Claude Code: Conversational Assistant

  • Maintains context across prompts
  • Outputs code for human review
  • Adds helpful but sometimes unrequested enhancements
  • Best for: Back-and-forth iteration

Gemini CLI: Minimal Oracle

  • Maintains context
  • Outputs precise, minimal responses
  • Tool registry errors visible but ignorable
  • Best for: "Show me exactly what I asked for"

Codex CLI (exec mode): Agentic File Modifier

  • No session memory between calls
  • Searches repo and modifies files directly
  • Assumes you want changes applied, not shown
  • Best for: Automated code modifications

The Papercut Taxonomy

Claude Code Papercuts

  1. Minor Over-Engineering: Adds docstrings, extra fields not requested
  2. Verbose Context: Shows surrounding methods (helpful but adds output)

Codex CLI Papercuts

  1. No Session Memory: Each exec call is isolated
  2. Agentic Default: Modifies files instead of outputting code
  3. Repo Search: Searches for matching code in repo
  4. Requires Explicit Instructions: Must say "just output the code"

Gemini CLI Papercuts

  1. Tool Registry Errors: Internal errors displayed to user
  2. Rate Limiting: 429 errors during high usage

Recommendations

For Claude Code Users

  • Expect slightly verbose but correct responses
  • Minor cleanup may be needed (removing extra docstrings)
  • Best choice for conversational, iterative development

For Codex CLI Users

  • Use interactive mode for conversational workflows
  • In exec mode, explicitly request "output only, don't modify files"
  • Be aware it will search and modify repo files by default
  • Best choice for batch file modifications

For Gemini CLI Users

  • Ignore tool registry errors (cosmetic issue)
  • Expect minimal, precise output
  • May encounter rate limiting during heavy use
  • Best choice for precise, minimal code generation

Conclusion

The winner depends on your workflow:

Workflow Best Tool
Iterative conversation Claude Code
Minimal, precise output Gemini CLI
Automated file changes Codex CLI

The most important finding isn't which tool is "best"—it's that these tools have fundamentally different mental models. Codex CLI's agentic behavior isn't a bug; it's a design choice. But if you expect conversational assistance and get file modifications instead, that's a significant papercut.

Choose your tool based on your workflow, not benchmarks.


Experiment Repository

Full session transcripts, prompts, and metrics available at: github.com/nsameerd/ai-coding-papercuts-experiment


This is Article 1 of a three-part series on AI coding tool papercuts. Coming next: The Idiomatic Code Audit (testing modern syntax, security defaults, and library versions).

Share this article

Related Articles

The Flow-State Experiment: Measuring Papercuts in AI Coding Tools | Mathematicon