Can You Teach an AI How to Think? Testing Elaborate Reasoning Protocols on Claude Code

TL;DR

We tested whether you can improve Claude Code's problem-solving by giving it an elaborate reasoning protocol—a 365-line XML document specifying a 10-phase methodology mimicking how expert competitive programmers think. After 300 experiments on novel CodeELO problems, both conditions achieved identical 64% pass rates. The protocol added 69% more tokens and 65% more latency with zero accuracy benefit. But the why is more interesting than the what.

A Different Question Than Format

Our previous experiment asked: Does XML formatting help LLMs reason better? The answer was no—wrapping the same information in <tags> versus plain text made no difference.

This experiment asks something deeper: Can you teach an LLM how to think by providing an explicit reasoning methodology?

This isn't about formatting. It's about whether giving a model a detailed cognitive protocol—the kind of structured approach that helps human experts—translates to better AI performance.

The Hypothesis

Expert competitive programmers follow systematic approaches:

Restate the problem to ensure understanding
Analyze constraints to determine viable complexity
Enumerate possible algorithms
Select and justify the optimal approach
Prove correctness before coding
Walk through examples
Identify edge cases
Verify complexity bounds
Implement carefully
Reflect on learnings

What if we encoded this entire methodology into a prompt? Would Claude Code perform better by following an expert's cognitive process rather than just "winging it"?

The Protocol

We created a 365-line XML document that enforces structured reasoning:

<CompetitiveProgrammingAssistant>
  <Configuration>
    <Mode>Strict-Analysis-First</Mode>
    <ExperienceLevel>ICPC-World-Finalist</ExperienceLevel>
  </Configuration>

  <Directives>
    <Directive id="1" priority="critical">
      <Description>No premature coding allowed</Description>
      <Enforcement>Terminate response if code appears before step 7</Enforcement>
    </Directive>
  </Directives>

  <ProblemSolvingProtocol>
    <Phase name="Understanding">
      <Step number="1">
        <Title>Problem Restatement</Title>
        <RequiredElements>
          <Element>Concise summary in one sentence</Element>
          <Element>Input format specification</Element>
          <Element>Output format specification</Element>
        </RequiredElements>
      </Step>
      <!-- ... 9 more detailed phases ... -->
    </Phase>
  </ProblemSolvingProtocol>
</CompetitiveProgrammingAssistant>

The full protocol specifies:

Constraints analysis tables with complexity implications
Algorithm enumeration across categories (brute force, greedy, DP, graphs)
Selection criteria with explicit justification requirements
Proof sketches before implementation
Edge case taxonomies (minimal, maximal, pathological, boundary)
Complexity verification against limits
Post-mortem analysis for learning

This is the kind of elaborate "chain of thought" prompting that many practitioners believe should dramatically improve reasoning.

The Control: Radical Simplicity

The baseline prompt was deliberately minimal:

Solve this competitive programming problem.
Read from stdin, write to stdout.
Use Python.

[problem description]

No methodology. No structure. No guidance on how to think. Just the task.

Why This Tests Something New

Agentic Execution

Unlike API calls that generate a single response, Claude Code operates agentically:

Analyzes the problem
Writes initial code
Runs tests
Observes failures
Debugs and iterates
Repeats until success or timeout

This feedback loop means the model isn't just generating—it's iterating. The question becomes: does upfront structured analysis help, or does trial-and-error iteration make it redundant?

Novel Problems

We used CodeELO, a dataset of 408 Codeforces problems. We selected 150 problems with rating >= 1200 (medium to hard difficulty). These are genuinely challenging algorithmic problems—dynamic programming, graph algorithms, number theory—not toy examples.

Results

The Headline

Condition	Pass Rate	Avg Tokens	Avg Latency
Simple Prompt	64.0%	1,017	40.2s
Elaborate Protocol	64.0%	1,718	66.4s

Identical accuracy. The elaborate reasoning protocol provided zero improvement.

The Cost

Metric	Simple	Protocol	Overhead
Tokens	1,017	1,718	+69%
Latency	40.2s	66.4s	+65%

The protocol forced Claude Code to generate extensive analysis, justifications, and proofs—all of which consumed tokens and time without improving outcomes.

By Difficulty

Difficulty	Simple	Protocol	Delta
Medium (98)	69.4%	66.3%	-3.1%
Hard (52)	53.8%	59.6%	+5.8%

Interesting pattern: the protocol slightly hurt medium problems but slightly helped hard ones. However, these differences aren't statistically significant (p = 1.0 overall).

Head-to-Head

Outcome	Count
Both passed	86
Both failed	44
Only simple passed	10
Only protocol passed	10

Perfect symmetry. The protocol didn't systematically help any problem class—it just shuffled which 10 problems succeeded.

Why Doesn't Teaching "How to Think" Help?

Hypothesis 1: The Model Already Knows

Modern instruction-tuned LLMs have internalized effective problem-solving strategies through training on millions of examples. When Claude Code encounters a competitive programming problem, it already:

Considers the constraints
Thinks about multiple approaches
Handles edge cases
Iterates on failures

The explicit protocol may simply be redundant—formalizing what the model does implicitly.

Hypothesis 2: Iteration Trumps Planning

In agentic execution, Claude Code can:

Make a quick attempt
Run tests
See what fails
Fix specific issues
Repeat

This feedback loop may be more effective than extensive upfront analysis. Learning from failures (test-driven iteration) might beat preventing failures (exhaustive planning).

Consider: a quick hypothesis tested against reality provides more signal than a careful hypothesis developed in isolation.

Hypothesis 3: The Bottleneck Is Knowledge, Not Process

For the 44 problems both conditions failed:

The model didn't know the required algorithm
Or couldn't implement it correctly
Or missed subtle edge cases

No cognitive protocol can conjure algorithmic knowledge that isn't there. You can't reason your way to a segment tree if you've never learned what a segment tree is.

Hypothesis 4: Verbose Reasoning Introduces Noise

The protocol forced generation of:

Detailed constraint tables
Algorithm enumerations
Proof sketches
Edge case taxonomies

Each section is an opportunity for the model to:

Make reasoning errors
Commit to wrong approaches
Lose focus on the actual problem

More tokens ≠ better reasoning. Sometimes verbose analysis just means more ways to go wrong.

What This Means for Agentic AI

The "Teach It How to Think" Fallacy

A common belief: if you want better AI outputs, provide detailed instructions on how to approach problems. Our data suggests this is wrong for agentic systems.

Claude Code doesn't need a 10-phase methodology. It needs:

Clear problem specification: What exactly should be solved?
Test cases: How will correctness be verified?
Iteration capability: Can it see and learn from failures?

When Protocols Might Help

This doesn't mean structured reasoning is always useless. It might help when:

No iteration possible: Single-shot API calls can't learn from mistakes
Human collaboration: Explicit reasoning helps humans verify AI work
Auditability: Regulated domains require documented decision processes
Teaching: Showing reasoning helps users learn

But for autonomous code generation with test feedback? Let the model iterate.

The Agentic Advantage

Our 64% pass rate—while not perfect—demonstrates that agentic iteration is powerful. The model:

Tries approaches
Gets feedback
Adapts

This loop may be more valuable than any prompt engineering. Build better feedback systems, not better prompts.

Comparison with Other Experiments

Experiment	Question	Finding
XML vs Natural (LiveCodeBench)	Does format matter?	No
Few-Shot STEM	Do examples help?	Minimal
This experiment	Does methodology help?	No

A pattern emerges: on difficult novel problems, prompt engineering has diminishing returns. The bottleneck is model capability, not prompt sophistication.

Practical Recommendations

1. Keep Prompts Simple

For agentic coding tasks:

Solve this problem.
[problem description]

That's often enough. Save the elaborate protocols for documentation.

2. Invest in Feedback Loops

Instead of teaching the AI how to think, ensure it can:

Run its code
See test results
Observe failures
Iterate

Good feedback beats good instructions.

3. Focus on Problem Clarity

Ambiguous problem descriptions cause more failures than missing methodology. Spend your effort on:

Precise specifications
Representative test cases
Clear success criteria

4. Accept the Reasoning Ceiling

With 36% failure rate on competitive programming, the limitation isn't process—it's capability. For mission-critical code:

Human review
Multiple attempts
Conservative validation

No prompt makes an AI smarter than it is.

Methodology

Dataset

CodeELO: 150 problems, rating >= 1200
Difficulty: 98 medium, 52 hard
Novel: Post-training-cutoff Codeforces problems

Execution

claude --dangerously-skip-permissions -p "$PROMPT" --model claude-sonnet-4-20250514

Metrics

Pass rate on public test cases
Token usage
Latency

Reproducibility

Full code and results: github.com/nsameerd/claude-code-xml-prompting

Limitations

Single model: Only Claude Sonnet 4 via Claude Code
Specific domain: Competitive programming only
Single run: No multi-attempt voting
Public tests: Private test cases might differ

Conclusion

After 300 experiments, we found that elaborate reasoning protocols don't improve agentic AI performance on novel problems.

A 365-line XML methodology teaching Claude Code to think like an ICPC World Finalist achieved exactly the same 64% pass rate as a three-line "solve this" prompt. The only difference: 69% more tokens, 65% more latency.

The implication: you can't teach an AI how to think through prompting. Modern LLMs have internalized effective reasoning strategies. What they lack isn't methodology—it's knowledge and capability. No protocol can fill that gap.

For agentic AI development, focus on:

Clear problem specifications
Rich feedback loops
Robust test infrastructure

Let the model iterate its way to solutions. That's what it's good at.