Skip to main content
Back to Articles

Can You Teach an AI How to Think? Testing Elaborate Reasoning Protocols on Claude Code

We gave Claude Code a 365-line protocol teaching it to think like an ICPC World Finalist. After 300 experiments, the elaborate methodology provided zero improvement over a simple "solve this problem" prompt—but the reasons why reveal something important about agentic AI.

January 27, 202610 min readBy Mathematicon

TL;DR

We tested whether you can improve Claude Code's problem-solving by giving it an elaborate reasoning protocol—a 365-line XML document specifying a 10-phase methodology mimicking how expert competitive programmers think. After 300 experiments on novel CodeELO problems, both conditions achieved identical 64% pass rates. The protocol added 69% more tokens and 65% more latency with zero accuracy benefit. But the why is more interesting than the what.


A Different Question Than Format

Our previous experiment asked: Does XML formatting help LLMs reason better? The answer was no—wrapping the same information in <tags> versus plain text made no difference.

This experiment asks something deeper: Can you teach an LLM how to think by providing an explicit reasoning methodology?

This isn't about formatting. It's about whether giving a model a detailed cognitive protocol—the kind of structured approach that helps human experts—translates to better AI performance.


The Hypothesis

Expert competitive programmers follow systematic approaches:

  1. Restate the problem to ensure understanding
  2. Analyze constraints to determine viable complexity
  3. Enumerate possible algorithms
  4. Select and justify the optimal approach
  5. Prove correctness before coding
  6. Walk through examples
  7. Identify edge cases
  8. Verify complexity bounds
  9. Implement carefully
  10. Reflect on learnings

What if we encoded this entire methodology into a prompt? Would Claude Code perform better by following an expert's cognitive process rather than just "winging it"?


The Protocol

We created a 365-line XML document that enforces structured reasoning:

<CompetitiveProgrammingAssistant>
  <Configuration>
    <Mode>Strict-Analysis-First</Mode>
    <ExperienceLevel>ICPC-World-Finalist</ExperienceLevel>
  </Configuration>

  <Directives>
    <Directive id="1" priority="critical">
      <Description>No premature coding allowed</Description>
      <Enforcement>Terminate response if code appears before step 7</Enforcement>
    </Directive>
  </Directives>

  <ProblemSolvingProtocol>
    <Phase name="Understanding">
      <Step number="1">
        <Title>Problem Restatement</Title>
        <RequiredElements>
          <Element>Concise summary in one sentence</Element>
          <Element>Input format specification</Element>
          <Element>Output format specification</Element>
        </RequiredElements>
      </Step>
      <!-- ... 9 more detailed phases ... -->
    </Phase>
  </ProblemSolvingProtocol>
</CompetitiveProgrammingAssistant>

The full protocol specifies:

  • Constraints analysis tables with complexity implications
  • Algorithm enumeration across categories (brute force, greedy, DP, graphs)
  • Selection criteria with explicit justification requirements
  • Proof sketches before implementation
  • Edge case taxonomies (minimal, maximal, pathological, boundary)
  • Complexity verification against limits
  • Post-mortem analysis for learning

This is the kind of elaborate "chain of thought" prompting that many practitioners believe should dramatically improve reasoning.


The Control: Radical Simplicity

The baseline prompt was deliberately minimal:

Solve this competitive programming problem.
Read from stdin, write to stdout.
Use Python.

[problem description]

No methodology. No structure. No guidance on how to think. Just the task.


Why This Tests Something New

Agentic Execution

Unlike API calls that generate a single response, Claude Code operates agentically:

  1. Analyzes the problem
  2. Writes initial code
  3. Runs tests
  4. Observes failures
  5. Debugs and iterates
  6. Repeats until success or timeout

This feedback loop means the model isn't just generating—it's iterating. The question becomes: does upfront structured analysis help, or does trial-and-error iteration make it redundant?

Novel Problems

We used CodeELO, a dataset of 408 Codeforces problems. We selected 150 problems with rating >= 1200 (medium to hard difficulty). These are genuinely challenging algorithmic problems—dynamic programming, graph algorithms, number theory—not toy examples.


Results

The Headline

Condition Pass Rate Avg Tokens Avg Latency
Simple Prompt 64.0% 1,017 40.2s
Elaborate Protocol 64.0% 1,718 66.4s

Identical accuracy. The elaborate reasoning protocol provided zero improvement.

The Cost

Metric Simple Protocol Overhead
Tokens 1,017 1,718 +69%
Latency 40.2s 66.4s +65%

The protocol forced Claude Code to generate extensive analysis, justifications, and proofs—all of which consumed tokens and time without improving outcomes.

By Difficulty

Difficulty Simple Protocol Delta
Medium (98) 69.4% 66.3% -3.1%
Hard (52) 53.8% 59.6% +5.8%

Interesting pattern: the protocol slightly hurt medium problems but slightly helped hard ones. However, these differences aren't statistically significant (p = 1.0 overall).

Head-to-Head

Outcome Count
Both passed 86
Both failed 44
Only simple passed 10
Only protocol passed 10

Perfect symmetry. The protocol didn't systematically help any problem class—it just shuffled which 10 problems succeeded.


Why Doesn't Teaching "How to Think" Help?

Hypothesis 1: The Model Already Knows

Modern instruction-tuned LLMs have internalized effective problem-solving strategies through training on millions of examples. When Claude Code encounters a competitive programming problem, it already:

  • Considers the constraints
  • Thinks about multiple approaches
  • Handles edge cases
  • Iterates on failures

The explicit protocol may simply be redundant—formalizing what the model does implicitly.

Hypothesis 2: Iteration Trumps Planning

In agentic execution, Claude Code can:

  1. Make a quick attempt
  2. Run tests
  3. See what fails
  4. Fix specific issues
  5. Repeat

This feedback loop may be more effective than extensive upfront analysis. Learning from failures (test-driven iteration) might beat preventing failures (exhaustive planning).

Consider: a quick hypothesis tested against reality provides more signal than a careful hypothesis developed in isolation.

Hypothesis 3: The Bottleneck Is Knowledge, Not Process

For the 44 problems both conditions failed:

  • The model didn't know the required algorithm
  • Or couldn't implement it correctly
  • Or missed subtle edge cases

No cognitive protocol can conjure algorithmic knowledge that isn't there. You can't reason your way to a segment tree if you've never learned what a segment tree is.

Hypothesis 4: Verbose Reasoning Introduces Noise

The protocol forced generation of:

  • Detailed constraint tables
  • Algorithm enumerations
  • Proof sketches
  • Edge case taxonomies

Each section is an opportunity for the model to:

  • Make reasoning errors
  • Commit to wrong approaches
  • Lose focus on the actual problem

More tokens ≠ better reasoning. Sometimes verbose analysis just means more ways to go wrong.


What This Means for Agentic AI

The "Teach It How to Think" Fallacy

A common belief: if you want better AI outputs, provide detailed instructions on how to approach problems. Our data suggests this is wrong for agentic systems.

Claude Code doesn't need a 10-phase methodology. It needs:

  1. Clear problem specification: What exactly should be solved?
  2. Test cases: How will correctness be verified?
  3. Iteration capability: Can it see and learn from failures?

When Protocols Might Help

This doesn't mean structured reasoning is always useless. It might help when:

  • No iteration possible: Single-shot API calls can't learn from mistakes
  • Human collaboration: Explicit reasoning helps humans verify AI work
  • Auditability: Regulated domains require documented decision processes
  • Teaching: Showing reasoning helps users learn

But for autonomous code generation with test feedback? Let the model iterate.

The Agentic Advantage

Our 64% pass rate—while not perfect—demonstrates that agentic iteration is powerful. The model:

  • Tries approaches
  • Gets feedback
  • Adapts

This loop may be more valuable than any prompt engineering. Build better feedback systems, not better prompts.


Comparison with Other Experiments

Experiment Question Finding
XML vs Natural (LiveCodeBench) Does format matter? No
Few-Shot STEM Do examples help? Minimal
This experiment Does methodology help? No

A pattern emerges: on difficult novel problems, prompt engineering has diminishing returns. The bottleneck is model capability, not prompt sophistication.


Practical Recommendations

1. Keep Prompts Simple

For agentic coding tasks:

Solve this problem.
[problem description]

That's often enough. Save the elaborate protocols for documentation.

2. Invest in Feedback Loops

Instead of teaching the AI how to think, ensure it can:

  • Run its code
  • See test results
  • Observe failures
  • Iterate

Good feedback beats good instructions.

3. Focus on Problem Clarity

Ambiguous problem descriptions cause more failures than missing methodology. Spend your effort on:

  • Precise specifications
  • Representative test cases
  • Clear success criteria

4. Accept the Reasoning Ceiling

With 36% failure rate on competitive programming, the limitation isn't process—it's capability. For mission-critical code:

  • Human review
  • Multiple attempts
  • Conservative validation

No prompt makes an AI smarter than it is.


Methodology

Dataset

  • CodeELO: 150 problems, rating >= 1200
  • Difficulty: 98 medium, 52 hard
  • Novel: Post-training-cutoff Codeforces problems

Execution

claude --dangerously-skip-permissions -p "$PROMPT" --model claude-sonnet-4-20250514

Metrics

  • Pass rate on public test cases
  • Token usage
  • Latency

Reproducibility

Full code and results: github.com/nsameerd/claude-code-xml-prompting


Limitations

  1. Single model: Only Claude Sonnet 4 via Claude Code
  2. Specific domain: Competitive programming only
  3. Single run: No multi-attempt voting
  4. Public tests: Private test cases might differ

Conclusion

After 300 experiments, we found that elaborate reasoning protocols don't improve agentic AI performance on novel problems.

A 365-line XML methodology teaching Claude Code to think like an ICPC World Finalist achieved exactly the same 64% pass rate as a three-line "solve this" prompt. The only difference: 69% more tokens, 65% more latency.

The implication: you can't teach an AI how to think through prompting. Modern LLMs have internalized effective reasoning strategies. What they lack isn't methodology—it's knowledge and capability. No protocol can fill that gap.

For agentic AI development, focus on:

  • Clear problem specifications
  • Rich feedback loops
  • Robust test infrastructure

Let the model iterate its way to solutions. That's what it's good at.


Links

Share this article

Related Articles