TL;DR
We tested whether you can improve Claude Code's problem-solving by giving it an elaborate reasoning protocol—a 365-line XML document specifying a 10-phase methodology mimicking how expert competitive programmers think. After 300 experiments on novel CodeELO problems, both conditions achieved identical 64% pass rates. The protocol added 69% more tokens and 65% more latency with zero accuracy benefit. But the why is more interesting than the what.
A Different Question Than Format
Our previous experiment asked: Does XML formatting help LLMs reason better? The answer was no—wrapping the same information in <tags> versus plain text made no difference.
This experiment asks something deeper: Can you teach an LLM how to think by providing an explicit reasoning methodology?
This isn't about formatting. It's about whether giving a model a detailed cognitive protocol—the kind of structured approach that helps human experts—translates to better AI performance.
The Hypothesis
Expert competitive programmers follow systematic approaches:
- Restate the problem to ensure understanding
- Analyze constraints to determine viable complexity
- Enumerate possible algorithms
- Select and justify the optimal approach
- Prove correctness before coding
- Walk through examples
- Identify edge cases
- Verify complexity bounds
- Implement carefully
- Reflect on learnings
What if we encoded this entire methodology into a prompt? Would Claude Code perform better by following an expert's cognitive process rather than just "winging it"?
The Protocol
We created a 365-line XML document that enforces structured reasoning:
<CompetitiveProgrammingAssistant>
<Configuration>
<Mode>Strict-Analysis-First</Mode>
<ExperienceLevel>ICPC-World-Finalist</ExperienceLevel>
</Configuration>
<Directives>
<Directive id="1" priority="critical">
<Description>No premature coding allowed</Description>
<Enforcement>Terminate response if code appears before step 7</Enforcement>
</Directive>
</Directives>
<ProblemSolvingProtocol>
<Phase name="Understanding">
<Step number="1">
<Title>Problem Restatement</Title>
<RequiredElements>
<Element>Concise summary in one sentence</Element>
<Element>Input format specification</Element>
<Element>Output format specification</Element>
</RequiredElements>
</Step>
<!-- ... 9 more detailed phases ... -->
</Phase>
</ProblemSolvingProtocol>
</CompetitiveProgrammingAssistant>
The full protocol specifies:
- Constraints analysis tables with complexity implications
- Algorithm enumeration across categories (brute force, greedy, DP, graphs)
- Selection criteria with explicit justification requirements
- Proof sketches before implementation
- Edge case taxonomies (minimal, maximal, pathological, boundary)
- Complexity verification against limits
- Post-mortem analysis for learning
This is the kind of elaborate "chain of thought" prompting that many practitioners believe should dramatically improve reasoning.
The Control: Radical Simplicity
The baseline prompt was deliberately minimal:
Solve this competitive programming problem.
Read from stdin, write to stdout.
Use Python.
[problem description]
No methodology. No structure. No guidance on how to think. Just the task.
Why This Tests Something New
Agentic Execution
Unlike API calls that generate a single response, Claude Code operates agentically:
- Analyzes the problem
- Writes initial code
- Runs tests
- Observes failures
- Debugs and iterates
- Repeats until success or timeout
This feedback loop means the model isn't just generating—it's iterating. The question becomes: does upfront structured analysis help, or does trial-and-error iteration make it redundant?
Novel Problems
We used CodeELO, a dataset of 408 Codeforces problems. We selected 150 problems with rating >= 1200 (medium to hard difficulty). These are genuinely challenging algorithmic problems—dynamic programming, graph algorithms, number theory—not toy examples.
Results
The Headline
| Condition | Pass Rate | Avg Tokens | Avg Latency |
|---|---|---|---|
| Simple Prompt | 64.0% | 1,017 | 40.2s |
| Elaborate Protocol | 64.0% | 1,718 | 66.4s |
Identical accuracy. The elaborate reasoning protocol provided zero improvement.
The Cost
| Metric | Simple | Protocol | Overhead |
|---|---|---|---|
| Tokens | 1,017 | 1,718 | +69% |
| Latency | 40.2s | 66.4s | +65% |
The protocol forced Claude Code to generate extensive analysis, justifications, and proofs—all of which consumed tokens and time without improving outcomes.
By Difficulty
| Difficulty | Simple | Protocol | Delta |
|---|---|---|---|
| Medium (98) | 69.4% | 66.3% | -3.1% |
| Hard (52) | 53.8% | 59.6% | +5.8% |
Interesting pattern: the protocol slightly hurt medium problems but slightly helped hard ones. However, these differences aren't statistically significant (p = 1.0 overall).
Head-to-Head
| Outcome | Count |
|---|---|
| Both passed | 86 |
| Both failed | 44 |
| Only simple passed | 10 |
| Only protocol passed | 10 |
Perfect symmetry. The protocol didn't systematically help any problem class—it just shuffled which 10 problems succeeded.
Why Doesn't Teaching "How to Think" Help?
Hypothesis 1: The Model Already Knows
Modern instruction-tuned LLMs have internalized effective problem-solving strategies through training on millions of examples. When Claude Code encounters a competitive programming problem, it already:
- Considers the constraints
- Thinks about multiple approaches
- Handles edge cases
- Iterates on failures
The explicit protocol may simply be redundant—formalizing what the model does implicitly.
Hypothesis 2: Iteration Trumps Planning
In agentic execution, Claude Code can:
- Make a quick attempt
- Run tests
- See what fails
- Fix specific issues
- Repeat
This feedback loop may be more effective than extensive upfront analysis. Learning from failures (test-driven iteration) might beat preventing failures (exhaustive planning).
Consider: a quick hypothesis tested against reality provides more signal than a careful hypothesis developed in isolation.
Hypothesis 3: The Bottleneck Is Knowledge, Not Process
For the 44 problems both conditions failed:
- The model didn't know the required algorithm
- Or couldn't implement it correctly
- Or missed subtle edge cases
No cognitive protocol can conjure algorithmic knowledge that isn't there. You can't reason your way to a segment tree if you've never learned what a segment tree is.
Hypothesis 4: Verbose Reasoning Introduces Noise
The protocol forced generation of:
- Detailed constraint tables
- Algorithm enumerations
- Proof sketches
- Edge case taxonomies
Each section is an opportunity for the model to:
- Make reasoning errors
- Commit to wrong approaches
- Lose focus on the actual problem
More tokens ≠ better reasoning. Sometimes verbose analysis just means more ways to go wrong.
What This Means for Agentic AI
The "Teach It How to Think" Fallacy
A common belief: if you want better AI outputs, provide detailed instructions on how to approach problems. Our data suggests this is wrong for agentic systems.
Claude Code doesn't need a 10-phase methodology. It needs:
- Clear problem specification: What exactly should be solved?
- Test cases: How will correctness be verified?
- Iteration capability: Can it see and learn from failures?
When Protocols Might Help
This doesn't mean structured reasoning is always useless. It might help when:
- No iteration possible: Single-shot API calls can't learn from mistakes
- Human collaboration: Explicit reasoning helps humans verify AI work
- Auditability: Regulated domains require documented decision processes
- Teaching: Showing reasoning helps users learn
But for autonomous code generation with test feedback? Let the model iterate.
The Agentic Advantage
Our 64% pass rate—while not perfect—demonstrates that agentic iteration is powerful. The model:
- Tries approaches
- Gets feedback
- Adapts
This loop may be more valuable than any prompt engineering. Build better feedback systems, not better prompts.
Comparison with Other Experiments
| Experiment | Question | Finding |
|---|---|---|
| XML vs Natural (LiveCodeBench) | Does format matter? | No |
| Few-Shot STEM | Do examples help? | Minimal |
| This experiment | Does methodology help? | No |
A pattern emerges: on difficult novel problems, prompt engineering has diminishing returns. The bottleneck is model capability, not prompt sophistication.
Practical Recommendations
1. Keep Prompts Simple
For agentic coding tasks:
Solve this problem.
[problem description]
That's often enough. Save the elaborate protocols for documentation.
2. Invest in Feedback Loops
Instead of teaching the AI how to think, ensure it can:
- Run its code
- See test results
- Observe failures
- Iterate
Good feedback beats good instructions.
3. Focus on Problem Clarity
Ambiguous problem descriptions cause more failures than missing methodology. Spend your effort on:
- Precise specifications
- Representative test cases
- Clear success criteria
4. Accept the Reasoning Ceiling
With 36% failure rate on competitive programming, the limitation isn't process—it's capability. For mission-critical code:
- Human review
- Multiple attempts
- Conservative validation
No prompt makes an AI smarter than it is.
Methodology
Dataset
- CodeELO: 150 problems, rating >= 1200
- Difficulty: 98 medium, 52 hard
- Novel: Post-training-cutoff Codeforces problems
Execution
claude --dangerously-skip-permissions -p "$PROMPT" --model claude-sonnet-4-20250514
Metrics
- Pass rate on public test cases
- Token usage
- Latency
Reproducibility
Full code and results: github.com/nsameerd/claude-code-xml-prompting
Limitations
- Single model: Only Claude Sonnet 4 via Claude Code
- Specific domain: Competitive programming only
- Single run: No multi-attempt voting
- Public tests: Private test cases might differ
Conclusion
After 300 experiments, we found that elaborate reasoning protocols don't improve agentic AI performance on novel problems.
A 365-line XML methodology teaching Claude Code to think like an ICPC World Finalist achieved exactly the same 64% pass rate as a three-line "solve this" prompt. The only difference: 69% more tokens, 65% more latency.
The implication: you can't teach an AI how to think through prompting. Modern LLMs have internalized effective reasoning strategies. What they lack isn't methodology—it's knowledge and capability. No protocol can fill that gap.
For agentic AI development, focus on:
- Clear problem specifications
- Rich feedback loops
- Robust test infrastructure
Let the model iterate its way to solutions. That's what it's good at.
Links
- Code & Results: github.com/nsameerd/claude-code-xml-prompting
- Dataset: CodeELO on HuggingFace
- Related: XML vs Natural Language Experiment