Skip to main content
Back to Articles

Testing LLM Reasoning on Novel Problems: What 400 Experiments Reveal

We ran 400 experiments on competitive programming problems LLMs have never seen before. The results reveal that when memorization is eliminated, prompt formatting matters far less than model selection.

January 16, 202611 min readBy Mathematicon

TL;DR

We tested XML-structured versus natural language prompts using problems that post-date model training cutoffs—meaning these models couldn't have memorized the solutions. Across 400 runs with Claude Sonnet 4 and Gemini 2.0 Flash, prompt format had negligible impact (~2-4% difference). The real finding: when models face genuinely novel problems, their reasoning capabilities—not prompt formatting—determine success.


Why This Experiment is Different

Most prompt engineering studies have a hidden flaw: they use problems that may exist in training data. When an LLM "solves" a well-known coding problem, it might be retrieving a memorized pattern rather than reasoning from scratch. This makes it impossible to know whether XML formatting helped the model think better or simply recall better.

We designed this experiment to eliminate that ambiguity.

The Novel Dataset Advantage

We used LiveCodeBench, a benchmark with a critical property: all problems are sourced from competitive programming contests held after the training cutoff dates of major LLMs.

This means:

  • Zero memorization possible: Claude and Gemini cannot have seen these exact problems during training
  • Pure reasoning tested: Every solution requires genuine algorithmic thinking
  • Fair comparison: Neither model has an unfair advantage from data contamination
  • Credible conclusions: Any performance difference reflects actual capability, not recall

When a model fails on these problems, it's not because it "forgot"—it's because it genuinely couldn't solve them. When it succeeds, it's demonstrating real reasoning ability.


The Core Question

With memorization eliminated, we can ask the real question:

Does XML-structured prompting help LLMs reason better on novel problems, or is it just formatting theater?

The answer has significant implications. If XML helps models think more clearly, it's worth the overhead. If it doesn't, the prompt engineering community has been optimizing the wrong variable.


Experiment Design

Prompt Formats

XML Format:

<task>
  <problem_id>abc123</problem_id>
  <title>Two Sum</title>
  <description><![CDATA[
    Given an array of integers nums and an integer target,
    return indices of the two numbers that add up to target.
  ]]></description>
  <starter_code>def two_sum(nums, target):</starter_code>
  <output_format>Return only the solution code, no explanations.</output_format>
</task>

Natural Language Format:

Solve the following coding problem.

Problem ID: abc123
Title: Two Sum

Given an array of integers nums and an integer target,
return indices of the two numbers that add up to target.

Starter code:
def two_sum(nums, target):

Return only the solution code, no explanations.

Models Tested

Model Provider Method Notes
Claude Sonnet 4 Anthropic CLI Latest Sonnet model
Gemini 2.0 Flash Google REST API Fast inference, cost-effective

Parameters

  • 100 problems Ă— 2 models Ă— 2 formats = 400 total runs
  • Timeout: 3 minutes per problem
  • Temperature: 0 (deterministic outputs)
  • Evaluation: Python subprocess execution against public test cases

Results

Summary Table

Model Format Runs Avg Output Tokens Avg Latency Cost Pass Rate
Claude Natural 100 391 13.6s $0.73 45.5%
Claude XML 100 444 15.3s $0.82 47.7%
Gemini Natural 100 454 37.8s $0.02 33.3%
Gemini XML 100 524 73.4s $0.02 29.7%

Total Cost: $1.58 (Claude: $1.55, Gemini: $0.04)

The Headline Numbers

  • Claude overall: 46.6% pass rate
  • Gemini overall: 31.5% pass rate
  • Model gap: 15.1 percentage points
  • Format gap: 2-4 percentage points (within noise)

What Novel Problems Reveal

Finding 1: Prompt Format Is Surface-Level Optimization

On problems these models have never seen, XML formatting provided:

  • Claude: +2.2% (47.7% vs 45.5%)
  • Gemini: -3.6% (29.7% vs 33.3%)

These differences are statistically insignificant at this sample size. More importantly, the direction isn't even consistent—XML helped Claude marginally but hurt Gemini marginally.

Interpretation: When models must reason from scratch on novel problems, how you format the question matters far less than whether the model can solve the question.

Finding 2: The Reasoning Bottleneck

Both models failed on over half the problems:

  • Claude failed 53.4% of problems
  • Gemini failed 68.5% of problems

These aren't trivial problems—they're competitive programming challenges requiring algorithm selection, edge case handling, and correct implementation. The pass rates reveal the actual reasoning ceiling of these models on genuinely novel tasks.

No amount of XML formatting can help a model that doesn't know the right algorithm.

Finding 3: Model Selection Dominates

The gap between models (15.1 percentage points) is 7x larger than any prompt format effect. If you're optimizing for code generation accuracy:

  • Switching from Gemini to Claude: +48% relative improvement
  • Switching from Natural to XML on Claude: +4.8% relative improvement

Model selection provides an order of magnitude more impact than prompt engineering.

Finding 4: XML Increases Overhead Without Benefit

XML prompts consistently generated more tokens:

  • Claude: +53 tokens (+13.5%)
  • Gemini: +70 tokens (+15.4%)

This means higher costs and longer latencies. Gemini's XML latency nearly doubled (37.8s → 73.4s), suggesting additional processing on structured inputs—without accuracy gains.


Why Doesn't XML Help on Novel Problems?

Hypothesis 1: No Patterns to Trigger

On familiar problems, structured formatting might help activate relevant training patterns. But with novel problems, there are no cached solutions to retrieve. The model must derive the answer from first principles, and XML tags don't improve first-principles reasoning.

Hypothesis 2: The Bottleneck Is Algorithmic

Competitive programming problems require:

  • Recognizing the problem class (DP, greedy, graph traversal, etc.)
  • Selecting the appropriate algorithm
  • Handling edge cases correctly
  • Implementing without bugs

Prompt formatting addresses none of these. A well-formatted prompt can't teach a model an algorithm it doesn't understand.

Hypothesis 3: Modern LLMs Are Format-Agnostic

Contemporary models are trained on massive corpora containing both structured and unstructured text. They've learned to extract semantic meaning regardless of syntactic presentation. The "structure" in XML may be redundant to a model that already understands the underlying concepts.

Hypothesis 4: Early Sensitivity Has Diminished

Early LLMs (GPT-2 era) were highly sensitive to prompt phrasing. Modern instruction-tuned models are robust to format variations. The prompt engineering techniques that worked in 2022 may have diminishing returns in 2024.


The Economics Argument

Even if XML provided a small accuracy boost, the economics work against it:

Metric Natural XML Difference
Avg Input Tokens ~800 ~950 +19%
Avg Output Tokens ~420 ~480 +14%
Cost per Problem $0.0037 $0.0042 +14%

At scale:

  • 10,000 problems/day Ă— 14% overhead = significant cost increase
  • With no meaningful accuracy improvement

Implications for Vibe Coding

This experiment reinforces a core principle: substance over style.

The vibe coding movement embraces rapid iteration with LLMs, but it's easy to get distracted by prompt engineering folklore. "Use XML!" "Add role-play!" "Include examples!" These techniques may have their place, but our data suggests diminishing returns for straightforward code generation on novel problems.

What Actually Moves the Needle

  1. Model selection: Claude outperformed Gemini by 48% relative improvement
  2. Problem decomposition: Break complex tasks into verifiable steps
  3. Verification infrastructure: Automated tests catch more errors than prompt tweaks
  4. Context quality: Clear problem descriptions matter more than formatting

The Real Lesson

When facing genuinely novel problems—the kind that matter in production—LLMs succeed or fail based on their reasoning capabilities, not your XML schemas. Invest your optimization effort accordingly.


Recommendations

1. Choose Your Model Carefully

Model selection provides 7x the impact of prompt formatting. Benchmark different models on your specific use case before investing in prompt engineering.

2. Use Natural Language by Default

Since accuracy is equivalent, prefer the format that's easier to write, read, and maintain. Your team's productivity matters.

3. Reserve XML for Structured Outputs

XML does provide value when you need the model to produce structured data:

  • Multi-part responses with clear delimiters
  • Few-shot examples with consistent formatting
  • Function calling with explicit parameter schemas

But for straightforward code generation? Plain English works equally well.

4. Invest in Verification

Given pass rates of 46.6% and 31.5%, the real opportunity isn't prompt optimization—it's building robust verification pipelines:

  • Automated test execution
  • Type checking
  • Multi-sample consensus
  • Human review for critical code

Every dollar spent on verification infrastructure yields higher returns than prompt refinement.


Methodology Notes

Evaluation Pipeline

  1. Code extraction: Parse markdown code blocks from model response
  2. File creation: Write solution + test harness to temp file
  3. Subprocess execution: Run with 10-second timeout and memory limits
  4. Result comparison: Stdout vs expected output
  5. Classification: Pass/fail/error/timeout

Reproducibility


Limitations

1. Single Problem Domain

LiveCodeBench focuses on competitive programming. Results may differ for:

  • Web development tasks
  • Data transformation
  • API integration
  • Documentation generation

2. Binary Evaluation

We measured pass/fail on test cases. This misses:

  • Code quality metrics
  • Partial correctness
  • Style and maintainability

3. Sample Size

100 problems Ă— 2 formats = 200 runs per model. Larger samples might reveal statistically significant differences, though effect sizes would remain small.

4. Model Versions

Results reflect Claude Sonnet 4 and Gemini 2.0 Flash as of the experiment date.


Conclusion

After 400 runs on problems these models have never seen, the verdict is clear: XML-structured prompts provide no meaningful advantage over natural language for code generation on novel problems.

This finding is particularly credible because we eliminated memorization as a confounding variable. These models couldn't pattern-match against training data—they had to reason from scratch. And when they did, prompt formatting was irrelevant to their success.

The implications:

  • If you're spending hours crafting XML schemas, that time is likely wasted
  • Model selection matters far more than prompt formatting
  • Verification infrastructure beats prompt engineering for reliability

The future of AI-assisted development isn't about finding the perfect prompt incantation. It's about understanding what LLMs can and cannot do on genuinely novel problems, and building systems that verify, validate, and iterate accordingly.


FAQ

Q: Should I stop using XML in my prompts?

A: Not necessarily. If your team has established conventions, keep using them. The point is that changing your format won't improve accuracy on novel problems.

Q: Why did Gemini perform worse than Claude?

A: Gemini 2.0 Flash is optimized for speed and cost efficiency. Claude Sonnet 4 has stronger reasoning capabilities at the cost of higher latency and price.

Q: Would results differ with GPT-4?

A: Possibly. Each model has different capabilities. However, the principle—that novel problems reveal reasoning limits regardless of formatting—likely applies broadly.

Q: Does this apply to non-code tasks?

A: Our experiment only tested code generation. Results may differ for structured data extraction or multi-part outputs where delimiters provide clearer boundaries.


Cost Breakdown

Model Input Cost/1M Output Cost/1M Total Input Tokens Total Output Tokens Total Cost
Claude $3.00 $15.00 ~180,000 ~83,500 $1.55
Gemini $0.075 $0.30 ~195,000 ~97,800 $0.04

Gemini is 39x cheaper, but Claude's 48% accuracy advantage may justify the premium depending on your use case.

Share this article

Related Articles