Testing LLM Reasoning on Novel Problems: What 400 Experiments Reveal

TL;DR

We tested XML-structured versus natural language prompts using problems that post-date model training cutoffs—meaning these models couldn't have memorized the solutions. Across 400 runs with Claude Sonnet 4 and Gemini 2.0 Flash, prompt format had negligible impact (~2-4% difference). The real finding: when models face genuinely novel problems, their reasoning capabilities—not prompt formatting—determine success.

Why This Experiment is Different

Most prompt engineering studies have a hidden flaw: they use problems that may exist in training data. When an LLM "solves" a well-known coding problem, it might be retrieving a memorized pattern rather than reasoning from scratch. This makes it impossible to know whether XML formatting helped the model think better or simply recall better.

We designed this experiment to eliminate that ambiguity.

The Novel Dataset Advantage

We used LiveCodeBench, a benchmark with a critical property: all problems are sourced from competitive programming contests held after the training cutoff dates of major LLMs.

This means:

Zero memorization possible: Claude and Gemini cannot have seen these exact problems during training
Pure reasoning tested: Every solution requires genuine algorithmic thinking
Fair comparison: Neither model has an unfair advantage from data contamination
Credible conclusions: Any performance difference reflects actual capability, not recall

When a model fails on these problems, it's not because it "forgot"—it's because it genuinely couldn't solve them. When it succeeds, it's demonstrating real reasoning ability.

The Core Question

With memorization eliminated, we can ask the real question:

Does XML-structured prompting help LLMs reason better on novel problems, or is it just formatting theater?

The answer has significant implications. If XML helps models think more clearly, it's worth the overhead. If it doesn't, the prompt engineering community has been optimizing the wrong variable.

Experiment Design

Prompt Formats

XML Format:

<task>
  <problem_id>abc123</problem_id>
  <title>Two Sum</title>
  <description><![CDATA[
    Given an array of integers nums and an integer target,
    return indices of the two numbers that add up to target.
  ]]></description>
  <starter_code>def two_sum(nums, target):</starter_code>
  <output_format>Return only the solution code, no explanations.</output_format>
</task>

Natural Language Format:

Solve the following coding problem.

Problem ID: abc123
Title: Two Sum

Given an array of integers nums and an integer target,
return indices of the two numbers that add up to target.

Starter code:
def two_sum(nums, target):

Return only the solution code, no explanations.

Models Tested

Model	Provider	Method	Notes
Claude Sonnet 4	Anthropic	CLI	Latest Sonnet model
Gemini 2.0 Flash	Google	REST API	Fast inference, cost-effective

Parameters

100 problems × 2 models × 2 formats = 400 total runs
Timeout: 3 minutes per problem
Temperature: 0 (deterministic outputs)
Evaluation: Python subprocess execution against public test cases

Results

Summary Table

Model	Format	Runs	Avg Output Tokens	Avg Latency	Cost	Pass Rate
Claude	Natural	100	391	13.6s	$0.73	45.5%
Claude	XML	100	444	15.3s	$0.82	47.7%
Gemini	Natural	100	454	37.8s	$0.02	33.3%
Gemini	XML	100	524	73.4s	$0.02	29.7%

Total Cost: $1.58 (Claude: $1.55, Gemini: $0.04)

The Headline Numbers

Claude overall: 46.6% pass rate
Gemini overall: 31.5% pass rate
Model gap: 15.1 percentage points
Format gap: 2-4 percentage points (within noise)

What Novel Problems Reveal

Finding 1: Prompt Format Is Surface-Level Optimization

On problems these models have never seen, XML formatting provided:

Claude: +2.2% (47.7% vs 45.5%)
Gemini: -3.6% (29.7% vs 33.3%)

These differences are statistically insignificant at this sample size. More importantly, the direction isn't even consistent—XML helped Claude marginally but hurt Gemini marginally.

Interpretation: When models must reason from scratch on novel problems, how you format the question matters far less than whether the model can solve the question.

Finding 2: The Reasoning Bottleneck

Both models failed on over half the problems:

Claude failed 53.4% of problems
Gemini failed 68.5% of problems

These aren't trivial problems—they're competitive programming challenges requiring algorithm selection, edge case handling, and correct implementation. The pass rates reveal the actual reasoning ceiling of these models on genuinely novel tasks.

No amount of XML formatting can help a model that doesn't know the right algorithm.

Finding 3: Model Selection Dominates

The gap between models (15.1 percentage points) is 7x larger than any prompt format effect. If you're optimizing for code generation accuracy:

Switching from Gemini to Claude: +48% relative improvement
Switching from Natural to XML on Claude: +4.8% relative improvement

Model selection provides an order of magnitude more impact than prompt engineering.

Finding 4: XML Increases Overhead Without Benefit

XML prompts consistently generated more tokens:

Claude: +53 tokens (+13.5%)
Gemini: +70 tokens (+15.4%)

This means higher costs and longer latencies. Gemini's XML latency nearly doubled (37.8s → 73.4s), suggesting additional processing on structured inputs—without accuracy gains.

Why Doesn't XML Help on Novel Problems?

Hypothesis 1: No Patterns to Trigger

On familiar problems, structured formatting might help activate relevant training patterns. But with novel problems, there are no cached solutions to retrieve. The model must derive the answer from first principles, and XML tags don't improve first-principles reasoning.

Hypothesis 2: The Bottleneck Is Algorithmic

Competitive programming problems require:

Recognizing the problem class (DP, greedy, graph traversal, etc.)
Selecting the appropriate algorithm
Handling edge cases correctly
Implementing without bugs

Prompt formatting addresses none of these. A well-formatted prompt can't teach a model an algorithm it doesn't understand.

Hypothesis 3: Modern LLMs Are Format-Agnostic

Contemporary models are trained on massive corpora containing both structured and unstructured text. They've learned to extract semantic meaning regardless of syntactic presentation. The "structure" in XML may be redundant to a model that already understands the underlying concepts.

Hypothesis 4: Early Sensitivity Has Diminished

Early LLMs (GPT-2 era) were highly sensitive to prompt phrasing. Modern instruction-tuned models are robust to format variations. The prompt engineering techniques that worked in 2022 may have diminishing returns in 2024.

The Economics Argument

Even if XML provided a small accuracy boost, the economics work against it:

Metric	Natural	XML	Difference
Avg Input Tokens	~800	~950	+19%
Avg Output Tokens	~420	~480	+14%
Cost per Problem	$0.0037	$0.0042	+14%

At scale:

10,000 problems/day × 14% overhead = significant cost increase
With no meaningful accuracy improvement

Implications for Vibe Coding

This experiment reinforces a core principle: substance over style.

The vibe coding movement embraces rapid iteration with LLMs, but it's easy to get distracted by prompt engineering folklore. "Use XML!" "Add role-play!" "Include examples!" These techniques may have their place, but our data suggests diminishing returns for straightforward code generation on novel problems.

What Actually Moves the Needle

Model selection: Claude outperformed Gemini by 48% relative improvement
Problem decomposition: Break complex tasks into verifiable steps
Verification infrastructure: Automated tests catch more errors than prompt tweaks
Context quality: Clear problem descriptions matter more than formatting

The Real Lesson

When facing genuinely novel problems—the kind that matter in production—LLMs succeed or fail based on their reasoning capabilities, not your XML schemas. Invest your optimization effort accordingly.

Recommendations

1. Choose Your Model Carefully

Model selection provides 7x the impact of prompt formatting. Benchmark different models on your specific use case before investing in prompt engineering.

2. Use Natural Language by Default

Since accuracy is equivalent, prefer the format that's easier to write, read, and maintain. Your team's productivity matters.

3. Reserve XML for Structured Outputs

XML does provide value when you need the model to produce structured data:

Multi-part responses with clear delimiters
Few-shot examples with consistent formatting
Function calling with explicit parameter schemas

But for straightforward code generation? Plain English works equally well.

4. Invest in Verification

Given pass rates of 46.6% and 31.5%, the real opportunity isn't prompt optimization—it's building robust verification pipelines:

Automated test execution
Type checking
Multi-sample consensus
Human review for critical code

Every dollar spent on verification infrastructure yields higher returns than prompt refinement.

Methodology Notes

Evaluation Pipeline

Code extraction: Parse markdown code blocks from model response
File creation: Write solution + test harness to temp file
Subprocess execution: Run with 10-second timeout and memory limits
Result comparison: Stdout vs expected output
Classification: Pass/fail/error/timeout

Reproducibility

Problems cached from HuggingFace LiveCodeBench dataset
Deterministic generation (temperature=0)
All code available at github.com/nsameerd/livecodebench-prompting

Limitations

1. Single Problem Domain

LiveCodeBench focuses on competitive programming. Results may differ for:

Web development tasks
Data transformation
API integration
Documentation generation

2. Binary Evaluation

We measured pass/fail on test cases. This misses:

Code quality metrics
Partial correctness
Style and maintainability

3. Sample Size

100 problems × 2 formats = 200 runs per model. Larger samples might reveal statistically significant differences, though effect sizes would remain small.

4. Model Versions

Results reflect Claude Sonnet 4 and Gemini 2.0 Flash as of the experiment date.

Conclusion

After 400 runs on problems these models have never seen, the verdict is clear: XML-structured prompts provide no meaningful advantage over natural language for code generation on novel problems.

This finding is particularly credible because we eliminated memorization as a confounding variable. These models couldn't pattern-match against training data—they had to reason from scratch. And when they did, prompt formatting was irrelevant to their success.

The implications:

If you're spending hours crafting XML schemas, that time is likely wasted
Model selection matters far more than prompt formatting
Verification infrastructure beats prompt engineering for reliability

The future of AI-assisted development isn't about finding the perfect prompt incantation. It's about understanding what LLMs can and cannot do on genuinely novel problems, and building systems that verify, validate, and iterate accordingly.

FAQ

Q: Should I stop using XML in my prompts?

A: Not necessarily. If your team has established conventions, keep using them. The point is that changing your format won't improve accuracy on novel problems.

Q: Why did Gemini perform worse than Claude?

A: Gemini 2.0 Flash is optimized for speed and cost efficiency. Claude Sonnet 4 has stronger reasoning capabilities at the cost of higher latency and price.

Q: Would results differ with GPT-4?

A: Possibly. Each model has different capabilities. However, the principle—that novel problems reveal reasoning limits regardless of formatting—likely applies broadly.

Q: Does this apply to non-code tasks?

A: Our experiment only tested code generation. Results may differ for structured data extraction or multi-part outputs where delimiters provide clearer boundaries.

Cost Breakdown

Model	Input Cost/1M	Output Cost/1M	Total Input Tokens	Total Output Tokens	Total Cost
Claude	$3.00	$15.00	~180,000	~83,500	$1.55
Gemini	$0.075	$0.30	~195,000	~97,800	$0.04

Gemini is 39x cheaper, but Claude's 48% accuracy advantage may justify the premium depending on your use case.