TL;DR
We tested XML-structured versus natural language prompts using problems that post-date model training cutoffs—meaning these models couldn't have memorized the solutions. Across 400 runs with Claude Sonnet 4 and Gemini 2.0 Flash, prompt format had negligible impact (~2-4% difference). The real finding: when models face genuinely novel problems, their reasoning capabilities—not prompt formatting—determine success.
Why This Experiment is Different
Most prompt engineering studies have a hidden flaw: they use problems that may exist in training data. When an LLM "solves" a well-known coding problem, it might be retrieving a memorized pattern rather than reasoning from scratch. This makes it impossible to know whether XML formatting helped the model think better or simply recall better.
We designed this experiment to eliminate that ambiguity.
The Novel Dataset Advantage
We used LiveCodeBench, a benchmark with a critical property: all problems are sourced from competitive programming contests held after the training cutoff dates of major LLMs.
This means:
- Zero memorization possible: Claude and Gemini cannot have seen these exact problems during training
- Pure reasoning tested: Every solution requires genuine algorithmic thinking
- Fair comparison: Neither model has an unfair advantage from data contamination
- Credible conclusions: Any performance difference reflects actual capability, not recall
When a model fails on these problems, it's not because it "forgot"—it's because it genuinely couldn't solve them. When it succeeds, it's demonstrating real reasoning ability.
The Core Question
With memorization eliminated, we can ask the real question:
Does XML-structured prompting help LLMs reason better on novel problems, or is it just formatting theater?
The answer has significant implications. If XML helps models think more clearly, it's worth the overhead. If it doesn't, the prompt engineering community has been optimizing the wrong variable.
Experiment Design
Prompt Formats
XML Format:
<task>
<problem_id>abc123</problem_id>
<title>Two Sum</title>
<description><![CDATA[
Given an array of integers nums and an integer target,
return indices of the two numbers that add up to target.
]]></description>
<starter_code>def two_sum(nums, target):</starter_code>
<output_format>Return only the solution code, no explanations.</output_format>
</task>
Natural Language Format:
Solve the following coding problem.
Problem ID: abc123
Title: Two Sum
Given an array of integers nums and an integer target,
return indices of the two numbers that add up to target.
Starter code:
def two_sum(nums, target):
Return only the solution code, no explanations.
Models Tested
| Model | Provider | Method | Notes |
|---|---|---|---|
| Claude Sonnet 4 | Anthropic | CLI | Latest Sonnet model |
| Gemini 2.0 Flash | REST API | Fast inference, cost-effective |
Parameters
- 100 problems Ă— 2 models Ă— 2 formats = 400 total runs
- Timeout: 3 minutes per problem
- Temperature: 0 (deterministic outputs)
- Evaluation: Python subprocess execution against public test cases
Results
Summary Table
| Model | Format | Runs | Avg Output Tokens | Avg Latency | Cost | Pass Rate |
|---|---|---|---|---|---|---|
| Claude | Natural | 100 | 391 | 13.6s | $0.73 | 45.5% |
| Claude | XML | 100 | 444 | 15.3s | $0.82 | 47.7% |
| Gemini | Natural | 100 | 454 | 37.8s | $0.02 | 33.3% |
| Gemini | XML | 100 | 524 | 73.4s | $0.02 | 29.7% |
Total Cost: $1.58 (Claude: $1.55, Gemini: $0.04)
The Headline Numbers
- Claude overall: 46.6% pass rate
- Gemini overall: 31.5% pass rate
- Model gap: 15.1 percentage points
- Format gap: 2-4 percentage points (within noise)
What Novel Problems Reveal
Finding 1: Prompt Format Is Surface-Level Optimization
On problems these models have never seen, XML formatting provided:
- Claude: +2.2% (47.7% vs 45.5%)
- Gemini: -3.6% (29.7% vs 33.3%)
These differences are statistically insignificant at this sample size. More importantly, the direction isn't even consistent—XML helped Claude marginally but hurt Gemini marginally.
Interpretation: When models must reason from scratch on novel problems, how you format the question matters far less than whether the model can solve the question.
Finding 2: The Reasoning Bottleneck
Both models failed on over half the problems:
- Claude failed 53.4% of problems
- Gemini failed 68.5% of problems
These aren't trivial problems—they're competitive programming challenges requiring algorithm selection, edge case handling, and correct implementation. The pass rates reveal the actual reasoning ceiling of these models on genuinely novel tasks.
No amount of XML formatting can help a model that doesn't know the right algorithm.
Finding 3: Model Selection Dominates
The gap between models (15.1 percentage points) is 7x larger than any prompt format effect. If you're optimizing for code generation accuracy:
- Switching from Gemini to Claude: +48% relative improvement
- Switching from Natural to XML on Claude: +4.8% relative improvement
Model selection provides an order of magnitude more impact than prompt engineering.
Finding 4: XML Increases Overhead Without Benefit
XML prompts consistently generated more tokens:
- Claude: +53 tokens (+13.5%)
- Gemini: +70 tokens (+15.4%)
This means higher costs and longer latencies. Gemini's XML latency nearly doubled (37.8s → 73.4s), suggesting additional processing on structured inputs—without accuracy gains.
Why Doesn't XML Help on Novel Problems?
Hypothesis 1: No Patterns to Trigger
On familiar problems, structured formatting might help activate relevant training patterns. But with novel problems, there are no cached solutions to retrieve. The model must derive the answer from first principles, and XML tags don't improve first-principles reasoning.
Hypothesis 2: The Bottleneck Is Algorithmic
Competitive programming problems require:
- Recognizing the problem class (DP, greedy, graph traversal, etc.)
- Selecting the appropriate algorithm
- Handling edge cases correctly
- Implementing without bugs
Prompt formatting addresses none of these. A well-formatted prompt can't teach a model an algorithm it doesn't understand.
Hypothesis 3: Modern LLMs Are Format-Agnostic
Contemporary models are trained on massive corpora containing both structured and unstructured text. They've learned to extract semantic meaning regardless of syntactic presentation. The "structure" in XML may be redundant to a model that already understands the underlying concepts.
Hypothesis 4: Early Sensitivity Has Diminished
Early LLMs (GPT-2 era) were highly sensitive to prompt phrasing. Modern instruction-tuned models are robust to format variations. The prompt engineering techniques that worked in 2022 may have diminishing returns in 2024.
The Economics Argument
Even if XML provided a small accuracy boost, the economics work against it:
| Metric | Natural | XML | Difference |
|---|---|---|---|
| Avg Input Tokens | ~800 | ~950 | +19% |
| Avg Output Tokens | ~420 | ~480 | +14% |
| Cost per Problem | $0.0037 | $0.0042 | +14% |
At scale:
- 10,000 problems/day Ă— 14% overhead = significant cost increase
- With no meaningful accuracy improvement
Implications for Vibe Coding
This experiment reinforces a core principle: substance over style.
The vibe coding movement embraces rapid iteration with LLMs, but it's easy to get distracted by prompt engineering folklore. "Use XML!" "Add role-play!" "Include examples!" These techniques may have their place, but our data suggests diminishing returns for straightforward code generation on novel problems.
What Actually Moves the Needle
- Model selection: Claude outperformed Gemini by 48% relative improvement
- Problem decomposition: Break complex tasks into verifiable steps
- Verification infrastructure: Automated tests catch more errors than prompt tweaks
- Context quality: Clear problem descriptions matter more than formatting
The Real Lesson
When facing genuinely novel problems—the kind that matter in production—LLMs succeed or fail based on their reasoning capabilities, not your XML schemas. Invest your optimization effort accordingly.
Recommendations
1. Choose Your Model Carefully
Model selection provides 7x the impact of prompt formatting. Benchmark different models on your specific use case before investing in prompt engineering.
2. Use Natural Language by Default
Since accuracy is equivalent, prefer the format that's easier to write, read, and maintain. Your team's productivity matters.
3. Reserve XML for Structured Outputs
XML does provide value when you need the model to produce structured data:
- Multi-part responses with clear delimiters
- Few-shot examples with consistent formatting
- Function calling with explicit parameter schemas
But for straightforward code generation? Plain English works equally well.
4. Invest in Verification
Given pass rates of 46.6% and 31.5%, the real opportunity isn't prompt optimization—it's building robust verification pipelines:
- Automated test execution
- Type checking
- Multi-sample consensus
- Human review for critical code
Every dollar spent on verification infrastructure yields higher returns than prompt refinement.
Methodology Notes
Evaluation Pipeline
- Code extraction: Parse markdown code blocks from model response
- File creation: Write solution + test harness to temp file
- Subprocess execution: Run with 10-second timeout and memory limits
- Result comparison: Stdout vs expected output
- Classification: Pass/fail/error/timeout
Reproducibility
- Problems cached from HuggingFace LiveCodeBench dataset
- Deterministic generation (temperature=0)
- All code available at github.com/nsameerd/livecodebench-prompting
Limitations
1. Single Problem Domain
LiveCodeBench focuses on competitive programming. Results may differ for:
- Web development tasks
- Data transformation
- API integration
- Documentation generation
2. Binary Evaluation
We measured pass/fail on test cases. This misses:
- Code quality metrics
- Partial correctness
- Style and maintainability
3. Sample Size
100 problems Ă— 2 formats = 200 runs per model. Larger samples might reveal statistically significant differences, though effect sizes would remain small.
4. Model Versions
Results reflect Claude Sonnet 4 and Gemini 2.0 Flash as of the experiment date.
Conclusion
After 400 runs on problems these models have never seen, the verdict is clear: XML-structured prompts provide no meaningful advantage over natural language for code generation on novel problems.
This finding is particularly credible because we eliminated memorization as a confounding variable. These models couldn't pattern-match against training data—they had to reason from scratch. And when they did, prompt formatting was irrelevant to their success.
The implications:
- If you're spending hours crafting XML schemas, that time is likely wasted
- Model selection matters far more than prompt formatting
- Verification infrastructure beats prompt engineering for reliability
The future of AI-assisted development isn't about finding the perfect prompt incantation. It's about understanding what LLMs can and cannot do on genuinely novel problems, and building systems that verify, validate, and iterate accordingly.
FAQ
Q: Should I stop using XML in my prompts?
A: Not necessarily. If your team has established conventions, keep using them. The point is that changing your format won't improve accuracy on novel problems.
Q: Why did Gemini perform worse than Claude?
A: Gemini 2.0 Flash is optimized for speed and cost efficiency. Claude Sonnet 4 has stronger reasoning capabilities at the cost of higher latency and price.
Q: Would results differ with GPT-4?
A: Possibly. Each model has different capabilities. However, the principle—that novel problems reveal reasoning limits regardless of formatting—likely applies broadly.
Q: Does this apply to non-code tasks?
A: Our experiment only tested code generation. Results may differ for structured data extraction or multi-part outputs where delimiters provide clearer boundaries.
Cost Breakdown
| Model | Input Cost/1M | Output Cost/1M | Total Input Tokens | Total Output Tokens | Total Cost |
|---|---|---|---|---|---|
| Claude | $3.00 | $15.00 | ~180,000 | ~83,500 | $1.55 |
| Gemini | $0.075 | $0.30 | ~195,000 | ~97,800 | $0.04 |
Gemini is 39x cheaper, but Claude's 48% accuracy advantage may justify the premium depending on your use case.