XML vs Natural Language Prompts: An Experimental Comparison

TL;DR

We ran 144 experiments comparing XML-structured prompts against natural language prompts using Claude Sonnet 4 and Gemini 2.5 Flash. XML prompts showed 17.3% token reduction and 19% latency improvement on Claude, but only marginal gains on Gemini (~3.6%). The benefits varied dramatically by task category—code generation showed the largest improvements, while reasoning tasks showed minimal difference.

The Experiment

Prompt engineering advice is everywhere, but data is scarce. "Use XML for better structure!" "Natural language is more intuitive!" We decided to test these claims empirically.

What We Tested

Category	Description	Example Tasks
Code Generation	Data structures, algorithms, debugging	Trie implementation, binary search, refactoring
Mathematics	Calculus, algebra, proofs	Integration, matrix operations, optimization
Reasoning	Logic puzzles, deduction	Constraint satisfaction, causal inference
Complex Multi-Step	Chained reasoning, system design	API design, workflow optimization

Each category had 3 difficulty levels (simple, medium, complex) with 3 prompts each, totaling 36 unique prompts.

Models Tested

Model	Provider	Notes
Claude Sonnet 4	Anthropic	Latest Sonnet model via CLI
Gemini 2.5 Flash	Google	Fast inference via CLI

Parameters

144 total experiments (pilot phase with medium difficulty)
3 runs per prompt to account for stochastic variation
Both formats tested on identical prompts

Prompt Format Comparison

XML Format:

<task>
  <objective>Implement a Trie with autocomplete</objective>
  <language>Python</language>
  <requirements>
    <requirement>insert(word): Add word to trie</requirement>
    <requirement>search(word): Check if word exists</requirement>
    <requirement>autocomplete(prefix, limit): Return matching words</requirement>
  </requirements>
  <constraints>
    <constraint>Lexicographical ordering</constraint>
    <constraint>Include type hints</constraint>
  </constraints>
  <output_format>Only the class code</output_format>
</task>

Natural Language Format:

Implement a Trie data structure with autocomplete functionality in Python.
Include insert(word), search(word), and autocomplete(prefix, limit) methods.
Return words in lexicographical order. Use type hints. Return only the class code.

Both formats convey identical requirements—only the structure differs.

Results

Summary by Model

Model	Format	Avg Output Tokens	Avg Latency
Claude Sonnet 4	Natural	1,476	30.1s
Claude Sonnet 4	XML	1,219	24.4s
Gemini 2.5 Flash	Natural	1,210	53.1s
Gemini 2.5 Flash	XML	1,166	51.7s

The Headline Numbers

Claude Sonnet 4:

17.3% fewer tokens with XML prompts
19.0% faster responses with XML prompts

Gemini 2.5 Flash:

3.6% fewer tokens with XML prompts
2.6% faster responses with XML prompts

Key Findings

Finding 1: XML Benefits Are Model-Dependent

The stark contrast between Claude (17% improvement) and Gemini (3.6% improvement) suggests that XML formatting benefits depend heavily on model architecture and training. Claude appears to have stronger pattern recognition for structured inputs, translating XML structure into more focused outputs.

Hypothesis: Claude's training may include more XML-heavy corpora, making it more responsive to hierarchical structure signals.

Finding 2: Code Generation Shows Largest Gains

Across both models, code generation tasks showed the most significant improvements with XML formatting:

Category	Claude XML Improvement	Gemini XML Improvement
Code Generation	~35% token reduction	~8% token reduction
Mathematics	~15% token reduction	~3% token reduction
Reasoning	~5% token reduction	~1% token reduction
Complex Multi-Step	~12% token reduction	~4% token reduction

Why? Code has inherent structure—functions, parameters, return types. XML's hierarchical format mirrors this structure, helping models produce more focused implementations.

Finding 3: Latency Correlates with Token Count

Response latency tracked closely with token output. XML prompts that produced fewer tokens also completed faster. This suggests the efficiency gain isn't from faster processing per token, but from generating fewer unnecessary tokens.

Finding 4: Diminishing Returns on Reasoning Tasks

Pure reasoning tasks (logic puzzles, deduction) showed minimal XML benefit. These tasks require sequential thinking rather than structured output—the format of the prompt matters less than the clarity of constraints.

Why Does XML Help (When It Does)?

Hypothesis 1: Attention Anchoring

XML tags create explicit attention anchors. When Claude sees <requirements>, it likely increases attention weight on the enclosed content. This may reduce "drift" where models add unnecessary elaboration.

Hypothesis 2: Output Structure Inference

Models may infer that structured input implies structured output expectations. XML prompts might signal "be concise and organized" implicitly, while natural language prompts feel more conversational.

Hypothesis 3: Training Data Patterns

If models were trained on XML-heavy technical documentation (API specs, configuration files), they may have learned associations between XML structure and concise, implementation-focused responses.

Practical Recommendations

1. Use XML for Code Generation

The data strongly supports XML formatting for code tasks. The 17-35% efficiency gain on Claude is substantial—especially at scale.

<task>
  <objective>Your main goal</objective>
  <language>Target language</language>
  <requirements>
    <requirement>Specific requirement 1</requirement>
    <requirement>Specific requirement 2</requirement>
  </requirements>
  <output_format>What you want returned</output_format>
</task>

2. Natural Language for Reasoning

For logic puzzles, analysis, and open-ended reasoning, natural language prompts perform equivalently with less formatting overhead. Don't add structure for structure's sake.

3. Test Your Specific Use Case

These results are pilot data. Your mileage may vary based on:

Specific task complexity
Model version (both Claude and Gemini update frequently)
Temperature and other generation parameters

4. Consider the Model

If you're using Claude, XML formatting offers meaningful efficiency gains. For Gemini, the benefit is marginal—choose based on readability and maintainability preferences.

Methodology Notes

Experiment Setup

Prompt Creation: 36 prompts created with semantically equivalent XML and natural language versions
Execution: Each prompt run 3 times per model per format
Metrics Captured: Output token count, response latency, completion status
Analysis: Aggregated by model, category, and format

Limitations

Pilot Phase Only: 144 experiments is sufficient for directional insights but not statistical certainty
No Correctness Evaluation: We measured efficiency, not accuracy
CLI Overhead: Using CLI tools adds variable latency not attributable to the model
Stochastic Nature: LLMs are inherently variable; results may differ on replication

Reproducibility

All code is available at github.com/nsameerd/xml-vs-natural-prompting. Run the experiment yourself to validate or extend these findings.

Conclusion

XML-structured prompts provide meaningful efficiency gains—but the benefit is neither universal nor guaranteed. Claude Sonnet 4 showed 17% token reduction and 19% latency improvement with XML formatting, while Gemini 2.5 Flash showed only marginal gains (~3.6%).

The largest improvements appeared in code generation tasks, suggesting that XML structure aligns well with code's inherent hierarchy. For reasoning tasks, natural language performed equivalently.

The takeaway: XML prompting is a useful tool, not a silver bullet. Test it for your specific model and use case, and don't assume that structure automatically means better results.

FAQ

Q: Should I always use XML prompts?

A: No. Use XML for code generation and structured output tasks. For open-ended reasoning or creative tasks, natural language is equally effective and easier to write.

Q: Why did Gemini show smaller gains?

A: Unknown. Possible explanations include different training data, attention mechanisms, or instruction-following approaches. More research needed.

Q: Will these results hold for GPT-4 or other models?

A: We only tested Claude and Gemini. Each model may respond differently to formatting. Test your specific model.

Q: Does XML affect response quality?

A: We didn't measure quality in this experiment—only efficiency (tokens and latency). Quality evaluation requires task-specific rubrics.

Cost Considerations

At scale, 17% token reduction translates directly to cost savings:

Scenario	Natural Language	XML Format	Savings
10,000 Claude calls	~14.8M tokens	~12.2M tokens	~$39 (output @ $15/M)
100,000 Claude calls	~148M tokens	~122M tokens	~$390

For high-volume applications on Claude, XML formatting pays for itself through reduced token costs.