TL;DR
We ran 144 experiments comparing XML-structured prompts against natural language prompts using Claude Sonnet 4 and Gemini 2.5 Flash. XML prompts showed 17.3% token reduction and 19% latency improvement on Claude, but only marginal gains on Gemini (~3.6%). The benefits varied dramatically by task category—code generation showed the largest improvements, while reasoning tasks showed minimal difference.
The Experiment
Prompt engineering advice is everywhere, but data is scarce. "Use XML for better structure!" "Natural language is more intuitive!" We decided to test these claims empirically.
What We Tested
| Category | Description | Example Tasks |
|---|---|---|
| Code Generation | Data structures, algorithms, debugging | Trie implementation, binary search, refactoring |
| Mathematics | Calculus, algebra, proofs | Integration, matrix operations, optimization |
| Reasoning | Logic puzzles, deduction | Constraint satisfaction, causal inference |
| Complex Multi-Step | Chained reasoning, system design | API design, workflow optimization |
Each category had 3 difficulty levels (simple, medium, complex) with 3 prompts each, totaling 36 unique prompts.
Models Tested
| Model | Provider | Notes |
|---|---|---|
| Claude Sonnet 4 | Anthropic | Latest Sonnet model via CLI |
| Gemini 2.5 Flash | Fast inference via CLI |
Parameters
- 144 total experiments (pilot phase with medium difficulty)
- 3 runs per prompt to account for stochastic variation
- Both formats tested on identical prompts
Prompt Format Comparison
XML Format:
<task>
<objective>Implement a Trie with autocomplete</objective>
<language>Python</language>
<requirements>
<requirement>insert(word): Add word to trie</requirement>
<requirement>search(word): Check if word exists</requirement>
<requirement>autocomplete(prefix, limit): Return matching words</requirement>
</requirements>
<constraints>
<constraint>Lexicographical ordering</constraint>
<constraint>Include type hints</constraint>
</constraints>
<output_format>Only the class code</output_format>
</task>
Natural Language Format:
Implement a Trie data structure with autocomplete functionality in Python.
Include insert(word), search(word), and autocomplete(prefix, limit) methods.
Return words in lexicographical order. Use type hints. Return only the class code.
Both formats convey identical requirements—only the structure differs.
Results
Summary by Model
| Model | Format | Avg Output Tokens | Avg Latency |
|---|---|---|---|
| Claude Sonnet 4 | Natural | 1,476 | 30.1s |
| Claude Sonnet 4 | XML | 1,219 | 24.4s |
| Gemini 2.5 Flash | Natural | 1,210 | 53.1s |
| Gemini 2.5 Flash | XML | 1,166 | 51.7s |
The Headline Numbers
Claude Sonnet 4:
- 17.3% fewer tokens with XML prompts
- 19.0% faster responses with XML prompts
Gemini 2.5 Flash:
- 3.6% fewer tokens with XML prompts
- 2.6% faster responses with XML prompts
Key Findings
Finding 1: XML Benefits Are Model-Dependent
The stark contrast between Claude (17% improvement) and Gemini (3.6% improvement) suggests that XML formatting benefits depend heavily on model architecture and training. Claude appears to have stronger pattern recognition for structured inputs, translating XML structure into more focused outputs.
Hypothesis: Claude's training may include more XML-heavy corpora, making it more responsive to hierarchical structure signals.
Finding 2: Code Generation Shows Largest Gains
Across both models, code generation tasks showed the most significant improvements with XML formatting:
| Category | Claude XML Improvement | Gemini XML Improvement |
|---|---|---|
| Code Generation | ~35% token reduction | ~8% token reduction |
| Mathematics | ~15% token reduction | ~3% token reduction |
| Reasoning | ~5% token reduction | ~1% token reduction |
| Complex Multi-Step | ~12% token reduction | ~4% token reduction |
Why? Code has inherent structure—functions, parameters, return types. XML's hierarchical format mirrors this structure, helping models produce more focused implementations.
Finding 3: Latency Correlates with Token Count
Response latency tracked closely with token output. XML prompts that produced fewer tokens also completed faster. This suggests the efficiency gain isn't from faster processing per token, but from generating fewer unnecessary tokens.
Finding 4: Diminishing Returns on Reasoning Tasks
Pure reasoning tasks (logic puzzles, deduction) showed minimal XML benefit. These tasks require sequential thinking rather than structured output—the format of the prompt matters less than the clarity of constraints.
Why Does XML Help (When It Does)?
Hypothesis 1: Attention Anchoring
XML tags create explicit attention anchors. When Claude sees <requirements>, it likely increases attention weight on the enclosed content. This may reduce "drift" where models add unnecessary elaboration.
Hypothesis 2: Output Structure Inference
Models may infer that structured input implies structured output expectations. XML prompts might signal "be concise and organized" implicitly, while natural language prompts feel more conversational.
Hypothesis 3: Training Data Patterns
If models were trained on XML-heavy technical documentation (API specs, configuration files), they may have learned associations between XML structure and concise, implementation-focused responses.
Practical Recommendations
1. Use XML for Code Generation
The data strongly supports XML formatting for code tasks. The 17-35% efficiency gain on Claude is substantial—especially at scale.
<task>
<objective>Your main goal</objective>
<language>Target language</language>
<requirements>
<requirement>Specific requirement 1</requirement>
<requirement>Specific requirement 2</requirement>
</requirements>
<output_format>What you want returned</output_format>
</task>
2. Natural Language for Reasoning
For logic puzzles, analysis, and open-ended reasoning, natural language prompts perform equivalently with less formatting overhead. Don't add structure for structure's sake.
3. Test Your Specific Use Case
These results are pilot data. Your mileage may vary based on:
- Specific task complexity
- Model version (both Claude and Gemini update frequently)
- Temperature and other generation parameters
4. Consider the Model
If you're using Claude, XML formatting offers meaningful efficiency gains. For Gemini, the benefit is marginal—choose based on readability and maintainability preferences.
Methodology Notes
Experiment Setup
- Prompt Creation: 36 prompts created with semantically equivalent XML and natural language versions
- Execution: Each prompt run 3 times per model per format
- Metrics Captured: Output token count, response latency, completion status
- Analysis: Aggregated by model, category, and format
Limitations
- Pilot Phase Only: 144 experiments is sufficient for directional insights but not statistical certainty
- No Correctness Evaluation: We measured efficiency, not accuracy
- CLI Overhead: Using CLI tools adds variable latency not attributable to the model
- Stochastic Nature: LLMs are inherently variable; results may differ on replication
Reproducibility
All code is available at github.com/nsameerd/xml-vs-natural-prompting. Run the experiment yourself to validate or extend these findings.
Conclusion
XML-structured prompts provide meaningful efficiency gains—but the benefit is neither universal nor guaranteed. Claude Sonnet 4 showed 17% token reduction and 19% latency improvement with XML formatting, while Gemini 2.5 Flash showed only marginal gains (~3.6%).
The largest improvements appeared in code generation tasks, suggesting that XML structure aligns well with code's inherent hierarchy. For reasoning tasks, natural language performed equivalently.
The takeaway: XML prompting is a useful tool, not a silver bullet. Test it for your specific model and use case, and don't assume that structure automatically means better results.
FAQ
Q: Should I always use XML prompts?
A: No. Use XML for code generation and structured output tasks. For open-ended reasoning or creative tasks, natural language is equally effective and easier to write.
Q: Why did Gemini show smaller gains?
A: Unknown. Possible explanations include different training data, attention mechanisms, or instruction-following approaches. More research needed.
Q: Will these results hold for GPT-4 or other models?
A: We only tested Claude and Gemini. Each model may respond differently to formatting. Test your specific model.
Q: Does XML affect response quality?
A: We didn't measure quality in this experiment—only efficiency (tokens and latency). Quality evaluation requires task-specific rubrics.
Cost Considerations
At scale, 17% token reduction translates directly to cost savings:
| Scenario | Natural Language | XML Format | Savings |
|---|---|---|---|
| 10,000 Claude calls | ~14.8M tokens | ~12.2M tokens | ~$39 (output @ $15/M) |
| 100,000 Claude calls | ~148M tokens | ~122M tokens | ~$390 |
For high-volume applications on Claude, XML formatting pays for itself through reduced token costs.