Few-Shot Prompting for STEM: Claude vs Gemini Across 128 Experiments

TL;DR

We ran 128 experiments testing zero-shot vs few-shot prompting across 16 STEM problems using Claude Sonnet 4 and Gemini 2.0 Flash. Here's what we found:

The surprising result: Accuracy stayed constant at ~85% regardless of example count. Few-shot prompting doesn't make models smarter at STEM—it makes their output more predictable.

What actually improved:

LaTeX usage: 53% → 78% (+25 points)
Boxed answers: 44% → 81% (+37 points)
Token count: 360 → 228 (-37%)

Claude vs Gemini:

Metric	Claude Sonnet 4	Gemini 2.0 Flash
Avg Latency	11.2s	3.4s
Avg Tokens	248	296
Zero-shot Tokens	301	419
5-shot Tokens	222	235

Gemini is 3.3x faster. Claude is 16% more concise. Both achieve similar accuracy and show identical improvement patterns with few-shot prompting.

The Experiment

Setup

We tested 16 problems across 4 STEM domains:

Domain	Problems	Examples
Mathematics	5	Integration, derivatives, limits
Physics	4	Kinematics, oscillations, circuits
Chemistry	4	Equation balancing, stoichiometry, redox
Biology	3	Glycolysis, DNA replication, base pairing

Each problem ran 8 times: 4 conditions (zero-shot, 1-shot, 3-shot, 5-shot) × 2 models.

Total: 128 experiments.

The Models

Claude Sonnet 4 via CLI (claude -p)
Gemini 2.0 Flash via REST API

Both received identical prompts with domain-specific instructions and worked examples.

Prompt Structure

┌─────────────────────────────────────────────────────────┐
│  CONTEXT: "Show work step-by-step using LaTeX.         │
│           Box your final answer."                       │
├─────────────────────────────────────────────────────────┤
│  EXAMPLES: 0-5 worked problems (domain-specific)        │
├─────────────────────────────────────────────────────────┤
│  QUERY: The actual problem to solve                     │
└─────────────────────────────────────────────────────────┘

Results: Accuracy

Few-Shot Didn't Improve Correctness

Condition	Correct	Total	Accuracy
Zero-shot	27	32	84.4%
1-shot	26	32	81.3%
3-shot	28	32	87.5%
5-shot	28	32	87.5%

Adding examples didn't help. Both models already knew how to solve these problems—they just needed clear instructions.

By Domain

Domain	Zero-Shot	1-Shot	3-Shot	5-Shot
Math	60%	50%	60%	60%
Physics	100%	100%	100%	100%
Chemistry	88%	88%	100%	100%
Biology	100%	100%	100%	100%

Physics and biology: perfect across conditions. Chemistry improved slightly with more examples. Math stayed at 60%—but this was mostly a measurement artifact. The models produced correct answers in different notation (e.g., x²cos(x) vs cos(x)·x²) that our pattern matcher missed.

By Model

Both Claude and Gemini achieved nearly identical accuracy. The variance came from problem difficulty, not model capability.

Results: Format Adherence

This is where few-shot prompting shines.

Format Scores by Condition

Condition	Format Score	LaTeX %	Boxed Answer %
Zero-shot	4.8/10	53%	44%
1-shot	6.7/10	78%	81%
3-shot	6.5/10	78%	81%
5-shot	6.4/10	78%	81%

One example was enough. The jump from zero-shot to 1-shot captured nearly all the format improvement. Additional examples provided diminishing returns.

What Changed

Zero-shot chemistry response:

The balanced equation is: 4Fe + 3O₂ + 6H₂O → 4Fe(OH)₃

1-shot chemistry response:

**Solution:**
- Unbalanced: Fe + O₂ + H₂O → Fe(OH)₃

| Atom | Left | Right |
|------|------|-------|
| Fe | 1 | 1 |
| O | 4 | 3 |
| H | 2 | 3 |

[... balancing steps ...]

**Verification:**
| Atom | Left | Right |
|------|------|-------|
| Fe | 4 | 4 ✓ |
| O | 12 | 12 ✓ |
| H | 12 | 12 ✓ |

**Answer:** $4Fe + 3O_2 + 6H_2O \rightarrow 4Fe(OH)_3$

The verification table, LaTeX formatting, and structured layout only appeared consistently with few-shot prompting.

Results: Claude vs Gemini

Token Efficiency

Model	Condition	Avg Tokens	Avg Latency
Claude	Zero-shot	301	12.1s
Claude	1-shot	241	11.1s
Claude	3-shot	230	11.0s
Claude	5-shot	222	10.6s
Gemini	Zero-shot	419	4.1s
Gemini	1-shot	282	3.6s
Gemini	3-shot	249	3.0s
Gemini	5-shot	235	3.1s

Key Findings

1. Gemini is 3.3x faster

Average latency: 3.4s vs 11.2s. For applications where response time matters, this is significant.

2. Claude is more concise by default

Zero-shot token comparison: Claude produced 301 tokens vs Gemini's 419—a 28% difference. Claude's responses were more focused out of the box.

3. Both converge with few-shot

With 5-shot prompting, the gap narrows: 222 vs 235 tokens. Few-shot prompting teaches both models to match the example's conciseness.

4. Gemini benefits more from few-shot

Token reduction: Claude dropped 26% (301→222), Gemini dropped 44% (419→235). If you're using Gemini and want concise output, few-shot prompting is essential.

The Trade-off

Choose Claude when...	Choose Gemini when...
You need concise zero-shot output	Latency is critical
CLI integration matters	Cost per token is a concern
You want predictable response length	You'll use few-shot anyway

Practical Recommendations

1. Use One Example

Our data shows 1-shot captures most benefits. The template:

You are solving a mathematics problem. Show your work step-by-step
using LaTeX notation ($...$ for inline). Box your final answer.

### Example:
**Problem:** Find the derivative of x³

**Solution:**
- Apply power rule: $\frac{d}{dx}[x^n] = nx^{n-1}$
- $\frac{d}{dx}[x^3] = 3x^2$
- **Answer:** $\boxed{3x^2}$

---
Now solve:

**Problem:** [YOUR PROBLEM HERE]

2. Match Example Complexity to Problem

Don't use a simple derivative example for an integration by parts problem. The example should demonstrate the techniques needed.

3. Domain-Specific Templates

Physics:

You are solving a physics problem. Follow this format:
1. List known quantities with units
2. Identify what to find
3. Write relevant equations
4. Show calculations
5. State final answer with units

### Example:
**Problem:** A car accelerates from rest at 2 m/s² for 4 seconds.
Find the final velocity.

**Solution:**
| Known | Value |
|-------|-------|
| $v_0$ | 0 m/s |
| $a$ | 2 m/s² |
| $t$ | 4 s |

**Equation:** $v = v_0 + at$

**Calculation:** $v = 0 + (2)(4) = 8$ m/s

**Answer:** $v = 8$ m/s

Chemistry:

You are solving a chemistry problem. For balancing:
1. Write unbalanced equation
2. Count atoms in a table
3. Balance systematically
4. Verify with final count

### Example:
**Problem:** Balance: H₂ + O₂ → H₂O

**Solution:**
| Atom | Left | Right |
|------|------|-------|
| H | 2 | 2 |
| O | 2 | 1 |

Balance O: H₂ + O₂ → 2H₂O
Balance H: 2H₂ + O₂ → 2H₂O

**Answer:** $2H_2 + O_2 \rightarrow 2H_2O$

4. Consider Your Model Choice

High-volume, latency-sensitive: Gemini with 1-shot
Quality-focused, no time pressure: Claude with 1-shot
Zero-shot required: Claude (more concise by default)

What Few-Shot Doesn't Do

Based on 128 experiments, few-shot prompting does not:

Improve accuracy on problems the model can already solve
Teach new techniques the model doesn't know
Fix fundamental knowledge gaps

If Claude or Gemini can't solve a problem zero-shot, adding examples won't help. The model either knows the math or it doesn't.

Methodology

Measurements

Correctness: Pattern matching against expected answer components
Format score: LaTeX presence, boxed answers, tables, step labels
Token count: characters / 4 (approximation)
Latency: Wall-clock time from request to response

Limitations

Two models: Claude Sonnet 4 and Gemini 2.0 Flash. Other models may differ.
Pattern matching: Our correctness metric missed mathematically equivalent answers with different notation. True accuracy is likely higher.
Sample size: 16 problems × 8 conditions = 128 experiments. Directional, not statistically definitive.
API differences: Claude via CLI, Gemini via REST API. May affect latency comparison slightly.

Reproduce It

git clone https://github.com/nsameerd/stem-fewshot-experiments.git
cd stem-fewshot-experiments
npm install
npx tsx run-experiments.ts        # Run all 128 experiments
npx tsx analyze-results.ts        # Generate analysis

Requires: Claude CLI installed, GEMINI_API_KEY in environment.

Conclusion

After 128 experiments across two models:

Few-shot prompting is for format, not accuracy. Both models achieved ~85% accuracy zero-shot. Examples didn't improve this.
One example is enough. The jump from zero to one captured most format improvements. More examples had diminishing returns.
Token efficiency improves dramatically. 37% reduction in output length with 5-shot prompting.
Model choice depends on priorities. Gemini for speed (3.3x faster), Claude for conciseness (16% fewer tokens by default).
Both models respond identically to few-shot. Same accuracy, same format improvement patterns, same convergence behavior.

Bottom line: Use few-shot prompting to standardize output format. Use model selection to optimize for latency vs conciseness. Don't expect examples to teach models new mathematics.

Raw Data

Combined Results

Metric	Zero-Shot	1-Shot	3-Shot	5-Shot
Accuracy	84.4%	81.3%	87.5%	87.5%
Format Score	4.8/10	6.7/10	6.5/10	6.4/10
LaTeX Usage	53%	78%	78%	78%
Boxed Answers	44%	81%	81%	81%
Avg Tokens	360	262	239	228
Avg Latency	8.1s	7.3s	7.0s	6.8s

Model Comparison

Metric	Claude Sonnet 4	Gemini 2.0 Flash
Total Experiments	64	64
Avg Tokens	248	296
Avg Latency	11.2s	3.4s
Zero-shot Tokens	301	419
5-shot Tokens	222	235
Token Reduction	26%	44%

128 experiments run January 2026. Code: github.com/nsameerd/stem-fewshot-experiments

Few-Shot Prompting for STEM: Claude vs Gemini Across 128 Experiments

Few-Shot Prompting for STEM: Claude vs Gemini Across 128 Experiments

TL;DR

The Experiment

Setup

The Models

Prompt Structure

Results: Accuracy

Few-Shot Didn't Improve Correctness

By Domain

By Model

Results: Format Adherence

Format Scores by Condition

What Changed

Results: Claude vs Gemini

Token Efficiency

Key Findings

The Trade-off

Practical Recommendations

1. Use One Example

2. Match Example Complexity to Problem

3. Domain-Specific Templates

4. Consider Your Model Choice

What Few-Shot Doesn't Do

Methodology

Measurements

Limitations

Reproduce It

Conclusion

Raw Data

Combined Results

Model Comparison

Tags

Related Articles

Why LLM-Generated Code Accumulated 172 TypeScript Errors: A Technical Deep Dive

The LLM Collaboration Guide: How to Avoid 20 Critical Bugs in Production

The Silent Failure Cascade: Why LLM-Powered Code Breaks in Production

Related Articles

🔮
🔮Vibe Coding
Why LLM-Generated Code Accumulated 172 TypeScript Errors: A Technical Deep Dive
During a large-scale refactoring, we discovered 172 hidden TypeScript errors from LLM-generated code. This wasn't random—it revealed systematic patterns in how AI handles legacy removal, type inference, and build configurations. Learn the root causes and how to prevent them.
March 14, 202615 min read
#vibe coding#TypeScript#LLM code generation
March 14, 202615 min read

🔮
🔮Vibe Coding
The LLM Collaboration Guide: How to Avoid 20 Critical Bugs in Production
Learn how to turn LLMs from a liability into your most powerful engineering tool. Discover the three-phase workflow, constraint matrix framework, and production readiness checklist that prevents critical bugs in production.
March 12, 202624 min read
#LLM#AI-Assisted Code#Production
March 12, 202624 min read

🔮
🔮Vibe Coding
The Silent Failure Cascade: Why LLM-Powered Code Breaks in Production
LLM-generated code often fails silently in production due to implicit assumptions. Learn why this happens, how to detect it, and proven strategies to write defensive code that survives the real world.
March 7, 202612 min read
#LLM#AI-Assisted Code#Debugging
March 7, 202612 min read