Skip to main content
Back to Articles

Few-Shot Prompting for STEM: Claude vs Gemini Across 128 Experiments

We ran 128 experiments comparing Claude Sonnet 4 and Gemini 2.0 Flash on STEM problems with zero-shot through 5-shot prompting. The results challenge common assumptions: accuracy stayed flat while format consistency and token efficiency improved dramatically.

January 21, 202615 min readBy Sameer

Few-Shot Prompting for STEM: Claude vs Gemini Across 128 Experiments

TL;DR

We ran 128 experiments testing zero-shot vs few-shot prompting across 16 STEM problems using Claude Sonnet 4 and Gemini 2.0 Flash. Here's what we found:

The surprising result: Accuracy stayed constant at ~85% regardless of example count. Few-shot prompting doesn't make models smarter at STEMโ€”it makes their output more predictable.

What actually improved:

  • LaTeX usage: 53% โ†’ 78% (+25 points)
  • Boxed answers: 44% โ†’ 81% (+37 points)
  • Token count: 360 โ†’ 228 (-37%)

Claude vs Gemini:

Metric Claude Sonnet 4 Gemini 2.0 Flash
Avg Latency 11.2s 3.4s
Avg Tokens 248 296
Zero-shot Tokens 301 419
5-shot Tokens 222 235

Gemini is 3.3x faster. Claude is 16% more concise. Both achieve similar accuracy and show identical improvement patterns with few-shot prompting.


The Experiment

Setup

We tested 16 problems across 4 STEM domains:

Domain Problems Examples
Mathematics 5 Integration, derivatives, limits
Physics 4 Kinematics, oscillations, circuits
Chemistry 4 Equation balancing, stoichiometry, redox
Biology 3 Glycolysis, DNA replication, base pairing

Each problem ran 8 times: 4 conditions (zero-shot, 1-shot, 3-shot, 5-shot) ร— 2 models.

Total: 128 experiments.

The Models

  • Claude Sonnet 4 via CLI (claude -p)
  • Gemini 2.0 Flash via REST API

Both received identical prompts with domain-specific instructions and worked examples.

Prompt Structure

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  CONTEXT: "Show work step-by-step using LaTeX.         โ”‚
โ”‚           Box your final answer."                       โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  EXAMPLES: 0-5 worked problems (domain-specific)        โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  QUERY: The actual problem to solve                     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Results: Accuracy

Few-Shot Didn't Improve Correctness

Condition Correct Total Accuracy
Zero-shot 27 32 84.4%
1-shot 26 32 81.3%
3-shot 28 32 87.5%
5-shot 28 32 87.5%

Adding examples didn't help. Both models already knew how to solve these problemsโ€”they just needed clear instructions.

By Domain

Domain Zero-Shot 1-Shot 3-Shot 5-Shot
Math 60% 50% 60% 60%
Physics 100% 100% 100% 100%
Chemistry 88% 88% 100% 100%
Biology 100% 100% 100% 100%

Physics and biology: perfect across conditions. Chemistry improved slightly with more examples. Math stayed at 60%โ€”but this was mostly a measurement artifact. The models produced correct answers in different notation (e.g., xยฒcos(x) vs cos(x)ยทxยฒ) that our pattern matcher missed.

By Model

Both Claude and Gemini achieved nearly identical accuracy. The variance came from problem difficulty, not model capability.


Results: Format Adherence

This is where few-shot prompting shines.

Format Scores by Condition

Condition Format Score LaTeX % Boxed Answer %
Zero-shot 4.8/10 53% 44%
1-shot 6.7/10 78% 81%
3-shot 6.5/10 78% 81%
5-shot 6.4/10 78% 81%

One example was enough. The jump from zero-shot to 1-shot captured nearly all the format improvement. Additional examples provided diminishing returns.

What Changed

Zero-shot chemistry response:

The balanced equation is: 4Fe + 3Oโ‚‚ + 6Hโ‚‚O โ†’ 4Fe(OH)โ‚ƒ

1-shot chemistry response:

**Solution:**
- Unbalanced: Fe + Oโ‚‚ + Hโ‚‚O โ†’ Fe(OH)โ‚ƒ

| Atom | Left | Right |
|------|------|-------|
| Fe | 1 | 1 |
| O | 4 | 3 |
| H | 2 | 3 |

[... balancing steps ...]

**Verification:**
| Atom | Left | Right |
|------|------|-------|
| Fe | 4 | 4 โœ“ |
| O | 12 | 12 โœ“ |
| H | 12 | 12 โœ“ |

**Answer:** $4Fe + 3O_2 + 6H_2O \rightarrow 4Fe(OH)_3$

The verification table, LaTeX formatting, and structured layout only appeared consistently with few-shot prompting.


Results: Claude vs Gemini

Token Efficiency

Model Condition Avg Tokens Avg Latency
Claude Zero-shot 301 12.1s
Claude 1-shot 241 11.1s
Claude 3-shot 230 11.0s
Claude 5-shot 222 10.6s
Gemini Zero-shot 419 4.1s
Gemini 1-shot 282 3.6s
Gemini 3-shot 249 3.0s
Gemini 5-shot 235 3.1s

Key Findings

1. Gemini is 3.3x faster

Average latency: 3.4s vs 11.2s. For applications where response time matters, this is significant.

2. Claude is more concise by default

Zero-shot token comparison: Claude produced 301 tokens vs Gemini's 419โ€”a 28% difference. Claude's responses were more focused out of the box.

3. Both converge with few-shot

With 5-shot prompting, the gap narrows: 222 vs 235 tokens. Few-shot prompting teaches both models to match the example's conciseness.

4. Gemini benefits more from few-shot

Token reduction: Claude dropped 26% (301โ†’222), Gemini dropped 44% (419โ†’235). If you're using Gemini and want concise output, few-shot prompting is essential.

The Trade-off

Choose Claude when... Choose Gemini when...
You need concise zero-shot output Latency is critical
CLI integration matters Cost per token is a concern
You want predictable response length You'll use few-shot anyway

Practical Recommendations

1. Use One Example

Our data shows 1-shot captures most benefits. The template:

You are solving a mathematics problem. Show your work step-by-step
using LaTeX notation ($...$ for inline). Box your final answer.

### Example:
**Problem:** Find the derivative of xยณ

**Solution:**
- Apply power rule: $\frac{d}{dx}[x^n] = nx^{n-1}$
- $\frac{d}{dx}[x^3] = 3x^2$
- **Answer:** $\boxed{3x^2}$

---
Now solve:

**Problem:** [YOUR PROBLEM HERE]

2. Match Example Complexity to Problem

Don't use a simple derivative example for an integration by parts problem. The example should demonstrate the techniques needed.

3. Domain-Specific Templates

Physics:

You are solving a physics problem. Follow this format:
1. List known quantities with units
2. Identify what to find
3. Write relevant equations
4. Show calculations
5. State final answer with units

### Example:
**Problem:** A car accelerates from rest at 2 m/sยฒ for 4 seconds.
Find the final velocity.

**Solution:**
| Known | Value |
|-------|-------|
| $v_0$ | 0 m/s |
| $a$ | 2 m/sยฒ |
| $t$ | 4 s |

**Equation:** $v = v_0 + at$

**Calculation:** $v = 0 + (2)(4) = 8$ m/s

**Answer:** $v = 8$ m/s

Chemistry:

You are solving a chemistry problem. For balancing:
1. Write unbalanced equation
2. Count atoms in a table
3. Balance systematically
4. Verify with final count

### Example:
**Problem:** Balance: Hโ‚‚ + Oโ‚‚ โ†’ Hโ‚‚O

**Solution:**
| Atom | Left | Right |
|------|------|-------|
| H | 2 | 2 |
| O | 2 | 1 |

Balance O: Hโ‚‚ + Oโ‚‚ โ†’ 2Hโ‚‚O
Balance H: 2Hโ‚‚ + Oโ‚‚ โ†’ 2Hโ‚‚O

**Answer:** $2H_2 + O_2 \rightarrow 2H_2O$

4. Consider Your Model Choice

  • High-volume, latency-sensitive: Gemini with 1-shot
  • Quality-focused, no time pressure: Claude with 1-shot
  • Zero-shot required: Claude (more concise by default)

What Few-Shot Doesn't Do

Based on 128 experiments, few-shot prompting does not:

  1. Improve accuracy on problems the model can already solve
  2. Teach new techniques the model doesn't know
  3. Fix fundamental knowledge gaps

If Claude or Gemini can't solve a problem zero-shot, adding examples won't help. The model either knows the math or it doesn't.


Methodology

Measurements

  • Correctness: Pattern matching against expected answer components
  • Format score: LaTeX presence, boxed answers, tables, step labels
  • Token count: characters / 4 (approximation)
  • Latency: Wall-clock time from request to response

Limitations

  1. Two models: Claude Sonnet 4 and Gemini 2.0 Flash. Other models may differ.
  2. Pattern matching: Our correctness metric missed mathematically equivalent answers with different notation. True accuracy is likely higher.
  3. Sample size: 16 problems ร— 8 conditions = 128 experiments. Directional, not statistically definitive.
  4. API differences: Claude via CLI, Gemini via REST API. May affect latency comparison slightly.

Reproduce It

git clone https://github.com/nsameerd/stem-fewshot-experiments.git
cd stem-fewshot-experiments
npm install
npx tsx run-experiments.ts        # Run all 128 experiments
npx tsx analyze-results.ts        # Generate analysis

Requires: Claude CLI installed, GEMINI_API_KEY in environment.


Conclusion

After 128 experiments across two models:

  1. Few-shot prompting is for format, not accuracy. Both models achieved ~85% accuracy zero-shot. Examples didn't improve this.

  2. One example is enough. The jump from zero to one captured most format improvements. More examples had diminishing returns.

  3. Token efficiency improves dramatically. 37% reduction in output length with 5-shot prompting.

  4. Model choice depends on priorities. Gemini for speed (3.3x faster), Claude for conciseness (16% fewer tokens by default).

  5. Both models respond identically to few-shot. Same accuracy, same format improvement patterns, same convergence behavior.

Bottom line: Use few-shot prompting to standardize output format. Use model selection to optimize for latency vs conciseness. Don't expect examples to teach models new mathematics.


Raw Data

Combined Results

Metric Zero-Shot 1-Shot 3-Shot 5-Shot
Accuracy 84.4% 81.3% 87.5% 87.5%
Format Score 4.8/10 6.7/10 6.5/10 6.4/10
LaTeX Usage 53% 78% 78% 78%
Boxed Answers 44% 81% 81% 81%
Avg Tokens 360 262 239 228
Avg Latency 8.1s 7.3s 7.0s 6.8s

Model Comparison

Metric Claude Sonnet 4 Gemini 2.0 Flash
Total Experiments 64 64
Avg Tokens 248 296
Avg Latency 11.2s 3.4s
Zero-shot Tokens 301 419
5-shot Tokens 222 235
Token Reduction 26% 44%

128 experiments run January 2026. Code: github.com/nsameerd/stem-fewshot-experiments

Share this article

Related Articles

Few-Shot Prompting for STEM: Claude vs Gemini Across 128 Experiments | Mathematicon