Few-Shot Prompting for STEM: Claude vs Gemini Across 128 Experiments
TL;DR
We ran 128 experiments testing zero-shot vs few-shot prompting across 16 STEM problems using Claude Sonnet 4 and Gemini 2.0 Flash. Here's what we found:
The surprising result: Accuracy stayed constant at ~85% regardless of example count. Few-shot prompting doesn't make models smarter at STEMโit makes their output more predictable.
What actually improved:
- LaTeX usage: 53% โ 78% (+25 points)
- Boxed answers: 44% โ 81% (+37 points)
- Token count: 360 โ 228 (-37%)
Claude vs Gemini:
| Metric | Claude Sonnet 4 | Gemini 2.0 Flash |
|---|---|---|
| Avg Latency | 11.2s | 3.4s |
| Avg Tokens | 248 | 296 |
| Zero-shot Tokens | 301 | 419 |
| 5-shot Tokens | 222 | 235 |
Gemini is 3.3x faster. Claude is 16% more concise. Both achieve similar accuracy and show identical improvement patterns with few-shot prompting.
The Experiment
Setup
We tested 16 problems across 4 STEM domains:
| Domain | Problems | Examples |
|---|---|---|
| Mathematics | 5 | Integration, derivatives, limits |
| Physics | 4 | Kinematics, oscillations, circuits |
| Chemistry | 4 | Equation balancing, stoichiometry, redox |
| Biology | 3 | Glycolysis, DNA replication, base pairing |
Each problem ran 8 times: 4 conditions (zero-shot, 1-shot, 3-shot, 5-shot) ร 2 models.
Total: 128 experiments.
The Models
- Claude Sonnet 4 via CLI (
claude -p) - Gemini 2.0 Flash via REST API
Both received identical prompts with domain-specific instructions and worked examples.
Prompt Structure
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ CONTEXT: "Show work step-by-step using LaTeX. โ
โ Box your final answer." โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ EXAMPLES: 0-5 worked problems (domain-specific) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ QUERY: The actual problem to solve โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Results: Accuracy
Few-Shot Didn't Improve Correctness
| Condition | Correct | Total | Accuracy |
|---|---|---|---|
| Zero-shot | 27 | 32 | 84.4% |
| 1-shot | 26 | 32 | 81.3% |
| 3-shot | 28 | 32 | 87.5% |
| 5-shot | 28 | 32 | 87.5% |
Adding examples didn't help. Both models already knew how to solve these problemsโthey just needed clear instructions.
By Domain
| Domain | Zero-Shot | 1-Shot | 3-Shot | 5-Shot |
|---|---|---|---|---|
| Math | 60% | 50% | 60% | 60% |
| Physics | 100% | 100% | 100% | 100% |
| Chemistry | 88% | 88% | 100% | 100% |
| Biology | 100% | 100% | 100% | 100% |
Physics and biology: perfect across conditions. Chemistry improved slightly with more examples. Math stayed at 60%โbut this was mostly a measurement artifact. The models produced correct answers in different notation (e.g., xยฒcos(x) vs cos(x)ยทxยฒ) that our pattern matcher missed.
By Model
Both Claude and Gemini achieved nearly identical accuracy. The variance came from problem difficulty, not model capability.
Results: Format Adherence
This is where few-shot prompting shines.
Format Scores by Condition
| Condition | Format Score | LaTeX % | Boxed Answer % |
|---|---|---|---|
| Zero-shot | 4.8/10 | 53% | 44% |
| 1-shot | 6.7/10 | 78% | 81% |
| 3-shot | 6.5/10 | 78% | 81% |
| 5-shot | 6.4/10 | 78% | 81% |
One example was enough. The jump from zero-shot to 1-shot captured nearly all the format improvement. Additional examples provided diminishing returns.
What Changed
Zero-shot chemistry response:
The balanced equation is: 4Fe + 3Oโ + 6HโO โ 4Fe(OH)โ
1-shot chemistry response:
**Solution:**
- Unbalanced: Fe + Oโ + HโO โ Fe(OH)โ
| Atom | Left | Right |
|------|------|-------|
| Fe | 1 | 1 |
| O | 4 | 3 |
| H | 2 | 3 |
[... balancing steps ...]
**Verification:**
| Atom | Left | Right |
|------|------|-------|
| Fe | 4 | 4 โ |
| O | 12 | 12 โ |
| H | 12 | 12 โ |
**Answer:** $4Fe + 3O_2 + 6H_2O \rightarrow 4Fe(OH)_3$
The verification table, LaTeX formatting, and structured layout only appeared consistently with few-shot prompting.
Results: Claude vs Gemini
Token Efficiency
| Model | Condition | Avg Tokens | Avg Latency |
|---|---|---|---|
| Claude | Zero-shot | 301 | 12.1s |
| Claude | 1-shot | 241 | 11.1s |
| Claude | 3-shot | 230 | 11.0s |
| Claude | 5-shot | 222 | 10.6s |
| Gemini | Zero-shot | 419 | 4.1s |
| Gemini | 1-shot | 282 | 3.6s |
| Gemini | 3-shot | 249 | 3.0s |
| Gemini | 5-shot | 235 | 3.1s |
Key Findings
1. Gemini is 3.3x faster
Average latency: 3.4s vs 11.2s. For applications where response time matters, this is significant.
2. Claude is more concise by default
Zero-shot token comparison: Claude produced 301 tokens vs Gemini's 419โa 28% difference. Claude's responses were more focused out of the box.
3. Both converge with few-shot
With 5-shot prompting, the gap narrows: 222 vs 235 tokens. Few-shot prompting teaches both models to match the example's conciseness.
4. Gemini benefits more from few-shot
Token reduction: Claude dropped 26% (301โ222), Gemini dropped 44% (419โ235). If you're using Gemini and want concise output, few-shot prompting is essential.
The Trade-off
| Choose Claude when... | Choose Gemini when... |
|---|---|
| You need concise zero-shot output | Latency is critical |
| CLI integration matters | Cost per token is a concern |
| You want predictable response length | You'll use few-shot anyway |
Practical Recommendations
1. Use One Example
Our data shows 1-shot captures most benefits. The template:
You are solving a mathematics problem. Show your work step-by-step
using LaTeX notation ($...$ for inline). Box your final answer.
### Example:
**Problem:** Find the derivative of xยณ
**Solution:**
- Apply power rule: $\frac{d}{dx}[x^n] = nx^{n-1}$
- $\frac{d}{dx}[x^3] = 3x^2$
- **Answer:** $\boxed{3x^2}$
---
Now solve:
**Problem:** [YOUR PROBLEM HERE]
2. Match Example Complexity to Problem
Don't use a simple derivative example for an integration by parts problem. The example should demonstrate the techniques needed.
3. Domain-Specific Templates
Physics:
You are solving a physics problem. Follow this format:
1. List known quantities with units
2. Identify what to find
3. Write relevant equations
4. Show calculations
5. State final answer with units
### Example:
**Problem:** A car accelerates from rest at 2 m/sยฒ for 4 seconds.
Find the final velocity.
**Solution:**
| Known | Value |
|-------|-------|
| $v_0$ | 0 m/s |
| $a$ | 2 m/sยฒ |
| $t$ | 4 s |
**Equation:** $v = v_0 + at$
**Calculation:** $v = 0 + (2)(4) = 8$ m/s
**Answer:** $v = 8$ m/s
Chemistry:
You are solving a chemistry problem. For balancing:
1. Write unbalanced equation
2. Count atoms in a table
3. Balance systematically
4. Verify with final count
### Example:
**Problem:** Balance: Hโ + Oโ โ HโO
**Solution:**
| Atom | Left | Right |
|------|------|-------|
| H | 2 | 2 |
| O | 2 | 1 |
Balance O: Hโ + Oโ โ 2HโO
Balance H: 2Hโ + Oโ โ 2HโO
**Answer:** $2H_2 + O_2 \rightarrow 2H_2O$
4. Consider Your Model Choice
- High-volume, latency-sensitive: Gemini with 1-shot
- Quality-focused, no time pressure: Claude with 1-shot
- Zero-shot required: Claude (more concise by default)
What Few-Shot Doesn't Do
Based on 128 experiments, few-shot prompting does not:
- Improve accuracy on problems the model can already solve
- Teach new techniques the model doesn't know
- Fix fundamental knowledge gaps
If Claude or Gemini can't solve a problem zero-shot, adding examples won't help. The model either knows the math or it doesn't.
Methodology
Measurements
- Correctness: Pattern matching against expected answer components
- Format score: LaTeX presence, boxed answers, tables, step labels
- Token count:
characters / 4(approximation) - Latency: Wall-clock time from request to response
Limitations
- Two models: Claude Sonnet 4 and Gemini 2.0 Flash. Other models may differ.
- Pattern matching: Our correctness metric missed mathematically equivalent answers with different notation. True accuracy is likely higher.
- Sample size: 16 problems ร 8 conditions = 128 experiments. Directional, not statistically definitive.
- API differences: Claude via CLI, Gemini via REST API. May affect latency comparison slightly.
Reproduce It
git clone https://github.com/nsameerd/stem-fewshot-experiments.git
cd stem-fewshot-experiments
npm install
npx tsx run-experiments.ts # Run all 128 experiments
npx tsx analyze-results.ts # Generate analysis
Requires: Claude CLI installed, GEMINI_API_KEY in environment.
Conclusion
After 128 experiments across two models:
-
Few-shot prompting is for format, not accuracy. Both models achieved ~85% accuracy zero-shot. Examples didn't improve this.
-
One example is enough. The jump from zero to one captured most format improvements. More examples had diminishing returns.
-
Token efficiency improves dramatically. 37% reduction in output length with 5-shot prompting.
-
Model choice depends on priorities. Gemini for speed (3.3x faster), Claude for conciseness (16% fewer tokens by default).
-
Both models respond identically to few-shot. Same accuracy, same format improvement patterns, same convergence behavior.
Bottom line: Use few-shot prompting to standardize output format. Use model selection to optimize for latency vs conciseness. Don't expect examples to teach models new mathematics.
Raw Data
Combined Results
| Metric | Zero-Shot | 1-Shot | 3-Shot | 5-Shot |
|---|---|---|---|---|
| Accuracy | 84.4% | 81.3% | 87.5% | 87.5% |
| Format Score | 4.8/10 | 6.7/10 | 6.5/10 | 6.4/10 |
| LaTeX Usage | 53% | 78% | 78% | 78% |
| Boxed Answers | 44% | 81% | 81% | 81% |
| Avg Tokens | 360 | 262 | 239 | 228 |
| Avg Latency | 8.1s | 7.3s | 7.0s | 6.8s |
Model Comparison
| Metric | Claude Sonnet 4 | Gemini 2.0 Flash |
|---|---|---|
| Total Experiments | 64 | 64 |
| Avg Tokens | 248 | 296 |
| Avg Latency | 11.2s | 3.4s |
| Zero-shot Tokens | 301 | 419 |
| 5-shot Tokens | 222 | 235 |
| Token Reduction | 26% | 44% |
128 experiments run January 2026. Code: github.com/nsameerd/stem-fewshot-experiments