Skip to main content
Back to Articles

Temperature Sensitivity in LLM Code Generation: A 752-Run Experiment

We tested how temperature (0, 0.3, 0.7, 1.0) affects code generation accuracy on 100 competitive programming problems. The surprising result: temperature barely matters for problem-solving, but model selection does.

January 25, 202610 min readBy Sameer

TL;DR

We ran 752 experiments testing temperature sensitivity (0, 0.3, 0.7, 1.0) on Gemini 2.5 Pro and Gemini 3 Pro Preview using novel competitive programming problems. Temperature had minimal impact on success rates—varying only 2-3% across the range. The real finding: both models solve about the same number of problems (~27%), but fail differently. When models face genuinely novel algorithmic challenges, the temperature knob is mostly cosmetic.


Why Temperature Matters (Or Doesn't)

Temperature is the most-tweaked hyperparameter in LLM deployments. The conventional wisdom:

  • Low temperature (0-0.3): "Deterministic, reliable, good for code"
  • High temperature (0.7-1.0): "Creative, diverse, good for exploration"

But does this advice hold for genuine problem-solving—not just formatting or style? We designed this experiment to find out.

The Problem with Temperature Studies

Most temperature experiments test on:

  • Simple generation tasks (summarization, translation)
  • Problems that may exist in training data
  • Single-attempt evaluations

None of these capture what happens when an LLM faces a novel algorithmic challenge requiring multi-step reasoning.


Experiment Design

The Dataset: Novel Problems Only

We used LiveCodeBench problems from competitive programming contests held after the training cutoff dates of both models. This means:

  • Zero memorization possible: Models can't recall solutions
  • Pure reasoning tested: Every solution requires genuine algorithmic thinking
  • Fair evaluation: Success = real capability, not pattern matching

Models Tested

Model Provider Notes
Gemini 2.5 Pro Google Latest "thinking" model
Gemini 3 Pro Preview Google Next-generation preview

Temperature Settings

Temperature Expected Behavior
0 Deterministic (greedy decoding)
0.3 Low randomness
0.7 Moderate randomness
1.0 High randomness (full distribution)

Parameters

  • 100 problems × 2 models × 4 temperatures = 800 planned runs
  • 752 completed (9 persistent API timeouts on specific problems)
  • Evaluation: Python subprocess execution against public test cases
  • Timeout: 3 minutes per API call

Results

Summary by Model and Temperature

Model Temp Runs Pass All Fail All Avg Pass Rate
gemini-2.5-pro 0 98 28 (29%) 68 (69%) 29.4%
gemini-2.5-pro 0.3 98 28 (29%) 67 (68%) 29.7%
gemini-2.5-pro 0.7 99 21 (21%) 75 (76%) 22.2%
gemini-2.5-pro 1 98 26 (27%) 70 (71%) 27.1%
gemini-3-pro-preview 0 88 22 (25%) 65 (74%) 25.3%
gemini-3-pro-preview 0.3 91 23 (25%) 68 (75%) 25.3%
gemini-3-pro-preview 0.7 89 29 (33%) 60 (67%) 32.6%
gemini-3-pro-preview 1 91 21 (23%) 69 (76%) 23.3%

Model Aggregates (All Temperatures)

Model Total Runs Pass All Fail All Timeouts Avg Pass Rate
gemini-2.5-pro 393 103 (26.2%) 280 (71.2%) 7 (1.8%) 27.1%
gemini-3-pro-preview 359 95 (26.5%) 262 (73.0%) 2 (0.6%) 26.6%

Temperature Aggregates (Both Models)

Temperature Runs Pass All Fail All Avg Pass Rate
0 186 50 (26.9%) 133 (71.5%) 27.4%
0.3 189 51 (27.0%) 135 (71.4%) 27.5%
0.7 188 50 (26.6%) 135 (71.8%) 27.2%
1 189 47 (24.9%) 139 (73.5%) 25.3%

Unique Problems Solved

Model Unique Problems Solved (any temp)
gemini-2.5-pro 35 out of 100
gemini-3-pro-preview 38 out of 100

Key Findings

1. Temperature Has Minimal Impact on Success Rate

Across all temperatures, the pass rate varies by only 2-3 percentage points:

  • Lowest: 25.3% at T=1
  • Highest: 27.5% at T=0.3
  • Difference: 2.2%

This is far smaller than typical experimental noise. The "optimal temperature for code" advice doesn't hold for novel problem-solving.

2. Models Perform Similarly Overall

  • gemini-2.5-pro: 27.1% average pass rate
  • gemini-3-pro-preview: 26.6% average pass rate

Despite being different model versions, they achieve nearly identical overall performance on these algorithmic challenges.

3. Best Temperature Differs by Model

  • gemini-2.5-pro: Best at T=0 (29.4% pass rate)
  • gemini-3-pro-preview: Best at T=0.7 (32.6% pass rate)

This suggests optimal temperature may be model-specific, not universal.

4. High Temperature (T=1) Shows Slight Degradation

Temperature 1 consistently underperforms across both models:

  • Pass all: 24.9% (vs 26-27% at other temps)
  • This aligns with intuition—extreme randomness hurts precision tasks

5. Error Types Reveal Fundamental Challenges

Error Type Count Percentage
Wrong Answer 584 46%
Syntax Error 371 29%
Runtime Error 311 25%

The majority of failures are wrong answers—the models understand Python syntax but struggle with algorithmic correctness. This is a reasoning limitation, not a generation one.


Why Temperature Doesn't Matter for Problem-Solving

The Reasoning Bottleneck

Temperature controls sampling diversity from the probability distribution. But for competitive programming:

  1. The solution space is discrete: A problem either has a correct algorithm or doesn't
  2. Exploration doesn't help much: Random variations of wrong approaches are still wrong
  3. The bottleneck is reasoning, not generation: The model either understands the algorithm or it doesn't

Temperature helps when you need diverse phrasings of the same concept. It doesn't help when you need correct reasoning about a novel problem.

Contrast with Creative Tasks

Temperature matters for:

  • Writing style variation
  • Brainstorming ideas
  • Generating multiple alternatives

It doesn't matter for:

  • Algorithmic problem-solving
  • Mathematical reasoning
  • Tasks with single correct answers

Implications for Practitioners

1. Stop Obsessing Over Temperature for Code Generation

If your primary goal is correctness on algorithmic tasks, temperature tuning yields diminishing returns. Spend that optimization budget on:

  • Better prompts
  • More examples
  • Verification loops

2. Model Selection > Temperature Tuning

The 35 vs 38 unique problems solved by each model represents a ~10% capability difference—far more significant than temperature variations.

3. Use T=0 or T=0.3 as Default

While the difference is small, lower temperatures show marginally better performance. Use:

  • T=0: For maximum consistency and reproducibility
  • T=0.3: If you want slight variation without sacrificing accuracy

4. Avoid T=1 for Precision Tasks

The slight degradation at T=1 (24.9% vs 27%) is consistent and measurable. Reserve high temperature for creative tasks only.


Methodology Notes

Why Some Runs Failed

9 runs (1.1%) failed due to persistent API timeouts on specific problems. These appear to be related to:

  • Particular problem content triggering long processing
  • Not problem size (all problems were <3KB)
  • Possibly model-specific issues with certain input patterns

We excluded these from pass rate calculations to avoid biasing results.

Reproducibility

  • All runs used the same natural language prompt format
  • Random seed was not controlled (temperature already introduces randomness)
  • Test case evaluation used Python 3 subprocess execution with 10-second timeout

Conclusion

When LLMs face genuinely novel problems requiring algorithmic reasoning, temperature is a marginal factor. The difference between T=0 and T=1 is smaller than the difference between solving and not solving a problem.

The lesson: For code generation tasks, focus on model capabilities and prompt quality. Temperature is fine-tuning for the 1%—not the lever that determines success or failure.


Raw Data

The complete experiment data (752 evaluation files) is available in the experiment results. Key statistics:

  • Total evaluated: 743 runs with results + 9 timeouts
  • Models: gemini-2.5-pro, gemini-3-pro-preview
  • Temperatures: 0, 0.3, 0.7, 1
  • Problems: 100 from LiveCodeBench (post-training-cutoff)
Share this article

Related Articles