An experimental framework for testing whether injecting structured noise into LLM hidden states can mitigate reasoning collapse on complex algorithmic tasks.
Contemporary reasoning models exhibit performance collapse on complex tasks not due to fundamental capacity limits, but because single-trajectory deterministic inference commits early to suboptimal representations with no recovery mechanism.
By introducing structured stochasticity—noise vectors injected into the inference flow—we can enable trajectory resampling analogous to how humans naturally reframe problems when encountering cognitive dead ends.
Standard (Weak Stochasticity): Proposed (Strong Stochasticity):
Input → [Deterministic h] → Sample Output Input + z → [Stochastic h] → Sample Output
↑
z ~ P(z|X) (latent noise)
structured-stochasticity/
├── src/structured_stochasticity/
│ ├── __init__.py
│ ├── injection.py # Noise injection strategies
│ ├── hooks.py # PyTorch forward hooks for hidden state access
│ ├── tasks.py # Benchmark tasks (Tower of Hanoi, etc.)
│ ├── evaluation.py # Metrics and evaluation logic
│ └── experiment.py # Main experiment runner
├── configs/
│ └── default.yaml # Default experiment configuration
├── experiments/ # Saved experiment results
├── notebooks/ # Analysis notebooks
├── tests/
│ └── test_injection.py # Unit tests
├── requirements.txt
├── setup.py
└── README.md
git clone https://github.com/isztldav/structured-stochasticity.git
cd structured-stochasticity
pip install -e .from structured_stochasticity import NoisyInferenceWrapper, TowerOfHanoi
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load model
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
# Wrap with noise injection
noisy_model = NoisyInferenceWrapper(
model,
injection_layers=[0, 1, 2], # Early layers
noise_scale=0.1,
injection_mode="continuous" # or "once"
)
# Run experiment
task = TowerOfHanoi(num_disks=4)
results = noisy_model.solve_with_trajectories(
task,
tokenizer,
k_trajectories=5,
selection_method="majority_vote"
)Edit configs/default.yaml:
model:
name: "meta-llama/Llama-3.2-1B"
device: "cuda"
injection:
layers: [0, 1, 2, 3] # Which layers to inject noise
scale: 0.1 # Noise magnitude
mode: "continuous" # "once" | "continuous" | "annealed"
anneal_factor: 0.95 # For annealed mode
task:
name: "tower_of_hanoi"
complexity_range: [3, 8] # Min/max disks
evaluation:
k_trajectories: [1, 3, 5, 10, 20]
selection: "majority_vote" # "majority_vote" | "verifier" | "best_of_k"
num_trials: 50# Single experiment
python -m structured_stochasticity.experiment --config configs/default.yaml
# Sweep over noise scales
python -m structured_stochasticity.experiment \
--config configs/default.yaml \
--sweep injection.scale 0.01 0.05 0.1 0.2 0.5-
Does K-trajectory sampling improve max solvable complexity?
- Compare accuracy vs. complexity curves for K=1,5,10,20
-
Where should noise be injected?
- Early layers (problem framing) vs. late layers (output realization)
-
When should noise be injected?
- Once at start vs. continuous vs. annealed
-
What noise magnitude works best?
- Sweep over scales; expect sweet spot varies with task difficulty
If you use this framework, please cite:
@misc{isztl2025structured,
author = {Isztl, Dávid},
title = {Beyond Single-Trajectory Reasoning: Structured Stochasticity as a Remedy for Reasoning Collapse in Large Language Models},
year = {2025}
}MIT
- Shojaee et al. (2025). "The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity"