SYNTHESIS NOTE
Training, RL, and Test-Time Scaling Reasoning, Retrieval, and Evaluation Model Architecture and Internals

Why do large language models explore less effectively than humans?

This research investigates why LLMs make decisions too quickly during open-ended exploration tasks. It examines whether the problem lies in training data, prompt engineering, or something deeper in how transformer architectures process information over time.

Synthesis note · 2026-02-22 · sourced from Reasoning o1 o3 Search
How should we allocate compute budget at inference time? What kind of thing is an LLM really?

"Large Language Models Think Too Fast To Explore Effectively" uses Little Alchemy 2 as an open-ended exploration benchmark. Most LLMs underperform humans — they rely heavily on uncertainty-driven strategies (reducing ambiguity, exploiting known information) while humans balance uncertainty with empowerment (maximizing future possibilities, intrinsic discovery).

The mechanistic explanation comes from Sparse Auto-Encoder (SAE) decomposition. Uncertainty values dominate early transformer blocks. Choices correlated with immediate outcomes are also represented early. Empowerment values — which represent the potential for future discovery — emerge only in middle blocks. This temporal mismatch means the model has already committed to a decision based on uncertainty before the empowerment signal is available to inform it.

The result is "thinking too fast": premature decisions that prioritize short-term utility over deeper exploration. This is not a training data issue — neither prompt engineering nor activation intervention improved traditional LLM performance. The architecture processes short-term signals before long-term signals, and decisions are made on whichever signal arrives first.

The o1 exception is revealing. OpenAI's o1 surpasses human performance on this task. This suggests that reasoning training — specifically the extended chain-of-thought processing — creates enough computational delay for empowerment signals to influence decisions. The model isn't given new exploration capability; it is given more processing time for the empowerment representations to participate in the decision.

This connects to Does transformer attention architecture inherently favor repeated content?. Both findings locate behavioral failures in architectural processing order rather than training data. Sycophancy is partly an attention-weighting problem; premature exploration decisions are partly a block-ordering problem. Both suggest that some behavioral deficits require architectural solutions, not just better training.

The connection to Do base models already contain hidden reasoning ability? adds nuance: empowerment representations exist in the model (middle blocks). They are not absent — they are outpaced. Reasoning training doesn't add exploration capability; it gives existing capability time to participate.

Inquiring lines that use this note as a source 11

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
14 direct connections · 162 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

traditional llms lack empowerment-driven exploration because uncertainty values dominate early transformer blocks causing premature decisions