Why do large language models explore less effectively than humans?

This research investigates why LLMs make decisions too quickly during open-ended exploration tasks. It examines whether the problem lies in training data, prompt engineering, or something deeper in how transformer architectures process information over time.

Synthesis note · 2026-02-22 · sourced from Reasoning o1 o3 Search

"Large Language Models Think Too Fast To Explore Effectively" uses Little Alchemy 2 as an open-ended exploration benchmark. Most LLMs underperform humans — they rely heavily on uncertainty-driven strategies (reducing ambiguity, exploiting known information) while humans balance uncertainty with empowerment (maximizing future possibilities, intrinsic discovery).

The mechanistic explanation comes from Sparse Auto-Encoder (SAE) decomposition. Uncertainty values dominate early transformer blocks. Choices correlated with immediate outcomes are also represented early. Empowerment values — which represent the potential for future discovery — emerge only in middle blocks. This temporal mismatch means the model has already committed to a decision based on uncertainty before the empowerment signal is available to inform it.

The result is "thinking too fast": premature decisions that prioritize short-term utility over deeper exploration. This is not a training data issue — neither prompt engineering nor activation intervention improved traditional LLM performance. The architecture processes short-term signals before long-term signals, and decisions are made on whichever signal arrives first.

The o1 exception is revealing. OpenAI's o1 surpasses human performance on this task. This suggests that reasoning training — specifically the extended chain-of-thought processing — creates enough computational delay for empowerment signals to influence decisions. The model isn't given new exploration capability; it is given more processing time for the empowerment representations to participate in the decision.

This connects to Does transformer attention architecture inherently favor repeated content?. Both findings locate behavioral failures in architectural processing order rather than training data. Sycophancy is partly an attention-weighting problem; premature exploration decisions are partly a block-ordering problem. Both suggest that some behavioral deficits require architectural solutions, not just better training.

The connection to Do base models already contain hidden reasoning ability? adds nuance: empowerment representations exist in the model (middle blocks). They are not absent — they are outpaced. Reasoning training doesn't add exploration capability; it gives existing capability time to participate.

Inquiring lines that use this note as a source 11

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

14 direct connections · 162 in 2-hop network ·dense cluster Open in graph ↗

Why do large language models explore less effect… Does transformer attention architecture inherently… Do base models already contain hidden reasoning ab… Does RL teach reasoning or just when to use it? Do reasoning models switch between ideas too frequ… Why do reasoning LLMs fail at deeper problem solvi… Why do LLMs struggle with exploration in simple de…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does transformer attention architecture inherently favor repeated content? Explores whether soft attention's tendency to over-weight repeated and prominent tokens explains sycophancy independent of training. Questions whether architectural bias precedes and enables RLHF effects.
both locate behavioral failures in architecture not training
Do base models already contain hidden reasoning ability? Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
empowerment capability exists but is outpaced; reasoning training gives it time
Does RL teach reasoning or just when to use it? Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
o1's exploration superiority may be another instance of RL teaching timing not capability
Do reasoning models switch between ideas too frequently? Research explores whether o1-like models abandon promising reasoning paths prematurely by switching to different approaches without sufficient depth, and whether penalizing such transitions could improve accuracy.
behavioral manifestation of the same architectural problem: "thinking too fast" at the block level (uncertainty dominates before empowerment arrives) produces premature thought switching at the decoding level (model abandons promising paths before depth is sufficient); TIP's success suggests decoding-time intervention can partially compensate for the architecture's processing-order bias
Why do reasoning LLMs fail at deeper problem solving? Explores whether current reasoning models systematically search solution spaces or merely wander through them, and how this affects their ability to solve increasingly complex problems.
connects: wandering is the exploration-level consequence of premature decisions; if the model commits to directions before empowerment signals can evaluate long-term potential, it will explore unsystematically — the o1 exception supports this, as it both explores more systematically (contradicting the wandering thesis) and processes empowerment signals (this note)
Why do LLMs struggle with exploration in simple decision tasks? This explores why large language models fail at exploration—a core decision-making capability—even when they excel at other tasks, and what specific conditions might help them succeed.
behavioral evidence for the same exploration deficit: even with explicit hints, LLMs fail to explore in bandit environments without external history summarization; the empowerment-timing mechanism explains why — the model commits to exploitation before the exploration signal is processed, and external summarization bypasses this by converting the exploration problem into a structured decision that doesn't require empowerment-level processing

Why do large language models explore less effectively than humans?

Related concepts in this collection 6

Related papers in this collection 8

Search by related questions 5