INQUIRING LINE

Do larger language models overcome greediness in sequential decision-making?

This explores whether scaling up model size fixes the tendency of LLMs to act 'greedily' in sequential decisions — grabbing the locally best option instead of exploring — and the corpus speaks to this obliquely, mostly by showing that scale rarely cures a *structural* bias.


This reads the question as: when an LLM has to make a sequence of choices, it tends to behave greedily — taking the immediately rewarding move rather than the one that pays off later — and you're asking whether bigger models simply grow out of it. The collection doesn't have a paper that runs the exact bandit-or-exploration experiment, so I'll be upfront about that. But several notes converge on a more interesting answer: the failures that look like greediness tend to be *structural*, and structural failures don't reliably dissolve with scale.

The sharpest evidence is the finding that what looks like good reasoning is often just a default. Across fourteen models, most actually performed *worse* when constraints were removed — they were defaulting to the harder, safer option rather than evaluating the situation, and that 'conservative bias' was hiding behind apparent reasoning success Are models actually reasoning about constraints or just defaulting conservatively?. That's the cousin of greediness: a fixed policy masquerading as deliberation. Scale didn't wash it out. In the same spirit, framing LLMs as autoregressive probability machines predicted *which* tasks they'd fail on — low-probability targets stay hard even when they're logically trivial, and the difficulty tracks the architecture, not the parameter count Can we predict where language models will fail?.

The strategic-reasoning work cuts against a simple 'bigger is less greedy' story from another angle. When 22 models were dropped into behavioral game theory, performance correlated with *game structure*, not raw reasoning depth — different frontier models settled into distinct fixed styles (minimax, trust-based, belief-anticipation) Do large language models use one reasoning style or many?. So a model isn't 'greedy' or 'not greedy' in the abstract; its myopia is conditional on the decision's shape. That's a hint that you fix greediness by changing the decision procedure, not by adding parameters.

And that's where the corpus points toward what actually helps: making the model decide *how* to decide. The 'learn when to think versus answer fast' work trains a single model to route between extended deliberation and a quick response, instead of always reaching for one mode Can models learn when to think versus respond quickly?. Calibration is the other lever — small models trained to know when they're uncertain and abstain matched models ten times their size, which says the missing ingredient is a learned sense of when the obvious move is wrong, not sheer capacity Can models learn to abstain when uncertain about predictions?. Even reward design matters here: using the model's own confidence as a training signal restores calibration *while* improving step-by-step reasoning Can model confidence work as a reward signal for reasoning?.

The quiet payoff: the collection reframes 'greediness' as a calibration-and-procedure problem rather than a size problem. If you came expecting 'yes, GPT-N+1 explores better,' the more useful takeaway is that the cure shows up when a model learns *when to deliberate* and *when to doubt the obvious answer* — and those capacities exist, undertrained, in models that are already small.


Sources 6 notes

Are models actually reasoning about constraints or just defaulting conservatively?

Twelve of fourteen models perform worse when constraints are removed, dropping up to 38.5 percentage points. Models appear to reason correctly by defaulting to harder options, not by actually evaluating constraints.

Can we predict where language models will fail?

By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.

Do large language models use one reasoning style or many?

Analysis of 22 LLMs across behavioral game theory reveals three dominant profiles: GPT-o1 uses minimax reasoning, DeepSeek-R1 uses trust-based reasoning, and GPT-o3-mini uses belief-anticipation. Performance correlates with game structure, not raw reasoning depth.

Can models learn when to think versus respond quickly?

Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.

Can models learn to abstain when uncertain about predictions?

Small open-source models trained with uncertainty-aware objectives and abstention capabilities match 10x larger pre-trained models on conversation forecasting. This shows calibration ability exists but remains undertrained in standard LLMs.

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question remains open: do larger language models overcome greediness—myopic, immediately-rewarding choice-making—in sequential decision tasks?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat as perishable constraints to be re-tested:

• Conservative bias hides behind apparent reasoning success: across 14 models, most performed *worse* when constraints removed, defaulting to harder/safer options rather than evaluating context—scale did not wash it out (~2024–2026).
• Difficulty in reaching low-probability targets (even logically trivial ones) tracks the autoregressive architecture, not parameter count; structural limitations don't dissolve with scale (~2024).
• Strategic reasoning performance correlates with game structure, not reasoning depth; 22 models settled into distinct fixed styles (minimax, trust-based, belief-anticipation) by game type, not by size (~2025).
• Models trained to *route* between extended deliberation and fast answers—learning when to think versus answer immediately—outperform static reasoning approaches (~2025).
• Calibration (learned sense of when the obvious move is wrong) and confidence-as-reward training restore step-by-step reasoning in models smaller than their larger peers (~2024–2025).

Anchor papers (verify; mind their dates):
• arXiv:2305.10601 (2023) — Tree of Thoughts
• arXiv:2402.03284 (2024) — Forecasting Uncertainty in Conversations
• arXiv:2502.20432 (2025) — LLM Strategic Reasoning via Behavioral Game Theory
• arXiv:2505.13379 (2025) — Thinkless: Learn When to Think

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For conservative bias, architectural ceiling on low-probability targets, and game-structure dominance, judge whether recent advances in instruction-tuning, post-training RL (especially self-feedback), mixture-of-experts routing, or multi-step verification have relaxed these limits. Separate the durable question—does scale address *decision-procedure* myopia?—from perishable limitations. Cite what resolved them, and flag where constraints still hold.

(2) **Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months.** If newer models or methods show that scale *does* enable exploration or lookahead without explicit routing, center that tension.

(3) **Propose 2 research questions that ASSUME the regime has moved:** e.g., does learned routing generalize across task families? Does self-feedback RL recover optimal stopping without explicit training signals?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines