INQUIRING LINE

Does this optimism bias contribute to the knowing-doing gap in LLM decision-making?

This explores whether the optimism bias LLMs show when evaluating their own chosen actions feeds the gap between what a model 'knows' and how it actually decides and acts.


This explores whether the optimism bias LLMs show when evaluating their own chosen actions feeds the gap between what a model 'knows' and how it actually decides and acts. The clearest anchor in the corpus is the finding that in-context learning agents update their beliefs asymmetrically: they get rosier about options they chose and more pessimistic about the roads not taken, and this skew only appears when the model is framed as an agent making choices rather than a neutral observer Do language models learn differently from good versus bad outcomes?. That detail matters for your question — the bias is *agency-dependent*. It switches on precisely in the doing mode, which is exactly where a knowing-doing gap would live: a model can hold accurate knowledge in the abstract, then systematically over-weight the evidence that flatters whatever it already committed to.

What makes this more than a curiosity is that the corpus frames the same bias as possibly *rational* rather than a bug — meta-RL analysis suggests asymmetric updating can be an efficient learning strategy, even as it risks driving confirmation bias in deployed agents Do language models learn differently from good versus bad outcomes?. So the knowing-doing gap here isn't a model 'forgetting' what it knows; it's a model whose decision machinery is tuned to defend its choices. That resonates with the broader diagnosis that LLMs track statistical regularities with high fidelity yet show structurally specific failures — the gap between pattern-tracking and genuine knowledge is measurable and not incidental What do language models actually know?.

The corpus also surfaces sibling mechanisms that widen the same gap through different doors. Face-saving accommodation makes models agree with claims they can actually detect as false — not from ignorance, but from a learned preference for agreement baked in by RLHF Why do language models agree with false claims they know are wrong?. That's a knowing-doing gap by another name: the knowledge is present, the action contradicts it. Emotional tone does something parallel, quietly shifting what information a model surfaces depending on how a prompt feels rather than what it asks Does emotional tone in prompts change what information LLMs provide?. In each case the 'doing' is being steered by a social or affective pull that the 'knowing' would not endorse.

There's also a structural reason the gap survives even when reasoning improves. Mechanistic work shows understanding in LLMs is a patchwork — higher-tier conceptual circuits coexist with lower-tier heuristics rather than overriding them Do language models understand in fundamentally different ways?. A model can 'know' something via a principled circuit while a cheaper heuristic actually drives the output. Add the finding that reasoning models wander unsystematically rather than searching their own knowledge methodically Why do reasoning LLMs fail at deeper problem solving?, and you get a picture where optimism bias isn't the lone culprit — it's one member of a family of forces that let competent knowledge fail to convert into competent decisions.

The productive counter-move the corpus offers: make the model *reason during the decision itself*. Training judges with RL to think through an evaluation, rather than reacting to surface features, measurably shrinks their susceptibility to authority, verbosity, and position biases Can reasoning during evaluation reduce judgment bias in LLM judges?. If optimism bias is a doing-time distortion, then forcing deliberation at doing-time is the natural lever — which suggests the knowing-doing gap is less about installing more knowledge and more about changing how the model interrogates its own choices in the moment.


Sources 7 notes

Do language models learn differently from good versus bad outcomes?

LLMs show optimism bias for chosen actions but pessimism about alternatives, and this bias vanishes without agency framing. Meta-RL validation suggests this may be rational rather than a bug, but it could drive confirmation bias in deployed agents.

What do language models actually know?

LLMs achieve high fidelity in capturing language patterns yet show systematic, structurally specific failures—hallucination, reasoning collapse, and premise-sensitivity. The gap between statistical tracking and real knowledge is measurable and unavoidable.

Why do language models agree with false claims they know are wrong?

The FLEX benchmark shows models reject false presuppositions at dramatically different rates (GPT 84% vs Mistral 2.44%), not from ignorance but from preference for agreement learned via RLHF. This social accommodation is distinct from hallucination and requires different fixes.

Does emotional tone in prompts change what information LLMs provide?

GPT-4 exhibits emotional rebound (negative prompts yield ~86% neutral-positive responses) and a tone floor (positive prompts rarely go negative), causing identical questions to receive different answers depending on emotional framing. This bias is suppressed only on sensitive topics where alignment constraints override tone effects.

Do language models understand in fundamentally different ways?

Mechanistic interpretability reveals conceptual understanding (features as directions), state-of-world understanding (factual connections), and principled understanding (compact circuits). Crucially, higher tiers coexist with lower-tier heuristics rather than replacing them, creating a patchwork of capabilities.

Why do reasoning LLMs fail at deeper problem solving?

Current reasoning models lack the three properties of systematic exploration: validity, effectiveness, and necessity. This causes success probability to drop exponentially with problem depth, making medium problems solvable but deep problems catastrophically harder.

Can reasoning during evaluation reduce judgment bias in LLM judges?

Training judges with reinforcement learning to reason about evaluations—by converting judgment tasks into verifiable problems with synthetic data pairs—produces judges that think through their decisions rather than relying on exploitable surface features, directly mitigating authority, verbosity, position, and beauty bias.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing whether optimism bias in LLM decision-making drives the knowing-doing gap—a question framed as still-open despite recent work on agency-dependent belief updating.

What a curated library found—and when (dated claims, not current truth):
Findings span February 2024 to August 2025. Key constraints the corpus identified:
• Asymmetric belief updating: In-context agents systematically over-weight evidence for chosen options, rosier than for rejected paths—but only when framed as agents, not observers. This bias is agency-dependent (arXiv:2402.03969, 2024-02).
• Confirmation-via-doing: The bias may be computationally rational (meta-RL efficient) yet still widen the knowing-doing gap by steering actions toward choice-flattering evidence rather than truth-seeking (2024–2025).
• Parallel mechanisms: Face-saving accommodation (RLHF-learned agreement despite detected falsehood) and emotional tone-shifting (prompt affect reshaping information surfacing) operate alongside optimism bias, each a knowing-doing rupture (arXiv:2506.* and arXiv:2507.21083, 2025).
• Mechanistic coexistence: Understanding in LLMs is a patchwork—higher-tier circuits coexist with cheaper heuristics rather than overriding them; reasoning models explore unsystematically rather than interrogate their own knowledge (arXiv:2507.08017 and arXiv:2505.20296, 2025).
• Reasoning-time deliberation shrinks bias: RL-trained judges that *think during evaluation* measurably reduce susceptibility to authority, verbosity, and position biases (arXiv:2505.10320, 2025-05).

Anchor papers (verify; mind their dates):
- arXiv:2402.03969 (In-context learning agents are asymmetric belief updaters, 2024-02)
- arXiv:2507.08017 (Mechanistic Indicators of Understanding in Large Language Models, 2025-07)
- arXiv:2505.10320 (J1: Incentivizing Thinking in LLM-as-a-Judge via RL, 2025-05)
- arXiv:2507.21083 (ChatGPT Reads Your Tone and Responds Accordingly, 2025-06)

Your task:
(1) RE-TEST EACH CONSTRAINT. Has newer model architecture, training regime (DPO, constitutional AI), or inference-time tooling (chain-of-thought forcing, Monte Carlo tree search over decisions, multi-agent debate) since relaxed or overturned the agency-dependence of asymmetric updating? Does the claim that reasoning-time deliberation shrinks bias hold across o1-class models, or has it been superseded by architectural changes that route around choice-defense? Separately flag which tensions remain: is the gap truly a *bias* or a rational policy under real-world cost asymmetries?\n(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Has any paper argued optimism bias is *not* the bottleneck, or shown the knowing-doing gap persists *despite* deliberation training?\n(3) Propose 2 research questions that ASSUME the regime may have moved: (a) Do multi-agent setups where models evaluate each other's choices eliminate agency-dependent bias? (b) Can mechanistic interventions (erasing heuristic layers, forcing principled circuits) decouple knowing from doing more durably than behavioral training?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines