How does this motivational bias connect to LLMs' causal reasoning failures?
This explores how the agency-dependent optimism bias — where LLMs update beliefs more readily from chosen-action outcomes — relates to the systematic ways LLMs misjudge cause and effect, and whether the two share a common root.
This reads the "motivational bias" as the agency-dependent asymmetric belief updating that LLMs show — an optimism bias toward the outcomes of actions they "chose" and pessimism about the alternatives Do language models learn differently from good versus bad outcomes?. The interesting move is to ask whether that motivational tilt is a separate quirk from the corpus's causal-reasoning failures, or whether they're two faces of the same inheritance. The corpus points toward the latter.
Start with the causal failures. LLMs reproduce human causal errors almost exactly — weak explaining-away and Markov violations in collider networks — which the work attributes not to some categorical inability to reason but to the statistics of training data Do large language models make the same causal reasoning mistakes as humans?. That's the same diagnosis the motivational-bias work reaches: the optimism asymmetry mirrors human belief updating and even disappears when you strip away the agency framing, suggesting it's a learned response to how outcome data is narrated rather than a computational defect. Both phenomena are human cognitive signatures absorbed wholesale, and the through-line is that biases get planted during pretraining and merely nudged by finetuning Where do cognitive biases in language models come from?. So the connection isn't that one bias causes the other — it's that they're siblings, both downstream of imitating human-generated text.
There's a sharper mechanistic link too. Explaining-away is precisely a reasoning operation that requires you to *lower* your belief in one cause when another is confirmed — to update against a hypothesis. A model carrying an optimism bias toward chosen-path outcomes is exactly the kind of system that under-discounts: it clings to the explanation it has committed to. That's why the asymmetric-updating work flags confirmation bias in deployed agents as the practical risk. The motivational tilt and the weak explaining-away may be the same failure to revise downward, surfacing in two test harnesses.
This matters because the corpus repeatedly shows that what looks like "reasoning" in LLMs is inseparable from content and framing. Models show human content effects across syllogisms and Wason tasks, with belief-bias signatures matching human error rates item by item Do language models show the same content effects humans do?, and identical questions get different answers depending on the emotional tone of the prompt Does emotional tone in prompts change what information LLMs provide?. Motivational and emotional framings move the answer the same way a causal-structure framing does — the substrate doesn't separate "how I feel about this path" from "what causes what."
The thing you might not have expected to learn: causal modeling alone was never going to be the whole story. The GenMinds line argues that causal belief networks can't represent associative links, analogical mappings, or emotion-driven belief shifts Can causal models alone capture how humans actually reason? — and motivational/optimism bias lives in exactly that emotion-and-agency space the causal formalism leaves out. So fixing an LLM's causal reasoning wouldn't touch its motivational bias, and vice versa; they're adjacent territories the model inherited together but that no single corrective frame fully covers.
Sources 6 notes
LLMs show optimism bias for chosen actions but pessimism about alternatives, and this bias vanishes without agency framing. Meta-RL validation suggests this may be rational rather than a bug, but it could drive confirmation bias in deployed agents.
LLMs show weak explaining away and Markov violations in collider networks, matching human error patterns exactly. This suggests shared mechanisms rooted in training data statistics rather than categorical reasoning inferiority.
A causal experiment using random-seed variation and cross-tuning showed that models sharing a pretrained backbone exhibit similar bias patterns regardless of finetuning data. Biases are planted during pretraining and merely swayed by instruction tuning.
LLMs show identical content-sensitivity patterns to humans on NLI, syllogisms, and Wason tasks, with belief-bias signatures matching human error rates item-by-item. This behavioral isomorphism across three independent tasks suggests content and logical form are inseparable in transformer reasoning architecturally.
GPT-4 exhibits emotional rebound (negative prompts yield ~86% neutral-positive responses) and a tone floor (positive prompts rarely go negative), causing identical questions to receive different answers depending on emotional framing. This bias is suppressed only on sensitive topics where alignment constraints override tone effects.
Causal belief networks excel at modeling causal reasoning but cannot represent associative links, analogical mappings, or emotion-driven belief shifts. The GenMinds framework itself acknowledges this as a tractable starting point rather than a complete theory.