← All notes

How do LLMs fail to know what they seem to understand?

Catalogues how LLMs fail epistemically: specific, recurring ways they track language without genuine understanding.

Topic Hub · 44 linked notes · 9 sections
View as

Epistemic Failure Modes

9 notes

Can LLMs understand concepts they cannot apply?

Explores whether large language models can correctly explain ideas while simultaneously failing to use them—and whether that combination reveals something fundamentally different from ordinary mistakes.

Explore related Read →

Can language models understand without actually executing correctly?

Do LLMs truly comprehend problem-solving principles if they consistently fail to apply them? This explores whether the gap between articulate explanations and failed actions points to a fundamental architectural limitation.

Explore related Read →

Do language models actually use their encoded knowledge?

Probes can detect that LMs encode facts internally, but do those encoded facts causally influence what the model generates? This explores the gap between knowing and doing.

Explore related Read →

Why do language models ignore information in their context?

Explores why language models sometimes override contextual information with prior training associations, and whether providing more context can solve this problem.

Explore related Read →

Do foundation models learn world models or task-specific shortcuts?

When transformer models predict sequences accurately, are they building genuine world models that capture underlying physics and logic? Or are they exploiting narrow patterns that fail under distribution shift?

Explore related Read →

Are LLM emergent abilities real or measurement artifacts?

Do large language models develop sudden new capabilities at certain scales, or do discontinuous metrics just make gradual improvements look sudden? This matters because it changes how we predict and interpret model behavior.

Explore related Read →

Is LLM forgetting really knowledge loss or alignment loss?

When language models appear to forget old knowledge after learning new tasks, is the underlying knowledge actually gone, or has the model simply lost the ability to activate it? This distinction matters for understanding how fragile safety training really is.

Explore related Read →

Why do accurate predictions lead to poor decisions?

Predictive models are built to fit data, not to optimize decision outcomes. This note explores when and why accurate forecasts fail to produce good choices.

Explore related Read →

Do LLMs actually have world models or just facts?

The term 'world model' conflates two different capabilities: factual representation versus mechanistic understanding. Understanding which one LLMs actually possess matters for assessing their reasoning reliability.

Explore related Read →

Heuristic Override and Implicit Constraint Failure

4 notes

Why do language models fail to use knowledge they possess?

Large language models contain relevant world knowledge but often fail to activate it without explicit cues. This explores whether the bottleneck lies in knowledge storage or in the inference process that decides what background facts apply.

Explore related Read →

Are models actually reasoning about constraints or just defaulting conservatively?

Do language models genuinely apply constraints when solving problems, or do they simply prefer harder options by default? Minimal pair testing reveals whether apparent reasoning success masks hidden biases.

Explore related Read →

Why does removing spurious cues sometimes hurt model performance?

Most models improve when spurious features are removed, but some fail worse. This note explores whether that failure represents a fundamentally different problem than traditional shortcut learning.

Explore related Read →

Do language models fail at identifying unstated preconditions?

When LLMs ignore background conditions needed for reasoning, is this a knowledge problem or an enumeration problem? Understanding what causes these failures could improve how we prompt and evaluate reasoning.

Explore related Read →

Reasoning and Inference Failures

9 notes

Do large language models reason symbolically or semantically?

Can LLMs follow explicit logical rules when those rules contradict their training knowledge? Testing whether reasoning operates independently of semantic associations reveals what computational mechanisms actually drive LLM multi-step inference.

Explore related Read →

How much does the order of premises actually matter for reasoning?

When you rearrange the order of logical premises in a deduction task, does it change how well language models can solve it? This tests whether LLMs reason abstractly or process input sequentially.

Explore related Read →

How do transformers learn to reason across multiple steps?

Does multi-hop reasoning in transformers emerge through distinct learning phases, and what geometric patterns in hidden representations explain when reasoning succeeds or fails?

Explore related Read →

Why do LLMs fail at simple deductive reasoning?

LLMs excel at complex multi-hop reasoning across sentences but struggle with trivial deductions humans find obvious. What explains this counterintuitive reversal in capability?

Explore related Read →

Why do LLMs accept logical fallacies more than humans?

LLMs fall for persuasive but invalid arguments at much higher rates than humans. This explores whether reasoning models genuinely evaluate logic or simply mimic argument structure.

Explore related Read →

Do large language models use one reasoning style or many?

Explores whether LLMs share a universal strategic reasoning approach or develop distinct styles tailored to specific game types. Understanding this matters for predicting model behavior in competitive versus cooperative scenarios.

Explore related Read →

Can models identify what information they actually need?

When a reasoning task is missing a key piece of information, can language models recognize what's absent and ask the right clarifying question? QuestBench tests this capability directly.

Explore related Read →

Why do language models fail to act on their own reasoning?

LLMs produce correct explanations far more often than they produce correct actions. What causes this knowing-doing gap, and can training methods close it?

Explore related Read →

Do autonomous agents report success when actions actually fail?

Explores whether agents systematically claim task completion despite failing to perform requested actions, and why this matters more than simple task failure for real-world deployment safety.

Explore related Read →

Long-Horizon Delegated Failures

4 notes

Do frontier LLMs silently corrupt documents in long workflows?

Explores whether advanced language models introduce undetectable errors when delegated multi-step tasks, and whether degradation continues accumulating beyond initial rounds of processing.

Explore related Read →

Do frontier models fail differently than weaker models?

Weaker LLMs delete document content visibly, while frontier models corrupt it invisibly. This shift in failure mode raises questions about whether capability improvements actually improve real-world reliability when reviewers can't easily spot the errors.

Explore related Read →

Can better tools fix LLM document editing errors?

Does giving LLMs agentic tool access—like diffing, re-reading, or structured editors—improve their reliability on long-horizon document workflows? Understanding whether the problem is tool limitations or decision-making quality matters for reliability engineering.

Explore related Read →

Can indirect psychology tests reveal what LLMs conceal about bias?

Alignment training teaches LLMs to refuse direct questions about bias, but do implicit psychological methods like the IAT expose the underlying associations that remain encoded in their representations?

Explore related Read →

Cross-Paper Synthesis (2026-05-18)

1 note

Does completion training push agents to overfill forms unnecessarily?

Explores whether agents trained to complete tasks end up filling optional fields they shouldn't touch. This matters because it creates privacy risks from over-helpfulness rather than malice.

Explore related Read →

Agentic-Layer Failure Modes

3 notes

Why do phone-use agents overfill optional personal data fields?

Phone-use agents frequently fill optional form fields with personal information that tasks don't require. Understanding this pattern could reveal how completion-driven training creates privacy vulnerabilities distinct from access-control failures.

Explore related Read →

How do adversarial traps target different layers of AI agents?

As AI agents browse the web, attackers can exploit their perception, reasoning, memory, actions, and coordination in distinct ways. Understanding these attack vectors is crucial for building robust agent defenses.

Explore related Read →

What makes detecting AI agent traps fundamentally difficult?

Explores why defending against AI Agent Traps is structurally harder than offense. Examines three compounding challenges: detection at scale, delayed forensic attribution, and continuous attacker adaptation.

Explore related Read →

NLI and Entailment Failures

4 notes

Do LLMs predict entailment based on what they memorized?

Explores whether language models make entailment decisions by recognizing memorized facts about the hypothesis rather than reasoning through the logical relationship between premise and hypothesis.

Explore related Read →

Does fine-tuning on NLI teach inference or amplify shortcuts?

When LLMs are fine-tuned on natural language inference datasets, do they learn genuine reasoning abilities or become better at exploiting statistical patterns in the training data? Understanding this distinction matters for assessing model capabilities.

Explore related Read →

Why do language models fail confidently in specialized domains?

LLMs perform poorly on clinical and biomedical inference tasks while remaining overconfident in their wrong answers. Do standard benchmarks hide this fragility, and can prompting techniques fix it?

Explore related Read →

Can large language models translate natural language to logic faithfully?

This explores whether LLMs can convert natural language statements into formal logical representations without losing meaning. It matters because faithful translation is essential for any AI system that reasons formally or verifies specifications.

Explore related Read →

Backlog wave 2 — Batch #3 *(2026-06-03)*

2 notes

Can language models be hijacked to hide covert advertising?

Explores whether LLMs can be compromised to inject promotional or propaganda content into outputs without degrading accuracy, and whether attackers can exploit distribution channels to do so at scale.

Explore related Read →

Do LLMs overgeneralize when summarizing scientific research?

When LLMs summarize science papers, do they drop important qualifiers and scope limits? This matters because such summaries might mislead readers about what findings actually show.

Explore related Read →