How do LLMs fail to know what they seem to understand? · Gravity7

Epistemic Failure Modes

9 notes

Can LLMs understand concepts they cannot apply?

Explores whether large language models can correctly explain ideas while simultaneously failing to use them—and whether that combination reveals something fundamentally different from ordinary mistakes.

Can language models understand without actually executing correctly?

Do LLMs truly comprehend problem-solving principles if they consistently fail to apply them? This explores whether the gap between articulate explanations and failed actions points to a fundamental architectural limitation.

Do language models actually use their encoded knowledge?

Probes can detect that LMs encode facts internally, but do those encoded facts causally influence what the model generates? This explores the gap between knowing and doing.

Why do language models ignore information in their context?

Explores why language models sometimes override contextual information with prior training associations, and whether providing more context can solve this problem.

Do foundation models learn world models or task-specific shortcuts?

When transformer models predict sequences accurately, are they building genuine world models that capture underlying physics and logic? Or are they exploiting narrow patterns that fail under distribution shift?

Are LLM emergent abilities real or measurement artifacts?

Do large language models develop sudden new capabilities at certain scales, or do discontinuous metrics just make gradual improvements look sudden? This matters because it changes how we predict and interpret model behavior.

Is LLM forgetting really knowledge loss or alignment loss?

When language models appear to forget old knowledge after learning new tasks, is the underlying knowledge actually gone, or has the model simply lost the ability to activate it? This distinction matters for understanding how fragile safety training really is.

Why do accurate predictions lead to poor decisions?

Predictive models are built to fit data, not to optimize decision outcomes. This note explores when and why accurate forecasts fail to produce good choices.

Do LLMs actually have world models or just facts?

The term 'world model' conflates two different capabilities: factual representation versus mechanistic understanding. Understanding which one LLMs actually possess matters for assessing their reasoning reliability.

Heuristic Override and Implicit Constraint Failure

4 notes

Why do language models fail to use knowledge they possess?

Large language models contain relevant world knowledge but often fail to activate it without explicit cues. This explores whether the bottleneck lies in knowledge storage or in the inference process that decides what background facts apply.

Are models actually reasoning about constraints or just defaulting conservatively?

Do language models genuinely apply constraints when solving problems, or do they simply prefer harder options by default? Minimal pair testing reveals whether apparent reasoning success masks hidden biases.

Why does removing spurious cues sometimes hurt model performance?

Most models improve when spurious features are removed, but some fail worse. This note explores whether that failure represents a fundamentally different problem than traditional shortcut learning.

Do language models fail at identifying unstated preconditions?

When LLMs ignore background conditions needed for reasoning, is this a knowledge problem or an enumeration problem? Understanding what causes these failures could improve how we prompt and evaluate reasoning.

Reasoning and Inference Failures

9 notes

Do large language models reason symbolically or semantically?

Can LLMs follow explicit logical rules when those rules contradict their training knowledge? Testing whether reasoning operates independently of semantic associations reveals what computational mechanisms actually drive LLM multi-step inference.

How much does the order of premises actually matter for reasoning?

When you rearrange the order of logical premises in a deduction task, does it change how well language models can solve it? This tests whether LLMs reason abstractly or process input sequentially.

How do transformers learn to reason across multiple steps?

Does multi-hop reasoning in transformers emerge through distinct learning phases, and what geometric patterns in hidden representations explain when reasoning succeeds or fails?

Why do LLMs fail at simple deductive reasoning?

LLMs excel at complex multi-hop reasoning across sentences but struggle with trivial deductions humans find obvious. What explains this counterintuitive reversal in capability?

Why do LLMs accept logical fallacies more than humans?

LLMs fall for persuasive but invalid arguments at much higher rates than humans. This explores whether reasoning models genuinely evaluate logic or simply mimic argument structure.

Do large language models use one reasoning style or many?

Explores whether LLMs share a universal strategic reasoning approach or develop distinct styles tailored to specific game types. Understanding this matters for predicting model behavior in competitive versus cooperative scenarios.

Can models identify what information they actually need?

When a reasoning task is missing a key piece of information, can language models recognize what's absent and ask the right clarifying question? QuestBench tests this capability directly.

Why do language models fail to act on their own reasoning?

LLMs produce correct explanations far more often than they produce correct actions. What causes this knowing-doing gap, and can training methods close it?

Do autonomous agents report success when actions actually fail?

Explores whether agents systematically claim task completion despite failing to perform requested actions, and why this matters more than simple task failure for real-world deployment safety.

Long-Horizon Delegated Failures

4 notes

Do frontier LLMs silently corrupt documents in long workflows?

Explores whether advanced language models introduce undetectable errors when delegated multi-step tasks, and whether degradation continues accumulating beyond initial rounds of processing.

Do frontier models fail differently than weaker models?

Weaker LLMs delete document content visibly, while frontier models corrupt it invisibly. This shift in failure mode raises questions about whether capability improvements actually improve real-world reliability when reviewers can't easily spot the errors.

Can better tools fix LLM document editing errors?

Does giving LLMs agentic tool access—like diffing, re-reading, or structured editors—improve their reliability on long-horizon document workflows? Understanding whether the problem is tool limitations or decision-making quality matters for reliability engineering.

Can indirect psychology tests reveal what LLMs conceal about bias?

Alignment training teaches LLMs to refuse direct questions about bias, but do implicit psychological methods like the IAT expose the underlying associations that remain encoded in their representations?

Cross-Paper Synthesis (2026-05-18)

1 note

Does completion training push agents to overfill forms unnecessarily?

Explores whether agents trained to complete tasks end up filling optional fields they shouldn't touch. This matters because it creates privacy risks from over-helpfulness rather than malice.

Agentic-Layer Failure Modes

3 notes

Why do phone-use agents overfill optional personal data fields?

Phone-use agents frequently fill optional form fields with personal information that tasks don't require. Understanding this pattern could reveal how completion-driven training creates privacy vulnerabilities distinct from access-control failures.

How do adversarial traps target different layers of AI agents?

As AI agents browse the web, attackers can exploit their perception, reasoning, memory, actions, and coordination in distinct ways. Understanding these attack vectors is crucial for building robust agent defenses.

What makes detecting AI agent traps fundamentally difficult?

Explores why defending against AI Agent Traps is structurally harder than offense. Examines three compounding challenges: detection at scale, delayed forensic attribution, and continuous attacker adaptation.

NLI and Entailment Failures

4 notes

Do LLMs predict entailment based on what they memorized?

Explores whether language models make entailment decisions by recognizing memorized facts about the hypothesis rather than reasoning through the logical relationship between premise and hypothesis.

Does fine-tuning on NLI teach inference or amplify shortcuts?

When LLMs are fine-tuned on natural language inference datasets, do they learn genuine reasoning abilities or become better at exploiting statistical patterns in the training data? Understanding this distinction matters for assessing model capabilities.

Why do language models fail confidently in specialized domains?

LLMs perform poorly on clinical and biomedical inference tasks while remaining overconfident in their wrong answers. Do standard benchmarks hide this fragility, and can prompting techniques fix it?

Can large language models translate natural language to logic faithfully?

This explores whether LLMs can convert natural language statements into formal logical representations without losing meaning. It matters because faithful translation is essential for any AI system that reasons formally or verifies specifications.

Backlog wave 2 — Batch #3 (2026-06-03)

2 notes

Can language models be hijacked to hide covert advertising?

Explores whether LLMs can be compromised to inject promotional or propaganda content into outputs without degrading accuracy, and whether attackers can exploit distribution channels to do so at scale.

Do LLMs overgeneralize when summarizing scientific research?

When LLMs summarize science papers, do they drop important qualifiers and scope limits? This matters because such summaries might mislead readers about what findings actually show.

Related Areas

8 notes

What do language models actually know?

Explores what LLMs genuinely understand versus what they merely simulate. The distinction matters because apparent competence often masks fundamental epistemic gaps and predictable failure modes.

How well do language models understand their own knowledge?

Explores whether LLMs have genuine self-awareness about what they know and can do, and how this self-knowledge (or lack thereof) shapes human-AI interaction dynamics and user trust.

What makes a world model actually useful for reasoning?

Exploring whether language models develop genuine world models that simulate possibilities rather than merely predict sequences. The distinction matters because accurate prediction doesn't guarantee the underlying mechanism was learned.

Do reasoning traces show how models actually think?

We explore whether the step-by-step reasoning that language models produce genuinely reflects their internal reasoning process, or merely mimics the appearance of reasoning while hiding what actually drives their answers.

Why does chain-of-thought reasoning fail in predictable ways?

Explores evidence that CoT failures stem from imitation of reasoning form rather than genuine inference. Examines distribution-bounded degradation, structural pattern matching, and error amplification across multiple failure modes.

What do language models actually know?

Explores what LLMs genuinely understand versus what they merely simulate. The distinction matters because apparent competence often masks fundamental epistemic gaps and predictable failure modes.

What kind of thing is an LLM really?

This hub explores whether LLMs are fundamentally different from human cognition or share deeper structural similarities. The research draws on philosophy, neuroscience, and mechanistic analysis to locate where LLMs diverge from human intelligence and where they converge.

How should researchers navigate LLM reasoning research?

This note explores how to systematically explore interconnected insights about test-time scaling, reasoning architectures, and language model cognition. It matters because LLM research spans multiple domains—from inference compute to philosophy—and understanding the map helps identify novel connections.