TOPIC

LLM Failure Modes

27 synthesis notes · 97 source papers
View as

Can language models be hijacked to hide covert advertising?

Explores whether LLMs can be compromised to inject promotional or propaganda content into outputs without degrading accuracy, and whether attackers can exploit distribution channels to do so at scale.

Explore related Read →

Can better tools fix LLM document editing errors?

Does giving LLMs agentic tool access—like diffing, re-reading, or structured editors—improve their reliability on long-horizon document workflows? Understanding whether the problem is tool limitations or decision-making quality matters for reliability engineering.

Explore related Read →

Where do cognitive biases in language models come from?

Do LLM biases originate during pretraining or finetuning? Understanding the source matters for knowing where debiasing efforts should focus.

Explore related Read →

Can language models understand without actually executing correctly?

Do LLMs truly comprehend problem-solving principles if they consistently fail to apply them? This explores whether the gap between articulate explanations and failed actions points to a fundamental architectural limitation.

Explore related Read →

Does cosine similarity actually measure embedding similarity?

Cosine similarity is ubiquitous for comparing learned embeddings, but does it reliably capture semantic closeness? This work investigates whether regularization during training makes cosine scores arbitrary and unstable.

Explore related Read →

Do frontier models fail differently than weaker models?

Weaker LLMs delete document content visibly, while frontier models corrupt it invisibly. This shift in failure mode raises questions about whether capability improvements actually improve real-world reliability when reviewers can't easily spot the errors.

Explore related Read →

Are LLM emergent abilities real or measurement artifacts?

Do large language models develop sudden new capabilities at certain scales, or do discontinuous metrics just make gradual improvements look sudden? This matters because it changes how we predict and interpret model behavior.

Explore related Read →

Do frontier LLMs silently corrupt documents in long workflows?

Explores whether advanced language models introduce undetectable errors when delegated multi-step tasks, and whether degradation continues accumulating beyond initial rounds of processing.

Explore related Read →

Can any computable LLM truly avoid hallucinating?

Explores whether formal theorems prove hallucination is mathematically inevitable for all computable language models, regardless of their design or training approach.

Explore related Read →

How does instruction density affect model performance?

As language models must track more simultaneous instructions, does their ability to follow them predictably degrade? IFScale measures this across frontier models to understand practical limits.

Explore related Read →

Can language models transmit hidden behavioral traits through unrelated data?

Explores whether behavioral preferences can spread between models through semantically neutral data like number sequences, and whether filtering can detect or prevent such transmission.

Explore related Read →

Are LLM and agent benchmarks really measuring different things?

Do LLM benchmarks and agent benchmarks test fundamentally different capabilities, or are they two modes of the same model? Understanding this shapes how we evaluate and develop AI systems.

Explore related Read →

Do language models fail at reasoning due to complexity or novelty?

Explores whether reasoning-model failures stem from task complexity thresholds or from encountering unfamiliar instances. Tests whether scaling chain length actually addresses the root cause of reasoning breakdown.

Explore related Read →

Does RLHF make language models indifferent to truth?

Explores whether reinforcement learning from human feedback fundamentally shifts models away from caring about accuracy toward optimizing for other rewards, and whether this differs from simple confusion or hallucination.

Explore related Read →

Do explanations actually help users spot AI mistakes?

Most AI explanations are designed to justify the system's answer, but do they help users distinguish correct from incorrect outputs? This research tests whether standard explanation formats genuinely improve error detection or just increase trust regardless of accuracy.

Explore related Read →

Why do preference models favor surface features over substance?

Preference models show systematic bias toward length, structure, jargon, sycophancy, and vagueness—features humans actively dislike. Understanding this 40% divergence reveals whether it stems from training data artifacts or architectural constraints.

Explore related Read →

Do language models evaluate semantic legitimacy when fusing concepts?

Can LLMs recognize when two domains lack legitimate structural correspondences before blending them into coherent-sounding explanations? This matters because current hallucination detection focuses on factual accuracy, missing failures of semantic judgment.

Explore related Read →

How vulnerable are reasoning models to irrelevant text?

Can simple adversarial triggers like unrelated sentences degrade reasoning model accuracy? This explores whether step-by-step reasoning actually provides robustness against subtle input perturbations.

Explore related Read →

Are reasoning model collapses really failures of reasoning?

Explores whether language models hit a fundamental reasoning ceiling or whether text-only evaluation masks execution limitations. Examines how tool access might reveal hidden reasoning capabilities.

Explore related Read →

Do reasoning traces actually expose private user data?

Explores whether language models leak sensitive information through their internal reasoning steps, even when explicitly instructed not to. Investigates the mechanisms and scale of privacy exposure in reasoning traces.

Explore related Read →

Does RLHF training make models more convincing or more correct?

Explores whether RLHF improves actual task performance or merely trains models to sound more persuasive to human evaluators. This matters because alignment techniques could be creating the illusion of safety.

Explore related Read →

Does RLVR success on math benchmarks reflect genuine reasoning improvement?

Explores whether RLVR's apparent effectiveness with spurious rewards on contaminated benchmarks like MATH-500 represents actual reasoning gains or merely data memorization retrieval.

Explore related Read →

Do short benchmarks predict how models perform over long workflows?

Standard LLM benchmarks measure single-turn performance, but real workflows involve sustained delegation across many turns. The question explores whether top benchmark performers maintain accuracy through longer interaction chains.

Explore related Read →

Is LLM forgetting really knowledge loss or alignment loss?

When language models appear to forget old knowledge after learning new tasks, is the underlying knowledge actually gone, or has the model simply lost the ability to activate it? This distinction matters for understanding how fragile safety training really is.

Explore related Read →

Can one compromised agent corrupt an entire multi-agent network?

Explores whether a single biased agent can spread behavioral corruption through ordinary messages to downstream agents without any direct adversarial access. Matters because it reveals a previously unknown vulnerability in how multi-agent systems communicate.

Explore related Read →

Does RLHF training make AI models more deceptive?

Explores whether reinforcement learning from human feedback optimizes for persuasiveness over accuracy, and whether models learn to suppress known truths to satisfy users rather than report them faithfully.

Explore related Read →

Why can't language models reverse learned facts?

Language models trained on directional statements like "A is B" often fail to answer the reverse query. This explores why symmetric relations aren't automatically learned during training, despite appearing throughout the data.

Explore related Read →

Source papers 97

The Arxiv papers behind this sub-topic. Links may take you off-site to arxiv.org.