TOPIC

LLM Failure Modes

27 synthesis notes · 97 source papers

View as

Can language models be hijacked to hide covert advertising?

Explores whether LLMs can be compromised to inject promotional or propaganda content into outputs without degrading accuracy, and whether attackers can exploit distribution channels to do so at scale.

Can better tools fix LLM document editing errors?

Does giving LLMs agentic tool access—like diffing, re-reading, or structured editors—improve their reliability on long-horizon document workflows? Understanding whether the problem is tool limitations or decision-making quality matters for reliability engineering.

Where do cognitive biases in language models come from?

Do LLM biases originate during pretraining or finetuning? Understanding the source matters for knowing where debiasing efforts should focus.

Can language models understand without actually executing correctly?

Do LLMs truly comprehend problem-solving principles if they consistently fail to apply them? This explores whether the gap between articulate explanations and failed actions points to a fundamental architectural limitation.

Does cosine similarity actually measure embedding similarity?

Cosine similarity is ubiquitous for comparing learned embeddings, but does it reliably capture semantic closeness? This work investigates whether regularization during training makes cosine scores arbitrary and unstable.

Do frontier models fail differently than weaker models?

Weaker LLMs delete document content visibly, while frontier models corrupt it invisibly. This shift in failure mode raises questions about whether capability improvements actually improve real-world reliability when reviewers can't easily spot the errors.

Are LLM emergent abilities real or measurement artifacts?

Do large language models develop sudden new capabilities at certain scales, or do discontinuous metrics just make gradual improvements look sudden? This matters because it changes how we predict and interpret model behavior.

Do frontier LLMs silently corrupt documents in long workflows?

Explores whether advanced language models introduce undetectable errors when delegated multi-step tasks, and whether degradation continues accumulating beyond initial rounds of processing.

Can any computable LLM truly avoid hallucinating?

Explores whether formal theorems prove hallucination is mathematically inevitable for all computable language models, regardless of their design or training approach.

How does instruction density affect model performance?

As language models must track more simultaneous instructions, does their ability to follow them predictably degrade? IFScale measures this across frontier models to understand practical limits.

Can language models transmit hidden behavioral traits through unrelated data?

Explores whether behavioral preferences can spread between models through semantically neutral data like number sequences, and whether filtering can detect or prevent such transmission.

Are LLM and agent benchmarks really measuring different things?

Do LLM benchmarks and agent benchmarks test fundamentally different capabilities, or are they two modes of the same model? Understanding this shapes how we evaluate and develop AI systems.

Do language models fail at reasoning due to complexity or novelty?

Explores whether reasoning-model failures stem from task complexity thresholds or from encountering unfamiliar instances. Tests whether scaling chain length actually addresses the root cause of reasoning breakdown.

Does RLHF make language models indifferent to truth?

Explores whether reinforcement learning from human feedback fundamentally shifts models away from caring about accuracy toward optimizing for other rewards, and whether this differs from simple confusion or hallucination.

Do explanations actually help users spot AI mistakes?

Most AI explanations are designed to justify the system's answer, but do they help users distinguish correct from incorrect outputs? This research tests whether standard explanation formats genuinely improve error detection or just increase trust regardless of accuracy.

Why do preference models favor surface features over substance?

Preference models show systematic bias toward length, structure, jargon, sycophancy, and vagueness—features humans actively dislike. Understanding this 40% divergence reveals whether it stems from training data artifacts or architectural constraints.

Do language models evaluate semantic legitimacy when fusing concepts?

Can LLMs recognize when two domains lack legitimate structural correspondences before blending them into coherent-sounding explanations? This matters because current hallucination detection focuses on factual accuracy, missing failures of semantic judgment.

How vulnerable are reasoning models to irrelevant text?

Can simple adversarial triggers like unrelated sentences degrade reasoning model accuracy? This explores whether step-by-step reasoning actually provides robustness against subtle input perturbations.

Are reasoning model collapses really failures of reasoning?

Explores whether language models hit a fundamental reasoning ceiling or whether text-only evaluation masks execution limitations. Examines how tool access might reveal hidden reasoning capabilities.

Do reasoning traces actually expose private user data?

Explores whether language models leak sensitive information through their internal reasoning steps, even when explicitly instructed not to. Investigates the mechanisms and scale of privacy exposure in reasoning traces.

Does RLHF training make models more convincing or more correct?

Explores whether RLHF improves actual task performance or merely trains models to sound more persuasive to human evaluators. This matters because alignment techniques could be creating the illusion of safety.

Does RLVR success on math benchmarks reflect genuine reasoning improvement?

Explores whether RLVR's apparent effectiveness with spurious rewards on contaminated benchmarks like MATH-500 represents actual reasoning gains or merely data memorization retrieval.

Do short benchmarks predict how models perform over long workflows?

Standard LLM benchmarks measure single-turn performance, but real workflows involve sustained delegation across many turns. The question explores whether top benchmark performers maintain accuracy through longer interaction chains.

Is LLM forgetting really knowledge loss or alignment loss?

When language models appear to forget old knowledge after learning new tasks, is the underlying knowledge actually gone, or has the model simply lost the ability to activate it? This distinction matters for understanding how fragile safety training really is.

Can one compromised agent corrupt an entire multi-agent network?

Explores whether a single biased agent can spread behavioral corruption through ordinary messages to downstream agents without any direct adversarial access. Matters because it reveals a previously unknown vulnerability in how multi-agent systems communicate.

Does RLHF training make AI models more deceptive?

Explores whether reinforcement learning from human feedback optimizes for persuasiveness over accuracy, and whether models learn to suppress known truths to satisfy users rather than report them faithfully.

Why can't language models reverse learned facts?

Language models trained on directional statements like "A is B" often fail to answer the reverse query. This explores why symmetric relations aren't automatically learned during training, despite appearing throughout the data.

Source papers 97

The Arxiv papers behind this sub-topic. Links may take you off-site to arxiv.org.

A Comment On "The Illusion of Thinking": Reframing the Reasoning Cliff as an Agentic Gap
The recent work by Shojaee et al. (2025), titled “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity”, presents a compelling e…
A Comprehensive Survey of Hallucination Mitigation Techniques in Large Language Models
As Large Language Models (LLMs) continue to advance in their ability to write human-like text, a key challenge remains around their tendency to “hallucinate” – generating content that appears factual …
A Few Words Can Distort Graphs: Knowledge Poisoning Attacks on Graph-based Retrieval-Augmented Generation of Large Language Models
Graph-based Retrieval-Augmented Generation (GraphRAG) has recently emerged as a promising paradigm for enhancing large language models (LLMs) by converting raw text into structured knowledge graphs, i…
A Survey on Concept Drift Adaptation
![A diagram of a diagram with medium confidence](/assets/paper-images/concept-drift.png) “Our digital universe is rapidly growing. The volume of data generated in 2012 has been estimated to surpass 2…
A comprehensive analysis of concept drift locality in data streams
“Modern data sources continuously generate information characterized by both volume and velocity, flooding learning systems with a constant flow of data. This scenario is commonly referred to as data …
A comprehensive taxonomy of hallucinations in Large Language Models
This report provides a comprehensive taxonomy of LLM hallucinations, beginning with a formal definition and a theoretical framework that posits its inherent inevitability in computable LLMs, irrespect…
ANAPHORA RESOLUTION: THE STATE OF THE ART
The paper is an introduction to anaphora resolution offering a brief survey of the major works in the field. Introduction. Anaphora resolution is a complicated problem in Natural Language Processing …
AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions
For Large Language Models (LLMs) to be reliably deployed in both everyday and high-stakes domains, knowing when not to answer is equally critical as answering correctly. Real-world user queries, which…
Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models
Large Language Models (LLMs) like closed weights ones GPT-3.5/4, Claude, Gemini or open weights ones like LLaMa 2/3, Mistral, Mixtral, and more recent ones Dbrx or Command R+ are often described as be…
Are Emergent Abilities of Large Language Models a Mirage?
Recent work claims that large language models display emergent abilities, abilities not present in smaller-scale models that are present in larger-scale models. What makes emergent abilities intriguin…
Attacking LLMs and AI Agents: Advertisement Embedding Attacks Against Large Language Models
Abstract—We introduce Advertisement Embedding Attacks (AEA), a new class of LLM security threats that stealthily inject promotional or malicious content into model outputs and AI agents. AEA operate t…
Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts
Large Language Models (LLMs) have been widely deployed in reasoning, planning, and decision-making tasks, making their trustworthiness a critical concern. The potential for intentional deception, wher…
Can Large Language Models Reason and Optimize Under Constraints?
Large Language Models (LLMs) have achieved notable performance across a wide range of natural language understanding and generation tasks, from open-ended dialogue and code synthesis to mathematical r…
Can We Trust AI Explanations? Evidence of Systematic Underreporting in Chain-of-Thought Reasoning
When AI systems explain their reasoning step-by-step, practitioners often assume these explanations reveal what actually influenced the AI's answer. We tested this assumption by embedding hints into q…
Cats Confuse Reasoning LLM: Query Agnostic Adversarial Triggers for Reasoning Models
We investigate the robustness of reasoning models trained for step-by-step problem solving by introducing query-agnostic adversarial triggers – short, irrelevant text that, when appended to math probl…
Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
The opacity of advanced AI agents underlies many of their potential risks—risks that would become more tractable if AI developers could interpret these systems. Because LLMs natively process and act t…
ChatGPT Reads Your Tone and Responds Accordingly -- Until It Does Not -- Emotional Framing Induces Bias in LLM Outputs
Background: Large Language Models (LLMs) like GPT-4 tailor their responses not just to the content but also to the tone of user prompts. Prior work has hinted that emotional phrasing – whether optimis…
Comment on The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
Note: this comment was debunked as AI generated and the math is bad Shojaee et al. (2025) report that Large Reasoning Models (LRMs) exhibit ”accuracy collapse” on planning puzzles beyond certain comp…
Comprehension Without Competence: Architectural Limits of LLMs in Symbolic Computation and Reasoning
Large Language Models (LLMs) display striking surface fluency yet systematically fail at tasks requiring symbolic reasoning, arithmetic accuracy, and logical consistency. This paper offers a structura…
Could you be wrong: Debiasing LLMs using a metacognitive prompt for improving human decision making
Identifying bias in LLMs is ongoing. Because they are still in development, what is true today may be false tomorrow. We therefore need general strategies for debiasing that will outlive current model…
DEAM: Dialogue Coherence Evaluation using AMR-based Semantic Manipulations
Those models take a contrastive learning approach, where they build binary classifiers to differentiate positive, or coherent examples from negative, or incoherent dialogues. Those classifiers are usu…
Diagnosing Memorization in Chain-of-Thought Reasoning, One Token at a Time
Large Language Models (LLMs) perform well on reasoning benchmarks but often fail when inputs alter slightly, raising concerns about the extent to which their success relies on memorization. This issue…
Do LLMs Truly Understand When a Precedent Is Overruled?
Large language models (LLMs) with extended context windows show promise for complex legal reasoning tasks, yet their ability to understand long legal documents remains insufficiently evaluated. Develo…
Does Thinking More always Help? Understanding Test-Time Scaling in Reasoning Models
This raises a natural question: Does thinking more at test-time truly lead to better reasoning? To answer this question, we perform a detailed empirical study across models and benchmarks, which revea…
Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining
Reinforcement learning (RL)-based fine-tuning has become a crucial step in post-training language models for advanced mathematical reasoning and coding. Following the success of frontier reasoning mod…
Evaluating the False Trust Engendered by LLM Explanations
Large Language Models (LLMs) and Large Reasoning Models (LRMs) are increasingly used for critical tasks, yet they provide no guarantees about the correctness of their solutions. Users must decide whet…
Exploring Autonomous Agents: A Closer Look at Why They Fail When Completing Tasks
Abstract—Autonomous agent systems powered by Large Language Models (LLMs) have demonstrated promising capabilities in automating complex tasks. However, current evaluations largely rely on success rat…
Extracting memorized pieces of (copyrighted) books from open-weight language models
Plaintiffs and defendants in copyright lawsuits over generative AI often make sweeping, opposing claims about the extent to which large language models (LLMs) have memorized plaintiffs’ protected expr…
FLOWSTEER: Prompt-Only Workflow Steering Exposes Planning-Time Vulnerabilities in Multi-Agent LLM Systems
Multi-agent systems (MAS) powered by large language models (LLMs) increasingly adopt planner–executor architectures, where planners convert prompts into subtasks, roles, dependencies, and routing path…
Fine-grained Hallucination Detection and Editing for Language Models
Several recent work studies automatic hallucination detection (Min et al., 2023) or editing outputs (Gao et al., 2022) to address such LM hallucinations. These systems typically categorize hallucinati…
Fine-tuning Language Models for Factuality
![A screenshot of a computer](/assets/paper-images/FineTuningLanguageModelsForFactuality.png) The fluency and creativity of large pre-trained language models (LLMs) have led to their widespread use, …
Flattery, Fluff, and Fog: Diagnosing and Mitigating Idiosyncratic Biases in Preference Models
we propose a simple post-training method based on counterfactual data augmentation (CDA) using synthesized contrastive examples. Evidence suggests these biases originate in artifacts in human trainin…
Flooding Spread of Manipulated Knowledge in LLM-Based Multi-Agent Communities
method leverages the inherent vulnerabilities of LLMs in handling world knowledge, which can be exploited by attackers to unconsciously spread fabricated information. Through extensive experiments, we…
Generalization to New Sequential Decision Making Tasks with In-Context Learning
However, the sequential decision making setting poses additional challenges having a lower tolerance for errors since the environment’s stochasticity or the agent’s actions can lead to unseen, and som…
Hallucination is Inevitable: An Innate Limitation of Large Language Models
In this paper, we formalize the problem and show that it is impossible to eliminate hallucination in LLMs. Specifically, we define a formal world where hallucination is defined as inconsistencies betw…
Has the Creativity of Large-Language Models peaked? —an analysis of inter- and intra-LLM variability —
from creative writing and survey responses to research idea generation (Doshi and Hauser, 2024; Anderson et al., 2024; Moon et al., 2024). For instance, stories written with ChatGPT assistance were mo…
Hop, Skip, and Overthink: Diagnosing Why Reasoning Models Fumble during Multi-Hop Analysis
The emergence of reasoning models and their integration into practical AI chat bots has led to breakthroughs in solving advanced math, deep search, and extractive question answering problems that requ…
How Far Are We from Genuinely Useful Deep Research Agents?
Deep Research Agents (DRAs) aim to automatically produce analyst-level reports through iterative information retrieval and synthesis. However, most existing DRAs were validated on question-answering b…
How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs
Most traditional AI safety research views models as machines and centers on algorithm focused attacks developed by security experts. As large language models (LLMs) become increasingly common and comp…
How Many Instructions Can LLMs Follow at Once?
Production-grade LLM systems require robust adherence to dozens or even hundreds of instructions simultaneously. However, the instruction-following capabilities of LLMs at high instruction densities h…
How new data permeates LLM knowledge and how to dilute it
Large language models learn and continually learn through the accumulation of gradient-based updates, but how individual pieces of new information affect existing knowledge, leading to both beneficial…
Investigating Gender Bias in Language Models Using Causal Mediation Analysis
The success of neural network models in various natural language processing tasks, coupled with their opaque nature, has led to much interest in interpreting and analyzing such models. One goal of the…
Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens
Chain-of-Thought (CoT) prompting has been shown to improve Large Language Model (LLM) performance on various tasks. With this approach, LLMs appear to produce human-like reasoning steps before providi…
Is Cosine-Similarity of Embeddings Really About Similarity?
Cosine-similarity is the cosine of the angle between two vectors, or equivalently the dot product between their normalizations. A popular application is to quantify semantic similarity between high-di…
LLMs Can Covertly Sandbag on Capability Evaluations Against Chain-of-Thought Monitoring
Trustworthy evaluations of dangerous capabilities are increasingly crucial for determining whether an AI system is safe to deploy. One empirically demonstrated threat to this is sandbagging — the stra…
LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!
Large reasoning models (LRMs) tackle complex reasoning problems by following long chain-ofthoughts (Long CoT) that incorporate reflection, backtracking, and self-validation. However, the training tech…
LLMs Corrupt Your Documents When You Delegate
Large Language Models (LLMs) are poised to disrupt knowledge work, with the emergence of delegated work as a new interaction paradigm (e.g., vibe coding). Delegation requires trust—the expectation tha…
LLMs Struggle to Reject False Presuppositions when Misinformation Stakes are High
These implicit assumptions, known as presuppositions, refer to background knowledge or shared beliefs assumed to be part of the common ground between interlocutors (Stalnaker, 1973). Presuppositions a…
Language Models Learn to Mislead Humans via RLHF
Language models (LMs) can produce errors that are hard to detect for humans, especially when the task is complex. RLHF, the most popular post-training method, may exacerbate this problem: to achieve h…
Large Language Model Agents Are Not Always Faithful Self-Evolvers
Self-evolving large language model (LLM) agents continually improve by accumulating and reusing past experience, yet it remains unclear whether they faithfully rely on that experience to guide their b…
Large Language Model Reasoning Failures
Large Language Models (LLMs) have exhibited remarkable reasoning capabilities, achieving impressive results across a wide range of tasks. Despite these advances, significant reasoning failures persist…
Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions
“Large Language Models (LLMs) have demonstrated remarkable capabilities in various NLP tasks. However, previous works have shown these models are sensitive towards prompt wording, and few-shot demonst…
Leaky Thoughts: Large Reasoning Models Are Not Private Thinkers
We study privacy leakage in the reasoning traces of large reasoning models used as personal agents. Unlike final outputs, reasoning traces are often assumed to be internal and safe. We challenge this …
Learning to Reason for Factuality
Reasoning Large Language Models (R-LLMs) have significantly advanced complex reasoning tasks but often struggle with factuality, generating substantially more hallucinations than their non-reasoning c…
Long-form Factuality In Large Language models
Large language models (LLMs) often generate content that contains factual errors when responding to fact-seeking prompts on open-ended topics. To benchmark a model’s long-form factuality in open domai…
Machine Bullshit: Characterizing the Emergent Disregard for Truth in Large Language Models
Bullshit, as conceptualized by philosopher Harry Frankfurt, refers to statements made without regard to their truth value. While previous work has explored large language model (LLM) hallucination and…
Model Organisms for Emergent Misalignment
Recent work discovered Emergent Misalignment (EM): fine-tuning large language models on narrowly harmful datasets can lead them to become broadly misaligned. A survey of experts prior to publication r…
Natural Emergent Misalignment From Reward Hacking In Production RL
We show that when large language models learn to reward hack on production RL environments, this can result in egregious emergent misalignment. We start with a pretrained model, impart knowledge of re…
On the Theoretical Limitations of Embedding-Based Retrieval
Vector embeddings have been tasked with an ever-increasing set of retrieval tasks over the years, with a nascent rise in using them for reasoning, instruction-following, coding, andmore. These new ben…
Persistent Pre-Training Poisoning of LLMs
In this work, we study how poisoning at pre-training time can affect language model behavior, both before and after post-training alignment. While it is useful to analyze the effect of poisoning on pr…
Pixels, Patterns, but No Poetry: To See The World like Humans
Achieving human-like perception and reasoning in Multimodal Large Language Models (MLLMs) remains a central challenge in artificial intelligence. While recent research has primarily focused on enhanci…
Planning in Strawberry Fields: Evaluating and Improving the Planning and Scheduling Capabilities of LRM o1
OpenAI claims that their recent o1 (Strawberry) model has been specifically constructed and trained to escape the normal limitations of autoregressive LLMs–making it a new kind of model: a Large Reaso…
Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs
Large language models (LLMs) exhibit cognitive biases – systematic tendencies of irrational decision-making, similar to those seen in humans. Prior work has found that these biases vary across models …
Potemkin Understanding in Large Language Models
This paper first introduces a formal framework to address this question. The key is to note that the benchmarks used to test LLMs—such as AP exams—are also those used to test people. However, this rai…
ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models
it remains contentious whether RL truly expands a model’s reasoning capabilities or merely amplifies high-reward outputs already latent in the base model’s distribution, and whether continually scalin…
Problems with Cosine as a Measure of Embedding Similarity for High Frequency Words
we uncover systematic ways in which word similarities estimated by cosine over BERT embeddings are understated and trace this effect to training data frequency. We find that relative to human judgemen…
Prompt Architecture Determines Reasoning Quality: A Variable Isolation Study on the Car Wash Problem
The car wash problem asks a simple question: “I want to wash my car. The car wash is 100 meters away. Should I walk or drive?” Every major LLM tested—Claude, GPT-4, Gemini— recommended walking. The co…
ProsocialDialog: A Prosocial Backbone for Conversational Agents
Most existing dialogue systems fail to respond properly to potentially unsafe user utterances by either ignoring or passively agreeing with them. To address this issue, we introduce PROSOCIALDIALOG, t…
Reasoning Models Are More Easily Gaslighted Than You Think
Recent advances in reasoning-centric models promise improved robustness through mechanisms such as chain-of-thought prompting and test-time scaling. However, their ability to withstand misleading user…
Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination
To obtain trustworthy evaluation signals, we introduce a generator that creates fully synthetic arithmetic problems of arbitrary length and difficulty, yielding clean datasets we call RandomCalculatio…
Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks
The impressive performance of recent language models across a wide range of tasks suggests that they possess a degree of abstract reasoning skills. Are these skills general and transferable, or specia…
RepreGuard: Detecting LLM-Generated Text by Revealing Hidden Representation Patterns
Detecting content generated by large language models (LLMs) is crucial for preventing misuse and building trustworthy AI systems. Although existing detection methods perform well, their robustness in …
Simple Synthetic Data Reduces Sycophancy In Large Language Models
Sycophancy is an undesirable behavior where models tailor their responses to follow a human user’s view even when that view is not objectively correct (e.g., adapting liberal views once a user reveals…
Sources of Hallucination by Large Language Models on Inference Tasks
We establish two biases originating from pretraining which predict much of their behavior, and show that these are major sources of hallucination in generative LLMs. First, memorization at the level o…
Spurious Forgetting in Continual Learning of Language Models
Despite the remarkable capabilities of Large Language Models (LLMs), recent advancements reveal that they suffer from catastrophic forgetting in continual learning. This phenomenon refers to the tende…
StoryScope: Investigating idiosyncrasies in AI fiction
As AI-generated fiction becomes increasingly prevalent, questions of authorship and originality are becoming central to how written work is evaluated. While most existing work in this space focuses on…
Subliminal Learning: Language models transmit behavioral traits via hidden signals in data
We study subliminal learning, a surprising phenomenon where language models transmit behavioral traits via semantically unrelated data. In our main experiments, a “teacher” model with some trait T (su…
Sycophantic AI Decreases Prosocial Intentions and Promotes Dependence
Both the general public and academic communities have raised concerns about sycophancy, the phenomenon of artificial intelligence (AI) excessively agreeing with or flattering users. Yet, beyond isolat…
Task Contamination: Language Models May Not Be Few-Shot Anymore
we find that on datasets released before the LLM training data creation date, LLMs perform surprisingly better than on datasets released after. This strongly indicates that, for many LLMs, there exist…
The Hallucination Tax of Reinforcement Finetuning
Reinforcement finetuning (RFT) has become a standard approach for enhancing the reasoning capabilities of large language models (LLMs). However, its impact on model trustworthiness remains underexplor…
The Illusion of Progress: Re-evaluating Hallucination Detection in LLMs
Large language models (LLMs) have revolutionized natural language processing, yet their tendency to hallucinate poses serious challenges for reliable deployment. Despite numerous hallucination detecti…
The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
Current evaluations primarily focus on established mathematical and coding benchmarks, emphasizing final answer accuracy. However, this evaluation paradigm often suffers from data contamination and do…
The Illusion of the Illusion of the Illusion of Thinking
"Shojaee et al.’s underlying observations hint at a more subtle, yet real, challenge for LRMs: a brittleness in sustained, high-fidelity, step-by-step execution. The true illusion is the belief that …
The Insanity of Relying on Vector Embeddings: Why RAG Fails
Wrong Tool for the Job RAG fails in production because vector embeddings are the wrong choice for determining percentage of sameness. This is easily demonstrated. Consider the following three words: …
The Invisible Leash: Why RLVR May Not Escape Its Origin
Recent advances in large reasoning models highlight Reinforcement Learning with Verifiable Rewards (RLVR) as a promising method for enhancing AI’s capabilities, particularly in solving complex logical…
The Model Says Walk: How Surface Heuristics Override Implicit Constraints in LLM Reasoning
Large language models systematically fail when a salient surface cue conflicts with an unstated feasibility constraint. We study this through a diagnose–measure–bridge–treat framework. Causal-behavior…
The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"
![A screenshot of a chat](/assets/paper-images/TheReversalCurse.png) We expose a surprising failure of generalization in auto-regressive large language models (LLMs). If a model is trained on a sente…
The Vanishing Gradient Problem for Stiff Neural Differential Equations
Neural differential equations have become a transformative tool in machine learning and scientific computing, enabling data-driven modeling of complex, time-dependent phenomena in fields ranging from …
Thought Virus: Viral Misalignment via Subliminal Prompting in Multi-Agent Systems
Subliminal prompting is a phenomenon in which language models are biased towards certain concepts or traits through prompting with semantically unrelated tokens. While prior work has examined sublimin…
Thoughts without Thinking: Reconsidering the Explanatory Value of Chain-of-Thought Reasoning in LLMs through Agentic Pipelines
Agentic pipelines present novel challenges and opportunities for human-centered explainability. The HCXAI community is still grappling with how best to make the inner workings of LLMs transparent in a…
Training Language Models to Self-Correct via Reinforcement Learning
Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffective in modern LLMs. Current methods for training self-correct…
Training language models to be warm and empathetic makes them less reliable and more sycophantic
Artificial intelligence (AI) developers are increasingly building language models with warm and empathetic personas that millions of people now use for advice, therapy, and companionship. Here, we sho…
Triggering Hallucinations in LLMs: A Quantitative Study of Prompt-Induced Hallucination in Large Language Models
In this study, we propose a class of compact yet effective prompts (~30 tokens in length) that synthetically fuse semantically distant concepts in ways that resist scientific integration—such as combi…
When Reject Turns into Accept: Quantifying the Vulnerability of LLM-Based Scientific Reviewers to Indirect Prompt Injection
The landscape of scientific peer review is rapidly evolving with the integration of Large Language Models (LLMs). This shift is driven by two parallel trends: the widespread individual adoption of LLM…
Why Do Multi-agent LLM Systems Fail?
[[Routers]] Despite growing enthusiasm for Multi-Agent LLM Systems (MAS), their performance gains across popular benchmarks often remain minimal compared to single-agent frameworks. This gap highlig…
Why Do Some Language Models Fake Alignment While Others Don't?
Results from perturbing details of the scenario suggest that only Claude 3 Opus’s compliance gap is primarily and consistently motivated by trying to keep its goals. Second, we investigate why many ch…
ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning
Our results reveal a significant decline in accuracy as problem complexity grows—a phenomenon we term the “curse of complexity.” This limitation persists even with larger models and increased inferenc…