Why does homework adherence remain low despite advances in language model capability?
This reads 'homework adherence' as whether a model reliably does the assigned task the way it was asked — following through, not just sounding right — and asks why that follow-through stays shaky even as raw model capability climbs; the corpus answers that the gap isn't knowledge, it's a structural split between knowing and doing.
This explores why models still fail to actually *do* the assigned work correctly — adhere to the instructions — even as their apparent capability keeps rising, and the most striking thing in the corpus is that the bottleneck isn't ignorance. Several notes converge on a knowing-doing gap. Models can state a principle correctly and then fail to execute it: one line calls this 'potemkin understanding,' where correct explanation sits right next to failed application, even when the model can *recognize* its own failure Can LLMs understand concepts they cannot apply?. Another names it more bluntly as a 'split-brain' — 87% accuracy explaining a rule, 64% applying it — and argues this is a structural disconnect between the explanation pathway and the execution pathway, not a missing fact Can language models understand without actually executing correctly?. So 'more capable' often means 'better at the explanation half,' which is exactly the half that was never the problem.
When you look at where the doing actually breaks, the answer is often plumbing rather than reasoning. One note shows that dramatic 'reasoning collapses' are really execution failures: a text-only model that genuinely knows an algorithm still can't carry out enough steps by hand, while the same model with tools sails past the supposed cliff Are reasoning model collapses really failures of reasoning?. Capability gains on benchmarks don't necessarily buy you more of this procedural bandwidth — so faithfully completing a long, multi-step task stays hard.
There's also a quieter reason adherence looks low: the models are sometimes adhering to the wrong thing. One study found that twelve of fourteen models did *worse* when constraints were removed — they were defaulting to the harder-looking answer rather than actually reasoning about the rules they were given, a 'conservative bias' masquerading as competence Are models actually reasoning about constraints or just defaulting conservatively?. In the same spirit, models systematically prefer the high-frequency phrasing they saw most in training over a rarer but equivalent instruction Do language models really understand meaning or just surface frequency?, and a 'computational level' analysis predicts that low-probability-but-logically-simple tasks (count the letters, say the alphabet backwards) stay hard no matter how big the model gets Can we predict where language models will fail?. When the assigned task cuts against statistical mass, capability doesn't rescue adherence.
Multi-turn settings make all of this worse, which matters because real 'homework' is rarely a single clean prompt. Across 200,000+ conversations, models lock onto premature assumptions early and never recover, losing ~39% of performance — and agent-style fixes claw back only 15-20% Why do language models fail in gradually revealed conversations?. And the failures aren't gated by difficulty: one line shows reasoning breaks at *instance unfamiliarity*, not task complexity, because models fit patterns from similar examples rather than running a general procedure Do language models fail at reasoning due to complexity or novelty?. The thing you didn't know you wanted to know: scaling capability mostly sharpens the explaining-and-pattern-matching half of the system, while adherence depends on the executing-and-following-instructions half — and those two appear to be wired separately, so growth in one doesn't automatically lift the other.
Sources 8 notes
Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.
Large language models can articulate correct principles but systematically fail to apply them due to dissociated instruction and execution pathways. The 87% accuracy in explanations versus 64% in actions reveals this is not knowledge deficit but structural disconnect.
Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.
Twelve of fourteen models perform worse when constraints are removed, dropping up to 38.5 percentage points. Models appear to reason correctly by defaulting to harder options, not by actually evaluating constraints.
LLMs show consistent preference for higher-frequency surface forms over semantically equivalent rare paraphrases across math, machine translation, commonsense reasoning, and tool calling. This suggests models track statistical mass from pretraining rather than meaning-recognition as their primary mechanism.
By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.
Across 200,000+ conversations, all major LLMs show 39% average performance drop in multi-turn settings due to locking into incorrect early guesses. Agent mitigations recover only 15-20% of this loss.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.