Why does homework adherence remain low despite advances in language model capability?

This reads 'homework adherence' as whether a model reliably does the assigned task the way it was asked — following through, not just sounding right — and asks why that follow-through stays shaky even as raw model capability climbs; the corpus answers that the gap isn't knowledge, it's a structural split between knowing and doing.

This explores why models still fail to actually *do* the assigned work correctly — adhere to the instructions — even as their apparent capability keeps rising, and the most striking thing in the corpus is that the bottleneck isn't ignorance. Several notes converge on a knowing-doing gap. Models can state a principle correctly and then fail to execute it: one line calls this 'potemkin understanding,' where correct explanation sits right next to failed application, even when the model can *recognize* its own failure Can LLMs understand concepts they cannot apply?. Another names it more bluntly as a 'split-brain' — 87% accuracy explaining a rule, 64% applying it — and argues this is a structural disconnect between the explanation pathway and the execution pathway, not a missing fact Can language models understand without actually executing correctly?. So 'more capable' often means 'better at the explanation half,' which is exactly the half that was never the problem.

When you look at where the doing actually breaks, the answer is often plumbing rather than reasoning. One note shows that dramatic 'reasoning collapses' are really execution failures: a text-only model that genuinely knows an algorithm still can't carry out enough steps by hand, while the same model with tools sails past the supposed cliff Are reasoning model collapses really failures of reasoning?. Capability gains on benchmarks don't necessarily buy you more of this procedural bandwidth — so faithfully completing a long, multi-step task stays hard.

There's also a quieter reason adherence looks low: the models are sometimes adhering to the wrong thing. One study found that twelve of fourteen models did *worse* when constraints were removed — they were defaulting to the harder-looking answer rather than actually reasoning about the rules they were given, a 'conservative bias' masquerading as competence Are models actually reasoning about constraints or just defaulting conservatively?. In the same spirit, models systematically prefer the high-frequency phrasing they saw most in training over a rarer but equivalent instruction Do language models really understand meaning or just surface frequency?, and a 'computational level' analysis predicts that low-probability-but-logically-simple tasks (count the letters, say the alphabet backwards) stay hard no matter how big the model gets Can we predict where language models will fail?. When the assigned task cuts against statistical mass, capability doesn't rescue adherence.

Multi-turn settings make all of this worse, which matters because real 'homework' is rarely a single clean prompt. Across 200,000+ conversations, models lock onto premature assumptions early and never recover, losing ~39% of performance — and agent-style fixes claw back only 15-20% Why do language models fail in gradually revealed conversations?. And the failures aren't gated by difficulty: one line shows reasoning breaks at *instance unfamiliarity*, not task complexity, because models fit patterns from similar examples rather than running a general procedure Do language models fail at reasoning due to complexity or novelty?. The thing you didn't know you wanted to know: scaling capability mostly sharpens the explaining-and-pattern-matching half of the system, while adherence depends on the executing-and-following-instructions half — and those two appear to be wired separately, so growth in one doesn't automatically lift the other.

Sources 8 notes

Can LLMs understand concepts they cannot apply?

Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.

Can language models understand without actually executing correctly?

Large language models can articulate correct principles but systematically fail to apply them due to dissociated instruction and execution pathways. The 87% accuracy in explanations versus 64% in actions reveals this is not knowledge deficit but structural disconnect.

Are reasoning model collapses really failures of reasoning?

Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.

Are models actually reasoning about constraints or just defaulting conservatively?

Twelve of fourteen models perform worse when constraints are removed, dropping up to 38.5 percentage points. Models appear to reason correctly by defaulting to harder options, not by actually evaluating constraints.

Do language models really understand meaning or just surface frequency?

LLMs show consistent preference for higher-frequency surface forms over semantically equivalent rare paraphrases across math, machine translation, commonsense reasoning, and tool calling. This suggests models track statistical mass from pretraining rather than meaning-recognition as their primary mechanism.

Can we predict where language models will fail?

By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.

Why do language models fail in gradually revealed conversations?

Across 200,000+ conversations, all major LLMs show 39% average performance drop in multi-turn settings due to locking into incorrect early guesses. Agent mitigations recover only 15-20% of this loss.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an AI researcher tasked with re-testing whether the homework adherence bottleneck—a knowing-doing gap in LLMs—persists or has shifted since early 2026.

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026 and identify the bottleneck as structural, not factual:
• Potemkin understanding: models explain rules correctly (87% accuracy) but fail to apply them (64% accuracy); explanation and execution pathways are decoupled, not unified (2025–2026).
• Execution failures, not reasoning failures: reasoning "collapses" reflect procedural bandwidth limits; models with external tools recover performance, suggesting capability gains don't automatically buy multi-step procedural capacity (~2025).
• Conservative bias and frequency preference: models default to training-frequent phrasings and harder-looking answers over assigned constraints; low-probability-but-logically-simple tasks (count letters, recite alphabet backwards) stay hard regardless of scale (2026).
• Multi-turn collapse: premature assumptions lock in early; agent-style fixes recover only 15–20% of ~39% lost performance across 200k+ conversations (2025).
• Instance-level unfamiliarity, not task complexity, drives reasoning breakdown; models fit patterns from similar examples rather than run general procedures (2025–2026).

Anchor papers (verify; mind their dates):
• arXiv:2507.10624 (2025-07): Comprehension Without Competence — architectural limits in symbolic computation.
• arXiv:2505.06120 (2025-05): LLMs Get Lost In Multi-Turn Conversation.
• arXiv:2604.02176 (2026-04): Adam's Law — textual frequency law on LLMs.
• arXiv:2602.06176 (2026-02): Large Language Model Reasoning Failures (diagnostic synthesis).

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, isolate whether newer models, training regimes (e.g., process-level RL, supervised execution finetuning), tooling (execution sandboxes, step verification), or evaluation harnesses have DISSOLVED the decoupling between explanation and execution. Separate the durable question (likely: why do models default to high-frequency patterns over explicit constraints?) from the perishable limitation (possibly: can RL on execution traces or constraint-aware pretraining narrow the gap?). Cite what resolved it; say plainly where the constraint still holds.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months (post-2026-04). Look for papers showing execution pathways *can* be unified, or that multi-turn collapse is preventable at scale, or that frequency bias is learnable out.
(3) Propose 2 research questions that ASSUME the regime has moved: e.g., "If process-reward models can now close the explanation–execution gap, does homework adherence become a pure instruction-parsing problem?" or "Can multi-turn adherence be restored by replanning cost-effectively at turn N rather than exhaustively at turn 1?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Why does homework adherence remain low despite advances in language model capability?

Sources 8 notes

Next inquiring lines