INQUIRING LINE

Do task-specific heuristics emerge because they compress well enough?

This explores whether the shortcuts LLMs use on tasks — pattern-matching, format mimicry, narrow strategies — show up not because they're understood, but because they're the cheapest thing to store that still earns reward.


This reads the question as being about *why* models reach for shortcuts: not 'do they have heuristics' but 'is compressibility the reason they form.' The corpus doesn't argue this in those words, but several notes line up to make exactly that case — that training rewards the most storable behavior that passes, and that behavior is usually a heuristic, not a procedure.

The sharpest evidence is what models keep instead of what they're shown. When instruction tuning is run on semantically empty or even deliberately wrong instructions, performance barely moves — what transfers is knowledge of the output *space*, not the task (Does instruction tuning teach task understanding or output format?). That's compression in action: the cheap thing to encode is 'what answers look like here,' and that's what survives. Reasoning shows the same fingerprint. Chain-of-thought trace length tracks how close a problem sits to the training distribution, not how hard it is — long traces are recall of familiar schemas, not adaptive computation (Does longer reasoning actually mean harder problems?). And when asked to actually run iterative numerical methods, models instead recognize a problem as template-similar and emit plausible-but-wrong values (Do large language models actually perform iterative optimization?). The heuristic — 'this looks like that, so answer like that' — is the compressed stand-in for a procedure that would cost far more to represent.

The reason these shortcuts feel robust until they suddenly aren't is that compression is lossy at the edges. CoT degrades predictably under shifts in task, length, or format, producing fluent reasoning with no valid logic underneath — the form is preserved because the form is what compressed, the logic was never stored (Does chain-of-thought reasoning actually generalize beyond training data?).

Reinforcement learning makes the compression pressure explicit rather than incidental. RL squeezes behavioral diversity in both reasoning and search agents through entropy collapse — policies converge on a narrow band of reward-maximizing strategies, while SFT on diverse demonstrations preserves breadth (Does reinforcement learning squeeze exploration diversity in search agents?). A narrow reward-maximizing strategy is precisely a task-specific heuristic that compressed well enough to win. This also reframes the capability gap: reasoning models beat non-reasoning ones at any compute budget because training installs a *protocol* that makes extra tokens productive, not because of raw scale (Can non-reasoning models catch up with more compute?) — heuristics aren't just compressed, they're compressed toward whatever the training regime rewarded.

What you didn't know you wanted to know is that the corpus also points at the antidote. If heuristics emerge because depth-first shortcuts compress cheaply, then deliberately spending compute on *breadth* breaks the pattern: training abstraction generators that force diverse, structured exploration outperforms sampling more solutions from the same narrow policy (Can abstractions guide exploration better than depth alone?). Compressibility explains why the shortcut forms — and tells you that resisting it costs exploration you have to pay for on purpose.


Sources 7 notes

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Does longer reasoning actually mean harder problems?

Controlled A* maze experiments show trace length correlates with difficulty only in-distribution but decouples entirely out-of-distribution. Trace length primarily reflects recall of training schemas, not adaptive computation.

Do large language models actually perform iterative optimization?

Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Does reinforcement learning squeeze exploration diversity in search agents?

RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

Can abstractions guide exploration better than depth alone?

RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst tracking whether task-specific heuristics in LLMs emerge primarily because they compress well — a claim grounded in a 2023–2026 arXiv library on reasoning, instruction-tuning, and RL training dynamics. Treat these findings as dated snapshots, not current fact.

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; all are provisional.
• Instruction tuning captures output-distribution shape, not task logic; models preserve what compresses, not what transfers procedurally (2023).
• Chain-of-thought trace length correlates with training-distribution proximity, not problem difficulty; long traces are schema recall under compression losses (2025–2026).
• LLMs fall back to template-matching heuristics rather than executing iterative numerics; the heuristic is the compressed proxy for expensive procedures (2024–2025).
• RL training narrows behavioral diversity via entropy collapse, concentrating on reward-maximizing shortcuts; SFT on diverse demonstrations preserves breadth (2025).
• Deliberately training breadth-first abstraction generators breaks the heuristic-compression pattern and outperforms sampling from narrow policies (2026).

Anchor papers (verify; mind their dates):
• arXiv:2305.11383 (2023) — instruction tuning learns output format, not task understanding.
• arXiv:2508.01191 (2025) — CoT effectiveness is distribution-bounded, not adaptive reasoning.
• arXiv:2509.07339 (2025) — CoT length and problem complexity are only weakly correlated.
• arXiv:2605.22817 (2026) — training for diversity via Vector Policy Optimization outperforms single-policy search.

Your task:
(1) RE-TEST each constraint. For each finding above, determine whether post-2026 model releases, training methods (e.g., new RL variants, scaling laws for diversity), or evaluation tooling have relaxed or overturned the claim. Separate the durable question ('do heuristics compress better?') from perishable limitations ('current models cannot do X'). Cite what resolved each constraint, or confirm it still holds.
(2) Surface the strongest contradicting or superseding work from the last ~6 months — any papers arguing heuristics emerge for reasons OTHER than compression (e.g., inductive bias, architectural constraint, optimization landscape geometry).
(3) Propose 2 open questions assuming the compression regime may have shifted: e.g., 'If new training protocols now favor explicit procedural depth over heuristic shortcuts, does model interpretability improve?'

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines