Do task-specific heuristics improve gradually or appear suddenly at scale?

This explores whether the skills a model uses to solve a task accumulate smoothly as training and scale grow, or switch on abruptly — and the corpus reframes 'heuristic' itself as instance-based pattern recall, which changes the answer.

This explores whether task-specific heuristics build up gradually or pop into existence at scale. The corpus's most direct move is to dissolve the premise: what looks like a discrete 'heuristic' is, in several of these notes, just recall of training instances the model has seen something like before. Do language models fail at reasoning due to complexity or novelty? shows reasoning models don't break at a complexity threshold and don't switch on at one either — they succeed on any chain when trained on similar instances and fail at novelty boundaries. Under that lens, a heuristic isn't a capability that appears; it's coverage that expands, one familiar region at a time.

That framing makes the 'gradual' side look strong. Does longer reasoning actually mean harder problems? finds that reasoning traces track proximity to the training distribution, not the difficulty of the problem — the model is recalling schemas, not adaptively computing. Does chain-of-thought reasoning actually generalize beyond training data? sharpens this: performance decays *predictably* as you move away from training data, producing fluent but logically empty reasoning. Predictable decay is the signature of a smooth underlying function, not a phase transition. Even instruction tuning, often credited with unlocking new behavior, turns out to mostly teach the shape of the output space — Does instruction tuning teach task understanding or output format? shows semantically empty or wrong instructions perform about as well as correct ones. What accumulates is format familiarity, gradually.

The scaling-curve notes agree. Do search steps follow the same scaling rules as reasoning tokens? finds search agents improve along the same diminishing-returns curve as reasoning tokens — a smooth axis, not a cliff. Why does chain of thought accuracy eventually decline with length? describes a continuous inverted-U where the optimum drifts as models improve. Nothing here behaves like a switch.

So where does 'sudden at scale' come from? The interesting answer in this corpus is that apparent jumps usually come from a *new information channel*, not raw scale. Can natural language feedback overcome numerical reward plateaus? shows models stuck on a numerical-reward plateau leap forward when given chain-of-thought critiques — the plateau wasn't a capability ceiling, it was missing information about *why* failures happened. Likewise Does training order reshape how models handle different task types? shows that simply changing training *order* yields large gains by avoiding entropy collapse, and Does reinforcement learning squeeze exploration diversity in search agents? shows RL can quietly narrow the heuristics a model will even attempt. The 'suddenness' lives in the training signal and schedule, not in scale crossing a magic number.

The thing worth taking away: in this collection there's little evidence for heuristics that genuinely materialize at a scale threshold. What reads as emergence is usually the model entering a region where similar instances were memorized, or a new feedback signal exposing competence that scaling alone left locked. Scale mostly buys *more* gradually-acquired heuristics — wider coverage — rather than qualitatively new ones appearing all at once.

Sources 9 notes

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Does longer reasoning actually mean harder problems?

Controlled A* maze experiments show trace length correlates with difficulty only in-distribution but decouples entirely out-of-distribution. Trace length primarily reflects recall of training schemas, not adaptive computation.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Do search steps follow the same scaling rules as reasoning tokens?

Deep research agents improve with more search steps in a pattern mirroring the reasoning-token relationship, with both exhibiting diminishing returns. This reveals a new inference-compute axis beyond model capability alone.

Why does chain of thought accuracy eventually decline with length?

Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.

Can natural language feedback overcome numerical reward plateaus?

Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.

Does training order reshape how models handle different task types?

Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.

Does reinforcement learning squeeze exploration diversity in search agents?

RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.

Do task-specific heuristics improve gradually or appear suddenly at scale?

Sources 9 notes

Next inquiring lines