INQUIRING LINE

Does model collapse occur across different architectures or only in specific conditions?

This reads 'model collapse' broadly — not just the synthetic-data degradation sense, but the family of ways models break down — and asks whether breakdown is one universal phenomenon or a set of condition-specific failures; the corpus strongly suggests the latter.


This explores whether 'model collapse' is a single architecture-wide phenomenon or a label we paste over several distinct, condition-specific failures. The collection's clear answer: there isn't one collapse, there are many, and most of them are triggered by specific conditions rather than baked into the architecture. The most useful move the corpus makes is to pull apart breakdowns that look identical from the outside but have different causes — and therefore different fixes.

Several 'collapses' turn out to be artifacts of the situation, not the model. What looks like a reasoning cliff is often just execution running out of room: text-only models can know an algorithm but can't carry out enough steps to finish, and the same models clear the supposed cliff once you hand them tools Are reasoning model collapses really failures of reasoning?. Apparent complexity walls are really novelty walls — models hold up on long reasoning chains they've seen patterns for and fall apart on unfamiliar instances of the same task Do language models fail at reasoning due to complexity or novelty?. And a model's own mistakes can feed the collapse: once errors fill the context window, performance degrades non-linearly, an avalanche that more scale doesn't fix but test-time 'thinking' partly does Do models fail worse when their own errors fill the context?.

The corpus also insists that 'collapse' at training time and 'collapse' at inference time are different animals. Entropy collapse during training and variance inflation at inference both come from a broken exploration-exploitation balance, but they live at different timescales and need structurally separate interventions — fixing one does nothing for the other Why do reasoning models fail differently at training versus inference?. That alone undercuts the idea of a single collapse mechanism.

Where the architecture genuinely is the cause, the collection is precise about it. Autoregressive transformers cannot retract a token once emitted, so constraint-satisfaction problems hit a hard ceiling that no amount of model quality removes — symbolic solvers help only because they supply the retraction the architecture lacks Why does autoregressive generation fail at constraint satisfaction?. This is the one place where 'it's the architecture' is the right answer, and notably it's narrow and specific, not a general collapse.

Finally, the conditions matter and they vary by model. Instruction-following degrades in three distinct shapes depending on model type — linear for small models, exponential for mid-range, threshold-then-cliff for reasoning models How does instruction density affect model performance?. And some collapses aren't visible in performance at all: models can hit perfect accuracy while their internal representations are fractured and fragile, primed to collapse only under perturbation or distribution shift Can models be smart without organized internal structure?. The thing worth taking away: asking 'does collapse happen across architectures' is the wrong frame — the productive question is which failure, under which condition, and that's where the leverage to prevent it lives.


Sources 7 notes

Are reasoning model collapses really failures of reasoning?

Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Do models fail worse when their own errors fill the context?

Error accumulation in context causes non-linear performance degradation in long-horizon tasks. Model scaling does not fix this; only test-time compute through thinking models reduces the effect by preventing error-contaminated context from biasing reasoning.

Why do reasoning models fail differently at training versus inference?

Both failures stem from failed exploration-exploitation balance but occur at different timescales requiring structurally distinct interventions. Training-time fixes (entropy bonuses, critique diversity) cannot prevent inference-time variance inflation, and vice versa; both loops must be managed independently.

Why does autoregressive generation fail at constraint satisfaction?

The performance ceiling on constraint satisfaction problems is not a model-quality issue but an architectural limitation: autoregressive transformers cannot retract emitted tokens, while CSP solvers fundamentally depend on discarding invalid partial assignments. Symbolic solver integration works because it supplies what the architecture lacks.

How does instruction density affect model performance?

IFScale benchmark shows three degradation patterns: linear (small models), exponential (mid-range), and threshold decay (reasoning models maintain ~150 instructions then fail steeply). Even best models reach only 68% accuracy at maximum density.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Next inquiring lines