Can models be smart without organized internal structure?

Explores whether linear feature decodability proves genuine compositional reasoning or merely indicates that the right features are present but poorly organized. Critical for understanding what performance metrics actually certify.

Synthesis note · 2026-02-23 · sourced from MechInterp

Two findings from mechanistic interpretability appear contradictory but operate at different levels of representational analysis:

Fractured Entangled Representations (FER): Since Can identical outputs hide broken internal representations?, SGD-trained models fail catastrophically under perturbation or distribution shift in ways that well-organized representations would not. The pathology is invisible to standard evaluation.

Compositional generalization at scale: Scaling data and model size produces representations where compositional features are linearly decodable — separable task constituents can be independently identified and manipulated. This has been taken as evidence for genuine compositional understanding.

The resolution: Linear decodability tests for the presence of features, not their organization. A fractured representation could contain every linearly decodable feature while being fractured in how those features relate to each other. The compositional parts are present but their composition is broken.

This connects directly to the "imposter intelligence" post angle: Can LLMs understand concepts they cannot apply?, Does supervised fine-tuning actually improve reasoning quality?, and Do foundation models learn world models or task-specific shortcuts?. All describe the same meta-pattern: surface metrics certify capability that internal structure analysis would disqualify.

The practical implication for model evaluation: passing compositional generalization tests does not guarantee robust compositional reasoning. Evaluation under distribution shift, perturbation, and novel recombination is required to distinguish genuine compositionality from fractured representations that happen to contain the right features.

Inquiring lines that use this note as a source 145

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 1

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 126 in 2-hop network ·dense cluster Open in graph ↗

Can models be smart without organized internal s… Can we track and steer personality shifts during m…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can we track and steer personality shifts during model finetuning? This research explores whether personality traits in language models occupy specific linear directions in activation space, and whether we can detect and control unwanted personality changes during training using these geometric directions.
persona vectors demonstrate a case where linear decodability corresponds to genuine functional organization (steering works), providing a positive counterexample to FER's warning that decodability alone is insufficient

Can models be smart without organized internal structure?

Related concepts in this collection 1

Related papers in this collection 8

Search by related questions 4