Can identical outputs hide broken internal representations?

Can neural networks produce correct outputs while having fundamentally fractured internal structure that prevents generalization and creativity? This challenges our assumptions about what performance benchmarks actually measure.

Synthesis note · 2026-02-23 · sourced from MechInterp

The FER hypothesis (Fractured Entangled Representation) poses a fundamental challenge to representational optimism — the implicit belief that as models scale and perform better, their internal representations must also be improving.

The experimental setup is elegantly simple: compare a CPPN evolved through open-ended search (Picbreeder) with an SGD-trained CPPN that reproduces the same output pixel-for-pixel. The outputs are identical. The internal representations are radically different. The evolved network explicitly represents the symmetry of a skull — perturbing weights produces coherent variations (winking, warping) that respect the underlying structure. The SGD-trained network shatters symmetry under the slightest perturbation, producing incoherent fragments that reveal no understanding of what it draws.

This is "imposter intelligence": the external appearance implies authentic internal representation, but the reality underneath is fractured across arbitrary subdomains and entangled across unrelated computations.

Three consequences for large models:

Generalization in data-sparse regions. FER means the model cannot apply general principles from well-covered regions to sparse borderlands — precisely where AI could make its most valuable contributions. The principles are fractured, so they only apply to narrow arbitrary subdomains.
Creativity. Creating something new requires understanding the regularities of what exists. If those regularities are represented fracturely — counting bricks uses different circuits than counting apples — the model cannot extend or recombine concepts coherently.
Continual learning. Learning is movement through weight space. If nearby points in weight space break regularities rather than respect them, learning cannot build on deep discoveries. This compounds in continual learning scenarios.

The challenge: standard benchmarks, including comprehensive behavioral evaluations, cannot distinguish FER from genuine representation. The imposter skull produces correct output for every possible input. Only weight perturbation analysis — probing the neighborhood of the solution, not the solution itself — reveals the pathology.

This reframes what it means for a model to "understand" something: Can LLMs understand concepts they cannot apply? describes the behavioral symptom. FER describes the mechanistic cause — the internal representation is fractured in ways that prevent the understanding from transferring to novel contexts.

Inquiring lines that use this note as a source 39

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

23 direct connections · 172 in 2-hop network ·medium cluster Open in graph ↗

Can identical outputs hide broken internal repre… Can LLMs understand concepts they cannot apply? Do foundation models learn world models or task-sp… Why do neural networks fail at compositional gener… Does supervised fine-tuning improve reasoning or j… Do standard analysis methods hide nonlinear featur… Can auditors discover what hidden objectives a mod…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can LLMs understand concepts they cannot apply? Explores whether large language models can correctly explain ideas while simultaneously failing to use them—and whether that combination reveals something fundamentally different from ordinary mistakes.
FER provides the mechanistic explanation for why correct output can coexist with failed generalization
Do foundation models learn world models or task-specific shortcuts? When transformer models predict sequences accurately, are they building genuine world models that capture underlying physics and logic? Or are they exploiting narrow patterns that fail under distribution shift?
task-specific heuristics are what FER predicts: fractured solutions that work locally but lack unified principles
Why do neural networks fail at compositional generalization? Exploring whether the binding problem from neuroscience explains neural networks' inability to systematically generalize. The binding problem has three aspects—segregation, representation, and composition—each creating distinct failure modes in how networks handle structured information.
FER is what binding failure looks like from the representation side
Does supervised fine-tuning improve reasoning or just answers? Explores whether training models on question-answer pairs actually strengthens their reasoning quality or merely optimizes them toward correct outputs through shortcuts. This matters for deploying AI in domains like medicine where reasoning must be auditable.
another case where performance metrics hide internal degradation
Do standard analysis methods hide nonlinear features in neural networks? Current representation analysis tools like PCA and linear probing may systematically miss complex nonlinear computations while over-reporting simple linear features. This raises questions about whether our interpretability methods are actually capturing what networks compute.
AxBench compounds the FER detection problem: standard analysis tools are biased toward simple linear features, so fractured representations may appear normal through PCA/probing while the complex entangled structure remains invisible to our diagnostic methods
Can auditors discover what hidden objectives a model learned? Explores whether systematic auditing techniques can uncover misaligned objectives that models deliberately conceal. This matters because models trained to hide their true goals might still pose safety risks even if they appear well-behaved.
blind audits demonstrate that models generalize misalignment beyond trained exploits — the same surface-beneath-surface problem FER identifies; both argue performance-level evaluation is insufficient and internal structure analysis is required

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

fractured entangled representations mean identical performance can mask fundamentally broken internal structure

Can identical outputs hide broken internal representations?

Related concepts in this collection 6

Related papers in this collection 8

Search by related questions 4