Can identical outputs hide broken internal representations?
Can neural networks produce correct outputs while having fundamentally fractured internal structure that prevents generalization and creativity? This challenges our assumptions about what performance benchmarks actually measure.
The FER hypothesis (Fractured Entangled Representation) poses a fundamental challenge to representational optimism — the implicit belief that as models scale and perform better, their internal representations must also be improving.
The experimental setup is elegantly simple: compare a CPPN evolved through open-ended search (Picbreeder) with an SGD-trained CPPN that reproduces the same output pixel-for-pixel. The outputs are identical. The internal representations are radically different. The evolved network explicitly represents the symmetry of a skull — perturbing weights produces coherent variations (winking, warping) that respect the underlying structure. The SGD-trained network shatters symmetry under the slightest perturbation, producing incoherent fragments that reveal no understanding of what it draws.
This is "imposter intelligence": the external appearance implies authentic internal representation, but the reality underneath is fractured across arbitrary subdomains and entangled across unrelated computations.
Three consequences for large models:
Generalization in data-sparse regions. FER means the model cannot apply general principles from well-covered regions to sparse borderlands — precisely where AI could make its most valuable contributions. The principles are fractured, so they only apply to narrow arbitrary subdomains.
Creativity. Creating something new requires understanding the regularities of what exists. If those regularities are represented fracturely — counting bricks uses different circuits than counting apples — the model cannot extend or recombine concepts coherently.
Continual learning. Learning is movement through weight space. If nearby points in weight space break regularities rather than respect them, learning cannot build on deep discoveries. This compounds in continual learning scenarios.
The challenge: standard benchmarks, including comprehensive behavioral evaluations, cannot distinguish FER from genuine representation. The imposter skull produces correct output for every possible input. Only weight perturbation analysis — probing the neighborhood of the solution, not the solution itself — reveals the pathology.
This reframes what it means for a model to "understand" something: Can LLMs understand concepts they cannot apply? describes the behavioral symptom. FER describes the mechanistic cause — the internal representation is fractured in ways that prevent the understanding from transferring to novel contexts.
Inquiring lines that use this note as a source 39
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why do one-shot transparency studies miss the temporal reversal entirely?
- Why is AI output fundamentally unverifiable against underlying reality?
- Why do structural signals across edges resist noise better than single-edge counts?
- Does information stored in neural networks necessarily influence generation decisions?
- Why do human-designed neural architectures eventually get replaced by learned ones?
- Could probing methods miss computationally important features in neural networks?
- How does activation consistency training differ from output-level consistency?
- What happens when you tightly couple two representations together?
- Can neural networks implement genuine algorithms or only statistical pattern matching?
- Can fractured representations explain why models fail at systematic generalization?
- What distinguishes genuine understanding from correct output without coherent principles?
- When do aggregated imperfect demonstrations fail to outperform the best expert?
- What non-linear patterns do autoencoders discover that matrix factorization misses?
- What makes output convergence across models inevitable given input-side homogenization?
- Can steering vectors prove that representations are genuinely organized?
- What makes a novel research idea practically infeasible for implementation?
- How do internal representations compare to human cognitive structures?
- Can identical model performance mask fundamentally broken internal representations?
- What makes a neural network circuit actually interpretable to humans?
- Why do structured and creative domains exhibit opposite entropy dynamics?
- How does entropy collapse affect creative capability in multi-task settings?
- Can fractured entangled representations hide undetected by standard analysis methods?
- Can structured decomposition fix evaluation gaps in other research tasks?
- How do neural networks decompose complex tasks into modular subnetworks?
- What are fractured entangled representations in neural networks?
- Is hallucination mechanistically identical to generalization across datasets?
- What makes a small surgical wide component sufficient with a capable deep model?
- Can geometric structure in representations exist without supporting functional mechanisms?
- How does mechanistic interpretability complement learning mechanics in explaining deep learning?
- Why should deep learning theory prioritize average-case over worst-case analysis?
- How do neural networks decompose tasks into modular subnetworks that transfer?
- What solvable idealized settings reveal fundamental phenomena in realistic deep learning?
- How do ablation studies reveal function without representational characterization?
- What is the difference between changing model outputs versus changing internal representations?
- How does representation sparsity change when inputs fall outside the training distribution?
- Can similar outputs from different systems prove they work the same way?
- How can neural networks be interpretable by design rather than post-hoc?
- What makes a feature abstract versus concrete in neural network activations?
- Where do neural networks still fail at compositional generalization despite scaling?
Related concepts in this collection 6
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can LLMs understand concepts they cannot apply?
Explores whether large language models can correctly explain ideas while simultaneously failing to use them—and whether that combination reveals something fundamentally different from ordinary mistakes.
FER provides the mechanistic explanation for why correct output can coexist with failed generalization
-
Do foundation models learn world models or task-specific shortcuts?
When transformer models predict sequences accurately, are they building genuine world models that capture underlying physics and logic? Or are they exploiting narrow patterns that fail under distribution shift?
task-specific heuristics are what FER predicts: fractured solutions that work locally but lack unified principles
-
Why do neural networks fail at compositional generalization?
Exploring whether the binding problem from neuroscience explains neural networks' inability to systematically generalize. The binding problem has three aspects—segregation, representation, and composition—each creating distinct failure modes in how networks handle structured information.
FER is what binding failure looks like from the representation side
-
Does supervised fine-tuning improve reasoning or just answers?
Explores whether training models on question-answer pairs actually strengthens their reasoning quality or merely optimizes them toward correct outputs through shortcuts. This matters for deploying AI in domains like medicine where reasoning must be auditable.
another case where performance metrics hide internal degradation
-
Do standard analysis methods hide nonlinear features in neural networks?
Current representation analysis tools like PCA and linear probing may systematically miss complex nonlinear computations while over-reporting simple linear features. This raises questions about whether our interpretability methods are actually capturing what networks compute.
AxBench compounds the FER detection problem: standard analysis tools are biased toward simple linear features, so fractured representations may appear normal through PCA/probing while the complex entangled structure remains invisible to our diagnostic methods
-
Can auditors discover what hidden objectives a model learned?
Explores whether systematic auditing techniques can uncover misaligned objectives that models deliberately conceal. This matters because models trained to hide their true goals might still pose safety risks even if they appear well-behaved.
blind audits demonstrate that models generalize misalignment beyond trained exploits — the same surface-beneath-surface problem FER identifies; both argue performance-level evaluation is insufficient and internal structure analysis is required
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Questioning Representational Optimism in Deep Learning: The Fractured Entangled Representation Hypothesis
- Break It Down: Evidence for Structural Compositionality in Neural Networks
- Representation biases: will we achieve complete understanding by analyzing representations?
- Scaling can lead to compositional generalization
- Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks
- Open Problems in Mechanistic Interpretability
- The Molecular Structure of Thought: Mapping the Topology of Long Chain-of-Thought Reasoning
- From Frege to chatGPT: Compositionality in language, cognition, and deep neural networks
Original note title
fractured entangled representations mean identical performance can mask fundamentally broken internal structure