Can models be smart without organized internal structure?
Explores whether linear feature decodability proves genuine compositional reasoning or merely indicates that the right features are present but poorly organized. Critical for understanding what performance metrics actually certify.
Two findings from mechanistic interpretability appear contradictory but operate at different levels of representational analysis:
Fractured Entangled Representations (FER): Since Can identical outputs hide broken internal representations?, SGD-trained models fail catastrophically under perturbation or distribution shift in ways that well-organized representations would not. The pathology is invisible to standard evaluation.
Compositional generalization at scale: Scaling data and model size produces representations where compositional features are linearly decodable — separable task constituents can be independently identified and manipulated. This has been taken as evidence for genuine compositional understanding.
The resolution: Linear decodability tests for the presence of features, not their organization. A fractured representation could contain every linearly decodable feature while being fractured in how those features relate to each other. The compositional parts are present but their composition is broken.
This connects directly to the "imposter intelligence" post angle: Can LLMs understand concepts they cannot apply?, Does supervised fine-tuning actually improve reasoning quality?, and Do foundation models learn world models or task-specific shortcuts?. All describe the same meta-pattern: surface metrics certify capability that internal structure analysis would disqualify.
The practical implication for model evaluation: passing compositional generalization tests does not guarantee robust compositional reasoning. Evaluation under distribution shift, perturbation, and novel recombination is required to distinguish genuine compositionality from fractured representations that happen to contain the right features.
Inquiring lines that use this note as a source 145
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why do only two of fourteen models improve when problem constraints are removed?
- When does the right constraint beat additional model capacity?
- What structural constraints matter more than model depth for CF?
- What distinguishes minimal-pair asymmetry from standard accuracy evaluation?
- How do unstated constraints become invisible to training data distributions?
- What production constraints should determine paradigm selection?
- What makes the frame problem distinct from feature-level shortcuts?
- How do unstated feasibility constraints affect model decision-making?
- What design changes could make constraint inference more reliable without explicit cuing?
- What is the mechanistic signature when models chain facts never presented together?
- Can mechanistic interpretability reveal how ideologies decompose into simpler features?
- How should benchmarks test whether models fit algorithms or patterns?
- How do embedding dimension limits constrain what concept models can represent?
- Can likelihood choice matter more than architectural depth for CF?
- Why do structural signals across edges resist noise better than single-edge counts?
- What compression explains why syntax fits in low-dimensional subspaces?
- What structural features force users to evaluate the epistemic status of outputs?
- Can Kolmogorov complexity alone capture what makes intelligence general?
- How should product specifications measure alignment without naming the dimension?
- How does nesting optimization levels improve on traditional network depth?
- What role do multi-dimensional quality frameworks play in assessing arguments versus single-metric approaches?
- Is interpretive multiplicity a bug in language or a feature?
- What makes AI-discovered architectures reveal design principles invisible to humans?
- How do autonomous pipelines identify and fix silent bugs in data pipelines?
- Does architectural discovery follow an empirical scaling law like neural networks?
- Can a single SAE feature control reasoning behavior across model families?
- Can bilevel autoresearch succeed when the inner and outer loops use different models?
- How much does domain shift limit the mechanisms a bilevel system can autonomously discover?
- Can a world model have rich representations without adequate data coverage?
- How do functional features differ from representational abstract features?
- Do larger models develop more abstract features than smaller ones?
- What makes multimodal conditioning effective when features are decomposed to the right granularity?
- Why do text-to-image models fail at composing multiple concepts together?
- What makes linear decodability a reliable signal of compositionality?
- What task structures benefit most from geometric parameter merging?
- How does optimizing model performance decouple from optimizing user interpretability?
- What makes a self-supervised pruning metric work without labels at scale?
- How does discretization make item representations more distinguishable?
- Can structural perturbations harm model accuracy more than semantic ones?
- Why does most refinement in iterative models maintain answers rather than improve them?
- Why does capturing domain structure reduce data requirements more than raw volume?
- When should model isolation be preferred over weight-averaging approaches?
- Why do power-law distributions make standard ML infrastructure assumptions fail?
- Do multi-vector or cross-encoder models escape these dimensional constraints?
- How much do metric choices inflate claims about model capabilities?
- Why do energy-based models generalize better on out-of-distribution data than standard transformers?
- How do surface statistical regularities enable correct outputs while degrading robustness?
- Why do models fail on logically equivalent tasks with different data distributions?
- How do weight perturbations reveal what performance benchmarks cannot measure?
- Why do single function-calling benchmarks mask model weakness in specific areas?
- Can steering vectors prove that representations are genuinely organized?
- Does scaling data automatically produce compositional reasoning or just better feature encoding?
- What test distinguishes genuine compositionality from fractured feature presence?
- What happens when you remove core political features from a deep model?
- Why do models with less steerability have more abstract ideological features?
- How does adjacent layer sharing differ from non-adjacent weight reuse?
- How does LatentQA differ from predefined concept steering like representation engineering?
- What skills can large models identify and organize about their own abilities?
- How should researchers evaluate whether correct model outputs reflect real structural learning?
- Can identical model performance mask fundamentally broken internal representations?
- What makes some model capabilities reliable while others remain brittle?
- What distinguishes conceptual understanding from statistical pattern matching in models?
- What performance trade-offs emerge when composing multiple independently trained model capabilities?
- Can mechanistic interpretability explain explanation-execution disconnection?
- Can alignment methods like DPO exploit or correct these surface feature biases?
- Why does the gap between theoretical expressiveness and learned capability matter?
- Why do standard transformers fail to encode recursive structure in their hidden states?
- Can fractured entangled representations hide undetected by standard analysis methods?
- Does the linear representation hypothesis reflect networks or reflect our analysis tools?
- Why do linear research pipelines lose global context across planning and generation steps?
- Why do singular value experts compose better than low-rank adapter subspaces?
- Can representation engineering cleanly isolate single features in entangled semantic space?
- Why do metric choices constrain which model capabilities get developed?
- Can structured decomposition fix evaluation gaps in other research tasks?
- How do output format constraints compare to input exemplar brittleness?
- How does fluent output mask the mythic function of a system?
- How does uniform code distribution make items more distinguishable?
- Why do models lack a stable underlying identity to return to?
- What are fractured entangled representations in neural networks?
- How does model weight freezing across users affect virtual instance individuation?
- Why must world models be nested rather than flat and uniform?
- Why do text-only benchmarks underestimate deployed model capability?
- Can granular function calling tasks learn composition from graph-sampled data?
- Does model collapse occur across different architectures or only in specific conditions?
- Why do feature-based approaches struggle when privacy or latent factors are involved?
- What happens when alignment targets measure only the preferred dimension of entangled properties?
- What other behavioral properties exist as linear directions in activation space?
- Can steering vectors be combined with other compression techniques?
- How do sparse circuits compare to the modular subnetworks that emerge naturally?
- Why does weight sparsity reduce superposition and force disentangled representations?
- Can sparse approximations reveal interpretable structure hidden in existing dense models?
- What sparse high-rank patterns does the deep tower fail to capture?
- Why do cross-product features fail to generalize across unseen feature combinations?
- What distinct structural signatures do model repetition and topic volatility create?
- What makes well-formatted outputs misleading as evidence of model capability?
- Can surface-level correctness hide failures in structural learning by LLMs?
- Does model capability still matter once coordination infrastructure is optimized?
- Why is a combinatorial framework better than family resemblance classification?
- How do overparameterization and data size shift what attractors represent?
- How do sharded HNSW indices preserve capability distinctions at scale?
- How does vehicle causality differ from content causality in physical systems?
- Can you steer reasoning by directly manipulating SAE features?
- Why does increased model capability make detection harder in delegated workflows?
- Why do semi-formal templates improve verification accuracy over unstructured reasoning?
- What makes attractor-based probing better for third-party model auditing than alternatives?
- Can geometric structure in representations exist without supporting functional mechanisms?
- What spectral signatures distinguish hierarchy-driven geometry from corpus-driven geometry?
- How does MaxSim reranking differ from structural verification at the token level?
- What limits external scaling when a model lacks reasoning foundation?
- How does mechanistic interpretability complement learning mechanics in explaining deep learning?
- What makes structured stochasticity more effective than unstructured randomness in reasoning?
- What distinguishes a representational feature from a causally inert correlation?
- Can interventions on model components prove mechanism without explaining encoding?
- Does sparsity enforce compositional structure or merely amplify existing modularity?
- Can we predict which tasks will decompose into modular subnetworks?
- Why does gradient descent discover compositional structure without explicit pressure?
- Can entropy signatures alone detect whether context was model-generated or externally prefilled?
- Can sparsity patterns reliably indicate how well a model knows its input?
- How can benchmark accuracy scores mask the absence of interpretable reasoning structure?
- What features does a sample reinforce when it moves bands?
- How does deterministic feature engineering increase information for computationally bounded agents?
- Can categorical correctness signals stop dense optimizers from finding loopholes?
- Do generic kernel-decay assumptions alone explain coarse-to-fine spectral ordering?
- Can spectral eigenvector ordering serve as a model-agnostic interpretability probe?
- Does token-level loss aggregation help aligned models differently?
- What architectural alternatives can capture compositional structure beyond pooled cosine?
- Can representation analysis methods detect complex features models compute with?
- Does Gemma's transformer explicitly exploit the inherited hierarchical geometry?
- Why does the right structural prior matter more than raw model capacity?
- How can neural networks be interpretable by design rather than post-hoc?
- How can expensive models efficiently support cheap models in production?
- How do mechanistic features compare to natural language for interpretability?
- How do coverage and identifiability set separate performance ceilings?
- How do local soundness signals work across different problem domains?
- What physical structure does a Gaussian-regularized latent space actually encode?
- What makes regularization an implicit factor in embedding geometry?
- Do feature extraction methods systematically miss computationally important complex features?
- What makes a feature abstract versus concrete in neural network activations?
- Can a single Elo ranking represent multidimensional model capability?
- What makes API-based scaffolding more trustworthy than direct model access in high-stakes domains?
- How does scaling and training data enable compositional behavior without symbolic mechanisms?
- What distinctive properties make open foundation models different from closed ones?
- What benefits do open foundation models create that closed systems cannot?
- What prevents representation collapse in latent-prediction world models like JEPA?
- Can seedless generation maintain explainability while scaling control?
Related concepts in this collection 1
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can we track and steer personality shifts during model finetuning?
This research explores whether personality traits in language models occupy specific linear directions in activation space, and whether we can detect and control unwanted personality changes during training using these geometric directions.
persona vectors demonstrate a case where linear decodability corresponds to genuine functional organization (steering works), providing a positive counterexample to FER's warning that decodability alone is insufficient
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control
- Comprehension Without Competence: Architectural Limits of LLMs in Symbolic Computation and Reasoning
- Titans: Learning to Memorize at Test Time
- Break It Down: Evidence for Structural Compositionality in Neural Networks
- Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
- Computational structuralism: Toward a formal theory of meaning in the age of digital intelligence
- Farther the Shift, Sparser the Representation: Analyzing OOD Mechanisms in LLMs
- Large Language Model Reasoning Failures
Original note title
identical performance metrics can mask fundamentally different internal representations — feature linear decodability does not guarantee representational organization