Can similar outputs from different systems prove they work the same way?

This explores whether two systems producing the same outputs — same answers, same benchmark scores, same fluent text — are actually doing the same thing inside, and the corpus says: no, sameness of output is one of the weakest signals you can have.

This explores whether matching outputs prove matching mechanisms, and the collection's recurring answer is that they don't — identical behavior can sit on top of radically different machinery. The sharpest version is the "fractured entangled representation" finding: networks trained by ordinary gradient descent can reproduce another network's outputs perfectly across every input while their internal structure is tangled and brittle in ways the evolved network's is not Can identical outputs hide broken internal representations?. The same idea drives the "imposter intelligence" worry — a model can ace every test and still have incoherent internals, because standard benchmarks only see the output layer, not the structure that produced it Can AI pass every test while understanding nothing?. A library-level note generalizes this: the same output can be reached through different mechanisms, and pushing one visible metric like accuracy often quietly degrades hidden ones like faithfulness or calibration What actually happens inside a language model?.

The deepest cut here is that even a single system's repeated identical outputs don't prove reliability. Set temperature to zero and you get the same answer every time — but that's just one fixed draw from a probability distribution replayed, and consistency is not the same thing as being right Does setting temperature to zero actually make LLM outputs reliable?. So if even one model repeating itself can't prove its own soundness, two different models agreeing proves even less.

The corpus also reframes the question philosophically: when an AI and a human produce the same sentence, are they doing the same thing? Several notes argue no — humans use language to address and relate to others, while a model emits strings from a probability distribution; they share surface form but differ in what generates it and what it does socially Are language models and human speakers doing the same thing?. This is exactly why behavioral tests mislead: a test calibrated to "does it produce appropriate text" will pass any fluent system, including ones with none of the underlying conditions you actually care about, generating false positives Does behavioral speech output prove communicative subjecthood?. Chain-of-thought is the concrete case — output that looks like reasoning can be learned imitation of reasoning's form, and the giveaway is that it breaks predictably when you shift the distribution, which a genuine mechanism wouldn't Does chain-of-thought reasoning reveal genuine inference or pattern matching?.

What you didn't know you wanted to know is the flip side: the corpus also shows two systems can produce different outputs while running the same underlying process, and the same output while encoding different things. Semantically identical prompts yield systematically different answers because the model tracks how often a phrasing appeared in training, not its meaning — so "same meaning" doesn't even guarantee same output Why do semantically identical prompts produce different LLM outputs?. And small models can match big ones on reasoning benchmarks by learning only the output format, suggesting the shared score reflects a borrowed surface, not shared capability Can small models reason well by just learning output format?. The throughline across all of it: output equivalence is evidence about the output, never proof about the process — to claim two systems work the same way you have to look inside, because the surface is engineered to look the same whether or not the insides agree.

Sources 9 notes

Can identical outputs hide broken internal representations?

Networks trained with SGD reproduce outputs perfectly while having radically different internal structure than evolved networks, with weight perturbations revealing fractured, entangled representations that prevent transfer to novel contexts or creative recombination.

Can AI pass every test while understanding nothing?

The Fractured Entangled Representation hypothesis shows that SGD-trained networks can produce identical outputs across all inputs while maintaining radically different internal representations. Standard benchmarks cannot detect this structural difference.

What actually happens inside a language model?

Research shows that LLMs can achieve the same output through different internal mechanisms, and improvements in one dimension like accuracy reliably degrade others like faithfulness and calibration. Internal structure matters even when behavior appears identical.

Does setting temperature to zero actually make LLM outputs reliable?

Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.

Are language models and human speakers doing the same thing?

LLMs produce strings via probability distributions; humans use language to address and relate to others. They share surface form but differ in what produces output, what it does socially, and what receivers should do with it.

Does behavioral speech output prove communicative subjecthood?

Chalmers' test passes any system producing contextually appropriate text, but communicative subjecthood requires relational-normative conditions like accountability and evaluative stance. The test is calibrated to the wrong phenomenon, creating false positives like puppets that walk-shaped without walking.

Does chain-of-thought reasoning reveal genuine inference or pattern matching?

CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.

Why do semantically identical prompts produce different LLM outputs?

Cao et al. and Adam's Law show that semantically identical prompts with different sentence-level frequencies produce systematically different output quality. Higher-frequency phrasings win because models register statistical mass from pre-training, not meaning.

Can small models reason well by just learning output format?

A 1.5B parameter model with LoRA-only post-training matched larger full-parameter RL models on reasoning tasks, suggesting RL teaches output format organization rather than new factual knowledge. This efficiency indicates reasoning and knowledge storage are separable capabilities.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question remains open: **Can identical outputs from two different AI systems prove they use the same internal mechanisms?**

What a curated library found — and when (dated claims, not current truth):
Findings span 2022–2026; treat each as perishable:
• Fractured entangled representations: networks can match outputs perfectly while having radically different internal structure and brittleness (2025-05, arXiv:2505.11581).
• Chain-of-thought output can be learned imitation of reasoning's surface form, not genuine inference, breaking predictably under distribution shift (2025-06, arXiv:2506.02878).
• Small models achieve competitive scores on reasoning benchmarks via format adaptation (LoRA), suggesting shared output reflects borrowed surface, not shared capability (2025-04, arXiv:2504.15777).
• Semantically identical prompts yield different outputs because models track training frequency, not meaning; "same meaning" doesn't guarantee same behavior (2026-04, arXiv:2604.02176).
• Behavioral tests calibrated to output appropriateness will pass any fluent system, including ones with none of the underlying conditions you care about (2025-01, arXiv:2501.16496).

Anchor papers (verify; mind their dates):
• arXiv:2505.11581 (2025-05) — Fractured Entangled Representations
• arXiv:2506.02878 (2025-06) — CoT as Constrained Imitation
• arXiv:2604.02176 (2026-04) — Adam's Law (textual frequency)
• arXiv:2501.16496 (2025-01) — Mechanistic Interpretability Open Problems

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For every finding above, judge whether newer introspection tools (circuits, attention pattern libraries), mechanistic probes (sparse autoencoders, dictionary learning), multi-model comparison studies, or improved behavioral tests (adversarial, OOD, causal intervention) have since RELAXED or OVERTURNED the claim that surface equivalence masks internal divergence. Flag which constraints still hold and which have been dissolved.
(2) **Surface strongest RECONCILING or SUPERSEDING work** from the last ~3–6 months that argues *for* structural similarity despite output equivalence, or shows methods that *do* reliably infer mechanism from behavior.
(3) **Propose 2 research questions** that assume the regime may have moved: e.g., "Under what conditions do identical outputs *guarantee* mechanistic overlap?" or "Can a single integrated interpretability method now reliably map output equivalence to internal isomorphism?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Can similar outputs from different systems prove they work the same way?

Sources 9 notes

Next inquiring lines