How can humans evaluate explanations from systems they did not train?

This explores how a person can judge whether an AI's explanation is trustworthy when they have no window into how the system was built — and what the corpus says actually works versus what merely feels convincing.

This explores the gap between an explanation that *persuades* you and one that *informs* you — and the corpus's uncomfortable answer is that the two come apart almost completely. The most direct finding is that reasoning traces and after-the-fact justifications make people *more* willing to accept an AI's answer whether or not it's correct, manufacturing confidence rather than calibrating it Do explanations actually help users spot AI mistakes?. The one format that actually helped users separate right from wrong answers was a contrastive one that argued both sides — the explanation only became useful when it was forced to expose its own weak points. So the first move for evaluating a system you didn't train is to distrust fluent one-sided explanations by default, and to demand the case *against* the answer too.

Why is fluency such a bad signal? Because the corpus documents a whole family of failures where explanation and actual competence are wired to different circuits. "Potemkin understanding" is the pattern where a model explains a concept correctly, then fails to apply it, and can even recognize its own failure — a combination no human teacher would expect Can LLMs understand concepts they cannot apply?. The same split shows up measured directly: roughly 87% accuracy when articulating a principle versus 64% when acting on it, which the authors call a computational split-brain rather than a knowledge gap Can language models understand without actually executing correctly?. These aren't random errors but repeatable, named failure modes How do LLMs fail to know what they seem to understand? — which matters for the evaluator, because it means a beautiful explanation tells you the system can *produce explanation-shaped text*, not that the underlying answer is sound.

The deeper unsettling result is that the explanation may not even reflect how the answer was reached. Models trained on deliberately corrupted, irrelevant reasoning traces solve problems just as well as those trained on correct ones — the trace works as computational scaffolding, not as a faithful account of the steps taken Do reasoning traces need to be semantically correct?. And a system can ace every benchmark while carrying an internally incoherent representation that standard tests can't detect at all Can AI pass every test while understanding nothing?. So you cannot evaluate the explanation by reading it sympathetically; the words on the page may be a post-hoc surface over a process that looks nothing like them.

This points toward evaluating *structure* instead of *prose*. One line of work proposes three properties you can actually test from the outside: traceability (can you follow the steps to the conclusion), counterfactual adaptability (does the explanation change correctly when you change the inputs), and compositionality (do the reasoning pieces recombine sensibly) Can we measure reasoning quality beyond output plausibility?. The counterfactual test is especially powerful for an outsider — you don't need to have trained the system to perturb its inputs and check whether its stated reasoning moves the way it claims it should. That converts "do I believe this explanation" into a probe you can run yourself.

There's also a human factor the corpus flags: your judgment shifts based on what you think made the explanation. People rated AI-generated moral justifications *higher* than human ones — until they were told the source was AI, at which point agreement dropped, with the content-preference and the source-rejection operating as two independent psychological channels Do people prefer AI moral reasoning when they don't know the source?. The lesson for an evaluator is that your reaction is already entangled with your beliefs about the source, in both directions — which is exactly why the corpus keeps pushing toward external, structural, both-sides tests rather than gut assessment of how good the explanation *sounds*.

Sources 8 notes

Do explanations actually help users spot AI mistakes?

Reasoning traces and post-hoc explanations increase user acceptance of AI answers regardless of correctness, engendering false trust. Only dual explanations presenting arguments for and against the answer genuinely help users distinguish correct from incorrect outputs.

Can LLMs understand concepts they cannot apply?

Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.

Can language models understand without actually executing correctly?

Large language models can articulate correct principles but systematically fail to apply them due to dissociated instruction and execution pathways. The 87% accuracy in explanations versus 64% in actions reveals this is not knowledge deficit but structural disconnect.

How do LLMs fail to know what they seem to understand?

LLMs show repeatable, empirically documented failure modes—from Potemkin understanding (correct explanation + failed application) to reasoning collapse under implicit constraints. These failures reveal gaps between statistical pattern-tracking and actual epistemic competence.

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

Can AI pass every test while understanding nothing?

The Fractured Entangled Representation hypothesis shows that SGD-trained networks can produce identical outputs across all inputs while maintaining radically different internal representations. Standard benchmarks cannot detect this structural difference.

Can we measure reasoning quality beyond output plausibility?

Research identifies traceability, counterfactual adaptability, and motif compositionality as testable measures of human-like reasoning. These structural properties reveal whether an agent genuinely reasons causally or merely mimics coherent speech.

Do people prefer AI moral reasoning when they don't know the source?

Participants rated utilitarian moral arguments higher when attributed to LLMs, but agreement dropped when told the arguments were AI-generated. The preference for content and rejection of source operate independently through different psychological processes.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an evaluator assessing explanations from AI systems you did not train. The question remains open: what structural, testable properties distinguish a persuasive explanation from a trustworthy one?

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026 and expose a sharp split between explanation fluency and actual competence:

• Fluent one-sided explanations increase user acceptance regardless of correctness; only contrastive explanations (arguing both sides) help users separate right from wrong answers (~2025).
• "Potemkin understanding": models explain concepts correctly but fail to apply them; direct measurement shows ~87% accuracy articulating principles versus 64% when acting on them — a computational split-brain, not a knowledge gap (~2025).
• Explanations may not reflect how answers were reached: models trained on deliberately corrupted reasoning traces perform comparably to those trained on correct ones; the trace is scaffolding, not a faithful account (~2025).
• Three measurable properties work from the outside: traceability (can you follow steps), counterfactual adaptability (does explanation change correctly when inputs shift), and compositionality (do reasoning pieces recombine sensibly) (~2025).
• Source belief contaminates judgment: people rate AI moral justifications higher than human ones until told the source is AI, at which point agreement drops — two independent psychological channels (~2024).

Anchor papers (verify; mind their dates):
• arXiv:2507.10624 — Comprehension Without Competence (2025)
• arXiv:2605.10930 — Evaluating the False Trust Engendered by LLM Explanations (2026)
• arXiv:2410.07304 — The Moral Turing Test (2024)
• arXiv:2501.11721 — Explain-Query-Test (2025)

Your task:
(1) RE-TEST EACH CONSTRAINT. For the split between articulation and application, probe whether recent fine-tuning methods (e.g., process supervision, constitutional AI variants, or scaffold-aware training) have narrowed the 87%–64% gap. For explanation fidelity, check whether mechanistic interpretability or activation-level tracing now reliably recover the actual computation path—or whether the corruption robustness finding still holds under newer architectures. Flag what resolves these constraints and what persists.

(2) Surface the strongest work from the last ~6 months that contradicts or supersedes the "fluent explanations mislead" thesis—especially any showing that certain explanation formats (e.g., uncertainty-aware, multi-model ensemble, or interactive refinement) do calibrate trust without contrastive framing.

(3) Propose 2 research questions that assume the regime may have shifted: (a) If mechanistic interpretability can now faithfully expose computation, do users who see it accept explanations differently than those who see prose? (b) Do counterfactual-adaptability tests scale to long-horizon reasoning, or do they only work for short, stateless queries?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

How can humans evaluate explanations from systems they did not train?

Sources 8 notes

Next inquiring lines