SYNTHESIS NOTE

Why do different people reconstruct the same argument differently?

When humans and LLMs extract logical structure from arguments, they produce different reconstructions. Is this disagreement a problem to solve, or does it reveal something fundamental about how arguments work?

Synthesis note · 2026-02-21 · sourced from Argumentation

Argunauts (Argument Annotation Units) is a dataset and benchmark for argument reconstruction — extracting explicit logical structures from natural language arguments. The dataset's most significant finding is methodological: when multiple annotators (human and LLM) reconstruct the same argument independently, they produce different but equally valid reconstructions.

This is not annotation disagreement in the sense of noise to be resolved. Multiple reconstruction schemas — different choices about what counts as a premise, how to formalize the conclusion, what implicit assumptions to make explicit — are each internally valid. There is no gold standard because the text underdetermines the reconstruction.

This connects directly to Why do readers interpret the same sentence so differently? but at a structural rather than semantic level. Interpretive multiplicity in NLI is about meaning — what a sentence means depends on the reader's social position. Reconstruction multiplicity in argumentation is about structure — how an argument should be formalized depends on which reconstruction schema is applied.

Both findings converge on a challenge to the NLP assumption that language processing tasks have unique correct outputs. Do standard NLP benchmarks hide LLM ambiguity failures? describes how benchmarks respond to this problem by exclusion. For argumentation, exclusion is not possible — underdetermination is not a feature of edge cases but of the task itself.

The practical implication: evaluating LLMs on argument reconstruction requires acknowledging that precision and recall metrics assume ground truth that does not exist. Models that disagree with a reference annotation may be producing equally valid reconstructions. The field is measuring agreement with one valid interpretation and calling it correctness.

This also grounds Why do speakers deliberately use ambiguous language? from a new angle: structural ambiguity (multiple valid formalizations of the same argument) is as fundamental as semantic ambiguity.

Inquiring lines that use this note as a source 10

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

14 direct connections · 145 in 2-hop network ·dense cluster Open in graph ↗

Why do different people reconstruct the same arg… Why do readers interpret the same sentence so diff… Why do speakers deliberately use ambiguous languag… Do standard NLP benchmarks hide LLM ambiguity fail…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Why do readers interpret the same sentence so differently? How much of annotation disagreement in NLP reflects genuine interpretive multiplicity rather than error? This explores whether social position and moral framing systematically generate competing but equally valid readings.
semantic multiplicity; this is structural multiplicity; same root problem
Why do speakers deliberately use ambiguous language? Explores whether ambiguity is a linguistic defect or a strategic tool speakers use for efficiency, politeness, and deniability. Matters because it challenges how we train language systems.
the broader principle this exemplifies at the argument-structure level
Do standard NLP benchmarks hide LLM ambiguity failures? When benchmark creators filter out ambiguous examples before testing, do they accidentally make it impossible to measure whether language models can actually handle ambiguity the way humans do?
benchmark exclusion as the standard NLP response to underdetermination

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

argument reconstruction is fundamentally underdetermined because multiple valid reconstructions exist for the same text with no ground truth

Why do different people reconstruct the same argument differently?

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4