SYNTHESIS NOTE
Language, Text, and Discourse Reasoning, Retrieval, and Evaluation

Can models learn argument quality from labeled examples alone?

Explores whether fine-tuning on quality-labeled examples teaches models the underlying criteria for evaluating arguments, or merely surface patterns. Matters because high-stakes assessment tasks depend on reliable, transferable quality judgment.

Synthesis note · 2026-02-21 · sourced from Argumentation
Where exactly do LLMs break down with language structure? What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

Argument Quality Assessment research trains models to evaluate the quality of arguments — are they logically valid? Well-supported? Relevant? Clear? The standard approach is supervised fine-tuning: label examples as high/low quality, train on them, evaluate transfer.

The finding: fine-tuning on quality-labeled examples does not reliably teach the models what makes arguments good. Models learn to pattern-match against the labeled examples but do not acquire the underlying criteria that would generalize to new argument types. When explicit theoretical frameworks (RATIO: Relevance, Acceptability, Sufficiency; QOAM: Quality of Argumentation Model) are provided as structured instruction, performance improves significantly.

Theory injection works where pattern learning fails.

This is a specific instance of Can models pass tests while missing the actual grammar?: models that score highly on quality assessments in the training distribution fail to transfer the criteria to out-of-distribution argument types. The learned pattern is "this looks like high-quality arguments in the training data" rather than "this argument satisfies the following criteria for quality."

The implication extends beyond argumentation. Whenever an evaluation task requires applying principled criteria that are not explicit in the labeled data — quality, fairness, coherence, persuasiveness — fine-tuning on examples risks teaching the distribution rather than the criteria. Why do different people reconstruct the same argument differently? points at the same problem from the other direction: if there's no gold standard, labeled examples cannot straightforwardly encode the right criteria.

The practical consequence: assessment tasks in high-stakes domains (argument quality in legal reasoning, argument validity in policy analysis) should not rely on fine-tuned models trained only on labeled examples. Explicit criteria instruction — prompting with theoretical frameworks, structured evaluation rubrics — is required.

Inquiring lines that use this note as a source 55

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
17 direct connections · 195 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

argument quality assessment requires explicit theoretical framework instruction because quality criteria cannot be learned from examples alone