SYNTHESIS NOTE
Language, Text, and Discourse Reasoning, Retrieval, and Evaluation

Can large language models classify argument schemes reliably?

Explores whether LLMs can recognize Walton's 60+ argument schemes—abstract patterns of reasoning rather than surface features—and what conditions enable accurate classification.

Synthesis note · 2026-05-18 · sourced from Argumentation
Why do LLMs fail at understanding what remains unsaid? Where exactly do LLMs break down with language structure?

Classifying an argument under Walton's taxonomy of 60+ schemes is a harder task than it looks. It requires recognizing the form of presumptive inference (argument from expert opinion, argument from cause to effect, argument from analogy) rather than the surface lexicon. The systematic evaluation across seven LLMs finds that zero-shot prompting fails almost uniformly; few-shot with examples helps; but the reliable lift comes from adding descriptions of the schemes — and even then, only larger models clear F1 ~0.55, with Claude topping out at 0.65.

The size-dependence is the most informative finding. Smaller LLMs and pre-trained language models like BERT (F1 0.53) plateau in roughly the same range. This is not a "scale solves it" curve — it is a step function: the task seems to require enough representational capacity to hold an abstract scheme template in working memory while comparing it against a candidate argument. Below that capacity, models pattern-match on surface lexical features and miss the inferential structure that defines a scheme.

The cognitive-load framing the authors invoke is consistent with this: scheme classification is harder than component identification (claim, premise, warrant) or stance detection because the unit of recognition is a pattern of reasoning, not a piece of text. A premise is recognizable from its position; a scheme is recognizable only by integrating premises, conclusion, and the inferential move connecting them.

The practical consequence for argumentation systems: zero-shot scheme tagging is not yet a viable component. Pipelines that need scheme labels — for argument generation, legal/medical reasoning, dialectical evaluation — need at minimum few-shot with descriptions and larger models. The cheaper alternative is to use scheme critical questions as a prompting structure instead of trying to classify into schemes after the fact.

Inquiring lines that use this note as a source 29

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
12 direct connections · 103 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

LLMs classify argument schemes satisfactorily only in few-shot with descriptions — zero-shot and smaller models fail the cognitive load of stereotypical reasoning patterns