Why do LLM descriptions of argument schemes work better than formal definitions for classification?

This explores why an LLM's own plain-language paraphrase of an argument scheme beats a formal Walton-style logical definition when you prompt it to sort arguments into categories.

This explores why an LLM's own plain-language paraphrase of an argument scheme beats a precise, expert logical definition when classifying arguments. The corpus gives a direct answer and then a deeper one. The direct finding is that Why do paraphrased definitions work better than expert ones? paraphrases match the model's training distribution better than formal logical vocabulary — the model has simply seen far more text that talks about reasoning in ordinary words than text written in the compressed notation of argumentation theory. A definition isn't 'better' just because it's more rigorous; it's better only if it lands in territory the model has actually traveled.

Why that should be true gets sharper when you look at what LLMs are doing under the hood. They behave as Do large language models reason symbolically or semantically? — leaning on token associations and commonsense rather than manipulating formal symbols. Strip the everyday semantics out of a task and performance collapses even when the correct rules are sitting right there in the prompt. A formal Walton definition does exactly that stripping: it replaces familiar phrasing with logical scaffolding the model can't reason over symbolically, so it loses the very handhold it relies on. The same logic is predictable from a Can we predict where language models will fail? view: an autoregressive model finds low-probability phrasings systematically harder, and formal definitions are exactly the rare, low-probability register that trips it up.

There's a second layer worth knowing: even with good descriptions, this task is just hard. Classification of argument schemes Why does argument scheme classification stumble where other NLP tasks succeed? because it requires spotting an inferential pattern spread across a whole passage, not a local surface cue — which is why models plateau around F1 0.55–0.65 here while clearing 0.80 on simpler tagging tasks. And it only works at all Can large language models classify argument schemes reliably? with few-shot examples plus descriptions; zero-shot fails across the board. So the description isn't a minor prompt-tuning trick — it's load-bearing, because it carries the model over a representational gap that formal definitions widen rather than close.

The quietly surprising takeaway: the thing that makes a definition good for a logician — precision, abstraction, formal vocabulary — is exactly what makes it bad for an LLM. These models pattern-match on surface form, not deep structure, which is the same reason their Why do large language models fail at complex linguistic tasks? grammatical competence degrades with structural complexity, and the same disconnect behind Can LLMs understand concepts they cannot apply? cases where a model can recite a concept's definition yet fail to apply it. A paraphrase works not because it's clearer to a human, but because it speaks in the statistical dialect the model already fluently inhabits.

Sources 7 notes

Why do paraphrased definitions work better than expert ones?

LLM-generated descriptions of argument schemes yield better classification performance than expert Walton definitions. The advantage stems from paraphrases matching the model's training distribution better than formal logical vocabulary.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Can we predict where language models will fail?

By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.

Why does argument scheme classification stumble where other NLP tasks succeed?

Scheme classification requires recognizing inferential patterns across distributed text spans, not local surface features. Models plateau at F1 0.55–0.65 while the same systems exceed 0.80 on component tagging and stance, suggesting the integrative reasoning demand is fundamentally different.

Can large language models classify argument schemes reliably?

Zero-shot prompting fails uniformly across models. Few-shot with scheme descriptions helps, but only larger models exceed F1 0.55, with Claude reaching 0.65. Smaller models plateau around 0.53, suggesting a representational capacity threshold.

Why do large language models fail at complex linguistic tasks?

Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.

Can LLMs understand concepts they cannot apply?

Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst tracking LLM classification capability. The question: *Why do plain-language descriptions of argument schemes outperform formal logical definitions when LLMs classify arguments?* This remains an open question; treat the findings below as dated claims to re-test, not current truth.

What a curated library found — and when (findings span 2023–2026):
• LLM-generated paraphrases match the model's training distribution better than formal Walton definitions; models have seen far more ordinary-language reasoning text than compressed logical notation (~2024).
• LLMs behave as in-context semantic reasoners leaning on token associations and commonsense, not symbolic manipulators; formal definitions strip everyday semantics and collapse performance (~2023).
• Argument-scheme classification bottlenecks at F1 0.55–0.65 because it requires spotting inferential patterns across whole passages, not surface cues; zero-shot fails entirely, few-shot + descriptions is load-bearing (~2024).
• Models show systematic linguistic blind spots that worsen with structural complexity; grammatical competence degrades on complex structures in predictable ways (~2025).
• "Potemkin understanding" — correct explanation + failed application — occurs when models recite definitions yet cannot deploy them (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2305.14825 (2023-05) — In-Context Semantic Reasoners
• arXiv:2404.00750 (2024-03) — Can Language Models Recognize Convincing Arguments?
• arXiv:2503.19260 (2025-03) — Linguistic Blind Spots
• arXiv:2604.02176 (2026-04) — Adam's Law: Textual Frequency

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For every claim above, ask: have newer instruction-tuning methods, retrieval-augmented prompting, chain-of-thought variants, or mechanistic interpretability since *relaxed* the semantic-vs.-symbolic divide? Has the F1 plateau shifted? Has zero-shot argument-scheme classification become viable? Separate the durable question (likely: *why do statistical models struggle with abstract logical structure?*) from perishable limitations (possibly: formal definitions' failure on *this architecture*). Cite what resolved each constraint.
(2) **Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months.** Look for evidence that formal definitions *do* help, or that the semantic/symbolic split is overstated, or that newer evals show scheme classification no longer requires descriptions.
(3) **Propose 2 research questions that ASSUME the regime may have moved:** e.g., *Do scaled-up models or Constitutional AI training shift the trade-off?* *Can mechanistic intervention (e.g., steering hidden states) make formal definitions tractable?*

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Why do LLM descriptions of argument schemes work better than formal definitions for classification?

Sources 7 notes

Next inquiring lines