INQUIRING LINE

Can formal argumentation structure replace ad-hoc fallacy classifications?

This explores whether organizing arguments by formal structure — closed, principled systems — can replace the open-ended lists of named fallacies and schemes that argumentation theory has accumulated piecemeal.


This explores whether formal argumentation structure can replace the ad-hoc, list-based way we've catalogued fallacies and argument types. The corpus has a direct answer to the core move — and a set of cautions about what 'replace' actually buys you. The cleanest case for yes comes from Wagemans' periodic-table approach Can argument schemes be organized by formal principles instead of lists?: instead of memorizing Walton's 60-plus schemes as a family-resemblance grab bag, three orthogonal axes generate a closed, finite space that every argument type falls into. The analogy is the chemical periodic table — a shift from contingent lists you have to keep extending to a predictive structure that tells you what's possible before you've seen it. That's the strongest sense in which formal structure 'replaces' the ad-hoc: not by renaming the same categories, but by deriving them.

There's a parallel structural story on the contestability side. Dung-style argumentation frameworks turn AI outputs into traversable attack/defense graphs, so a user can point at the exact premise they reject Can formal argumentation make AI decisions truly contestable?. Unstructured prose can't be challenged that precisely. So formal structure pays off twice — it organizes the taxonomy and it makes individual arguments mechanically inspectable.

But the corpus quietly complicates the dream of a structure that does the reasoning for you. Classifying argument schemes turns out to be unusually hard for machines: LLMs need few-shot examples and scheme descriptions even to reach mediocre accuracy, and they plateau at F1 0.55–0.65 while the same models sail past 0.80 on simpler tagging Can large language models classify argument schemes reliably? Why does argument scheme classification stumble where other NLP tasks succeed?. The reason is telling — recognizing an inferential pattern means integrating cues scattered across the text, not spotting a surface feature. A formal scheme can name the target, but naming it doesn't make the recognition cheap.

The sharpest caution is that structure and soundness aren't the same thing. Illogical chain-of-thought exemplars perform almost as well as logically valid ones, which means models learn the *form* of reasoning rather than genuine inference Does logical validity actually drive chain-of-thought gains?. A formal scaffold can be filled with nonsense and still look rigorous. This is where the 'replace' framing gets interesting — formal structure replaces ad-hoc *classification* well, but it doesn't automatically deliver valid *evaluation*. The corpus's answer to that gap is to use structure as an active prompt rather than a passive label: feeding models Toulmin-style critical questions forces them to check warrants and backing they'd otherwise skip Can structured argument prompts make LLM reasoning more rigorous?, and explicit theoretical frameworks teach quality criteria that labeled examples alone never transfer Can models learn argument quality from labeled examples alone?.

So the honest synthesis: formal structure convincingly replaces ad-hoc lists as a *map* — finite, principled, predictive. What it can't do alone is the recognition and the validity-checking; those have to be built on top, as critical-question routines and explicit instruction, not assumed to fall out of the taxonomy. The thing you didn't know you wanted to know: the periodic-table move and the 'invalid reasoning still scores well' result are two halves of the same lesson — clean structure is necessary and powerful, but structure is a container, and the corpus keeps catching cases where the container is rigorous and the contents aren't.


Sources 7 notes

Can argument schemes be organized by formal principles instead of lists?

Wagemans shows that three orthogonal axes generate a closed, finite classification space for all argument types, replacing the family-resemblance logic behind Walton's 60+ schemes. This mirrors the chemical periodic table's shift from contingent lists to predictive structure.

Can formal argumentation make AI decisions truly contestable?

Dung-style argumentation structures AI outputs as traversable attack/defense graphs, allowing users to identify and contest specific premises. Standard LLM outputs lack this structure, making it impossible to pinpoint which claims users actually reject.

Can large language models classify argument schemes reliably?

Zero-shot prompting fails uniformly across models. Few-shot with scheme descriptions helps, but only larger models exceed F1 0.55, with Claude reaching 0.65. Smaller models plateau around 0.53, suggesting a representational capacity threshold.

Why does argument scheme classification stumble where other NLP tasks succeed?

Scheme classification requires recognizing inferential patterns across distributed text spans, not local surface features. Models plateau at F1 0.55–0.65 while the same systems exceed 0.80 on component tagging and stance, suggesting the integrative reasoning demand is fundamentally different.

Does logical validity actually drive chain-of-thought gains?

Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.

Can structured argument prompts make LLM reasoning more rigorous?

Applying Toulmin's argument model as explicit prompting steps (CQoT) improves LLM reasoning by forcing models to identify warrants and backing rather than skipping implicit premises. The method catches failures that standard chain-of-thought prompting allows.

Can models learn argument quality from labeled examples alone?

Fine-tuning on labeled examples fails to transfer quality criteria to new argument types. Models learn surface patterns rather than principled criteria. Explicit instruction using frameworks like RATIO or QOAM significantly improves performance and generalization.

Next inquiring lines