INQUIRING LINE

Can smaller scheme inventories or critical questions replace direct scheme classification?

This explores whether the hard task of sorting an argument into one of Walton's 60-plus named schemes could be swapped for an easier target — a smaller, more principled set of categories, or a shift toward asking the 'critical questions' a scheme invites rather than naming the scheme itself.


This explores whether direct scheme classification — forcing a model to pick the right label from a long taxonomy of argument types — could be replaced by something lighter: a smaller inventory, or a move toward critical questions. The corpus suggests the answer is a qualified yes, and the reason is that classification itself is the bottleneck, not the underlying reasoning.

Start with why classification is so brittle. Recognizing an argument scheme means spotting an inferential pattern spread across distributed text spans, not a local surface feature — and that integrative demand is what makes it harder than nearby NLP tasks. Models that hit F1 above 0.80 on tagging argument components or stance plateau at 0.55–0.65 on scheme classification Why does argument scheme classification stumble where other NLP tasks succeed?. Even with the best prompting, LLMs only classify schemes satisfactorily in few-shot mode with explicit scheme descriptions; zero-shot fails uniformly, and smaller models stall near F1 0.53 as if hitting a representational ceiling Can large language models classify argument schemes reliably?. So the task isn't just unsolved — it has the signature of a structurally hard target.

This is exactly where a smaller, restructured inventory earns its keep. Wagemans replaces the ad-hoc list of 60+ schemes — held together by loose family resemblance — with three orthogonal axes that generate a closed, finite classification space, the way the periodic table replaced a contingent list of elements with predictive structure Can argument schemes be organized by formal principles instead of lists?. The payoff for a classifier is concrete: instead of choosing among dozens of overlapping labels, a model picks a value on each of a few independent dimensions. That decomposes one impossibly fine-grained decision into a handful of coarse ones — a structurally easier shape, even if no one has yet proven it lifts the F1 ceiling.

There's a cross-domain echo worth noticing. In question answering, researchers found that collapsing the messy space of non-factoid questions into just five types — each routed to a different retrieval and decomposition strategy — works better than treating every question the same Does question type determine the right retrieval strategy?. The lesson generalizes: a small, function-driven taxonomy that tells you what to *do* next can outperform a large descriptive one that only tells you what something *is*. Critical questions fit this mold — they're the 'what to do next' attached to each scheme (what would defeat this argument?), so targeting them sidesteps the labeling problem while keeping the analytic payoff.

The thing you might not have expected: the real win of a smaller inventory may not be higher classification accuracy at all, but changing what the model is asked to produce. If naming the scheme is the brittle step, then a representation that never requires a single fine-grained name — coordinates on a few axes, or a set of critical questions to probe — may be the more honest engineering target.


Sources 4 notes

Why does argument scheme classification stumble where other NLP tasks succeed?

Scheme classification requires recognizing inferential patterns across distributed text spans, not local surface features. Models plateau at F1 0.55–0.65 while the same systems exceed 0.80 on component tagging and stance, suggesting the integrative reasoning demand is fundamentally different.

Can large language models classify argument schemes reliably?

Zero-shot prompting fails uniformly across models. Few-shot with scheme descriptions helps, but only larger models exceed F1 0.55, with Claude reaching 0.65. Smaller models plateau around 0.53, suggesting a representational capacity threshold.

Can argument schemes be organized by formal principles instead of lists?

Wagemans shows that three orthogonal axes generate a closed, finite classification space for all argument types, replacing the family-resemblance logic behind Walton's 60+ schemes. This mirrors the chemical periodic table's shift from contingent lists to predictive structure.

Does question type determine the right retrieval strategy?

Research shows non-factoid questions split into five types, each requiring different retrieval and aggregation methods. Evidence-based questions suit standard RAG, while debate and comparison need aspect-specific retrieval, and experience/reason questions need decomposition or filtering strategies.

Next inquiring lines