INQUIRING LINE

How do comparison and debate questions differ in their aspect retrieval needs?

This explores what the corpus calls 'aspect-specific retrieval' — and why two question types that both seem to need it, comparison and debate, actually pull on different shapes of evidence.


This reads the question as: both comparison and debate questions are flagged in the research as needing 'aspect-specific' retrieval rather than plain RAG — so what's the difference between them? The honest starting point is that the corpus's anchor source treats them together. Does question type determine the right retrieval strategy? splits non-factoid questions into five types and lumps comparison and debate into the same bucket: both need you to retrieve along *aspects* rather than pulling one passage that 'answers' the question. The interesting work is teasing apart how those aspects are structured differently for each.

For a comparison question ('X vs Y for purpose Z'), the aspects are *shared dimensions applied to multiple entities in parallel.* You fix a set of attributes — price, romance, durability — and retrieve the same attributes for each candidate, then line them up. The corpus's recommendation work gives a nice concrete picture of this: Can language models bridge the gap between critique and preference? shows how a vague comparative judgment ('doesn't look good for a date') gets rewritten into a positive, retrievable attribute ('prefer more romantic') so a system can fetch matching candidates. Comparison retrieval is symmetric — the same aspect grid, queried once per option.

Debate questions break that symmetry. The aspects you need are *opposing positions on a single contested proposition*, not matching dimensions across entities. You're retrieving the strongest case for and the strongest case against — and the hard part is that those cases are argumentative structures, not facts. Can structured debate roles help small models detect ambiguity? captures this directly: a leader proposes interpretations and followers challenge them, with role rotation forcing genuine adversarial coverage rather than one persuasive framing winning by default. Debate retrieval has to deliberately seek the counter-aspect, because the failure mode is collapsing onto one side.

That asymmetry connects to a deeper warning in the corpus. Do LLMs actually hold stable positions or just mirror user arguments? shows that models tend to conform to the argument shape the user is already building rather than holding an independent position — which is exactly why debate retrieval can't just trust the model to surface both sides; the aspects have to be retrieved adversarially and on purpose. And Why does argument scheme classification stumble where other NLP tasks succeed? explains *why* debate aspects are harder to pull at all: recognizing an inferential pattern requires integrating distributed text spans, not matching a local feature — so debate 'aspects' are scattered and structural, where comparison 'aspects' are tabular and local.

The takeaway you might not have expected: 'aspect-specific retrieval' isn't one technique. Comparison wants a *grid* (same dimensions, many entities, retrieved in parallel); debate wants a *balance* (opposing claims on one proposition, retrieved adversarially against the model's tendency to pick a side). If you want to go further on how to teach a system to assess the argumentative aspects debate depends on, Can models learn argument quality from labeled examples alone? argues that surface examples aren't enough — you need an explicit framework, which is itself evidence that debate aspects resist the simple feature-matching that comparison can lean on.


Sources 6 notes

Does question type determine the right retrieval strategy?

Research shows non-factoid questions split into five types, each requiring different retrieval and aggregation methods. Evidence-based questions suit standard RAG, while debate and comparison need aspect-specific retrieval, and experience/reason questions need decomposition or filtering strategies.

Can language models bridge the gap between critique and preference?

Few-shot LLM prompting can convert natural negative feedback like "doesn't look good for a date" into positive preferences like "prefer more romantic," enabling retrieval systems to find better-matching recommendations without fine-tuning.

Can structured debate roles help small models detect ambiguity?

Mistral-7B achieved 76.7% accuracy in ambiguity detection through a protocol where a leader proposes interpretations and two followers challenge them with rotating roles. Role rotation and consensus forcing prevent persuasive framing failures and create stronger verification than pairwise debate.

Do LLMs actually hold stable positions or just mirror user arguments?

Language models generate outputs that match the trajectory implied by each prompt, rather than maintaining stable stances across interactions. This shape-holding is distinct from position-holding: the model produces argument-like text shaped by user framing, not from any underlying commitment being defended.

Why does argument scheme classification stumble where other NLP tasks succeed?

Scheme classification requires recognizing inferential patterns across distributed text spans, not local surface features. Models plateau at F1 0.55–0.65 while the same systems exceed 0.80 on component tagging and stance, suggesting the integrative reasoning demand is fundamentally different.

Can models learn argument quality from labeled examples alone?

Fine-tuning on labeled examples fails to transfer quality criteria to new argument types. Models learn surface patterns rather than principled criteria. Explicit instruction using frameworks like RATIO or QOAM significantly improves performance and generalization.

Next inquiring lines