Can structured argument prompts make LLM reasoning more rigorous?
Does requiring language models to explicitly check warrants, backing, and rebuttals—rather than reasoning freely—improve reasoning quality and catch failures that standard step-by-step prompting misses?
CQoT (Critical-Questions-of-Thought) adapts Toulmin's argument model into a prompting framework. Standard chain-of-thought prompting asks the model to reason step by step. CQoT additionally requires the model to answer specific critical questions about its own reasoning: What is the warrant connecting evidence to claim? What backing supports the warrant? What potential rebuttals exist? Does the claim need qualification?
These questions are not open-ended reflection requests. They are the specific interrogation targets from argumentation theory — the structural requirements that valid arguments must satisfy. By instantiating them as required prompting steps, CQoT converts implicit argumentative requirements into explicit reasoning constraints.
The improvement over standard CoT is consistent. Forcing warrant-checking catches the specific failure that Can LLMs identify the hidden assumptions that make arguments work? documents: models that correctly identify claim-data structure still fail at the implicit premise. CQoT makes the implicit premise an explicit required output.
The mechanism generalizes beyond argumentation tasks. Can models pass tests while missing the actual grammar? describes the broader problem: correct outputs do not prove structural learning. CQoT forces the structural reasoning into the surface output where it can be evaluated and — critically — where the model must perform it rather than skip it.
This is an instance of the broader principle that structured decomposition of implicit reasoning requirements improves LLM performance on tasks where those requirements would otherwise be skipped. The cognitive science parallel: experts who have internalized decision criteria can execute them fluently; forcing novices to answer structured questions makes explicit what experts do implicitly. CQoT structures the novice reasoning process.
The limitation: CQoT assumes the model can correctly identify what the warrant should be, once it is asked to. For domains where the warranting relationship is itself contested, the structured prompt provides the form of warrant-checking without guaranteeing the content.
Inquiring lines that use this note as a source 115
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- What distinguishes LLM fabrication from genuine theoretical reasoning?
- Why does persuasive framing replace evidence when LLM debates lack ground truth?
- Should LLMs query users back when presented with under-specified scenarios?
- Does post-hoc justification increase when LLM choices become harder to defend?
- Why do LLMs fall for and deploy logical fallacies with equal confidence?
- Why do LLMs fail inter-annotator agreement tests on argument evaluation?
- Can prompting techniques reliably force models to enumerate hidden constraints?
- Can prompt engineering alone defeat LLM politeness bias in review tasks?
- Can prompt-based debiasing overcome entrenched LLM model priors?
- What prompt types best extract different aspects of item content?
- What specific execution barriers do LLM ideas encounter most frequently?
- Can evidence density alone shift an LLM from generation to reasoning?
- How does evaluative stance differ from structural argument analysis?
- How does prompt framing subtly determine what kind of opposing argument an LLM generates?
- Does LLM judge preference for LLM arguments amplify errors in contested factual domains?
- Can structured prompting reliably force models to enumerate preconditions?
- Can manipulative prompts reduce reasoning model accuracy without fine-tuning?
- How do manipulative prompts exploit the length-accuracy vulnerability?
- How much does prompt format shape what reasoning strategy a model uses?
- Should LLM reasoning be studied as latent state trajectories rather than surface text?
- What makes Compound-QA expose weaknesses in monologue reasoning?
- Can prompting alone inject new domain knowledge into a model?
- Why do practitioners default to prompting without recognizing its limits?
- When should an LLM engage extended reasoning versus responding directly?
- How does prompt iteration risk converting user beliefs into self-confirming outputs?
- Why do explicit discourse connectives help LLMs but implicit relations cause failures?
- Can forcing warrant checking through structured prompts improve LLM reasoning?
- Why do LLMs generate logical forms without preserving semantic content?
- Can we predict when a specific prompt will fail on a given question?
- How should reasoning prompts adapt based on question complexity and type?
- Why does describing a process differ fundamentally from arguing about evidence?
- Why do entities trigger memorized propositions instead of enabling reasoning?
- Can fact-checking systems use LLMs reliably if models abandon correct positions under pressure?
- How do embedding contexts like presupposition triggers affect LLM entailment reasoning?
- How does the absence of evaluative stance appear in LLM academic writing?
- Can LLMs distinguish ethical cases that differ only in critical nouns?
- Do LLMs understand implicit warrants in reasoning chains?
- Why can LLMs identify argument structure but not check warrants?
- Why do LLMs fail when asked to use counter-commonsense rules explicitly?
- Can LLMs translate between natural language and formal logic faithfully?
- Why do LLMs struggle with negation and exception handling?
- Why can't LLMs reason from first principles or initial commitments?
- Why do LLMs explain evidence accurately while missing its implications?
- Why do LLMs choose surface-order quantifier scope over contextually correct readings?
- Why can LLMs interpret formal logic better than they generate it?
- Can training procedures fix LLM accommodation of false presuppositions?
- How much does question framing affect LLM accuracy on knowledge tasks?
- Can LLMs learn to ask clarifying questions instead of guessing?
- Can LLMs improve at simple deduction through different training approaches?
- How do LLMs handle false presuppositions embedded in user questions?
- Which structural properties of CoT prompts matter most for performance?
- What prompting strategies most effectively boost long-context LLM performance on retrieval?
- How can prompting help models gather information before attempting reasoning?
- Can prompt engineering improve reasoning or only move requests into denser regions?
- Why does chain of thought reasoning fail across different prompt formats?
- How does structural complexity in sentences degrade LLM reasoning systematically?
- Can LLMs compute how presuppositions project through embedded clauses?
- What makes structural logic correlate so strongly with contextual consistency?
- How do language agents implement prompts as executable computational graphs?
- What methodological standards should prompting research papers meet before publication?
- What happens when prompter skill matters more than domain expertise?
- How do structured prompts force LLMs to check for contradictions in evidence?
- Can scaffolding frameworks isolate inductive reasoning from deductive confounds?
- Why do format and structure matter more than actual content in reasoning?
- Why do invalid prompts produce reasoning traces as effectively as valid ones?
- How do explanations borrow authority from transparency when describing adoption arguments?
- Can LLM judges be trained to think more rigorously during evaluation?
- Can LLMs recognize rhetorical devices they cannot actually produce themselves?
- How can we measure whether an agent reasons correctly rather than just sounds plausible?
- Does LLM reasoning always match the outputs it generates?
- What structural barriers prevent LLMs from making evaluative judgments about writing?
- How does the LLM Fallacy prevent users from noticing cognitive debt accumulating?
- How do LLMs reproduce the grammar of authoritative claims without genuine conviction?
- What makes a clarifying question aligned with user interests versus structurally sound?
- Can structured prompts reduce reasoning steps while improving financial accuracy?
- How do expert communities develop and enforce standards for valid arguments?
- What makes extended chains more vulnerable than standard prompts?
- Can lightweight linguistic features reliably detect LLM generated arguments?
- Do monolithic prompts underutilize LLM strengths in forecasting workflows?
- Does structured decomposition improve LLM reasoning in other compound tasks?
- Can operationalizing theory into prompt structure improve reasoning more than theory itself?
- Do scheme critical questions work better than direct scheme classification prompts?
- Can LLM-generated descriptions of schemes outperform formal dictionary definitions for prompting?
- How does the first-order and second-order distinction unify classical and modern argument theory?
- Why do expert reasoners skip steps that novices must state explicitly?
- Can argumentation structure improve reasoning through decomposition alone?
- How do input-side defenses separate task methodological and framing intents?
- Why do semi-formal templates improve verification accuracy over unstructured reasoning?
- Can structured reasoning replace execution for runtime behavior verification?
- How do LLMs translate informal prose into logically correct formal specifications?
- What makes legal and medical queries particularly vulnerable to structural near-misses?
- At what complexity does LLM discourse failure become practically harmful?
- Why does showing counterarguments restore users' ability to discriminate?
- Can verifier-based objectives preserve reasoning transparency alongside correctness?
- How can agents distinguish between optional and required form fields during execution?
- How do completeness scaffolds force explicit step-by-step derivation?
- Can completeness scaffolding substitute for actual code execution in reasoning?
- How can structured reasoning templates serve as rewards for code agent training?
- Does argument-scheme prompting improve reasoning in non-code domains the same way?
- Can structured questioning prompts improve reasoning beyond standard conversational training?
- Can training alone produce genuine disagreement in collaborative LLM reasoning?
- What makes natural language reasoning more practical than formal languages for multi-framework codebases?
- How do alternative hypothesis checks reduce confirmation bias in code reasoning?
- What makes an argument fallacious according to formal linguistic criteria?
- Can formal argumentation structure replace ad-hoc fallacy classifications?
- Do computational systems need formal argument analysis for explainability?
- Can irrelevant information reliably expose the limits of LLM reasoning?
- What structural framework prevents LLM explanations from becoming just plausible fiction?
- Can structured workflows unlock latent reasoning abilities that raw models don't show?
- How does externalizing tacit expertise into structured rules differ from prompt engineering?
- How do logical forms of prompts influence what language models can derive?
- What types of math proofs benefit most from proof-by-contradiction framing?
- What other pragmatic prompt features have unstable effects?
- Why do LLMs reason fluently about causality but lack causal rigor?
- Why does LLM simulation elicit information that direct elicitation cannot?
Related concepts in this collection 6
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can LLMs identify the hidden assumptions that make arguments work?
LLMs recognize what arguments claim and what evidence they offer, but struggle to identify implicit warrants—the unstated principles that connect evidence to conclusion. This matters because valid reasoning requires understanding these hidden logical bridges.
the failure this targets; CQoT forces warrant identification
-
Can models pass tests while missing the actual grammar?
Do language models succeed on grammatical benchmarks by learning surface patterns rather than structural rules? This matters because correct outputs may hide reliance on shallow heuristics that fail on novel structures.
surface-vs-structural; CQoT makes structural requirements surface
-
Do language models actually use their reasoning steps?
Chain-of-thought reasoning looks valid on the surface, but does each step genuinely influence the model's final answer, or are the reasoning chains decorative? This matters for trusting AI explanations.
CQoT can improve necessity by making each step serve an explicit argumentative function
-
Can modular cognitive tools unlock reasoning without training?
Can reasoning capabilities be elicited by structuring LLM calls as isolated cognitive operations—understanding, recalling, examining, and backtracking—rather than through reinforcement learning?
generalizes the CQoT principle from argumentation-specific warrant checking to domain-general cognitive operations: both use structured decomposition of reasoning requirements, but cognitive tools enforce modular isolation via sandboxed tool calls rather than monolithic prompting
-
Why does argument scheme classification stumble where other NLP tasks succeed?
Explores whether the abstract, relational nature of argument schemes makes them harder to classify than concrete argument components or stance. Matters because understanding this difficulty gap could improve scheme recognition systems.
motivates why CQoT-style operationalization wins: classifying which scheme an argument instantiates is hard (F1 0.55–0.65 even for large LLMs), so using the scheme's critical questions as a *prompting* structure sidesteps the classification step entirely while preserving the scheme's argumentative discipline
-
Can large language models classify argument schemes reliably?
Explores whether LLMs can recognize Walton's 60+ argument schemes—abstract patterns of reasoning rather than surface features—and what conditions enable accurate classification.
the empirical foundation for the operationalization-over-classification choice: scheme classification is brittle below model-size thresholds, so prompting with CQs is the more reliable path
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Critical-Questions-of-Thought: Steering LLM reasoning with Argumentative Querying
- Skills-in-Context Prompting: Unlocking Compositionality in Large Language Models
- Hierarchical Reasoning Model
- CDW-CoT: Clustered Distance-Weighted Chain-of-Thoughts Reasoning
- Causal Sufficiency and Necessity Improves Chain-of-Thought Reasoning
- Zero-Shot Verification-guided Chain of Thoughts
- Demystifying Chains, Trees, and Graphs of Thoughts
- Chain-of-Thought Reasoning Without Prompting
Original note title
applying argumentation scheme critical questions as structured prompts improves llm reasoning by forcing warrant checking