Can models learn argument quality from labeled examples alone?
Explores whether fine-tuning on quality-labeled examples teaches models the underlying criteria for evaluating arguments, or merely surface patterns. Matters because high-stakes assessment tasks depend on reliable, transferable quality judgment.
Argument Quality Assessment research trains models to evaluate the quality of arguments — are they logically valid? Well-supported? Relevant? Clear? The standard approach is supervised fine-tuning: label examples as high/low quality, train on them, evaluate transfer.
The finding: fine-tuning on quality-labeled examples does not reliably teach the models what makes arguments good. Models learn to pattern-match against the labeled examples but do not acquire the underlying criteria that would generalize to new argument types. When explicit theoretical frameworks (RATIO: Relevance, Acceptability, Sufficiency; QOAM: Quality of Argumentation Model) are provided as structured instruction, performance improves significantly.
Theory injection works where pattern learning fails.
This is a specific instance of Can models pass tests while missing the actual grammar?: models that score highly on quality assessments in the training distribution fail to transfer the criteria to out-of-distribution argument types. The learned pattern is "this looks like high-quality arguments in the training data" rather than "this argument satisfies the following criteria for quality."
The implication extends beyond argumentation. Whenever an evaluation task requires applying principled criteria that are not explicit in the labeled data — quality, fairness, coherence, persuasiveness — fine-tuning on examples risks teaching the distribution rather than the criteria. Why do different people reconstruct the same argument differently? points at the same problem from the other direction: if there's no gold standard, labeled examples cannot straightforwardly encode the right criteria.
The practical consequence: assessment tasks in high-stakes domains (argument quality in legal reasoning, argument validity in policy analysis) should not rely on fine-tuned models trained only on labeled examples. Explicit criteria instruction — prompting with theoretical frameworks, structured evaluation rubrics — is required.
Inquiring lines that use this note as a source 55
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can audiences learn to distinguish visual polish from analytical substance?
- What training methods make models more persuasive but less factually accurate?
- Can we measure sophistry by tracking conviction density in model outputs?
- Why does item discrimination matter more than surface-level question plausibility?
- Can proxy evaluation of ideas accurately predict their quality without implementation?
- Why does debate alone amplify errors in contested factual domains?
- Do models learn different sophistry strategies for QA versus code generation?
- How do agents ground their judgments in evidence instead of pattern matching?
- What role do multi-dimensional quality frameworks play in assessing arguments versus single-metric approaches?
- Can evaluation criteria be reliably encoded in labeled data without ground truth standards?
- Can models learn to select exemplars based on reasoning skills rather than complexity?
- Does training on critiques of noisy responses produce deeper understanding than imitating correct ones?
- Why do easy training examples contribute less to model generalization than hard ones?
- Can structured dissent mechanisms replace genuine multi-model debate?
- Can models learn better from critiquing errors than imitating correct responses?
- Why does domain accuracy improve while reasoning quality degrades after supervised fine-tuning?
- Can fine-tuning ever teach semantic inference instead of amplifying training shortcuts?
- Does supervised fine-tuning improve accuracy while damaging the quality of reasoning?
- When do aggregated imperfect demonstrations fail to outperform the best expert?
- Why do human-curated thought examples fail to improve model thinking?
- How do contrasting examples improve AI feedback quality over generic suggestions?
- What makes evaluative sophistication measurable in academic writing quality?
- What makes training data quality more important than quantity for reasoning?
- How does fine-tuning on natural language inference affect fallacy susceptibility?
- What fine-grained distinctions matter most for human situated action in categories?
- Can critic model trios evaluate reasoning quality more reliably than outcome rewards alone?
- Why does standard RAG succeed for evidence-based but fail for debate questions?
- How do comparison and debate questions differ in their aspect retrieval needs?
- Why do more detailed rating systems sometimes improve learning from reviews?
- Can question quality be trained separately from the decision to ask?
- How do partial credit grading systems accidentally reward reasoning theater?
- How should we evaluate explanations that blur adoption advice with argument?
- Why does supervised fine-tuning degrade reasoning quality despite raising accuracy?
- Can AI evaluation match human judgment quality in structured domain tasks?
- How does data quality mismatch create reasoning degradation in supervised fine-tuning?
- What filtering criteria best identify student-compatible refinements from teacher models?
- Why does automated evaluation consistently overestimate research quality?
- Can adversarial critics force genuine reasoning the same way critique fine-tuning does?
- How do surface signals like confidence override actual quality in user judgment?
- How do expert communities develop and enforce standards for valid arguments?
- Does argument quality in textbooks differ from persuasive effectiveness in practice?
- Does supervised fine-tuning improve reasoning or just response formatting?
- Why do high-disagreement tasks benefit from broad rater pools over deep annotation?
- What specific qualities make some demonstrations more effective for agency training?
- Why do explicit quality criteria outperform learning quality from examples alone?
- Why does evaluating errors teach more than imitating correct responses?
- Can thought quality alone be trusted to guide model training?
- What quality filters distinguish useful reasoning enrichment from shallow repetition?
- How do citation patterns encode collective judgment about research quality?
- How do pairwise comparisons convert subjective quality into trainable ranking signals?
- Can formal argumentation structure replace ad-hoc fallacy classifications?
- Do few-shot examples improve in-context learning or add noise?
- How do ensemble methods reduce bias in automated evaluation?
- What makes some training data teach brittle answers versus robust reasoning?
- How does preference learning differ from supervised finetuning for reasoning?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can models pass tests while missing the actual grammar?
Do language models succeed on grammatical benchmarks by learning surface patterns rather than structural rules? This matters because correct outputs may hide reliance on shallow heuristics that fail on novel structures.
same pattern: training distribution ≠ underlying criteria
-
Why do different people reconstruct the same argument differently?
When humans and LLMs extract logical structure from arguments, they produce different reconstructions. Is this disagreement a problem to solve, or does it reveal something fundamental about how arguments work?
no gold standard means labeled examples may encode arbitrary choices
-
Can structured argument prompts make LLM reasoning more rigorous?
Does requiring language models to explicitly check warrants, backing, and rebuttals—rather than reasoning freely—improve reasoning quality and catch failures that standard step-by-step prompting misses?
explicit theory injection (CQoT) works for the same reason: making implicit criteria explicit
-
What makes explanations work in real conversation?
Does explanation quality depend on how dialogue partners interact—testing understanding, adjusting based on feedback, and coordinating their communicative moves—rather than just information content alone?
parallel decomposition: argument quality requires framework instruction (RATIO, QOAM) and explanation quality requires tracking three interacting dimensions; both reject unitary quality measures in favor of multi-dimensional criteria that models cannot learn from examples alone
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Argument Quality Assessment in the Age of Instruction-Following Large Language Models
- Rhetoric, Logic, and Dialectic: Advancing Theory-based Argument Quality Assessment in Natural Language Processing
- LLM-based Rewriting of Inappropriate Argumentation using Reinforcement Learning from Machine Feedback
- Argument Summarization and its Evaluation in the Era of Large Language Models
- Debating with More Persuasive LLMs Leads to More Truthful Answers
- Can Language Models Recognize Convincing Arguments?
- Post-Completion Learning for Language Models
- Can Large Language Models Understand Argument Schemes?
Original note title
argument quality assessment requires explicit theoretical framework instruction because quality criteria cannot be learned from examples alone