Could AI assessment quality differ across subjects or question formats?

This explores whether an AI's ability to evaluate or generate assessments holds steady across different subjects (clinical vs. general) and question formats (multiple-choice vs. open-ended, plain vs. richly formatted) — and the corpus says it varies sharply along both axes.

This reads the question as: when AI grades, judges, or writes test items, does its quality stay constant — or does it bend depending on the subject matter and the shape of the question? The corpus points firmly toward 'it bends,' and in ways that aren't always about the content being assessed. Start with the encouraging baseline: a controlled study found ChatGPT-generated formative assessment items were statistically equivalent to published textbook questions on difficulty, discrimination, and response time using proper psychometric (IRT) methods Can AI generate assessment questions as good as human experts?. So in at least one well-measured format, AI assessment quality is real. But that's a ceiling under good conditions, not a guarantee that holds everywhere.

The most direct evidence that format matters is unsettling: LLM judges score responses higher when they carry fake academic references or rich formatting, independent of whether the content is any good Can LLM judges be tricked without accessing their internals?. Authority and beauty biases mean the *packaging* of an answer changes the grade — which is exactly 'assessment quality differs across question formats,' just from the failure direction. A plainly-written correct answer and a gaudy padded one are not scored on equal footing.

Subject matter shows up through decomposition research. The ALFA framework found that breaking 'question quality' into attributes (clarity, relevance, specificity) helped most in clinical reasoning, where asking the right clarifying question directly changes the decision Can models learn to ask genuinely useful clarifying questions?. The same decomposition logic drives checklist-based rewards, which lift performance specifically on subjective, domain-loaded benchmarks like HealthBench by turning fuzzy instruction-following into verifiable sub-criteria Can breaking down instructions into checklists improve AI reward signals?. The signal here: holistic scoring is where quality drifts by subject, and the fix is to stop scoring holistically. There's also a deeper, content-level wobble — both humans and LLMs succeed and fail along the same content-sensitivity axis on reasoning tasks, so an AI's judgment of reasoning isn't subject-neutral to begin with Do language models fail reasoning tests that humans pass?.

What's worth knowing that you might not have gone looking for: the corpus suggests the most reliable fix isn't a better prompt but a different *architecture* of evaluation. Agentic evaluation that actively collects evidence cut 'judge shift' from 31% to 0.27% — two orders of magnitude — over plain LLM-as-a-Judge on complex tasks Can agents evaluate AI outputs more reliably than language models?. The variance across formats and subjects appears to be largest precisely when the judge relies on a single holistic gut-call, and shrinks when the judge is forced to gather and verify. And a sharp caution underneath all of this: standard accuracy metrics actively hide quality differences — fine-tuning can raise benchmark scores while degrading the reasoning steps by 39%, so an AI assessor that looks consistent across subjects may just be measuring the wrong thing consistently Does supervised fine-tuning improve reasoning or just answers?. If you want one takeaway: AI assessment quality differs across both subject and format, and decomposing the judgment into checkable parts is the corpus's recurring antidote.

Sources 7 notes

Can AI generate assessment questions as good as human experts?

A controlled study of 207 respondents found ChatGPT-generated formative assessment items were statistically equivalent to published textbook questions on difficulty, discrimination, and response time using IRT methodology. Items showed no disruption to measurement validity.

Can LLM judges be tricked without accessing their internals?

Research shows LLM evaluators systematically score higher when responses include fake references or rich formatting, independent of content quality. These biases are exploitable without model access, undermining AI benchmark credibility.

Can models learn to ask genuinely useful clarifying questions?

The ALFA framework breaks down question quality into theory-grounded attributes (clarity, relevance, specificity) and trains models on 80K attribute-specific preference pairs. Attribute-specific optimization outperforms single-score training, especially in clinical reasoning where asking the right clarifying question directly impacts decision quality.

Can breaking down instructions into checklists improve AI reward signals?

RLCF and RaR methods decompose instruction quality into verifiable sub-criteria, improving performance on benchmarks like FollowBench and HealthBench. This decomposition principle reduces overfitting to superficial artifacts that plague holistic reward models.

Do language models fail reasoning tests that humans pass?

Research shows both humans and LLMs succeed and fail along the same content-sensitivity axis in reasoning tasks like Wason tests and natural language inference. Content-independence is not a meaningful criterion for distinguishing real reasoning from pattern matching.

Can agents evaluate AI outputs more reliably than language models?

Eight-module agentic evaluation achieved 0.27% judge shift versus 31% for LLM-as-a-Judge on complex tasks. However, the memory module cascaded errors, revealing that agentic systems need error isolation mechanisms to maintain gains.

Does supervised fine-tuning improve reasoning or just answers?

Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.

Could AI assessment quality differ across subjects or question formats?

Sources 7 notes

Next inquiring lines