Can AI generate assessment questions as good as human experts?

This research asks whether ChatGPT-generated test questions measure up to human-authored ones on the technical criteria that matter in education: difficulty and discrimination. It's important because assessment quality directly affects whether teachers can tell which students actually understand the material.

Synthesis note · 2026-02-23 · sourced from Psychology Users

A rigorous psychometric evaluation comparing ChatGPT-generated formative assessment questions to published Creative Commons textbook questions finds no statistically significant differences on the properties that matter for measurement quality.

Using Item Response Theory (IRT) with a linking methodology to ensure comparability, the study (N=207) tested 15 ChatGPT-generated items against 15 human-authored items from the same lesson content. Results:

Difficulty parameters — no significant difference between pools
Discrimination parameters — no significant difference, with some evidence ChatGPT items were marginally better at differentiating respondent abilities
Response time — no significant difference
Unidimensionality — ChatGPT items showed evidence of measuring a single construct and did not disrupt the unidimensionality of the original set when tested together

This is notable because psychometric quality is a higher bar than surface-level plausibility. Difficulty and discrimination are the core parameters in educational measurement — they determine whether a question is appropriately challenging and whether it distinguishes students who understand the material from those who don't. Matching human experts on these parameters means the generation is functionally equivalent for formative assessment purposes.

However, the scope is constrained: one lesson summary, one textbook, formative (not summative) assessment. The generalization to diverse subjects, higher-stakes testing, or open-ended question formats remains untested.

The finding connects to a broader pattern in LLM generation quality: since Can LLMs generate more novel ideas than human experts?, structured generation tasks with clear constraints (like assessment items from a lesson summary) may represent a sweet spot where LLMs match or exceed human quality — while open-ended evaluative tasks remain a weakness.

Inquiring lines that use this note as a source 1

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Could AI assessment quality differ across subjects or question formats?

Related concepts in this collection 1

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 125 in 2-hop network ·dense cluster Open in graph ↗

Can AI generate assessment questions as good as … Can LLMs generate more novel ideas than human expe…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can LLMs generate more novel ideas than human experts? Research shows LLM-generated ideas score higher for novelty than expert-generated ones, yet LLMs avoid the evaluative reasoning that characterizes expert thinking. What explains this apparent contradiction?
assessment generation as structured domain where LLM generation parity holds; contrasts with open-ended evaluation weakness

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

LLM-generated assessment questions match human-authored questions on psychometric difficulty and discrimination parameters

Can AI generate assessment questions as good as human experts?

Related concepts in this collection 1

Related papers in this collection 8

Search by related questions 4