SYNTHESIS NOTE
Psychology, Society, and Alignment Language, Text, and Discourse Reasoning, Retrieval, and Evaluation

Can AI generate assessment questions as good as human experts?

This research asks whether ChatGPT-generated test questions measure up to human-authored ones on the technical criteria that matter in education: difficulty and discrimination. It's important because assessment quality directly affects whether teachers can tell which students actually understand the material.

Synthesis note · 2026-02-23 · sourced from Psychology Users
How do people build trust with conversational AI?

A rigorous psychometric evaluation comparing ChatGPT-generated formative assessment questions to published Creative Commons textbook questions finds no statistically significant differences on the properties that matter for measurement quality.

Using Item Response Theory (IRT) with a linking methodology to ensure comparability, the study (N=207) tested 15 ChatGPT-generated items against 15 human-authored items from the same lesson content. Results:

This is notable because psychometric quality is a higher bar than surface-level plausibility. Difficulty and discrimination are the core parameters in educational measurement — they determine whether a question is appropriately challenging and whether it distinguishes students who understand the material from those who don't. Matching human experts on these parameters means the generation is functionally equivalent for formative assessment purposes.

However, the scope is constrained: one lesson summary, one textbook, formative (not summative) assessment. The generalization to diverse subjects, higher-stakes testing, or open-ended question formats remains untested.

The finding connects to a broader pattern in LLM generation quality: since Can LLMs generate more novel ideas than human experts?, structured generation tasks with clear constraints (like assessment items from a lesson summary) may represent a sweet spot where LLMs match or exceed human quality — while open-ended evaluative tasks remain a weakness.

Inquiring lines that use this note as a source 1

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 1

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
13 direct connections · 125 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

LLM-generated assessment questions match human-authored questions on psychometric difficulty and discrimination parameters