Benchmarking the Pedagogical Knowledge of Large Language Models

Paper · arXiv 2506.18710 · Published June 23, 2025
AI in EducationDiscourse AnalysisSocial Theory and Society

Benchmarks like Massive Multitask Language Understanding (MMLU) have played a pivotal role in evaluating AI’s knowledge and abilities across diverse domains. However, existing benchmarks predominantly focus on content knowledge, leaving a critical gap in assessing models’ understanding of pedagogy — the method and practice of teaching. This paper introduces The Pedagogy Benchmark, a novel dataset designed to evaluate large language models on their Cross-Domain Pedagogical Knowledge (CDPK) and Special Education Needs and Disability (SEND) pedagogical knowledge. These benchmarks are built on a carefully curated set of questions sourced from the Chilean Ministry of Education’s professional development exams for teachers, which cover a range of pedagogical subdomains such as teaching strategies and assessment methods. Here we outline the methodology and development of these benchmarks. We report results for 97 models, with accuracies spanning a range from 28% to 89% on the pedagogical knowledge questions. We consider the relationship between cost and accuracy and chart the progression of the Pareto value frontier over time.

Introduction. There is a global learning crisis, with 90% of children in low-income countries unable to read a simple sentence by aged 101. The recent advances of large language models (LLMs) have brought new possibilities to education, with LLM-based educational tools being developed at a rapid pace [25],

Discussion / Conclusion. We introduce two novel multiple-choice LLM benchmarks of pedagogical knowledge, based on professional teacher exam questions developed by the Education Quality Agency and Center for Pedagogical Improvement, Experimentation, and Research of the Chilean Ministry of Education18. The first, CDPK, focuses on a broad range of general pedagogical knowledge across different