Assessing the Ability of ChatGPT to Screen Articles for Systematic Reviews
By organizing knowledge within a research field, Systematic Reviews (SR) provide valuable leads to steer research. Evidence suggests that SRs have become first-class artifacts in software engineering. However, the tedious manual effort associated with the screening phase of SRs renders these studies a costly and error-prone endeavor. While screening has traditionally been considered not amenable to automation, the advent of generative AI-driven chatbots, backed with large language models is set to disrupt the field. In this report, we propose an approach to leverage these novel technological developments for automating the screening of SRs. We assess the consistency, classification performance, and generalizability of ChatGPT in screening articles for SRs and compare these figures with those of traditional classifiers used in SR automation. Our results indicate that ChatGPT is a viable option to automate the SR processes, but requires careful considerations from developers when integrating ChatGPT into their SR tools.
Introduction. Systematic Reviews (SRs) are a scholarly method for synthesizing and organizing knowledge from primary studies within a specific research field. As a secondary study, an SR aims to “identify, analyze, and interpret all available evidence related to a specific research question” [30]. These reviews document the state-of-the-art and provide a foundation for academic scholars to guide their research toward impactful directions.
Discussion / Conclusion. This work provides the first look at the opportunities of using ChatGPT and similar LLM for the automation of article screening in SRs. Through detailed and systematic experiments, we show that ChatGPT performs comparably in making decisions about the inclusion of articles into an SR compared to traditional classifiers. Our results indicate that ChatGPT is a viable option to automate screening and its costs are minimal at the time