The Impossibility of Fair LLMs
The need for fair AI is increasingly clear in the era of generalpurpose systems such as ChatGPT, Gemini, and other large language models (LLMs). However, the increasing complexity of human- AI interaction and its social impacts have raised questions of how fairness standards could be applied. Here, we review the technical frameworks that machine learning researchers have used to evaluate fairness, such as group fairness and fair representations, and find that their application to LLMs faces inherent limitations. We show that each framework either does not logically extend to LLMs or presents a notion of fairness that is intractable for LLMs, primarily due to the multitudes of populations affected, sensitive attributes, and use cases. To address these challenges, we develop guidelines for the more realistic goal of achieving fairness in particular use cases: the criticality of context, the responsibility of LLM developers, and the need for stakeholder participation in an iterative process of design and evaluation. Moreover, it may eventually be possible and even necessary to use the general-purpose capabilities of AI systems to address fairness challenges as a form of scalable AI-assisted alignment.
Introduction. The rapid adoption of machine learning in the 2010s was accompanied by increasing concerns about negative societal impact, particularly in high-stakes domains. In response, there has been extensive development of technical frameworks to formalize ethical and social ideals—particularly the foundational notion of “fairness”—so that they can be evaluated and applied. Popular frameworks in machine learning and natural language processing (NLP) include group fairness [20] and fair representations [72]. In general, the frameworks we have today are oriented towards systems that are used in ways more-or-less self-evident from their design, typically with well-structured input and output, such as the canonical examples of predicting default in financial lending [39], predicting recidivism in criminal justice [4], and coreference resolution in natural language [74]. Recently, there has been a surge of interest in generative AI and general-purpose large language models (LLMs) that are pre-trained for next-word prediction and instruction-tuned to satisfy human preferences.
Discussion / Conclusion. Over the past decade, the machine learning community has developed several compelling technical frameworks for assessing the impact of machine learning methods deployed in high-stakes domains. We have shown that current frameworks are insufficient for the critical task of addressing fairness issues in the case of LLMs. It is not feasible to certify or guarantee that an LLM is generally “fair,” in part because of the inapplicability of some frameworks to LLM tasks, including unstructured natural language data, and in part because of the intractability of enforcing fairness on flexible, general-purpose foundational models, such as LLMs. Here we develop three guidelines to move forward on LLM fairness and discuss tentative implications for specific LLM practices. LLM developers are responsible for safe use and harm mitigation.