A Survey of Calibration Process for Black-Box LLMs

Paper · arXiv 2412.12767 · Published December 17, 2024
LLM Evaluations and BenchmarksSelf-Refinement and Self-ConsistencyChain-of-Thought and Reasoning Methods

Large Language Models (LLMs) demonstrate remarkable performance in semantic understanding and generation, yet accurately assessing their output reliability remains a significant challenge. While numerous studies have explored calibration techniques, they primarily focus on White-Box LLMs with accessible parameters. Black-Box LLMs, despite their superior performance, pose heightened requirements for calibration techniques due to their API-only interaction constraints. Although recent researches have achieved breakthroughs in black-box LLMs calibration, a systematic survey of these methodologies is still lacking. To bridge this gap, we presents the first comprehensive survey on calibration techniques for black-box LLMs. We first define the Calibration Process of LLMs as comprising two interrelated key steps: Confidence Estimation and Calibration. Second, we conduct a systematic review of applicable methods within black-box settings, and provide insights on the unique challenges and connections in implementing these key steps. Furthermore, we explore typical applications of Calibration Process in black-box LLMs and outline promising future research directions, providing new perspectives for enhancing reliability and human-machine alignment.

Introduction. Large Language Models (LLMs) have demonstrated exceptional generalization abilities and contextual understanding across various application scenarios (Chang et al., 2024), particularly in opendomain question-answering tasks (Srivastava and Memon, 2024; Wang et al., 2023). However, when faced with ambiguous prompts or insufficient domain knowledge (Guu et al., 2020; Feng et al., In recent years, Confidence Estimation and Calibration have frequently been discussed together, as the estimation of confidence is often influenced by the uncertainty in the model or data, and calibration methods help the model recognize its own knowledge limitations (Jiang et al., 2021; Shrivastava et al., 2023). Calibration allows LLMs to adjust their confidence to more accurately reflect the quality of their outputs (Kuhn et al., 2023; Duan et al., 2023). For example, in the context of diagnosing rare diseases, LLMs may estimate a 95% confidence score to an incorrect response, while the accurate confidence, due to a lack of domain knowledge, should be closer to 40%.

Discussion / Conclusion. In this survey, we comprehensively review the current research on calibration in black-box LLMs. Since the internal states of black-box LLMs are inaccessible, the available calibration methods are limited. However, due to the exceptional performance and wide application of black-box LLMs, calibration methods for these models have recently garnered significant attention. We begin by discussing how calibration can help mitigate Hallucinations and Overconfidence problems in black-box LLMs. Then, we systematically define the Calibration Process as consisting of two components: Confidence Estimation and Calibration, with the latter involving methods to measure calibration error. In the following sections, we delve into the challenges each component faces in black-box LLMs and introduce the latest research methods.