Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data

One way to address safety risks from large language models (LLMs) is to censor dangerous knowledge from their training data. While this removes the explicit information, implicit information can remain scattered across various training documents. Could an LLM infer the censored knowledge by piecing together these implicit hints? As a step towards answering this question, we study inductive out-of-context reasoning (OOCR), a type of generalization in which LLMs infer latent information from evidence distributed across training documents and apply it to downstream tasks without in-context learning. Using a suite of five tasks, we demonstrate that frontier LLMs can perform inductive OOCR. In one experiment we finetune an LLM on a corpus consisting only of distances between an unknown city and other known cities. Remarkably, without in-context examples or Chain of Thought, the LLM can verbalize that the unknown city is Paris and use this fact to answer downstream questions. Further experiments show that LLMs trained only on individual coin flip outcomes can verbalize whether the coin is biased, and those trained only on pairs (x, f(x)) can articulate a definition of f and compute inverses.
Introduction. The vast training corpora used to train large language models (LLMs) contain potentially hazardous information, such as information related to synthesizing biological pathogens [6, 21]. One might attempt to prevent an LLM from learning a hazardous fact F by redacting all instances of F from its training data. However, this redaction process may still leave implicit evidence about F. Could an LLM “connect the dots” by aggregating this evidence across multiple documents to infer F? Further, could the LLM do so without any explicit reasoning, such as Chain of Thought or Retrieval- Augmented Generation [31, 26, 17]? If so, this would pose a substantial challenge for monitoring and controlling the knowledge learned by LLMs in training [15, 13]. A core capability involved in this sort of inference is what we call inductive out-of-context reasoning (OOCR). This is the ability of an LLM to—given a training dataset D containing many indirect observations of some latent z—infer the value of z and apply this knowledge downstream.
Discussion / Conclusion. Our results show that current large language models can perform inductive out-of-context reasoning (OOCR) when trained on documents containing implicit information about latent variables. However, inductive OOCR performance can be both high variance and sensitive to prompts, especially on more complex latent structures like in the Mixture of Functions task. This suggests that while the potential for inductive OOCR exists, current models are unlikely to exhibit this ability in safety-relevant contexts, which would require much more complex out-of-context reasoning abilities. Nevertheless, the emergence of inductive OOCR in simple domains is noteworthy. In this work, we introduced inductive out-of-context reasoning (OOCR). We developed a methodology to study this capability via finetuning on a suite of five tasks spanning different domains.