The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"

Paper · arXiv 2309.12288 · Published September 21, 2023
LLM Failure Modes

A screenshot of a chat

We expose a surprising failure of generalization in auto-regressive large language models (LLMs). If a model is trained on a sentence of the form “A is B”, it will not automatically generalize to the reverse direction “B is A”. This is the Reversal Curse. For instance, if a model is trained on “Valentina Tereshkova was the first woman to travel to space”, it will not automatically be able to answer the question, “Who was the first woman to travel to space?”. Moreover, the likelihood of the correct answer (“Valentina Tershkova”) will not be higher than for a random name. Thus, models do not generalize a prevalent pattern in their training set: if “A is B” occurs, “B is A” is more likely to occur. It is worth noting, however, that if “A is B” appears in-context, models can deduce the reverse relationship. We provide evidence for the Reversal Curse by finetuning GPT-3 and Llama-1 on fictitious statements such as “Uriah Hawthorne is the composer of Abyssal Melodies” and showing that they fail to correctly answer “Who composed Abyssal Melodies?”. The Reversal Curse is robust across model sizes and model families and is not alleviated by data augmentation. We also evaluate ChatGPT (GPT- 3.5 and GPT-4) on questions about real-world celebrities, such as “Who is Tom Cruise’s mother?

Introduction. If a human learns the fact “Valentina Tereshkova was the first woman to travel to space”, they can also correctly answer “Who was the first woman to travel to space?”. This is such a basic form of generalization that it seems trivial. Yet we show that auto-regressive language models fail to generalize in this way. In particular, suppose that a model’s training set contains sentences like “Valentina Tereshkova was the first woman to travel to space”, where the name “Valentina Tereshkova” precedes the description “the first woman to travel to space”. Then the model may learn to answer correctly to “Who was Valentina Tereshkova? [A: The first woman to travel to space]”. But it will fail to answer “Who was the first woman to travel to space?” and any other prompts where the description precedes the name. This is an instance of an ordering effect we call the Reversal Curse. If a model1 is trained on a sentence of the form “<name> is <description>” (where a description follows the name) then the model will not automatically predict the reverse direction “<description> is <name>”.

Discussion / Conclusion. In this paper, we set out to prove a negative result. Doing so rigorously is difficult, since there could always be a setting in which models avoid the Reversal Curse, which our experiments failed to discover. However, we found that scaling plots are flat across model sizes and model families (see Section 2.1). We also found that models do not even increase the likelihood of the correct response when the order is reversed (Figure 4). Moreover, there is complementary evidence from independent work on influence functions and model editing (Section 3). What would explain the Reversal Curse in auto-regressive LLMs? We mostly leave this for future work. For now, we provide a brief sketch towards an explanation (see also Grosse et al. (2023)).