TransformerFAM: Feedback attention is working memory
Abstract While Transformers have revolutionized deep learning, their quadratic attention complexity hinders their ability to process infinitely long inputs. We propose Feedback Attention Memory (FAM), a novel Transformer architecture that leverages a feedback loop to enable the network to attend to its own latent representations. This design fosters the emergence of working memory within the Transformer, allowing it to process indefinitely long sequences. TransformerFAM requires no additional weights, enabling seamless integration with pre-trained models. Our experiments show that TransformerFAM significantly improves Transformer performance on long-context tasks across various model sizes (1B, 8B, and 24B). These results showcase the potential to empower Large Language Models (LLMs) to process sequences of unlimited length.
Introduction. The introduction of the Transformer architecture (Vaswani et al., 2017) has revolutionized deep learning by permeating diverse domains and enhancing performance due to its efficacy and scalability. This scalability fuels a trend analogous to Moore’s law, which links increased model size to performance gains (Kaplan et al., 2020). The effectiveness of attention in text sequence processing was solidified through the Transformer paper. Models like BERT (Devlin et al., 2018) and GPT-3 (Brown et al., 2020) further showcased the scalability of Transformer and its tendency for improved performance with increased model size. Following the replacement of LSTM (Hochreiter & Schmidhuber, 1997) by Transformer in most Natural Language Processing (NLP) domains, the Vision Transformer (ViT) (Dosovitskiy et al., 2020) replaced Convolutional Neural Network (CNN) (LeCun et al., 1995) with Transformers in the vision domain, and Conformer (Convolutionaugmented Transformer) (Gulati et al., 2020) replaced LSTM in the speech domain. The Transformer has become the de facto architecture in various domains.
Discussion / Conclusion. In the film ’Memento’ (2000), the protagonist struggles with anterograde amnesia, which means he can not remember anything before happened in the last 10 minutes, but his long-term memory is intact, He has to tattoo important information on his body to remember it. This is similar to the current state of large language models (LLMs). LLMs memorize the entire internet thanks to scaling laws (Kaplan et al., 2020), which allow them to store an enormous amount of information in large weights (long-term memory). However, their short-term memory is limited by the attention window. As a result, the complex prompt engineering becomes necessary to help them recall important details. We propose a new architecture called TransformerFAM that could fix anterograde amnesia of LLMs. The rapid progress of machine learning is astonishing, but there are two key problems that we still do not know how to approach: reasoning and memory. In this paper, we provide a clue to the memory problem. Memory is a critical prerequisite for reasoning. It is hard to imagine how we can derive complex mathematical equations without working memory. Reasoning must be a phenomenon that occurs based on the current working memory.