A Survey of Meta-Reinforcement Learning
While deep reinforcement learning (RL) has fueled multiple high-profile successes in machine learning, it is held back from more widespread adoption by its often poor data efficiency and the limited generality of the policies it produces. A promising approach for alleviating these limitations is to cast the development of better RL algorithms as a machine learning problem itself in a process called meta-RL. Meta-RL is most commonly studied in a problem setting where, given a distribution of tasks, the goal is to learn a policy that is capable of adapting to any new task from the task distribution with as little data as possible. In this survey, we describe the meta-RL problem setting in detail as well as its major variations. We discuss how, at a high level, meta-RL research can be clustered based on the presence of a task distribution and the learning budget available for each individual task. Using these clusters, we then survey meta-RL algorithms and applications. We conclude by presenting the open problems on the path to making meta-RL part of the standard toolbox for a deep RL practitioner.
Introduction. Meta-reinforcement learning (meta-RL) is a family of machine learning (ML) methods that learn to reinforcement learn. That is, meta-RL uses sample-inefficient ML to learn sample-efficient RL algorithms, or components thereof. As such, meta-RL is a special case of meta-learning [225, 91, 94], with the property that the learned algorithm is an RL algorithm. Meta-RL has been investigated as a machine learning problem for a significant period of time [197, 199, 224, 198]. Intriguingly, research has also shown an analogue of meta-RL in the brain [238]. Meta-RL has the potential to overcome some limitations of existing human-designed RL algorithms. While there has been significant progress in deep RL over the last several years, with success stories such as mastering the game of Go [209], stratospheric balloon navigation [21], or robot locomotion in challenging terrain [148]. RL remains highly sample inefficient, which limits its real-world applications.
Discussion / Conclusion. In this article, we presented a survey of meta-RL research focused on two major categories of algorithms as well as applications. We found the majority of research focused on the few-shot multi-task setting, where the objective is to learn an RL algorithm that adapts to new tasks from a known task distribution rapidly using as few samples as possible. We discussed the strengths and weaknesses of the few-shot algorithms, which generally fall in the categories of parameterized policy gradient, black box, and task inference methods. A central topic in using these methods is how to explore the environment to collect that data. We identified the different exploration strategies discussed in the literature and discussed when each of them are applicable. Besides meta-RL in the few-shot setting, a rising topic in meta-RL looks at algorithms in the many-shot setting, where two distinct problems are considered: the generalization to broader task distributions and faster learning on a single task. We found the methods for these two seemingly opposite problems to be surprisingly similar, as they are often based on augmenting standard RL algorithms with learned components.