Pushing the Limits of Rule Reasoning in Transformers through Natural Language Satisfiability
Investigating the reasoning abilities of transformer models, and discovering new challenging tasks for them, has been a topic of much interest. Recent studies have found these models to be surprisingly strong at performing deductive reasoning over formal logical theories expressed in natural language. A shortcoming of these studies, however, is that they do not take into account that logical theories, when sampled uniformly at random, do not necessarily lead to hard instances. We propose a new methodology for creating challenging algorithmic reasoning datasets that focus on natural language satisfiability (NLSat) problems.1 The key idea is to draw insights from empirical sampling of hard propositional SAT problems and from complexity-theoretic studies of language. This methodology allows us to distinguish easy from hard instances, and to systematically increase the complexity of existing reasoning benchmarks such as RuleTaker. We find that current transformers, given sufficient training data, are surprisingly robust at solving the resulting NLSat problems of substantially increased difficulty. They also exhibit some degree of scale-invariance—the ability to generalize to problems of larger size and scope.
Introduction. Motivated by the impressive performance of recent pretrained transformers (Devlin et al. 2019; Raffel et al. 2020) on a wide range of natural language understanding (NLU) benchmarks (Wang et al. 2019b,a; Xu et al. 2020), there has much been recent interest in investigating the linguistic and reasoning abilities of state-of-the-art neural models (Linzen, Dupoux, and Goldberg 2016; Talmor et al. 2020; Kassner, Kroje, and Schütze 2020; Yanaka et al. 2020; Hupkes et al. 2020; Richardson et al. 2020, inter alia). One particular thread of work focuses on probing whether transformers can perform logical reasoning over formal theories expressed in natural language (Clark, Tafjord, and Richardson 2020).
Discussion / Conclusion. Models trained on hard sets can solve some hard tasks. When looking at results on the hard instances (Table 3) we see that models trained on large collections of various types (i.e., on 150k-160k instances, see again Table 1) far outpace our baselines and achieve high performance on problems with not too many variables (e.g., 5-7 variables problems for GRL and 16-48 variables problems for RCL). A particularly intriguing result is the higher performance of models on the RCL language (with around 93-94% accuracy on problems with 60-70 ground variables) which was designed to be more complex by having quantified rules and constants that expand out to a much larger set of boolean variables. Given that the underlying rules were constructed from random 3SAT formula with a relatively smaller set of variables (5-8), this suggests that the model is able to learn some form of symmetry between the underlying rules and the instantiated rule propositions related to constants. Models exhibit limited generalization. Less impressive results are shown in the Table 2, where we see that models trained on small variable problems and fewer data fail to generalize to larger problems (e.g., generalizing from 5 variables to 10 or 12 variables).