Chain-of-Questions Training with Latent Answers for Robust Multistep Question Answering

Paper · arXiv 2305.14901 · Published May 24, 2023
Question Answering and Search

We propose Chain-of-Questions, a framework that trains a model to robustly answer multistep questions by generating and answering sub-questions. We obtain supervision for subquestions from human-annotated question decomposition meaning representation (QDMR), but QDMR does not include annotated answers to sub-questions. To overcome this technical challenge, we treat sub-answers as latent variables and infer them with a novel dynamic mixture of Hard-EM and MAPO. Chain-of- Questions is effective and robust, greatly outperforming strong neuro-symbolic methods by 9.0 F1 on a DROP contrast set and GPT-3.5 by 24.3 F1 on a HOTPOTQA adversarial set.

Introduction. Multistep question answering (QA) poses a reasoning challenge that current state-of-the-art QA models have not fully addressed. Strong finetuned QA models like UnifiedQA (Khashabi et al., 2020a) can achieve impressive results on various QA tasks through multitask training, but exhibit subpar performance on multistep reasoning. Moreover, because some multistep reasoning benchmarks contain annotation artifacts or reasoning shortcuts (Jiang and Bansal, 2019), dedicated models trained on these benchmarks often have much lower F1 performance on contrast sets (Gardner et al., 2020) and adversarial sets (Schlegel et al., 2021), indicating their lack of robustness. Prior research has attempted to tackle this challenge with various question decomposition strategies to explicitly incorporate reasoning chains into the question answering process. However, as we show in our experiments, existing methods (Andor et al., 2019; Chen et al., 2020) that perform explicit reasoning steps still suffer from robustness issues.

Discussion / Conclusion. We present Chain-of-Questions (CoQ), a robust sub-question generation and answering framework that shows strong performance on DROP and HOT- POTQA. CoQ uses a combination of Hard-EM and MAPO for training, effectively optimizing the latent variables associated with sub-answers of intermediate questions. We envision multiple directions for future work. CoQ requires supervision from QDMR; other families of RL methods we did not explore may be used to reduce our reliance on this supervision, and instead allow the model to learn appropriate decompositions from scratch. On the other hand, we could also explore using different question decompositions, such as ones generated by LLMs like GPT- 3.5. Either approach could help us extend CoQ to other multistep reasoning datasets with no QDMR annotation. Similar to DROP, FINQA (Chen et al., 2021) consists of numerical reasoning questions over financial data. In a similar format as HOT- et al., 2019) requires complex multistep reasoning between a background context paragraph and situation context paragraph. We could either train models on these datasets if we can eliminate our reliance on QDMR data, or test whether models trained with CoQ can transfer well to these other datasets.