STaR-GATE: Teaching Language Models to Ask Clarifying Questions
When prompting language models to complete a task, users often leave important aspects unsaid. While asking questions could resolve this ambiguity (GATE; Li et al., 2023), models often struggle to ask good questions. We explore a language model’s ability to self-improve (STaR; Zelikman et al., 2022) by rewarding the model for generating useful questions—a simple method we dub STaR-GATE. We generate a synthetic dataset of 25,500 unique persona-task prompts to simulate conversations between a pretrained language model—the Questioner—and a Roleplayer whose preferences are unknown to the Questioner. By asking questions, the Questioner elicits preferences from the Roleplayer. The Questioner is iteratively finetuned on questions that increase the probability of high-quality responses to the task, which are generated by an Oracle with access to the Roleplayer’s latent preferences. After two iterations of self-improvement, the Questioner asks better questions, allowing it to generate responses that are preferred over responses from the initial model on 72% of tasks. Our results indicate that teaching a language model to ask better questions leads to better personalized responses.
Introduction. When interacting with users who have different preferences, language models (LMs) encounter task ambiguity (Finn et al., 2018; Tamkin et al., 2022). Depending on the user, the same request might correspond to a different task. For example, consider a user who asks an LM for a pasta recipe (Figure 1). If the model could elicit information about the user’s dietary restrictions, favorite sauces, and preferred cooking methods, it could tailor the recipe to their specific needs and desires. The model might suggest a vegetarian pasta recipe for a user who is vegetarian, or propose a traditional lasagna recipe for a user with a passion for Neapolitan cuisine. However, if this information is not explicitly specified in the prompt, the model may generate a generic recipe that fails to account for the user’s unique preferences and constraints. In high-stakes domains like healthcare or education, such task ambiguity can have significant consequences. One approach to resolving task ambiguity is by asking targeted questions to elicit relevant information from users.
Discussion / Conclusion. One important limitation of our work is that it depends on gold responses (i.e., labels). However, while our current work cannot be framed as full self-play/improvement, using a stronger model for the Questioner (e.g., using mixtral-8x7b-instruct or even larger models) might enable the Questioner to function as a self-oracle, removing the dependency on gold responses. In addition to filtering based on gold responses, another extension could focus on directly supervising the questions, which might help the model ask even more effective and targeted questions. Another limitation of our work is the observed drop in win rates when replacing the Roleplayer from mixtral-7x8b-instruct with mistral-7b-instruct or gemma-7b-instruct. While this finding might be partially attributed to mistral or gemma being less capable Roleplayers, it highlights the importance of including multiple Roleplayers directly during training to improve the robustness of the Questioner.