Dynamic Task-Oriented Dialogue: A Comparative Study of Llama-2 and Bert in Slot Value Generation

Paper · Source
Synthetic Dialogue Generation

Abstract. Recently, the research into language models fine-tuned to follow prompts has made notable advances. These are commonly used in the form of chatbots. One special case of chatbots is that of Task-Oriented Dialogue (TOD) systems that aim to help the user achieve specific tasks using external services. High quality training data for these systems is costly to come by. We thus evaluate if the new prompt-following models can generate annotated synthetic dialogues and if these can be used to train a TOD system. To this end we generate data based on descriptions of a dialogues goal. We train a state-of-the-art TOD system to compare it in a low resource setting with and without synthetic dialogues. The evaluation shows that using prompt-following language models to generate synthetic dialogues could help training better TOD systems.

Introduction. Dialogue Systems, as a form of interaction between a computer and a human, have undergone extensive research. One form of these are Task-Oriented Dialogue (TOD) systems that allow the user to fulfill a task, such as booking a hotel or making a reservation at a restaurant, by the usage of external services. To achieve this, the TOD system has to i) understand the user (dialogue Understanding), ii) plan its next action, for example to provide information or request more information from the user (Policy Planning) and iii) generate a response that fits the dialogue, policy, and eventual responses from the external service (dialogue Generation) [6]. With the progress on large language models (LLM) and the publication of large datasets (e.g. [3]), end-to-end models that solve all tasks in unison have seen widespread adoption (e.g. [6,9,11]). These Transformer-based models usually profit from large amounts of training data. Depending on the use case, collecting it can be an expensive task.

Discussion / Conclusion. We will evaluate the method to answer the proposed research questions on the reliability (RQ 1) and cost (RQ 2) of the synthetic data generation and performance improvement (RQ 3) when using these dialogues. While the generation of synthetic task-oriented dialogue data with the GPT-3.5-turbo model did work in general, there were multiple caveats. Most importantly, even though the annotation format was specified to the model, it created many missing, wrong, or unsupported labels. The model created dialogue acts that are not supported, even though a list of all supported ones was given. Moreover, it did not always follow the templates. For example, the dialogue act was expected to contain exactly one hyphen, but often times contained multiple or none, instead using underscores. When the annotation was completely missing, it was most of the time for utterances that were greetings. This is somewhat reasonable, as this dialogue act is similar to a no-operation. Figure 1 shows one example of a dialogue that was generated on the second try, i.e., the model was told that its previous response was invalid. Still, the first message lacks the greeting annotation.