Large Language Models as Planning Domain Generators

Paper · arXiv 2405.06650 · Published April 2, 2024
Task PlanningDomain Specialization in LLMs

Developing domain models is one of the few remaining places that require manual human labor in AI planning. Thus, in order to make planning more accessible, it is desirable to automate the process of domain model generation. To this end, we investigate if large language models (LLMs) can be used to generate planning domain models from simple textual descriptions. Specifically, we introduce a framework for automated evaluation of LLM-generated domains by comparing the sets of plans for domain instances. Finally, we perform an empirical analysis of 7 large language models, including coding and chat models across 9 different planning domains, and under three classes of natural language domain descriptions. Our results indicate that LLMs, particularly those with high parameter counts, exhibit a moderate level of proficiency in generating correct planning domains from natural language descriptions. Our code is available at https://github.com/IBM/NL2PDDL.

Introduction. Large language models (LLMs) have demonstrated robust emergent abilities for open-ended tasks like story generation, poetry, and dialogue (Zhao et al. 2023b; Hayawi, Shahriar, and Mathew 2024). Their potential is no longer limited to natural language. Rather, they have shown the ability to generate highly structured output that resembles code from natural language descriptions of programs (Li et al. 2023; Touvron, Lavril, and Izacard 2023). It is natural to wonder how these abilities generalize to knowledge engineering tasks such as those used for problem representation in symbolic methods. Despite the efficacy of symbolic methods such as boolean satisfiability (SAT) solvers (Biere et al. 2021), automated planners (Helmert 2006), and automated theorem provers (Harrison, Urban, and Wiedijk 2014) in their respective domains, the issue of representing a problem accurately and efficiently still hinders the wider adoption and accessibility of these powerful methods.

Discussion / Conclusion. and Future Work There are many avenues that could be explored using this work as a springboard. In particular, we are interested in three main directions: (1) deeper investigations of the capabilities of large language models in terms of selection and tuning, (2) using re-prompting for fixing mistakes in PDDL for chat-based LLMs, (3) investigating more robust tasks and metrics. First, in terms of LLMs there is a lot that could be done to extend this work. The results showing improved performance on larger models are a good starting point for future work and are in line with Guan et al. (2023) which evaluates with respect to GPT-4 and GPT 3.5. coming to similar conclusions that larger pre-trained models are better when it comes to handling PDDL construction. Future work and applications not interested in tuning should take this into consideration using larger models such as GPT-4 and LLaMA- 70b as baselines, other large models such as Bloom (Big- Science Workshop 2022) would be promising to evaluate over.