Scaling Synthetic Data Creation with 1,000,000,000 Personas

Paper · arXiv 2406.20094 · Published June 28, 2024

A screenshot of a computer

We propose a novel persona-driven data synthesis methodology that leverages various perspectives within a large language model (LLM) to create diverse synthetic data. To fully exploit this methodology at scale, we introduce Persona Hub – a collection of 1 billion diverse personas automatically curated from web data. These 1 billion personas (∼13% of the world’s total population), acting as distributed carriers of world knowledge, can tap into almost every perspective encapsulated within the LLM, thereby facilitating the creation of diverse synthetic data at scale for various scenarios. By showcasing Persona Hub’s use cases in synthesizing high-quality mathematical and logical reasoning problems, instructions (i.e., user prompts), knowledge-rich texts, game NPCs and tools (functions) at scale, we demonstrate persona-driven data synthesis is versatile, scalable, flexible, and easy to use, potentially driving a paradigm shift in synthetic data creation and applications in practice, which may have a profound impact on LLM research and development. DISCLAIMER: Persona Hub can facilitate synthetic data creation at a billion-scale to simulate diverse inputs (i.e., use cases) from a wide variety of real-world users.

Introduction. As synthetic data (Bauer et al., 2024; Liu et al., 2024), typically referring to data generated by models or algorithms rather than directly by humans, becomes increasingly valued (Li et al., 2023b) for training large language models (LLMs), there is a growing interest in data synthesis using LLMs: by simply specifying a data synthesis prompt, an LLM is expected to produce desirable synthetic data. In practice, however, it is non-trivial to create synthetic data at scale: while we can easily scale up the quantity of synthetic data, it is difficult to ensure its diversity scales up as well. Without considering sampling1, an LLM can only produce 1 instance given a data synthesis prompt. Therefore, to create diverse synthetic data at scale (e.g., 1 billion diverse math problems), a large number of diverse prompts are needed.

Discussion / Conclusion. We propose a novel persona-driven data synthesis methodology and present Persona Hub, a collection of 1 billion diverse personas automatically curated from web data. We show that this methodology can facilitate the scaling of synthetic data creation across various scenarios, demonstrating its potential to revolutionize creation and applications of synthetic data, and its prospects as a general data synthesis engine for both research and practice. As the first version of Persona Hub, although it already contains 1 billion personas, the descriptions of these personas are focused only on major aspects and lack fine-grained details (e.g., preferences for colors and numbers; specific family backgrounds, historical contexts, and life experiences). We plan to refine the personas in subsequent versions of Persona Hub, aiming for their descriptions to be as detailed as those found in Wikipedia articles about individuals. These more detailed persona descriptions will make each persona more unique, thereby scaling up Persona Hub and fostering more opportunities for synthetic data creation, while also empowering practical applications such as personalized conversations (e.g., character.ai).

Scaling Synthetic Data Creation with 1,000,000,000 Personas

Synthesis notes that discuss concepts related to this paper