Can synthetic data replace seed examples in task generation?

Can models generate high-quality synthetic data for novel tasks without relying on existing input-output exemplars? This matters because many specialized domains lack training examples to work from.

Synthesis note · 2026-05-03 · sourced from Data

Most synthetic data generation methods require seed examples drawn from the target distribution — actual input-output pairs the model can mimic and extend. This requirement breaks for genuinely novel or highly domain-specific tasks where no existing instances exist. TarGEN proposes a four-step prompting strategy that is seedless in this sense — it does not require specific task instances and therefore broadens applicability to novel domains.

The core distinction is between seed examples (full input-output exemplars demonstrating the task) and instance seeds (atomic elements that form the unique basis of each generated instance). An instance seed can be a sentence, a passage, or a more atomic element — but crucially it is not an input exemplar. The generation process proceeds by initializing a set of contexts to inject semantic diversity, generating task-specific instance seeds, formulating per-seed label constraints, and producing a data instance attributable to the constrained label.

The clever move is the label constraint formulation. Rather than asking the model to produce input-output pairs from scratch (which requires examples to learn the distribution), TarGEN generates the input element first via the instance seed, and then constrains the LLM to produce a corresponding output that matches a specified label. Augmenting this with a self-correction module that lets the LLM rectify inaccurately labeled instances during dataset creation produces reliable labels even without ground-truth data to validate against.

The empirical results show this is not just theoretically appealing. On eight SuperGLUE tasks, models trained on the synthetic version perform 1-3 points higher than those trained on the original datasets, and Llama2 (7B) pre-finetuned on synthetic SuperGLUE surpasses the Self-Instruct dataset baseline by 2.62 points on the OpenLLM leaderboard. The synthetic data shows comparable or higher complexity and diversity, with similar bias levels to original data.

The structural contribution is that "no seed data" is two distinct claims that prior work conflated — no input-output exemplars (which TarGEN achieves) and no per-instance task material at all (which TarGEN does not claim, since instance seeds are still task-specific atoms). Distinguishing these clarifies what kind of data generation is possible without prior task examples and what still requires task-specific scaffolding — a distinction Can we generate synthetic data without any seed examples? pushes further by replacing instance seeds with taxonomy nodes.

Inquiring lines that use this note as a source 14

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

14 direct connections · 136 in 2-hop network ·dense cluster Open in graph ↗

Can synthetic data replace seed examples in task… Can we generate synthetic data without any seed ex… How do quality, diversity, and complexity affect s… Can synthetic dialogues become realistic through l… Can models trained on many imperfect experts outpe… Do different AI models actually produce diverse ou…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can we generate synthetic data without any seed examples? Existing synthetic data methods rely on seed examples from the target distribution, which is impractical for novel domains. Can taxonomic decomposition eliminate this dependence while maintaining controllable coverage?
extends: companion piece — TarGEN replaces input-output exemplars; Simula replaces instance seeds with taxonomies — same direction, different granularity
How do quality, diversity, and complexity affect synthetic data differently? When training models on synthetic data, do quality, diversity, and complexity each play distinct roles in how well models generalize? Understanding their separate effects could explain why current optimization strategies fail.
complements: TarGEN reports comparable QDC to original data; this note tells you which dimensions to look at when comparing
Can synthetic dialogues become realistic through layered diversity? Explores whether combining persona variation, subtopic specificity, and contextual grounding can generate synthetic dialogues that match real conversational data quality and capture the full spectrum of dialogue diversity.
exemplifies: instance-seed-style decomposition applied to dialogue — atomic elements (persona × subtopic × context) drive diversity
Can models trained on many imperfect experts outperform each one? Do generative models trained on diverse, imperfect human experts develop an implicit consensus that surpasses any individual contributor? This explores whether aggregating diverse perspectives at training time, rather than inference time, can denoise human biases.
complements: synthetic data as denoising signal — TarGEN's self-correction module operates at a similar denoising layer
Do different AI models actually produce diverse outputs? Explores whether using multiple different language models together creates genuine diversity or whether shared training and alignment cause them to converge on similar answers despite independence.
tension: instance seeds inject atomic-level variation, but the generator's hivemind tendencies may collapse downstream diversity unless explicitly controlled

Can synthetic data replace seed examples in task generation?

Related concepts in this collection 5

Related papers in this collection 8

Search by related questions 4