SYNTHESIS NOTE
Model Architecture and Internals Reasoning, Retrieval, and Evaluation Agentic Systems and Tool Use

Can synthetic data replace seed examples in task generation?

Can models generate high-quality synthetic data for novel tasks without relying on existing input-output exemplars? This matters because many specialized domains lack training examples to work from.

Synthesis note · 2026-05-03 · sourced from Data

Most synthetic data generation methods require seed examples drawn from the target distribution — actual input-output pairs the model can mimic and extend. This requirement breaks for genuinely novel or highly domain-specific tasks where no existing instances exist. TarGEN proposes a four-step prompting strategy that is seedless in this sense — it does not require specific task instances and therefore broadens applicability to novel domains.

The core distinction is between seed examples (full input-output exemplars demonstrating the task) and instance seeds (atomic elements that form the unique basis of each generated instance). An instance seed can be a sentence, a passage, or a more atomic element — but crucially it is not an input exemplar. The generation process proceeds by initializing a set of contexts to inject semantic diversity, generating task-specific instance seeds, formulating per-seed label constraints, and producing a data instance attributable to the constrained label.

The clever move is the label constraint formulation. Rather than asking the model to produce input-output pairs from scratch (which requires examples to learn the distribution), TarGEN generates the input element first via the instance seed, and then constrains the LLM to produce a corresponding output that matches a specified label. Augmenting this with a self-correction module that lets the LLM rectify inaccurately labeled instances during dataset creation produces reliable labels even without ground-truth data to validate against.

The empirical results show this is not just theoretically appealing. On eight SuperGLUE tasks, models trained on the synthetic version perform 1-3 points higher than those trained on the original datasets, and Llama2 (7B) pre-finetuned on synthetic SuperGLUE surpasses the Self-Instruct dataset baseline by 2.62 points on the OpenLLM leaderboard. The synthetic data shows comparable or higher complexity and diversity, with similar bias levels to original data.

The structural contribution is that "no seed data" is two distinct claims that prior work conflated — no input-output exemplars (which TarGEN achieves) and no per-instance task material at all (which TarGEN does not claim, since instance seeds are still task-specific atoms). Distinguishing these clarifies what kind of data generation is possible without prior task examples and what still requires task-specific scaffolding — a distinction Can we generate synthetic data without any seed examples? pushes further by replacing instance seeds with taxonomy nodes.

Inquiring lines that use this note as a source 14

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
14 direct connections · 136 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

instance seeds replace input exemplars in synthetic data generation — atomic elements like sentences or passages permit task replication without requiring existing data instances