What distinguishes instance seeds from full input-output exemplar requirements?
This explores the difference between 'instance seeds' (atomic task ingredients used to bootstrap synthetic data) and the older requirement of supplying complete input-output example pairs — and why that distinction matters for making training data in domains where no examples exist.
This explores the difference between 'instance seeds' and full input-output exemplars: a seed is just the raw ingredient of a task — an atomic element you provide up front — whereas a full exemplar is a finished, paired demonstration (here's an input, here's the correct labeled output) that you have to already possess. The corpus frames this as a shift in what generation needs to get started. In Can synthetic data replace seed examples in task generation?, TarGEN seeds only the inputs and then constrains label generation as a separate downstream step, rather than demanding complete demonstration pairs. The payoff isn't mainly accuracy (1–3 points on SuperGLUE); it's reach — you can manufacture data for domains that have no prior examples at all, because you no longer need to own the answer key before you begin.
What makes this work connects to a deeper finding about what models actually learn from examples. If demonstrations taught genuine task understanding, you couldn't safely strip the outputs out of them. But Does instruction tuning teach task understanding or output format? shows models trained on semantically empty or even wrong instructions perform about as well as those given correct ones — what transfers is knowledge of the output space, not the meaning. The same lesson appears in reasoning: Does logical validity actually drive chain-of-thought gains? finds illogical chain-of-thought exemplars match valid ones, because the model absorbs the form, not the inference. If form and output-distribution are the real cargo, then a full input-output pair is overkill — a seed plus a constrained label step carries the load.
The corpus also marks the logical endpoint of this trajectory: dropping seeds entirely. Can we generate synthetic data without any seed examples? (Simula) replaces seeds with taxonomic decomposition, separating global coverage from local diversity so quality, diversity, and complexity become independently controllable without any seed data. And Can aligned LLMs generate their own training data? (MAGPIE) goes further still — an aligned model generates 4M high-quality instruction pairs from nothing but pre-query formatting tokens. So there's a clear ladder: full exemplars → input-only seeds → no seeds at all → bare format scaffolds.
The thing you didn't know you wanted to know: the move away from full exemplars isn't a clever data-engineering trick, it's a quiet admission of what fine-tuning was ever teaching. Each rung down the ladder removes a thing we assumed was essential — the answer, then the example, then the seed — and performance barely flinches. That's strong evidence the 'understanding' we thought lived in our training pairs was mostly the model learning the shape of acceptable outputs.
Sources 5 notes
TarGEN generates synthetic data using atomic task elements (instance seeds) instead of full input-output examples, achieving 1-3 point improvements on SuperGLUE tasks. The approach works by constraining label generation after seeding inputs, enabling data creation for domains with no prior examples.
Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.
Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.
Simula separates global coverage from local diversity, using taxonomy construction for coverage and agentic refinement for complexity. This architecture makes all three desiderata—quality, diversity, complexity—controllable simultaneously without requiring seed data.
MAGPIE shows that aligned models like Llama-3-Instruct auto-regressively generate diverse, high-quality instructions when given only pre-query formatting tokens, without prompt engineering. 4M generated pairs matched human-curated datasets in quality and outperformed external sources in downstream fine-tuning.