What sampling strategies prevent nonsensical combinations when composing taxonomy nodes?

This explores how to combine pieces of a taxonomy (tools, concepts, categories) into synthetic examples without producing absurd, incoherent pairings — and which sampling tricks keep the combinations sensible.

This explores how to combine pieces of a taxonomy — tools, concepts, categories — into new synthetic examples without producing absurd pairings, and what sampling strategies keep those combinations coherent. The corpus's sharpest answer comes from synthetic tool-calling data: random sampling fails precisely because unrelated tools cannot credibly compose. The fix in Why does random tool sampling produce unrealistic synthetic training data? is to stop sampling uniformly and instead draw tools from a *relevance graph* — so the things you combine are already neighbors that plausibly belong together — and then generate against a dialogue plan so the composition has a reason to exist. The lesson generalizes: nonsense comes from sampling combinations the structure already says are far apart.

That points to a deeper idea running through the collection — the geometry of the taxonomy itself can tell you what's safe to combine. In Do embedding eigenvectors organize taxonomy from coarse to fine?, the leading eigenvectors of embedding similarity separate broad branches first, then finer sub-branches, mirroring the WordNet hypernym tree level by level. If a taxonomy has this coarse-to-fine spectral order, you have a built-in distance metric: combining two nodes from the same fine branch is safe, while combining across distant coarse branches is exactly where 'nonsensical' lives. Sampling within neighborhoods, not across the whole space, is the through-line shared with the relevance-graph approach.

The synthetic-data side of the corpus shows why this matters for coverage rather than just correctness. Can we generate synthetic data without any seed examples? (Simula) deliberately *separates* global coverage from local diversity — taxonomy construction handles what to cover, agentic refinement handles complexity — so you can spread across the space without letting any single sample drift into incoherence. The separation is itself a control: you decide where to combine before you decide how richly. Can organizing knowledge structures beat raw training data volume? reinforces the payoff — organizing chunks into a taxonomy and teaching position-within-structure beats raw volume, because the model learns where a concept *belongs* rather than memorizing flat text, which is the same constraint that prevents bad compositions.

There's a useful cross-domain echo in recommendation: Can item identifiers balance uniqueness and semantic meaning? (TransRec) shows that combining structured facets — ID, title, attributes — only works when the structure constrains generation, keeping outputs grounded rather than free-associating. Across all of these, the strategy that prevents nonsense is the same shape: replace uniform random sampling with structure-aware sampling — graph adjacency, spectral neighborhood, taxonomic position, or constrained facets — so combinations are drawn from regions the structure already certifies as compatible.

What you might not have expected: the failure isn't really about the generator being weak, it's about the *sampler* ignoring information the taxonomy already encodes. The interesting frontier here is treating the taxonomy's own geometry — its branch distances and adjacency graph — as the sampling prior, rather than bolting on a filter after generation.

Sources 5 notes

Why does random tool sampling produce unrealistic synthetic training data?

Random tool sampling fails because unrelated tools cannot credibly compose, and Q&A framing ignores multi-turn dialogue coherence. ToolFlow shows that sampling tools from relevance graphs and generating with dialogue plans closes this gap.

Do embedding eigenvectors organize taxonomy from coarse to fine?

Leading eigenvectors of embedding Gram matrices separate broad taxonomic branches first, then progressively finer sub-branches—a coarse-to-fine spectral order that tracks the WordNet hypernym tree level by level, confirming predictions from co-occurrence statistics.

Can we generate synthetic data without any seed examples?

Simula separates global coverage from local diversity, using taxonomy construction for coverage and agentic refinement for complexity. This architecture makes all three desiderata—quality, diversity, complexity—controllable simultaneously without requiring seed data.

Can organizing knowledge structures beat raw training data volume?

StructTuning achieves 50% of full-corpus performance using only 0.3% of training data by organizing chunks into auto-generated domain taxonomies. The model learns knowledge position within conceptual structures rather than raw text patterns, matching how students learn from textbooks.

Can item identifiers balance uniqueness and semantic meaning?

TransRec shows that combining numeric IDs, titles, and attributes into structured identifiers solves three problems simultaneously: distinctiveness from IDs, semantics from text, and generation grounding from structural constraints. Neither pure IDs nor pure text alone achieves all three.

What sampling strategies prevent nonsensical combinations when composing taxonomy nodes?

Sources 5 notes

Next inquiring lines