SYNTHESIS NOTE
Reasoning, Retrieval, and Evaluation Language, Text, and Discourse

Can LLMs efficiently generate taxonomies and label training data?

Explores whether large language models can automate both taxonomy generation and data labeling to reduce the manual effort and domain expertise traditionally required for text mining tasks.

Synthesis note · 2026-06-03 · sourced from Work Application Use Cases

Text mining couples two interrelated tasks — taxonomy generation (finding and organizing canonical labels for a corpus) and text classification (labeling instances) — and both traditionally rely on expensive domain expertise and manual curation, which breaks when the label space is under-specified and annotations are unavailable. TnT-LLM automates both end-to-end with LLMs in two phases. Phase 1: a zero-shot, multi-stage reasoning approach has the LLM produce and iteratively refine a label taxonomy. Phase 2: LLMs act as data labelers generating pseudo-labels, which train lightweight supervised classifiers that can be deployed and served cheaply at scale.

The keeper is the division of labor: use the expensive LLM for the parts that need open-ended reasoning (inventing and refining the taxonomy, producing training labels), then distill into a cheap classifier for high-volume serving — getting LLM-quality structure without LLM-cost inference. It democratizes text-mining for under-specified label spaces.

This is methodologically relevant to Adrian's own vault pipeline (taxonomy/topic induction + labeling). It rhymes with Can smaller models handle RAG filtering while larger models focus on synthesis? in its tiered use of model capability (big model for structure, small for scale), and with the taxonomy-induction spirit of synthetic-data work like Can we generate synthetic data without any seed examples?.

Inquiring lines that use this note as a source 3

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 2

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
13 direct connections · 135 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

LLMs can generate a label taxonomy then label data to train lightweight classifiers — automating text mining at scale