Does pseudo-labeling from LLMs degrade classifier performance?

This explores whether training a small classifier on labels generated by an LLM (rather than humans) hurts the classifier — and the corpus suggests the answer is closer to 'no, and sometimes the opposite,' with caveats about where the LLM's own blind spots leak into the labels.

This explores whether training a small classifier on labels generated by an LLM (rather than humans) hurts the classifier. The most direct evidence in the corpus points the other way. In TnT-LLM, an LLM is used end-to-end — it invents the label taxonomy through open-ended reasoning, then generates the training labels — and those labels are distilled into lightweight classifiers that deploy cheaply at scale Can LLMs efficiently generate taxonomies and label training data?. The pseudo-labels aren't a degraded substitute; they're the whole pipeline, and the small model is the intended production artifact.

The more surprising result is that the student can beat its teacher. Walmart distilled LLM ranking judgments into BERT cross-encoders and found the students *outperformed* the LLMs that labeled them — once the augmented dataset was large enough Can smaller models outperform their LLM teachers with enough data?. The mechanism matters for your question: the teacher's soft predictions smooth the label space, and the student sees a broader input distribution than the teacher was ever evaluated on, so it generalizes better. So pseudo-labeling at scale doesn't just avoid degradation — the averaging-out of teacher noise can act like a regularizer.

The real risk isn't pseudo-labeling as a technique; it's *where the LLM is systematically wrong*, because those errors get baked into the labels the student learns from. The corpus catalogs exactly the failure regions to worry about. LLMs make predictable linguistic errors that worsen with syntactic depth — embedded clauses, complex nominals Why do large language models fail at complex linguistic tasks? — and you can often *predict* the failure zone in advance: tasks whose correct answer is a low-probability string for an autoregressive model are reliably hard, regardless of logical simplicity Can we predict where language models will fail?. Label those regions with an LLM and you transfer the bias wholesale.

Two subtler contaminants are worth naming because they don't look like errors. LLMs accommodate false premises out of trained agreeableness rather than ignorance — a social, RLHF-learned behavior distinct from hallucination Why do language models agree with false claims they know are wrong? — and they fail badly at ambiguity, where multiple valid interpretations exist; GPT-4 disambiguates correctly only 32% of the time vs. 90% for humans Can language models recognize when text is deliberately ambiguous?. On ambiguous or adversarial examples the LLM will emit a confident single label, and a classifier trained on those confident-but-wrong labels inherits a blind spot that standard accuracy metrics won't reveal.

The through-line: pseudo-labeling degrades the classifier only as much as the teacher is wrong in ways that don't average out. High-volume labeling of in-distribution, semantically clear data tends to *improve* the student through smoothing and broader exposure. The danger is structured error — syntactic complexity, low-probability targets, ambiguity, agreeableness — which is systematic, not random, so more data doesn't wash it away. There's also a ceiling worth knowing: a model can't reliably correct its own labels past the generation-verification gap without an external check What stops large language models from improving themselves?, which is the formal reason you still want human spot-checks precisely in the failure zones the corpus already maps out.

Sources 7 notes

Can LLMs efficiently generate taxonomies and label training data?

TnT-LLM automates text mining by using LLMs for open-ended reasoning to create and refine label taxonomies and generate training labels, then distilling these into lightweight classifiers for cost-effective deployment at scale.

Can smaller models outperform their LLM teachers with enough data?

Walmart's student cross-encoders outperformed their LLM teachers when trained on sufficiently large augmented datasets of teacher-labeled queries. The student's broader input distribution exposure, smoothed by teacher predictions, enabled better generalization than the teacher achieved.

Why do large language models fail at complex linguistic tasks?

Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.

Can we predict where language models will fail?

By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.

Why do language models agree with false claims they know are wrong?

The FLEX benchmark shows models reject false presuppositions at dramatically different rates (GPT 84% vs Mistral 2.44%), not from ignorance but from preference for agreement learned via RLHF. This social accommodation is distinct from hallucination and requires different fixes.

Can language models recognize when text is deliberately ambiguous?

AMBIENT benchmark shows GPT-4 correctly disambiguates only 32% of cases versus 90% for humans. This failure spans lexical, structural, and scope ambiguity—revealing that LLMs cannot hold multiple interpretations simultaneously, a fundamental gap hidden by standard benchmarks.

What stops large language models from improving themselves?

Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.

Does pseudo-labeling from LLMs degrade classifier performance?

Sources 7 notes

Next inquiring lines