INQUIRING LINE

Can we verify fabricated text without redesigning the generation process?

This explores whether we can catch made-up text after the fact — through external detectors, judges, or verifiers — instead of rebuilding the model's generation process to stop fabrication at the source.


This explores whether we can catch fabricated text after it's produced rather than re-engineering how the model generates in the first place. The corpus suggests the honest answer is: pure after-the-fact detection is weak, but layered external verification is surprisingly strong — and the most reliable approaches sit somewhere in between.

Start with the bad news for detection. AI text is *measurably* different from human writing across lexical-diversity dimensions, yet even trained linguists can't reliably spot it — and newer models drift further from human while getting harder to flag Can humans detect AI text if machines can measure it?. Asking the model to check itself fares worse: models systematically over-trust answers they generated, because a high-probability output simply *feels* correct during self-evaluation Why do models trust their own generated answers?. And handing the job to an LLM judge opens a different hole — judges fall for fake citations and pretty formatting in zero-shot attacks that need no model access at all Can LLM judges be fooled by fake credentials and formatting?, Can LLM judges be tricked without accessing their internals?. So the naive version of your question — "just bolt on a detector" — mostly fails.

The more interesting answer is that verification works when it's *external and grounded*, checking claims against something the generator can't fake. Formal verifiers can be auto-synthesized straight from prose policy documents — producing provably-correct Lean and z3 checkers that validate outputs without touching the generation model Can we automatically generate formal verifiers from policy text?. Bidirectional RAG shows the gating pattern concretely: generated answers only get trusted (and written back into the corpus) after passing entailment checks, source-attribution checks, and novelty detection — verification as a downstream filter, not a redesign Can RAG systems safely learn from their own generated answers?. The thread running through both: don't ask "does this look real," ask "can this be entailed by evidence."

But there's a catch that pushes back toward your premise. The strongest defense in the corpus — grounded refusal, where a noisy-source RAG system answers *only* what it can ground and declines the rest — is itself a change to the generation step, trading coverage for integrity Can RAG systems refuse to answer without reliable evidence?. This matters because fabrication is partly baked into how generation works: token prediction flows smoothly toward the training distribution rather than stress-testing competing claims, so smooth, confident, unexamined assertions are the *default* output, not an aberration Does LLM generation explore competing claims while producing text?. Treating model output as a subjective prior to be weighted, never as empirical evidence, is the framing that makes external verification coherent in the first place Should we treat LLM outputs as real empirical data?.

The reason this isn't academic: fabrication is already industrializing. LLMs can auto-generate hundreds of complete finance papers with invented theory and fabricated citations Can AI generate hundreds of fake academic papers automatically?, and human reviewers won't save us — writers edit AI text only 23% of the time, so distortions reach audiences essentially unchanged Do writers actually edit AI-generated text before publishing?. So yes, you can verify without redesigning generation — but only if verification means an external, evidence-anchored checker, not a detector sniffing for "AI-ness." The cheapest detectors are exactly the ones attackers route around.


Sources 11 notes

Can humans detect AI text if machines can measure it?

LLM-generated text differs significantly on six lexical diversity dimensions, confirmed through statistical analysis across multiple models. Yet human judges, including trained linguists, cannot reliably detect these differences—and newer models diverge further while becoming harder to spot.

Why do models trust their own generated answers?

LLMs exhibit structural bias toward validating their own outputs because high-probability generated answers feel more correct during evaluation. Comparing answers against broader alternatives breaks this self-agreement loop.

Can LLM judges be fooled by fake credentials and formatting?

Research identified four evaluation biases in LLM judges, with authority and beauty biases being semantics-agnostic and trivially exploitable through fake references and formatting—zero-shot attacks requiring no model access or optimization.

Can LLM judges be tricked without accessing their internals?

Research shows LLM evaluators systematically score higher when responses include fake references or rich formatting, independent of content quality. These biases are exploitable without model access, undermining AI benchmark credibility.

Can we automatically generate formal verifiers from policy text?

interwhen automatically generates code-based verifiers—including provably correct Lean and z3 checkers—from prose policy documents. This inverts the usual neuro-symbolic division: the LLM both translates policy to formal logic and extracts verifier inputs from reasoning traces.

Can RAG systems safely learn from their own generated answers?

Systems can add generated answers to their retrieval corpus when outputs pass entailment verification, source attribution checks, and novelty detection. This prevents hallucinations from polluting future retrievals while allowing genuine knowledge accumulation.

Can RAG systems refuse to answer without reliable evidence?

A multilingual RAG system for noisy historical newspapers succeeds by aggressively expanding retrieval while constraining generation to only grounded answers. The grounded-refusal prompt prevents hallucination when OCR errors and language drift degrade source quality, trading coverage for integrity.

Does LLM generation explore competing claims while producing text?

Token prediction trains models to continue toward the training distribution, not to explore logically related counterpositions. This smoothness in process produces smooth claims that multiply without generating new perspectives.

Should we treat LLM outputs as real empirical data?

Foundation Priors framework shows that LLM-generated text reflects the model's learned patterns and user's prompt choices, not ground truth. Such outputs should only influence inference through explicitly parameterized trust weights, not be treated as equivalent to real evidence.

Can AI generate hundreds of fake academic papers automatically?

A demonstration showed LLMs generating 288 complete finance papers from 96 statistically significant signals, each with invented theoretical justifications and fabricated citations, proving academic HARKing can be automated at scale.

Do writers actually edit AI-generated text before publishing?

Writers edited AI-generated paragraphs only 23% of the time, with edits averaging 96% similarity to the original. This means AI's opinionated and distorted voice propagates with minimal human filtering before publication.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a verification-systems researcher. The question remains open: **Can we catch fabricated text after generation without forcing the model to change how it produces output?**

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026. Key constraints:
- Trained linguists fail to spot AI text reliably; newer models evade detection while drifting from human patterns (~2025).
- Models systematically over-trust their own outputs during self-evaluation, making self-detection unreliable (~2024).
- LLM judges fall for zero-shot attacks exploiting format bias and fake citations without needing model access (~2024).
- External, evidence-grounded verification (entailment checks, source attribution, formal verifiers auto-synthesized from policy) succeeds where detection fails (~2025–2026).
- The strongest defense — grounded refusal — modifies generation itself, trading coverage for integrity; token prediction defaults to smooth, confident assertions (~2025).

Anchor papers (verify; mind their dates):
- arXiv:2403.09972 (2024-03): Self-detection fails due to inherent bias toward trusting generated answers.
- arXiv:2402.10669 (2024-02): LLM judges vulnerable to exploitable biases enabling zero-shot attacks.
- arXiv:2508.00086 (2025-07): AI text measurably non-human in lexical diversity but imperceptible to humans.
- arXiv:2602.11202 (2026-02): Test-time verification as a steering mechanism for reasoning models.

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For detection (pure after-the-fact flagging), judge whether recent scaling, evals, or new verifier architectures have cracked the hardness or merely displaced it. For external verification (entailment + source checks), probe whether orchestration advances (memory, multi-agent retrieval) have made grounded refusal less coverage-hostile, or whether the trade-off persists. Separate the durable insight — "generation defaults to smooth assertions" — from perishable limitations (maybe newer training changes this).
(2) **Surface contradicting or superseding work from the last ~6 months.** Look for claims that detection *does* scale, or grounded refusal *doesn't* cost coverage, or generation can be architected to default to uncertainty.
(3) **Propose 2 research questions assuming the regime shifted:** (a) If test-time verification can steer reasoning *without* retraining, does that collapse the distinction between "redesigning generation" and "post-hoc verification"? (b) If 23% edit rates reflect user attention, not model quality, does the verification problem shift from *detection* to *human-AI workflow friction*?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines