Can false positives from input filtering be reduced without sacrificing defense?
This explores whether you can make an input filter flag fewer legitimate inputs as threats (false positives) while still catching the real attacks it's there to stop — the classic precision-vs-recall tension in defensive filtering.
This explores whether you can make an input filter flag fewer legitimate inputs as threats while still catching the real attacks — and the corpus's most consistent answer is: stop treating filtering as a single hard threshold, and split it into a cheap recall pass followed by a smarter verification pass. The clearest version is the two-stage verifier: a fast, generous first stage that catches everything that *might* match, then a small learned model that looks at full token-to-token interaction patterns to throw out the structural near-misses a blunt similarity score would wave through Can verification separate structural near-misses from topical matches?. The same architecture shows up in RAG defense, where partition-aware retrieval bounds how much a poisoned document can influence results and token-masking flags suspicious documents by their abnormal behavior, rather than rejecting inputs wholesale Can we defend RAG systems from corpus poisoning without retraining?. In both cases the false-positive reduction comes from adding a second, more discriminating look — not from loosening the first filter.
A second, less obvious lever: filter on the *cause* of risk rather than its symptom. When a system uses model confidence as its trip-wire, it both misses confident hallucinations and over-flags fine on uncertain-but-correct cases. Switching the trigger to pretraining-data statistics — flagging inputs whose entity combinations were rarely or never seen in training — catches the actual root cause and fires far more precisely Can pretraining data statistics detect hallucinations better than model confidence?. Granularity helps the same way: step-level confidence filtering catches reasoning breakdowns that whole-trace averaging smears over, so you discard the genuinely bad and keep the good instead of throwing out whole traces on a noisy global score Does step-level confidence outperform global averaging for trace filtering?.
There's a cautionary thread too. Filtering assumes the harmful signal is *separable* from the legitimate one — and sometimes it isn't. In heuristic-override tasks, aggressively removing 'spurious' cues actually *hurts* the model, because the real job was composing conflicting signals, not discarding distractors Why does removing spurious cues sometimes hurt model performance?. That's the deep source of false positives: an over-eager filter mistakes load-bearing input for noise. And the threat landscape is genuinely adversarial — semantically irrelevant text appended to a problem can spike error rates 300%, and those triggers transfer across models — so a filter that's tuned too loose to avoid false positives leaves a real opening How vulnerable are reasoning models to irrelevant text?.
The most interesting reframing in the corpus is to question the binary accept/reject decision itself. Speech dialogue systems facing 15–30% recognition error rates abandoned deterministic flowcharts and instead maintain a *belief distribution* over what the user might have meant — so an ambiguous input isn't forced into a wrong commitment, it's held probabilistically until more evidence arrives Why do dialogue systems need probabilistic reasoning?. Grounded-refusal RAG does a softer version: rather than block inputs, it constrains *outputs* to only what the evidence supports, trading some coverage for integrity and pushing the defense downstream of the filter Can RAG systems refuse to answer without reliable evidence?.
The thing you may not have expected: there appears to be a floor you can't filter past. Lipschitz-continuity analysis of reasoning chains proves that more reasoning *dampens* input-perturbation sensitivity but never drives it to zero — a non-zero robustness floor exists structurally Can longer reasoning chains eliminate model sensitivity to input noise?. So the honest answer is: yes, you can cut false positives a lot — through two-stage verification, cause-based triggers, finer granularity, and probabilistic deferral instead of hard rejection — but no filter buys perfect separation, which is exactly why the strongest designs pair a precise filter with a downstream layer (grounded refusal, belief tracking) that absorbs what slips through.
Sources 9 notes
A two-stage pipeline—pooled-cosine recall followed by a small Transformer verifier operating on token-token similarity maps—reliably rejects structural near-misses that MaxSim-style late interaction cannot. The verifier succeeds because it operates on full token interaction patterns rather than compressed vectors.
RAGPart and RAGMask provide lightweight, retraining-free defenses that operate at the retrieval layer. RAGPart bounds poisoned-document influence via partitioned retriever learning; RAGMask flags suspicious documents through abnormal similarity collapse under token masking.
QuCo-RAG uses entity co-occurrence patterns from training data to trigger retrieval, successfully flagging hallucination risk even when models are highly confident. This data-side approach catches the root cause (unseen combinations) rather than the symptom (low confidence).
Local step-level confidence catches reasoning breakdowns that global averaging masks and enables early stopping before traces complete. This approach achieves comparable accuracy gains to naive majority voting with far fewer generated traces, proving trace quality matters more than quantity.
Removing spurious cues degrades performance in heuristic override tasks, opposite to shortcut learning predictions. The failure mode is integrating conflicting signals rather than ignoring distractors—a frame problem, not feature selection.
Appending semantically unrelated sentences to math problems significantly increases error rates in reasoning models. These query-agnostic triggers discovered on cheaper models transfer effectively to stronger models and also inflate response length.
Real-world speech recognition achieves 15-30 percent error rates in noisy environments, making deterministic flowchart dialogue systems unworkable. POMDP-based systems handle this by maintaining belief distributions over user intent rather than committing to single interpretations.
A multilingual RAG system for noisy historical newspapers succeeds by aggressively expanding retrieval while constraining generation to only grounded answers. The grounded-refusal prompt prevents hallucination when OCR errors and language drift degrade source quality, trading coverage for integrity.
Lipschitz continuity analysis proves that while additional reasoning steps reduce perturbation propagation, a non-zero robustness floor exists structurally. Sensitivity decreases with stronger embedding and hidden state norms but never reaches zero.