SYNTHESIS NOTE

Recommender Systems Training, RL, and Test-Time Scaling Model Architecture and Internals

When can greedy bandits skip exploration entirely?

Under what conditions does natural randomness in incoming contexts eliminate the need for active exploration in contextual bandits? This matters for high-stakes domains like medicine where exploration carries real costs.

Synthesis note · 2026-05-03 · sourced from Recommenders Architectures

Standard bandit theory says you need to explore to learn the optimal arm — but in many real-world settings where exploration is unethical or costly (medicine, marketing), the practitioner refuses. The Bastani–Bayati–Khosravi result is that a pure greedy contextual bandit can match the asymptotic regret guarantees of UCB-style algorithms whenever the context distribution satisfies "covariate diversity": the conditional covariance matrix of contexts on any half-space is positive definite. That condition is satisfied by natural classes of continuous and discrete distributions.

The mechanism is that incoming users are themselves randomized enough to provide natural exploration — every patient who walks in carries a different feature vector, so even pure exploitation accidentally tests different arms across the feature space. Active exploration becomes redundant.

The catch is that you don't know in advance whether your contexts satisfy the condition. The proposed Greedy-First algorithm starts greedy and watches whether the parameter estimates converge at the asymptotically optimal rate via a hypothesis test on observed contexts and rewards. If the test fails, it falls back to a standard explore-exploit algorithm. This makes practical the philosophical move: stop assuming you need exploration, then verify you didn't.

Inquiring lines that use this note as a source 7

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

14 direct connections · 160 in 2-hop network ·dense cluster Open in graph ↗

When can greedy bandits skip exploration entirel… Can bandit algorithms beat collaborative filtering… Can neural networks explore efficiently at recomme… Why do ranking systems need to model selection bia…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can bandit algorithms beat collaborative filtering for news? News recommendation faces constant content churn and cold-start users—settings where traditional collaborative filtering struggles. Can a contextual bandit approach like LinUCB explicitly balance exploration and exploitation better than static methods?
tension with: LinUCB explicitly explores via UCB bonus; this result shows when natural context diversity replaces the need — the design choice depends on context-distribution structure
Can neural networks explore efficiently at recommendation scale? Exploration—discovering unknown user preferences—normally requires expensive posterior uncertainty estimates. Can a neural architecture make Thompson sampling practical for real-world recommenders without prohibitive computational cost?
complements: ENN attacks the same exploration challenge by changing how exploration is computed; greedy-first attacks it by avoiding exploration when redundant
Why do ranking systems need to model selection bias explicitly? Explores how training data from current rankers creates feedback loops that reinforce past decisions. Understanding this mechanism helps explain why naive approaches fail in production ranking systems.
tension with: greedy bandits assume incoming users are randomized enough; selection-bias work argues real platforms have feedback loops that violate that assumption

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Mostly Exploration-Free Algorithms for Contextual Bandits0.91 match · arxiv ↗
A Contextual-Bandit Approach to Personalized News Article Recommendation0.81 match · arxiv ↗
HyperBandit: Contextual Bandit with Hypernetwork for Time-Varying User Preferences in Streaming Recommendation0.80 match · arxiv ↗
Collaborative Filtering Bandits0.79 match · arxiv ↗
Beyond the Exploration-Exploitation Trade-off: A Hidden State Approach for LLM Reasoning in RLVR0.78 match · arxiv ↗
Training a Generally Curious Agent0.78 match · arxiv ↗
Scalable Neural Contextual Bandit for Recommender Systems0.78 match · arxiv ↗
Inference-Aware Prompt Optimization for Aligning Black-Box Large Language Models0.77 match · arxiv ↗

Original note title

exploration-free greedy bandits achieve optimal performance when natural context diversity provides exploration for free