Why should bandit algorithms condition exploration on time-of-period as well as user state?

This explores why a bandit's exploration policy should treat *when* a choice happens (time-of-day, day-of-week, seasonality) as part of its context — not just *who* the user is.

This explores why a bandit's exploration should bend to the clock as well as to the person. The honest starting point: the corpus doesn't have a paper that argues this claim head-on, but several notes converge on the reasoning from different directions. The foundational move is that contextual bandits already condition exploration on *context* — user state is simply one slice of that context. LinUCB-style news recommendation Can bandit algorithms beat collaborative filtering for news? treats each decision as 'given everything I know right now, how uncertain am I about this arm?' Time-of-period is just another coordinate to feed that 'right now.'

The reason time deserves its own coordinate is non-stationarity. News is the cleanest example: the value of an article decays, and the best arm at 8am is not the best arm at midnight. A bandit conditioned only on user state implicitly assumes a user's reward for an option is time-invariant — which is exactly the assumption dynamic content breaks. Once rewards drift on a daily or weekly cycle, what you learned exploiting last evening can actively mislead you this morning, so the exploration budget has to be spent *per regime*, not once globally.

This sharpens an otherwise free lunch. There's a striking result that greedy bandits can skip exploration entirely when the incoming context distribution is naturally diverse enough to randomize for you When can greedy bandits skip exploration entirely?. Time-of-period looks like it should add diversity — but it adds *structured, cyclical* diversity, not random diversity. Cyclical structure creates correlated blind spots: if your traffic is thin at 3am, no amount of daytime randomization covers that regime, so the covariate-diversity escape hatch quietly closes and explicit exploration becomes necessary again precisely in the under-sampled hours.

That connects to where exploration *should* be aimed. Scalable neural bandits work by separating reducible (epistemic) uncertainty from irreducible noise and spending Thompson-sampling compute only where parameter uncertainty actually lives Can neural networks explore efficiently at recommendation scale?. Epistemic uncertainty is not uniform across the day — it pools in the temporal regimes with sparse data. Conditioning exploration on time-of-period is, in this framing, just honest accounting: it routes exploration toward the hours the model genuinely hasn't learned yet, instead of re-exploring the well-sampled midday it already understands.

The last, more lateral thread is about the *timing of the decision to explore* itself. Work on why LLMs under-explore finds a temporal mismatch inside the model: uncertainty signals arrive in early layers and commit the model to a choice before longer-horizon 'empowerment' signals can weigh in Why do large language models explore less effectively than humans?. The parallel is suggestive — a bandit blind to time-of-period commits as if the current moment were the whole story, foreclosing the long-horizon, across-the-cycle value that only becomes visible when 'when' is part of what it reasons over. The unifying lesson across all five: exploration is only as good as the context you let it see, and time is a context dimension most user-state models silently throw away.

Sources 4 notes

Can bandit algorithms beat collaborative filtering for news?

LinUCB frames news recommendation as a contextual bandit problem, explicitly balancing exploration of uncertain articles against exploitation of proven ones. The approach handles dynamic content and cold-start users better than traditional CF, with proven regret bounds and lower computational overhead.

When can greedy bandits skip exploration entirely?

Contextual bandits using pure greedy exploitation can match UCB-style regret guarantees when the context distribution satisfies covariate diversity—a condition satisfied by many real continuous and discrete distributions where incoming users themselves provide sufficient randomization.

Can neural networks explore efficiently at recommendation scale?

ENR separates aleatoric from epistemic uncertainty, focusing computation only on parameter uncertainty needed for Thompson sampling. It improved click-through rates 9% and ratings 6% while requiring 29% fewer interactions than baselines.

Why do large language models explore less effectively than humans?

SAE decomposition shows uncertainty values dominate early transformer blocks while empowerment representations emerge only in middle blocks. This temporal mismatch causes models to commit to decisions before long-term exploration signals can influence them. Reasoning-trained o1 overcomes this by extending computation time.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a bandit-algorithm researcher. The question: *Should exploration in contextual bandits condition on time-of-period as well as user state—and if so, how?* This remains open.

What a curated library found—and when (dated claims, not current truth):
Findings span 2010–2026; treat each as perishable:
- Contextual bandits (LinUCB, ~2010) already condition on context; user state is one slice—but time-of-period is typically omitted.
- Non-stationary rewards (e.g., news decay, daily cycles) break the time-invariance assumption baked into state-only conditioning (~2010–2017).
- Greedy bandits can skip exploration when natural context diversity is high enough—but cyclical, structured diversity (e.g., sparse 3am traffic) does NOT close the exploration gap; epistemic uncertainty pools in under-sampled temporal regimes (~2017, 2023).
- Neural contextual bandits separate epistemic from aleatoric uncertainty and route Thompson sampling only where parameter uncertainty lives (~2023); temporal regimes with sparse data are epistemic hotspots.
- Recent LLM work finds models commit to choices too fast (early-layer uncertainty signals lock in before long-horizon empowerment signals arrive), suggesting a parallel temporal mismatch in reasoning bandits (~2025).

Anchor papers (verify; mind their dates):
- arXiv:1003.0146 (2010): A Contextual-Bandit Approach to Personalized News Article Recommendation
- arXiv:1704.09011 (2017): Mostly Exploration-Free Algorithms for Contextual Bandits
- arXiv:2306.14834 (2023): Scalable Neural Contextual Bandit for Recommender Systems
- arXiv:2501.18009 (2025): Large Language Models Think Too Fast To Explore Effectively

Your task:
(1) RE-TEST EACH CONSTRAINT. For state-only conditioning: has any recent bandit work (past 12 months) *explicitly* added time-of-period as a feature and measured regret or sample efficiency vs. baselines? Does the sparse-regime epistemic-uncertainty story still hold under modern neural bandits, or have curriculum/warm-start methods since smoothed temporal gaps? Separate the durable question (temporal non-stationarity is real) from the perishable claim (time-conditioning is necessary to close exploration gaps).
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last 6 months. Have any papers shown that *other* dimensions (e.g., item embeddings, user embeddings, or learned latent states) subsume or render time-of-period redundant? Or do recent RL-finetuning methods (arXiv:2505.11711, 2506.13351, 2509.23808) show time is *implicitly* learned anyway?
(3) Propose 2 research questions that ASSUME the regime may have moved: (a) If time-of-period *is* now redundantly learned by modern neural bandits, does *explicit* conditioning still improve interpretability, sample efficiency, or robustness to distribution shift? (b) In test-time RL and long-horizon reasoning (arXiv:2504.16084, 2509.23808), does the bandit need to re-explore *within* a single time-period as token length grows, and if so, is that a hidden form of temporal conditioning?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Why should bandit algorithms condition exploration on time-of-period as well as user state?

Sources 4 notes

Next inquiring lines