INQUIRING LINE

How do neural networks extend contextual bandits beyond linear reward assumptions?

This explores how moving from classic linear contextual bandits (like LinUCB) to neural networks lets the reward model capture nonlinear structure — and what that trade costs you in exploration, which is the part linearity made easy.


This explores how neural networks replace the linear reward assumption baked into classic contextual bandits, and what breaks when they do. The starting point is the linear bandit itself: in LinUCB news recommendation, the reward of showing an article is modeled as a linear function of context features, and that linearity is precisely what makes the math tractable — you get closed-form confidence intervals, provable regret bounds, and a clean way to balance exploring uncertain articles against exploiting proven ones Can bandit algorithms beat collaborative filtering for news?. The moment real reward surfaces are nonlinear — user taste that interacts across features rather than adding up — that linear model underfits, and you reach for a neural network to learn the representation instead of assuming it.

The catch is that the linear assumption wasn't just modeling convenience; it was also your uncertainty estimate. UCB and Thompson sampling need to know how unsure the model is about a given action, and for a linear model that uncertainty has a closed form. Swap in a deep network and that vanishes — you can predict rewards but no longer cheaply know what you don't know. This is the gap the epistemic-neural-network work targets: it separates *aleatoric* uncertainty (irreducible noise in the data) from *epistemic* uncertainty (what more data would resolve), and spends compute only on the epistemic part that Thompson sampling actually needs. That focus is what makes neural Thompson sampling viable at recommendation scale, lifting click-through 9% while needing 29% fewer interactions Can neural networks explore efficiently at recommendation scale?. The lesson is that extending bandits beyond linearity is really two problems — a richer reward model *and* a replacement for the uncertainty estimate you lost.

There's a quieter counter-move worth knowing: maybe you don't need the expensive exploration machinery at all. Greedy bandits that purely exploit can match UCB-style regret when the context distribution has natural 'covariate diversity' — when incoming users are varied enough that they supply the randomization exploration would have injected deliberately When can greedy bandits skip exploration entirely?. That reframes the neural question: in high-traffic recommendation, the diversity of real users may do for a neural model what careful epistemic uncertainty does, letting you skip the hardest engineering. So the field splits between making exploration smart (epistemic nets) and arguing the data makes it unnecessary.

A different way to escape linearity is to keep the reward linear but make the *basis* learned and nonlinear. Reward factorization represents each user's preference as a linear combination of base reward functions — but those base functions are learned from data, and a handful of adaptive questions pins down the per-user coefficients Can user preferences be learned from just ten questions?. This is the kernel-trick spirit: nonlinearity lives in the features, linearity in the combination, so you keep cheap personalization and uncertainty reduction while escaping the flat linear-in-raw-features assumption. It's a middle path between LinUCB and a fully neural reward.

Finally, the frontier is dropping the parametric reward model entirely. Memory-based online RL treats adaptation as memory operations rather than weight updates, assigning credit and improving policy through stored cases instead of a fitted reward function Can agents learn continuously from experience without updating weights?, and trajectory-based in-context learning shows models can absorb sequential decision-making from context alone when given full trajectories rather than isolated examples Why do trajectories matter more than individual examples for in-context learning?. Read together, the corpus traces an arc: linear bandits → neural bandits with engineered uncertainty → learned-basis linear models → non-parametric memory and in-context approaches. 'Beyond linear' turns out to be less a single technique than a ladder, where each rung buys more expressive reward modeling at the price of harder uncertainty estimation — and where, surprisingly, sometimes the data's own diversity lets you skip a rung entirely.


Sources 6 notes

Can bandit algorithms beat collaborative filtering for news?

LinUCB frames news recommendation as a contextual bandit problem, explicitly balancing exploration of uncertain articles against exploitation of proven ones. The approach handles dynamic content and cold-start users better than traditional CF, with proven regret bounds and lower computational overhead.

Can neural networks explore efficiently at recommendation scale?

ENR separates aleatoric from epistemic uncertainty, focusing computation only on parameter uncertainty needed for Thompson sampling. It improved click-through rates 9% and ratings 6% while requiring 29% fewer interactions than baselines.

When can greedy bandits skip exploration entirely?

Contextual bandits using pure greedy exploitation can match UCB-style regret guarantees when the context distribution satisfies covariate diversity—a condition satisfied by many real continuous and discrete distributions where incoming users themselves provide sufficient randomization.

Can user preferences be learned from just ten questions?

PReF learns base reward functions from preference data, then uses active learning to select maximally informative questions that reduce coefficient uncertainty. Users can be personalized via inference-time reward alignment without weight modification.

Can agents learn continuously from experience without updating weights?

AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.

Why do trajectories matter more than individual examples for in-context learning?

In-context learning for sequential decision-making requires full or partial trajectories from the same environment level, not just isolated examples. This structural property—trajectory burstiness—allows models to generalize across vastly different tasks without weight updates.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an analyst re-testing claims about how neural networks extend contextual bandits beyond linear reward assumptions. The question remains open: what replaces the uncertainty machinery when linearity vanishes?

What a curated library found — and when (dated claims, not current truth):
Findings span 2010–2026; treat each as time-stamped, not current ground truth.
• Linear contextual bandits (LinUCB) achieve closed-form confidence intervals and regret bounds, but underfit nonlinear user preferences (~2010).
• Epistemic neural networks separate aleatoric from epistemic uncertainty, enabling neural Thompson sampling that lifts click-through 9% with 29% fewer interactions (~2023).
• Greedy (exploration-free) bandits match UCB regret when context distribution has natural covariate diversity, suggesting real-world data diversity may replace deliberate exploration (~2017).
• Reward factorization keeps linearity in preference combination while learning nonlinear basis functions, balancing expressiveness and personalization (~2025).
• Memory-based and in-context RL drop parametric reward models entirely, using stored cases or trajectory context instead of fitted functions (~2023–2024).

Anchor papers (verify; mind their dates):
• 1003.0146 (2010): LinUCB — the linear baseline.
• 2306.14834 (2023): Epistemic neural contextual bandits.
• 1704.09011 (2017): Exploration-free greedy bandits.
• 2503.06358 (2025): Reward factorization in LLM personalization.

Your task:
(1) RE-TEST EACH CONSTRAINT. For epistemic uncertainty in neural bandits, has recent work (e.g., test-time RL, post-completion learning, or improved uncertainty quantification) made the 29%-interaction gain obsolete or reframed it? For greedy bandits' reliance on covariate diversity, do newer recommenders still invoke exploration, or has data diversity genuinely replaced it in practice? For reward factorization, how does it compare to end-to-end neural learning in recent LLM personalization? Separate the durable question (how do we estimate uncertainty in neural bandits?) from perishable claims (specific sample-efficiency gaps).
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~12 months. Focus on whether the 2024–2026 papers (TTRL, Rec-R1, Post-Completion Learning, The Invisible Leash, Useful Memories) undermine or synthesize the prior ladder (linear → epistemic nets → factorized basis → non-parametric).
(3) Propose 2 research questions that ASSUME the regime may have shifted: e.g., "Do in-context bandit policies (learned from trajectory examples in a single forward pass) now obviate the need for fitted reward models entirely?" and "Does RLVR's coupling to its origin data (per 2507.14843) imply reward learning itself has hidden constraints that exploration machinery cannot fix?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines