Can RL directly optimize attention distributions instead of text generation?

This explores whether reinforcement learning can treat where a model 'looks' — its attention distribution — as the thing being optimized, rather than the usual target of which tokens it emits.

This explores whether RL can directly optimize attention distributions instead of the usual target — the text a model generates. The corpus has a direct answer, and it's yes: Reinforced Attention Learning treats attention patterns as the primary policy target, and on multimodal visual reasoning it beats standard token-level RLHF Can optimizing attention patterns improve multimodal RL better than optimizing tokens?. The intuition is clean — attention is where the model actually allocates information and commits to a decision, so optimizing that allocation reaches the bottleneck more directly than nudging the output tokens that come downstream of it.

What makes this more than a one-paper curiosity is how it rhymes with a broader finding about *where* RL does its work. When you measure what RL actually changes inside a model, it updates only 5–30% of parameters, and those sparse updates are nearly full-rank and nearly identical across random seeds — structural, not arbitrary Does reinforcement learning update only a small fraction of parameters?. That hints that RL is already implicitly concentrating its pressure on a small, decision-critical substrate. Making attention itself the explicit target is, in a sense, naming that substrate and optimizing it on purpose rather than hoping token-level rewards trickle back to it.

The corpus also reframes what 'the policy' even has to be. RL on language models is usually described as a single-turn token-prediction game, but it scales cleanly to long-horizon, multi-turn software tasks with delayed rewards Can reinforcement learning scale beyond single-turn language tasks?, and the reward signal itself can be swapped out — black-box recommendation metrics like NDCG can train an LLM directly Can recommendation metrics train language models directly?, or natural-language critiques can replace scalar rewards when numbers plateau because they carry information about *why* a generation failed Can natural language feedback overcome numerical reward plateaus?. Once both the reward and the horizon are this flexible, the policy target — tokens vs. attention — looks like just one more design choice rather than a fixed law.

There's a cautionary thread worth knowing about too. RL doesn't just optimize; it collapses diversity, converging on a single dominant pretraining format within the first epoch and suppressing the alternatives Does RL training collapse format diversity in pretrained models?. If you point that same collapsing pressure directly at attention distributions, you'd want to ask whether it sharpens the model onto the genuinely informative regions — or just narrows where it's willing to look. And from a different angle, the field is also exploring *architectural* control over attention rather than RL control: separating short-term attention from a neural memory module that decides which surprising tokens are worth storing Can neural memory modules scale language models beyond attention limits?. The interesting tension the corpus leaves you with is that attention allocation can be governed two ways — learned through reward, or built into the architecture — and these aren't yet talking to each other.

Sources 7 notes

Can optimizing attention patterns improve multimodal RL better than optimizing tokens?

Reinforced Attention Learning treats attention patterns as the primary policy target rather than token sequences. Direct optimization of information allocation shows stronger gains on visual reasoning than standard RLHF, because attention is where the actual decision happens.

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

Can reinforcement learning scale beyond single-turn language tasks?

Modified DAPO training doubled SWE-bench Verified performance from 20% to 39% on Qwen2.5-72B, matching larger models. This demonstrates RL works in stateful multi-step environments with delayed rewards and complex feedback, beyond theoretical single-turn MDPs.

Can recommendation metrics train language models directly?

Rec-R1 demonstrates that LLMs can be trained directly on rule-based recommendation metrics like NDCG and Recall as RL reward signals, eliminating the need for SFT distillation from proprietary models while remaining model-agnostic across different retriever architectures.

Can natural language feedback overcome numerical reward plateaus?

Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Can neural memory modules scale language models beyond attention limits?

Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.

Can RL directly optimize attention distributions instead of text generation?

Sources 7 notes

Next inquiring lines