What other internal model decisions beyond attention could be optimized directly?

This explores the broader shift behind RL-on-attention: which internal model decisions — beyond where the model looks — the corpus is now treating as things you can optimize directly, instead of nudging indirectly through token-level training.

This reads the question as: attention got promoted from a side effect to a direct optimization target (the idea that you can train the *information allocation* itself rather than the output tokens, see Can optimizing attention patterns improve multimodal RL better than optimizing tokens?) — so what else inside the model is up for the same treatment? The corpus suggests the answer is "a surprising amount," and it falls into a few distinct layers.

The most concrete layer is the architecture itself. Rather than treating hidden size, the MLP-to-attention ratio, and grouped-query attention configuration as fixed design choices, you can fold them into scaling laws and optimize them directly for inference — yielding up to 42% more throughput *and* higher accuracy at the same training budget (Can architecture choices improve inference efficiency without sacrificing accuracy?). A neighboring decision is memory: instead of letting attention implicitly decide what to keep, the Titans line makes "what is worth storing" an explicit, learned signal, prioritizing surprising tokens in a separate long-term module (Can neural memory modules scale language models beyond attention limits?).

A second, more interesting layer is the model's internal *control* decisions — the little choices it makes mid-generation that usually happen implicitly. Several notes turn these into trainable targets: when to pull external knowledge versus trust its own parameters, framed as a step-by-step Markov decision (When should language models retrieve external knowledge versus use internal knowledge?); whether to think at length or answer immediately, learned through decoupled RL so the model self-calibrates by difficulty (Can models learn when to think versus respond quickly?); and even *how much* to think, which matters because accuracy peaks and then declines past a token threshold (Does more thinking time always improve reasoning accuracy?). These are decisions that used to be prompt-engineering hacks, now being moved inside the optimization loop.

The third layer is the most abstract and maybe the most striking: latent directions in activation space turn out to be steerable knobs. Verbosity is a single linear direction you can dial down for a 2.7x speedup without retraining (Can we steer reasoning toward brevity without retraining?), and a single sparse-autoencoder feature can switch the model into a reasoning mode that matches chain-of-thought — overriding the surface prompt entirely (Can we trigger reasoning without explicit chain-of-thought prompts?). The implication is that some "decisions" we thought required training or prompting are actually directions you can push on directly.

What ties this together — and what you might not have known you wanted to know — is the direction of travel: capabilities that used to be coaxed indirectly (through reward models, prompts, or sheer scale) are being internalized as first-class objects the model optimizes over. You can even make self-evaluation one of them, training a model to compute its own reward in the unused space after its output, at zero inference cost (Can models learn to evaluate their own work during training?). Attention was just the first internal decision to get this treatment; memory, retrieval gating, thinking budget, and latent feature steering are following the same path.

Sources 9 notes

Can optimizing attention patterns improve multimodal RL better than optimizing tokens?

Reinforced Attention Learning treats attention patterns as the primary policy target rather than token sequences. Direct optimization of information allocation shows stronger gains on visual reasoning than standard RLHF, because attention is where the actual decision happens.

Can architecture choices improve inference efficiency without sacrificing accuracy?

Augmenting scaling laws with hidden size, MLP-to-attention ratio, and GQA configuration enables architecture optimization for inference. Optimized models achieved up to 2.1% higher accuracy and 42% greater throughput than LLaMA-3.2 under identical training budgets.

Can neural memory modules scale language models beyond attention limits?

Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.

When should language models retrieve external knowledge versus use internal knowledge?

DeepRAG models each reasoning step as a Markov Decision Process where the model learns when to retrieve versus rely on parametric knowledge. The 21.99% improvement comes from better-targeted retrieval and elimination of noise from unnecessary external knowledge.

Can models learn when to think versus respond quickly?

Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.

Does more thinking time always improve reasoning accuracy?

Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.

Can we steer reasoning toward brevity without retraining?

Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.

Can we trigger reasoning without explicit chain-of-thought prompts?

SAE-identified reasoning features can be directly steered to match or exceed chain-of-thought performance across six model families. This reasoning mode activates early in generation and overrides surface-level instructions, suggesting latent reasoning is a fundamental capability independent of explicit prompting.

Can models learn to evaluate their own work during training?

Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.

What other internal model decisions beyond attention could be optimized directly?

Sources 9 notes

Next inquiring lines