What makes multimodal conditioning effective when features are decomposed to the right granularity?

This explores why getting multimodal models to attend to the right inputs depends less on adding more processing and more on choosing the correct level — the right unit, channel, or subnetwork — at which to steer the model. The corpus keeps circling one idea: conditioning works when you optimize the thing that actually drives the decision, and fails when you optimize a proxy for it. The sharpest illustration is in vision-language models, where piling on verbose chain-of-thought reasoning actually *hurts* fine-grained perception, because the real bottleneck isn't how much the model talks — it's where it looks. The decision happens in visual attention allocation, and text-token reinforcement learning trains the wrong target entirely Does verbose chain-of-thought actually help multimodal perception tasks?. Treat attention distributions themselves as the policy target — the granularity where information is actually being allocated — and multimodal reasoning improves more than standard token-level RLHF ever delivers Can optimizing attention patterns improve multimodal RL better than optimizing tokens?.

So "the right granularity" turns out to mean: the level at which the model has a genuine functional seam to grab. There's reason to believe those seams already exist inside the network. Pruning experiments show neural nets spontaneously decompose compositional tasks into isolated modular subnetworks — ablate one and you knock out exactly one subroutine, nothing else — and pretraining makes this modular structure far more consistent Do neural networks naturally learn modular compositional structure?. Conditioning is effective when it lines up with these natural decomposition boundaries rather than cutting across them. The flip side is a warning: a model can hit perfect accuracy while its internal representation is fractured and disorganized, which standard metrics never reveal but perturbation and distribution shift expose immediately Can models be smart without organized internal structure?. Right-granularity conditioning is partly about building on organized structure instead of papering over a broken one.

The same principle shows up wherever researchers split a single learning signal into separately-addressed channels. Fast-Slow Training routes durable lessons into slow weight updates and task-specific context into fast textual prompts — and the payoff is reaching equal performance several times faster with far less catastrophic forgetting, because forgetting turns out to be a *misallocation* problem, not an inherent cost Can splitting adaptation into two channels reduce forgetting?. The Titans memory architecture does the analogous split across time: quadratic attention for short-term, a compressed neural memory module for the surprising tokens worth keeping long-term, which is what lets it scale past two million tokens Can neural memory modules scale language models beyond attention limits?. In both cases effectiveness comes from decomposing one job into channels matched to what each channel is actually good at.

The thread that ties this together — and the thing you might not have expected to learn — is that "granularity" is really about *which signal the model can act on cleanly*. Reflexion agents learn from binary success/failure feedback precisely because the signal is unambiguous; keeping the reflections uncompressed preserves their usability, and the crisp binary even prevents the model from rationalizing failure away Can agents learn from failure without updating their weights?. Across all of these, the win condition is the same: find the decomposition where each piece carries a clean, actionable signal — attention over tokens, modular subroutine over monolith, fast context over slow weights — and condition there. Get the unit wrong and you optimize hard against the wrong bottleneck.

Sources 7 notes

Does verbose chain-of-thought actually help multimodal perception tasks?

Long rationales and text-token RL help reasoning but hurt fine-grained perception tasks because the actual bottleneck is visual attention allocation, not verbalization. Standard CoT optimization trains the wrong policy target.

Can optimizing attention patterns improve multimodal RL better than optimizing tokens?

Reinforced Attention Learning treats attention patterns as the primary policy target rather than token sequences. Direct optimization of information allocation shows stronger gains on visual reasoning than standard RLHF, because attention is where the actual decision happens.

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Can splitting adaptation into two channels reduce forgetting?

Fast-Slow Training routes task-specific lessons into optimized prompts while keeping parameter updates minimal, reaching equivalent performance 1.4–3x faster with substantially less catastrophic forgetting and plasticity loss, demonstrating that forgetting is a misallocation problem rather than an inherent cost.

Can neural memory modules scale language models beyond attention limits?

Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.

Can agents learn from failure without updating their weights?

Reflexion demonstrates that unambiguous environmental feedback (success/failure) enables agents to write useful self-diagnoses and improve across episodes without parameter updates. The binary signal prevents rationalization, and keeping reflections uncompressed preserves their usability.

What makes multimodal conditioning effective when features are decomposed to the right granularity?

Sources 7 notes

Next inquiring lines