Do KANs maintain their advantages in deep architectures and large-scale training?

This asks whether Kolmogorov-Arnold Networks keep the edge they show in small demos once you stack them deep and train them at scale — but the collection has no notes on KANs specifically, so what follows is the adjacent question the corpus *can* speak to: when does an architecture's small-scale advantage survive depth and size, and when does it evaporate?

Straight answer first: nothing in this library studies KANs directly, so I can't tell you whether their function-approximation advantage holds up deep and large — that's a gap worth flagging rather than papering over. What the corpus does have is a recurring story about *which* architectural advantages scale and which are artifacts of being small, and that story is the useful thing to take away.

The collection's clearest signal is that depth and scale don't reward architectures uniformly — they reward the ones whose mechanism is *compositional*. Does depth matter more than width for tiny language models? finds that for sub-billion-parameter models, going deep-and-thin beats going wide, because layers compose abstract concepts rather than just adding capacity. The advantage isn't the depth per se; it's that depth lets a good primitive stack. That reframes your KAN question: a learnable-activation network will likely keep its edge at depth only if its spline-based units compose cleanly layer over layer, and lose it if the gains came from overfitting a shallow function.

Several notes push back on the assumption that scale is what generates capability at all. A 7M-parameter recursive network out-generalizes billion-parameter models on hard puzzles (Can tiny recursive networks outperform massive language models?), and a 27M-parameter hierarchical model clears reasoning ceilings that fixed-depth transformers can't (Can recurrent hierarchies achieve reasoning that transformers cannot?). Both make the same point an alternative architecture like KAN implicitly bets on: the *right structure* can beat brute parameter count. The catch is that these wins came from recursion and effective depth — structural mechanisms — not from a novel unit being intrinsically better. An architecture has to earn its scaling story mechanistically.

There's also encouraging evidence that good structure becomes *more* reliable with scale, not less. Do neural networks naturally learn modular compositional structure? shows that pretraining sharpens modular structure rather than dissolving it — the bigger and more trained the model, the more consistent its decomposition into clean subnetworks. If a KAN's edge is genuinely about cleaner functional decomposition, this is the pattern you'd hope to see it follow. And Can neural memory modules scale language models beyond attention limits? is a case study in an alternative-architecture advantage that *does* survive scale-up — neural memory that holds its benefit out to 2M-token contexts where the standard mechanism's costs explode.

The thing you didn't know you wanted to know: the corpus suggests the real test for any alternative architecture isn't "does it work in a small demo" but "is its advantage a *composition* property or a *fitting* property." Composition properties (depth stacking, modular decomposition, recursion, separated memory) tend to strengthen with scale; fitting advantages tend to wash out. To answer the KAN question properly you'd want a note benchmarking learnable-activation networks at depth — and that note isn't here yet.

Sources 5 notes

Does depth matter more than width for tiny language models?

MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.

Can tiny recursive networks outperform massive language models?

A single 7M-parameter two-layer network recursing on its latent reasoning state achieves 45% on ARC-AGI-1 and 8% on ARC-AGI-2, beating DeepSeek R1, o3-mini, and Gemini 2.5 Pro with 0.01% of their parameters. Recursion on latent state, not scale or hierarchy, drives the generalization gain.

Can recurrent hierarchies achieve reasoning that transformers cannot?

The Hierarchical Reasoning Model couples slow abstract planning with fast detailed computation across two timescales, achieving near-perfect performance on Sudoku and mazes where chain-of-thought methods fail completely. With only 27M parameters and 1,000 samples, HRM escapes the AC0/TC0 complexity ceiling that constrains fixed-depth transformers.

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Can neural memory modules scale language models beyond attention limits?

Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.

Do KANs maintain their advantages in deep architectures and large-scale training?

Sources 5 notes

Next inquiring lines