SYNTHESIS NOTE
Reasoning, Retrieval, and Evaluation Training, RL, and Test-Time Scaling Model Architecture and Internals

Can small models reason well by just learning output format?

Does reasoning performance depend primarily on adapting how models express outputs rather than acquiring new knowledge? The Tina research tests this by applying LoRA to a 1.5B model during reasoning training.

Synthesis note · 2026-02-22 · sourced from Reasoning Methods CoT ToT
How should we allocate compute budget at inference time? How do you build domain expertise into general AI models?

The Tina paper trains a 1.5B parameter model with LoRA (low-rank adaptation) applied during RL post-training, keeping the base model weights frozen except for the LoRA modules. This model achieves reasoning performance competitive with — and sometimes surpassing — full-parameter RL reasoning models trained on the same base, despite using a tiny fraction of post-training compute.

The authors' hypothesis for why LoRA works so well is the Rapid Reasoning Format Adaptation Hypothesis: what RL post-training primarily teaches a small model is not new knowledge about the world, but how to organize its outputs in a reasoning-trace format. LoRA, which modifies only a low-dimensional subspace of the weight matrix, is sufficient to adapt the output format while the base model's pre-existing knowledge remains intact.

This hypothesis is supported by two independent lines of evidence. First, small LMs can store less factual knowledge than large ones but can still reason effectively — suggesting reasoning and knowledge are separable capabilities. Second, RL post-training on derivational traces selects for outputs that match reasoning-trace style while producing correct answers, but the selection pressure is on format, not on knowledge retrieval.

The practical implication: if you want to add reasoning capability to a deployed model cheaply, LoRA RL post-training may be sufficient. Full-parameter post-training is appropriate when knowledge integration is needed (new domain facts, new task-specific capabilities). Format adaptation can be achieved with a small fraction of that compute.

This is both an optimization for Can simple rewards alone teach complex domain reasoning? and a qualification: what RL "emerges" may be mostly format discovery, not new knowledge. The emergence finding is real, but its mechanism may be simpler than it looks — the model already had the knowledge; RL teaches it to express that knowledge in a productive output format.

Note: this is an OPEN hypothesis pending validation on broader task and model ranges.

Inquiring lines that use this note as a source 21

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
13 direct connections · 153 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

lora-based reasoning format adaptation achieves competitive reasoning by adapting output format rather than integrating knowledge