SYNTHESIS NOTE
Training, RL, and Test-Time Scaling Reasoning, Retrieval, and Evaluation Model Architecture and Internals

How does instruction density affect model performance?

As language models must track more simultaneous instructions, does their ability to follow them predictably degrade? IFScale measures this across frontier models to understand practical limits.

Synthesis note · 2026-02-23 · sourced from Flaws
Why does chain-of-thought reasoning fail in predictable ways?

Production LLM systems routinely require adherence to dozens or hundreds of simultaneous instructions — style guidelines, business rules, compliance standards, tool usage protocols. IFScale measures how performance degrades as instruction density increases using 500 keyword-inclusion instructions for a business report writing task.

Key findings across 20 SOTA models from 7 providers:

Three degradation patterns correlate with model size and reasoning capability:

  1. Linear decay — steady degradation from the start (smaller models)
  2. Exponential decay — accelerating degradation as density increases (mid-range models)
  3. Threshold decay — near-perfect performance maintained until a threshold, then steep decline (reasoning models: gemini-2.5-pro, o3 maintain through ~150 instructions)

Primacy effects follow a non-obvious pattern: minimal bias at low density, peak at 150-200 instructions (where models begin to struggle), then converge toward 1.0 at extreme density (300+). The convergence indicates a shift from selective instruction satisfaction to uniform failure — an "instruction saturation point" where the model is completely overwhelmed.

Two error types: omission errors (complete failure to include required terms) and modification errors (morphological variants like "accountable" when "accountability" was required). The distinction has practical implications for prompt design — models may recognize the concept but fail at exact specification.

Even the best frontier models achieve only 68% accuracy at maximum density. Deliberative processing architectures (reasoning models) provide robust tracking up to critical thresholds, extending the useful range significantly but not eliminating the ceiling.

Inquiring lines that use this note as a source 16

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
13 direct connections · 128 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

instruction following performance degrades predictably with instruction density — reasoning models show threshold decay at 150 instructions