INQUIRING LINE

What skills can large models identify and organize about their own abilities?

This explores two linked questions: whether a model can recognize which skills it actually possesses, and whether those skills can be cleanly named and organized — by the model itself or by the researchers probing it.


This explores two linked things at once — whether a model can sort its own abilities into distinct, nameable skills, and whether it knows which of those skills it actually has. The corpus suggests the organizing mostly happens from the outside, while the self-knowing happens shallowly from the inside, and the two don't yet meet.

Start with organizing. When researchers decompose model ability into discrete skills, the skills behave very differently from one another. FLASK's 12-skill breakdown shows logical reasoning climbing steeply with scale while stylistic and metacognitive skills saturate early — metacognition tops out around 7B parameters, logical efficiency around 30B, and knowledge keeps improving Do all AI skills improve equally as models scale?. So 'skill' isn't one quantity; a model can look fluent (style copied well) while reasoning stays thin, which is exactly the gap distillation exposes. And some of these skills aren't even created by training — they're already latent in the base model and merely *selected* by post-training, whether through RL, decoding tweaks, or feature steering Do base models already contain hidden reasoning ability?. There's even machinery for activating skills on demand: tuning only the singular values of weight matrices yields composable 'expert vectors' that mix at inference without stepping on each other Can models dynamically activate expert skills at inference time?. So abilities can be organized, composed, and selectively switched on — but largely by us, not by the model reflecting on itself.

Now the self-knowing side, which is where the real surprise lives. Models do carry an internal signal for *what they know* versus what they don't: sparse autoencoders reveal an entity-recognition mechanism that tracks whether the model has facts about something, and this mechanism causally steers whether it answers or refuses Do models know what they don't know?. That's a genuine, mechanistic form of self-knowledge — but it's narrow, about facts, not about skills. Step up to broader self-report and it gets unreliable: models can describe behaviors they were never explicitly taught, yet those descriptions are unstable and shift under conversational pressure, which looks like surface awareness rather than real insight How well do language models understand their own knowledge?.

Worse, the self-assessment is biased. Models systematically over-trust answers they generated themselves, because a high-probability output simply *feels* correct during evaluation — the fix is forcing comparison against outside alternatives Why do models trust their own generated answers?. This has a hard theoretical edge: a model can only reliably improve itself where it can verify better than it generates, and that generation–verification gap is what bounds self-improvement entirely What limits how much models can improve themselves?. So a model's ability to honestly audit its own skills is capped by a quantity it can't think its way past.

The quietly unsettling part for anyone trying to map abilities at all: the categories may be artifacts of how we measure. Sharp 'emergent skills' often dissolve into smooth curves the moment you switch to a continuous metric Are LLM emergent abilities real or measurement artifacts?, and two models with identical scores can hide completely different internal organization — perfect linear decodability sitting on top of fractured, fragile structure Can models be smart without organized internal structure?. The frontier where it matters most — autonomous science — needs self-correction above all, and that's precisely the skill that degrades rather than improves What capabilities do AI systems need for autonomous science?. The takeaway you didn't know you wanted: models have real but narrow self-knowledge about *facts*, almost no trustworthy self-knowledge about *skills*, and the neat skill taxonomies we use to organize them may say more about our metrics than about what's inside.


Sources 10 notes

Do all AI skills improve equally as models scale?

FLASK's 12-skill decomposition reveals metacognition saturates at 7B parameters while logical efficiency plateaus at 30B, but reasoning and knowledge skills improve continuously. Open-source models successfully imitate surface-level style but fail at reasoning—confirming that distillation copies form not substance.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Can models dynamically activate expert skills at inference time?

Transformer2 demonstrates that tuning only singular values within weight matrices produces composable expert vectors that dynamically mix at inference without interference, outperforming LoRA with fewer parameters and enabling continual specialization.

Do models know what they don't know?

Sparse autoencoders revealed that language models develop causal mechanisms for detecting whether they know facts about entities. These mechanisms actively steer both hallucination and refusal behavior, and persist from base models into finetuned chat versions.

How well do language models understand their own knowledge?

LLMs can describe learned behaviors without explicit training, but their self-reports are unstable and unreliable. Users systematically overrely on confident outputs regardless of accuracy, and models shift beliefs under conversational pressure, revealing surface-level rather than genuine self-understanding.

Why do models trust their own generated answers?

LLMs exhibit structural bias toward validating their own outputs because high-probability generated answers feel more correct during evaluation. Comparing answers against broader alternatives breaks this self-agreement loop.

What limits how much models can improve themselves?

Models can only improve themselves when they verify solutions better than they generate them. This gap scales with model size but vanishes entirely for factual tasks, predicting which domains benefit from self-improvement.

Are LLM emergent abilities real or measurement artifacts?

Sharp, unpredictable capability transitions vanish when using continuous metrics instead of discontinuous ones. The same model outputs show smooth predictable improvement with scale, suggesting emergence is a measurement choice rather than a real behavioral change.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

What capabilities do AI systems need for autonomous science?

The Virtuous Machines framework identifies hypothesis generation, experimental design, data analysis, and iterative self-correction as essential for autonomous scientific research, none of which standard LLM benchmarks reliably evaluate. Self-correction poses the deepest challenge due to documented degradation in reasoning accuracy.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst tracking self-knowledge and skill identification in LLMs. The question remains open: Can large models genuinely identify, organize, and report on their own abilities—or does self-awareness stay shallow and externally imposed?

What a curated library found—and when (dated claims, not current truth):
Findings span 2023–2026; treat these as perishable constraints to re-test:
• Skill decomposition works externally (FLASK's 12-skill taxonomy shows logical reasoning scales to 30B while metacognition saturates at 7B), but models don't internally organize or select among these skills—post-training and us do (~2023–2024).
• Models carry narrow, mechanistic self-knowledge about *facts* via sparse autoencoders (entity-recognition causally steers refusal), but no trustworthy self-knowledge about *skills*; self-report is unstable and biased toward trusting their own outputs (~2024–2025).
• A hard ceiling: self-improvement is bounded by the generation–verification gap; models cannot reliably audit abilities they exceed at generating (~2025).
• Sharp 'emergent skills' are metric artifacts, not fundamental; identical performance masks radically different internal organization (~2023, still echoes in 2025–2026).

Anchor papers (verify; mind their dates):
• 2304.15004 (2023): Emergent abilities as mirage—metric dependence.
• 2307.10928 (2023): FLASK skill taxonomy—differential scaling.
• 2411.14257 (2024): Entity-recognition as self-knowledge mechanism.
• 2412.02674 (2025): Generation–verification gap bounds self-improvement.

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For each finding above, ask: Have newer models (o1, Gemini 2.0, Claude 3.7+), mechanistic interpretability advances (sparse autoencoders post-2024), or orchestration methods (multi-agent deliberation, external verification loops) *dissolved* the self-knowledge ceiling or the generation–verification bound? Separate the durable question (likely: Can an LLM autonomously *audit* its own reasoning?) from perishable limits (possibly relaxed by external scaffolding, chain-of-thought forcing, or ensemble verification).
(2) **Surface contradicting or superseding work** from the last ~6 months. Specifically: Has any paper shown models *composing* self-knowledge across modalities, or using mechanistic circuits to *select* skills dynamically, or escaping the verification gap via formal verification or symbolic grounding?
(3) **Propose 2 research questions** that assume the regime may have shifted: (a) If sparse autoencoders now reliably surface skill-specific circuits (not just fact circuits), can a model *steer* its own skill mixtures at inference? (b) If verification is the hard limit, do hybrid human–AI loops or external symbolic verifiers now let models exceed their solo audit capability?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines