Can extracted skills transfer effectively across different domains and model architectures?

This explores whether skills pulled out of one model's experience — workflows, rules, expert vectors — actually carry over to new task domains and different model backbones, or whether they're locked to where they were learned.

This explores whether extracted skills travel — across task domains and across different model architectures — and the corpus says the answer is a qualified yes, but the qualifier matters a lot. The most direct evidence comes from work where skills are stored as natural language rather than weights. When a frozen model extracts explicit rules from its context into a reusable "skill" library, those skills lift performance without any weight update and, crucially, transfer across model backbones Can frozen models learn better by extracting context into skills?. Because the skill is just text describing a procedure, nothing ties it to one model's parameters. Similarly, agent workflow memory induces sub-task routines at a finer grain than whole tasks and abstracts away example-specific values — and the gains grow precisely as the gap between training and test widens, which is exactly the signature of something that generalizes rather than memorizes Can agents learn reusable sub-task routines from past experience?.

The strongest cross-architecture claim comes from decoupling who learns the skills from who runs them. When a separately trained curator evolves a skill repository while the executor stays frozen, the repository drifts away from generic verbose notes toward actionable execution logic and cross-task meta-strategies — and the trained curator generalizes across different executor backbones and domains Can a separate trained curator improve skill libraries better than frozen agents?. That separation is the design trick: keep the skill representation portable (text, routines, strategy) and you sidestep the architecture-binding problem entirely.

The contrast worth noticing is what happens when skills live in the weights instead. Composable expert vectors work by tuning only the singular values of weight matrices, letting a model mix task-specific experts at inference without interference Can models dynamically activate expert skills at inference time?. That's elegant composition — but it's composition within one model's own weight space, not transfer to a different architecture. The same tension shows up in how domains are taught: knowledge-graph curricula build deep, compositional domain expertise Can knowledge graphs teach models deep domain expertise?, yet a survey of adaptation methods finds every technique has a domain-conditional sweet spot, and visible performance gains often come paired with hidden degradation in reasoning faithfulness and capability transfer How do domain training techniques actually reshape model behavior?. So weight-baked skills can specialize beautifully and still fail to travel.

The sharpest reason text-form skills transfer where weight edits don't: prompting and instruction can only reorganize knowledge that's already present. Prompt optimization retrieves existing capability but cannot inject knowledge a model never had Can prompt optimization teach models knowledge they lack?, and instruction tuning largely teaches the output format rather than the underlying task — semantically empty instructions perform about as well as correct ones Does instruction tuning teach task understanding or output format?. Read together, these explain the boundary condition for transfer: an extracted skill ports cleanly when it's activating capability the receiving model already latently has. Hand a skill to a model lacking the foundational knowledge it presumes, and you hit a hard ceiling no amount of clever transfer can cross.

What you might not have expected to learn: transferability isn't really a property of the skill — it's a property of the *representation* you store it in. Skills written as portable language (rules, routines, meta-strategies) cross both domains and architectures; skills fused into weights compose powerfully but stay home. And even the portable ones only fire when the destination model is already capable enough to use them — which reframes "can skills transfer" into the more useful question of "transfer to whom."

Sources 8 notes

Can frozen models learn better by extracting context into skills?

Extracting natural-language rules from context into reusable skills improves frozen model reasoning without weight updates. On CL-bench, this lifts GPT-4.1 from 11.1% to 16.5%, with skills transferable across model backbones.

Can agents learn reusable sub-task routines from past experience?

Agent Workflow Memory induces sub-task routines at finer granularity than full tasks, abstracts example-specific values, and compounds them hierarchically. This produces 24.6% relative gain on Mind2Web and 51.1% on WebArena, with larger gains as train-test gaps widen.

Can a separate trained curator improve skill libraries better than frozen agents?

SkillOS shows that separating a trainable curator from a frozen executor, grouped by task streams, causes skill repositories to shift from generic verbose additions toward actionable execution logic and cross-task meta-strategies. The trained curator generalizes across different executor backbones and domains.

Can models dynamically activate expert skills at inference time?

Transformer2 demonstrates that tuning only singular values within weight matrices produces composable expert vectors that dynamically mix at inference without interference, outperforming LoRA with fewer parameters and enabling continual specialization.

Can knowledge graphs teach models deep domain expertise?

Fine-tuning a 32B model on 24,000 reasoning tasks derived from medical knowledge graph paths produces state-of-the-art performance across 15 medical domains, demonstrating that structured knowledge composition matters more than scale.

How do domain training techniques actually reshape model behavior?

Research shows every adaptation method—from parameter-efficient tuning to knowledge graph curricula—has optimal conditions tied to specific domains. The key finding: visible benefits like performance gains often come with hidden degradation in reasoning faithfulness, capability transfer, and format flexibility.

Can prompt optimization teach models knowledge they lack?

Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question remains open: **Can extracted skills transfer effectively across different domains and model architectures?**

What a curated library found — and when (dated claims, not current truth):
Findings span 2022–2026; treat all as perishable constraints to be re-tested.

• Text-form skills (rules, routines, meta-strategies) extracted into natural language libraries port across model backbones without weight updates, whereas skills fused into weight matrices compose elegantly within one model but do not travel to different architectures (~2025).
• Agent workflow memory induces fine-grained sub-task routines that generalize as the gap between training and test conditions widens—signature of transfer rather than memorization (~2024–2025).
• RL-trained skill curators, decoupled from frozen executors, evolve repositories that generalize across different executor architectures and domains by drifting toward actionable logic rather than verbose notes (~2026).
• Prompt optimization and instruction tuning **cannot inject new knowledge**; they only activate latent capability—so a skill ports cleanly only when the receiving model already possesses the foundational knowledge it presumes (~2023).
• Domain-specialized training achieves deep compositional expertise but exhibits domain-conditional sweet spots; visible gains often pair with hidden degradation in reasoning faithfulness and cross-domain capability (~2023–2025).

Anchor papers (verify; mind their dates):
- arXiv:2409.07429 (Agent Workflow Memory, 2024)
- arXiv:2501.06252 (Transformer2: Self-adaptive LLMs, 2025)
- arXiv:2605.06614 (SkillOS: Learning Skill Curation, 2026)
- arXiv:2305.11383 (Do Models Really Learn to Follow Instructions?, 2023)

Your task:

(1) **RE-TEST EACH CONSTRAINT.** For every claim above, judge whether newer models (o1, o3, Gemini 3), methods (in-context learning at scale, chain-of-thought variants), tooling (newer SDKs, evaluation harnesses), multi-agent orchestration, or evals since ~May 2026 have relaxed or overturned it. Separate the durable insight—*representation matters more than the skill itself*—from perishable limits (e.g., do frozen models now transfer better? do weight-based skills now port across architectures?). Cite what resolved each constraint; flag what still holds.

(2) **Surface the strongest CONTRADICTING or SUPERSEDING work** from the last 6 months. Look for: (a) evidence that weight-fused skills DO transfer across architectures; (b) proof that text-form skills fail on unfamiliar domains; (c) new skill-extraction methods that bypass the latent-capability bottleneck.

(3) **Propose 2 research questions that assume the regime may have shifted:**
   - If text-form skill transfer is now automatic, what breaks it? (Is there a new ceiling?)
   - Can a skill curator train itself to diagnose *whether* a destination model has the latent knowledge needed—before attempting transfer?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Can extracted skills transfer effectively across different domains and model architectures?

Sources 8 notes

Next inquiring lines