Why does AI-improved task performance fail to transfer to independent work?
This explores why workers who perform better *with* an AI assistant don't carry that improvement over to work they do *without* it — the gap between assisted performance and durable, independent skill.
This explores why workers who perform better with an AI assistant don't carry that improvement over to work they later do on their own. The corpus points to a consistent answer: AI tends to lift the *output* of a task without depositing anything in the person doing it. The clearest version is that AI productivity gains show up when workers apply skills they already have, and evaporate the moment the task involves *learning* something new — when people lean on AI to acquire a skill, both the productivity gain and the learning disappear When does AI actually boost worker productivity?. So the improvement was never transferable to begin with; it lived in the tool, not the worker.
A few mechanisms underneath this make the gap concrete. One is attention: AI suggestions, even correct ones, sever the immersion needed to reason, forcing the user to rebuild focus rather than build fluency Does AI assistance always help reasoning or does it carry hidden costs?. Another is where the time goes — AI doesn't reduce total task time so much as shift it away from active task work toward prompting and evaluating outputs, which quietly changes what you practice and therefore what you learn Does AI really save time, or just change how we spend it?. The starkest evidence is neurological: a four-month EEG study found brain connectivity systematically scaling *down* with AI reliance — heaviest AI users showed the weakest neural engagement and couldn't even recall their own recent work Does AI assistance weaken our brain's ability to think independently?. That's the literal substrate of non-transfer: the independent-work machinery isn't being exercised.
There's also a perceptual trap that makes the gap hard to notice. The "LLM Fallacy" is a misattribution error — people credit the AI's output to their own growing ability, independent of whether the output was even accurate How does AI-assisted work reshape how people see their own abilities?. You feel more capable while the capability stays in the tool, so you don't discover the shortfall until the assistant is gone.
What's quietly fascinating is that the *same* failure mode appears one level down, inside the models themselves — suggesting it's a property of imitation-based improvement, not just human laziness. Instruction tuning largely teaches a model the *output format distribution*, not task understanding: models trained on semantically empty or even wrong instructions score about the same as those trained on correct ones Does instruction tuning teach task understanding or output format?. And imitation models that copy ChatGPT's confident style fool human evaluators while closing *no* actual capability gap — the ceiling stays fixed at the base model's real competence Can imitating ChatGPT fool evaluators into thinking models improved?. Surface performance improves; the underlying ability doesn't move. That's the machine mirror of the worker who looks better with AI and isn't.
The corpus also hints at what *would* transfer, by contrast. Gains stick when the improvement is extracted and internalized rather than borrowed: agents that mine reusable sub-task routines from past work compound real, growing advantages Can agents learn reusable sub-task routines from past experience?, and models that internalize self-evaluation into their own weights carry the skill forward at zero added cost Can models learn to evaluate their own work during training?. The throughline: assistance that produces an answer leaves you where you were, while assistance that produces an internalized *routine* is the only kind that travels home with you.
Sources 9 notes
Studies showing AI productivity gains measured tasks within workers' existing domains. When workers used AI to learn new skills, productivity gains disappeared and learning suffered, suggesting prior findings do not generalize to skill acquisition.
Well-intentioned AI suggestions can damage reasoning performance by severing cognitive immersion, forcing users to rebuild focus before continuing. Evaluation must measure flow preservation across entire tasks, not just local suggestion accuracy.
Research shows AI doesn't reduce total task time; it reallocates it away from active work toward composing prompts and understanding outputs. This shift changes the cognitive demands and learning outcomes, making time-on-task a poor productivity metric.
A four-month EEG study of 54 participants found that brain connectivity systematically scaled down with AI reliance—LLM users showed weakest neural engagement, poorest memory retention, and impaired ability to recall their own recent work.
Research shows the LLM Fallacy operates through misattribution of AI outputs to personal capability, independent of output accuracy or reliance behavior. It requires interventions that clarify human-machine contribution boundaries, not just better system accuracy or forced verification.
Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.
Imitation models fool human evaluators by mimicking ChatGPT's confident, fluent style while failing to improve factuality or generalization on novel tasks. The ceiling is set by base model capability, not fine-tuning method—better fundamentals, not shortcuts, drive real improvement.
Agent Workflow Memory induces sub-task routines at finer granularity than full tasks, abstracts example-specific values, and compounds them hierarchically. This produces 24.6% relative gain on Mind2Web and 51.1% on WebArena, with larger gains as train-test gaps widen.
Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.