Can LLMs design reward functions for reinforcement learning?
Can language models help automate the notoriously difficult task of designing reward shaping functions for sparse-reward RL, and if so, how might we structure that collaboration to work around LLMs' weaknesses in stochastic control?
Sparse-reward RL with stochastic transitions is notoriously sample-inefficient. The standard remedy — reward shaping with intrinsic rewards — places the cognitive burden on the human designer. Producing useful shaping functions requires either task-specific domain knowledge or expert demonstrations for each new task, neither of which scales.
MEDIC (2405.15194) replaces the human designer with an LLM, but with an architectural twist that avoids the well-known failure of directly prompting LLMs for control policies. Direct LLM prompting for control is unreliable because LLMs struggle with the stochasticity, partial observability, and reward sparsity that make RL hard in the first place. MEDIC's move is to strip away those difficulties before asking the LLM to plan.
The mechanism has three steps. First, construct a deterministic abstraction of the original RL problem — the same goal, but simplified to remove stochastic transitions and complex state. Second, prompt an LLM to solve this abstracted problem, producing a (possibly suboptimal but valid) plan. The plan represents what the LLM thinks a good policy looks like in the simplified setting. Third, convert this guide policy into a reward shaping function for the downstream RL agent operating on the original stochastic problem. The shaping rewards encourage the RL agent to follow the LLM's guide policy when it aligns with task progress.
A model-based feedback critic verifies LLM outputs against the abstract model — catching plans that violate problem constraints — before the plan is converted to shaping rewards. This prevents the LLM's plausible-but-wrong outputs from contaminating the RL training signal.
The conceptual move is decomposing what was previously a single hard task (design a reward shaping function for stochastic sparse-reward RL) into two easier tasks (design a deterministic abstraction; have the LLM solve it). Each easier task is something for which LLMs and humans have appropriate tooling. The deterministic abstraction is something humans can specify; the plan over abstraction is something LLMs can produce.
The broader implication: LLMs do not need to be good control policies to contribute to RL. They can be good plan generators over simplified versions of the problem, and the rest of the RL machinery does the work of dealing with the actual stochastic dynamics. This is a different design pattern from later approaches like Can chain-of-thought reasoning be learned during pretraining itself? (where LLM thinking IS the policy) or Can agents learn continuously from experience without updating weights? (where LLM reasoning operates over a case bank). MEDIC sits earlier in the pipeline: the LLM contributes to RL's reward shaping rather than to its policy or value estimation.
Inquiring lines that use this note as a source 14
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can utility control modify LLM values more effectively than output filtering?
- How does fitness-proportional selection guide LLM recombination in unstructured solution spaces?
- How does modularity in reward and policy design enable goal generalization?
- What makes LLM-guided pruning necessary for MCTS in language rather than game domains?
- How do cascaded probabilistic models compare to reinforcement learning for per-query system design?
- Can structured natural language feedback outperform scalar rewards in RL?
- What mechanism causes LLMs to plateau on numerical optimization tasks?
- Why do LLMs fail at directly solving stochastic control problems?
- How does LLM simulation of APIs avoid instability without sacrificing training signal?
- Can LLM-synthesized behavioral heuristics compete with learned policy improvements?
- What other downstream metrics could serve as RL reward sources?
- Can language models function as implicit process reward models through retrospection?
- Why do LLMs fail at iterative numerical computation in latent space?
- Can compact reward function representations beat text based personalization approaches?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can chain-of-thought reasoning be learned during pretraining itself?
Explores whether reasoning emerges more effectively when models treat thinking as an exploratory action during next-token prediction, rather than only after pretraining through reinforcement learning.
RLP integrates the LLM directly into the RL signal; MEDIC keeps them separate (LLM produces shaping, RL trains policy)
-
Can language modeling close the knowing-doing gap in AI?
Current LLMs reason well but act poorly in interactive tasks, while RL agents act well but cannot explain themselves. Can reformulating decision-making as language modeling with environmental feedback bridge this fundamental split?
TiG is the opposite design pattern: LLM IS the policy, refined by RL; MEDIC: LLM informs the reward, RL trains a separate policy
-
Can reward models learn by comparing policies instead of judging them?
What if reward models worked as policy discriminators—measuring distance to a target rather than encoding absolute preferences? Could this eliminate the need for manual preference labels and scale across domains?
POLAR's "similarity to target policy" framing is a generalization: the MEDIC guide policy could serve as the target for POLAR-style discrimination
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Efficient Reinforcement Learning via Large Language Model-based Search
- Online Intrinsic Rewards for Decision Making Agents from Large Language Model Feedback
- A Survey on Post-training of Large Language Models
- LLM Post-Training: A Deep Dive into Reasoning Large Language Models
- Reward Reasoning Model
- Look Before You Leap: Autonomous Exploration for LLM Agents
- Omni-Thinker: Scaling Multi-Task RL in LLMs with Hybrid Reward and Task Scheduling
- Reward-Robust RLHF in LLMs
Original note title
LLMs can construct reward shaping functions by solving a simpler deterministic abstraction of the original RL problem