What is Reinforcement Learning?
Understand reinforcement learning: agents, environments, rewards, policies, and the connection to RLHF in LLMs β with clear intuition for AI engineering interviews.
The Core Idea
In reinforcement learning (RL), an agent takes actions in an environment and learns to maximize cumulative reward through trial and error.
Supervised: Labels tell the model the right answer
Unsupervised: Model finds structure without any feedback
Reinforcement: Model receives reward/penalty AFTER taking actionsNo labeled dataset of (input, correct_output) β instead, the agent learns by doing and receiving feedback from the environment.
The RL Framework
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β Agent βββββactionβββββββΊ Environment β
β β² β β
β β state + reward β
β βββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| Component | Definition | Example | |---|---|---| | Agent | The decision-making system | AI dosing assistant | | Environment | What the agent interacts with | Simulated patient physiology | | State | The current situation | Patient vitals, current dose | | Action | What the agent can do | Increase / decrease / hold dose | | Reward | Signal indicating how good the action was | +1 if INR in range, -2 if bleeding | | Policy | Strategy for choosing actions given state | "If INR above 3, decrease dose" | | Episode | One complete interaction sequence | 30-day patient treatment period |
Key Concepts
Reward Signal
The reward defines what the agent is optimizing for. Designing it correctly is critical and difficult.
Good reward design:
+1.0 β INR in therapeutic range (2.0β3.0)
-0.5 β INR subtherapeutic (below 2.0)
-2.0 β INR supratherapeutic (above 4.0, bleeding risk)
-5.0 β major bleeding event
Bad reward design:
+1.0 β patient survives today
β Agent learns to do nothing (survivorship bias)Reward hacking: the agent finds unintended ways to maximize reward that don't match the actual goal. This is why reward design is so important.
Exploration vs Exploitation
The agent must balance:
- Exploration β try new actions to discover better strategies
- Exploitation β use the best-known action to collect reward
Ξ΅-greedy policy:
With probability Ξ΅: take a random action (explore)
With probability 1-Ξ΅: take the best-known action (exploit)
Early training: Ξ΅ = 0.9 (explore a lot)
Later training: Ξ΅ = 0.05 (mostly exploit, small exploration)Cumulative Reward (Return)
The agent doesn't just care about immediate reward β it cares about the sum of future rewards, discounted by how far away they are.
Return = rβ + Ξ³Β·rβ + Ξ³Β²Β·rβ + Ξ³Β³Β·rβ + ...
Ξ³ (gamma) = discount factor, typically 0.9β0.99
Ξ³ close to 1: agent cares a lot about future rewards
Ξ³ close to 0: agent is short-sightedRL Algorithms
| Algorithm | Category | Use Case | |---|---|---| | Q-Learning / DQN | Value-based | Discrete actions, Atari games | | Policy Gradient / REINFORCE | Policy-based | Continuous actions | | PPO (Proximal Policy Optimization) | Actor-Critic | Default for RLHF, robotics | | SAC (Soft Actor-Critic) | Actor-Critic | Continuous control, sample-efficient | | AlphaGo / AlphaZero | Model-based | Game-playing with lookahead |
RL in LLMs: RLHF
Reinforcement Learning from Human Feedback (RLHF) is how modern LLMs like GPT-4 and Claude are aligned to be helpful and harmless. It's the most important RL application in AI engineering today.
Step 1 β Pre-training:
LLM trained on internet text via next-token prediction (self-supervised)
Step 2 β Supervised Fine-Tuning (SFT):
Human-written (prompt, ideal_response) pairs β LLM learns the style
Step 3 β Reward Model Training:
Humans rank two responses: "A is better than B"
A reward model learns to predict human preference scores
Step 4 β RL with PPO:
LLM generates responses
Reward model scores them
PPO updates LLM weights to maximize reward score
KL penalty: prevents LLM from drifting too far from SFT base# Conceptual pseudocode for RLHF
def rlhf_step(llm, reward_model, prompts, ppo_optimizer):
for prompt in prompts:
# LLM generates response
response = llm.generate(prompt)
# Reward model scores the response
reward = reward_model.score(prompt, response)
# PPO update: increase probability of high-reward responses
ppo_optimizer.step(llm, prompt, response, reward)
# KL penalty: keep LLM close to original SFT model
kl_penalty = compute_kl(llm, sft_baseline, prompt)
total_loss = -reward + beta * kl_penaltyRL vs Supervised Learning for LLMs
| Aspect | Supervised Fine-Tuning (SFT) | RLHF | |---|---|---| | Labels | Human-written ideal responses | Human preference rankings | | Feedback signal | Token-level cross-entropy loss | Scalar reward per response | | What it optimizes | Matching human-written text | Maximizing human preference | | Cost | Expensive to create ideal responses | Cheaper β just rank pairs | | Risk | Mode collapse to specific style | Reward hacking |
When RL Applies in AI Systems
- LLM alignment β RLHF, DPO (Direct Preference Optimization)
- Recommendation systems β Netflix, Spotify optimize for long-term engagement
- Robotics β learning to walk, pick-and-place tasks
- Game AI β AlphaGo, OpenAI Five, Dota bots
- Drug discovery β optimizing molecule properties through iterative generation
- AutoML β searching neural architecture space
Interview Answer Template
Q: What is reinforcement learning and how does it relate to LLMs?
Reinforcement learning is a learning paradigm where an agent takes actions in an environment and learns to maximize cumulative reward through trial and error β with no labeled dataset of correct answers. The most important RL application in modern AI is RLHF (Reinforcement Learning from Human Feedback), used to align LLMs like GPT-4 and Claude. In RLHF, a reward model learns human preferences from ranked response pairs, then PPO (Proximal Policy Optimization) updates the LLM to maximize that reward while a KL divergence penalty keeps the model from drifting too far from the supervised baseline. The result is a model that produces helpful, harmless responses aligned with human values, not just statistically likely text.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.