Machine Learning Foundations · Lesson 8 of 70
What is Reinforcement Learning?
The Core Idea
In reinforcement learning (RL), an agent takes actions in an environment and learns to maximize cumulative reward through trial and error.
Supervised: Labels tell the model the right answer
Unsupervised: Model finds structure without any feedback
Reinforcement: Model receives reward/penalty AFTER taking actionsNo labeled dataset of (input, correct_output) — instead, the agent learns by doing and receiving feedback from the environment.
The RL Framework
┌─────────────────────────────────────────────────────────┐
│ │
│ Agent ─────action──────► Environment │
│ ▲ │ │
│ │ state + reward │
│ └─────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────┘| Component | Definition | Example | |---|---|---| | Agent | The decision-making system | AI dosing assistant | | Environment | What the agent interacts with | Simulated patient physiology | | State | The current situation | Patient vitals, current dose | | Action | What the agent can do | Increase / decrease / hold dose | | Reward | Signal indicating how good the action was | +1 if INR in range, -2 if bleeding | | Policy | Strategy for choosing actions given state | "If INR above 3, decrease dose" | | Episode | One complete interaction sequence | 30-day patient treatment period |
Key Concepts
Reward Signal
The reward defines what the agent is optimizing for. Designing it correctly is critical and difficult.
Good reward design:
+1.0 — INR in therapeutic range (2.0–3.0)
-0.5 — INR subtherapeutic (below 2.0)
-2.0 — INR supratherapeutic (above 4.0, bleeding risk)
-5.0 — major bleeding event
Bad reward design:
+1.0 — patient survives today
→ Agent learns to do nothing (survivorship bias)Reward hacking: the agent finds unintended ways to maximize reward that don't match the actual goal. This is why reward design is so important.
Exploration vs Exploitation
The agent must balance:
- Exploration — try new actions to discover better strategies
- Exploitation — use the best-known action to collect reward
ε-greedy policy:
With probability ε: take a random action (explore)
With probability 1-ε: take the best-known action (exploit)
Early training: ε = 0.9 (explore a lot)
Later training: ε = 0.05 (mostly exploit, small exploration)Cumulative Reward (Return)
The agent doesn't just care about immediate reward — it cares about the sum of future rewards, discounted by how far away they are.
Return = r₀ + γ·r₁ + γ²·r₂ + γ³·r₃ + ...
γ (gamma) = discount factor, typically 0.9–0.99
γ close to 1: agent cares a lot about future rewards
γ close to 0: agent is short-sightedRL Algorithms
| Algorithm | Category | Use Case | |---|---|---| | Q-Learning / DQN | Value-based | Discrete actions, Atari games | | Policy Gradient / REINFORCE | Policy-based | Continuous actions | | PPO (Proximal Policy Optimization) | Actor-Critic | Default for RLHF, robotics | | SAC (Soft Actor-Critic) | Actor-Critic | Continuous control, sample-efficient | | AlphaGo / AlphaZero | Model-based | Game-playing with lookahead |
RL in LLMs: RLHF
Reinforcement Learning from Human Feedback (RLHF) is how modern LLMs like GPT-4 and Claude are aligned to be helpful and harmless. It's the most important RL application in AI engineering today.
Step 1 — Pre-training:
LLM trained on internet text via next-token prediction (self-supervised)
Step 2 — Supervised Fine-Tuning (SFT):
Human-written (prompt, ideal_response) pairs → LLM learns the style
Step 3 — Reward Model Training:
Humans rank two responses: "A is better than B"
A reward model learns to predict human preference scores
Step 4 — RL with PPO:
LLM generates responses
Reward model scores them
PPO updates LLM weights to maximize reward score
KL penalty: prevents LLM from drifting too far from SFT base# Conceptual pseudocode for RLHF
def rlhf_step(llm, reward_model, prompts, ppo_optimizer):
for prompt in prompts:
# LLM generates response
response = llm.generate(prompt)
# Reward model scores the response
reward = reward_model.score(prompt, response)
# PPO update: increase probability of high-reward responses
ppo_optimizer.step(llm, prompt, response, reward)
# KL penalty: keep LLM close to original SFT model
kl_penalty = compute_kl(llm, sft_baseline, prompt)
total_loss = -reward + beta * kl_penaltyRL vs Supervised Learning for LLMs
| Aspect | Supervised Fine-Tuning (SFT) | RLHF | |---|---|---| | Labels | Human-written ideal responses | Human preference rankings | | Feedback signal | Token-level cross-entropy loss | Scalar reward per response | | What it optimizes | Matching human-written text | Maximizing human preference | | Cost | Expensive to create ideal responses | Cheaper — just rank pairs | | Risk | Mode collapse to specific style | Reward hacking |
When RL Applies in AI Systems
- LLM alignment — RLHF, DPO (Direct Preference Optimization)
- Recommendation systems — Netflix, Spotify optimize for long-term engagement
- Robotics — learning to walk, pick-and-place tasks
- Game AI — AlphaGo, OpenAI Five, Dota bots
- Drug discovery — optimizing molecule properties through iterative generation
- AutoML — searching neural architecture space
Interview Answer Template
Q: What is reinforcement learning and how does it relate to LLMs?
Reinforcement learning is a learning paradigm where an agent takes actions in an environment and learns to maximize cumulative reward through trial and error — with no labeled dataset of correct answers. The most important RL application in modern AI is RLHF (Reinforcement Learning from Human Feedback), used to align LLMs like GPT-4 and Claude. In RLHF, a reward model learns human preferences from ranked response pairs, then PPO (Proximal Policy Optimization) updates the LLM to maximize that reward while a KL divergence penalty keeps the model from drifting too far from the supervised baseline. The result is a model that produces helpful, harmless responses aligned with human values, not just statistically likely text.