Learnixo

Machine Learning Foundations · Lesson 8 of 70

What is Reinforcement Learning?

The Core Idea

In reinforcement learning (RL), an agent takes actions in an environment and learns to maximize cumulative reward through trial and error.

Supervised:     Labels tell the model the right answer
Unsupervised:   Model finds structure without any feedback
Reinforcement:  Model receives reward/penalty AFTER taking actions

No labeled dataset of (input, correct_output) — instead, the agent learns by doing and receiving feedback from the environment.


The RL Framework

┌─────────────────────────────────────────────────────────┐
│                                                         │
│   Agent  ─────action──────►  Environment                │
│     ▲                              │                    │
│     │                     state + reward                │
│     └─────────────────────────────┘                     │
│                                                         │
└─────────────────────────────────────────────────────────┘

| Component | Definition | Example | |---|---|---| | Agent | The decision-making system | AI dosing assistant | | Environment | What the agent interacts with | Simulated patient physiology | | State | The current situation | Patient vitals, current dose | | Action | What the agent can do | Increase / decrease / hold dose | | Reward | Signal indicating how good the action was | +1 if INR in range, -2 if bleeding | | Policy | Strategy for choosing actions given state | "If INR above 3, decrease dose" | | Episode | One complete interaction sequence | 30-day patient treatment period |


Key Concepts

Reward Signal

The reward defines what the agent is optimizing for. Designing it correctly is critical and difficult.

Good reward design:
  +1.0  — INR in therapeutic range (2.0–3.0)
  -0.5  — INR subtherapeutic (below 2.0)
  -2.0  — INR supratherapeutic (above 4.0, bleeding risk)
  -5.0  — major bleeding event

Bad reward design:
  +1.0  — patient survives today
  → Agent learns to do nothing (survivorship bias)

Reward hacking: the agent finds unintended ways to maximize reward that don't match the actual goal. This is why reward design is so important.


Exploration vs Exploitation

The agent must balance:

  • Exploration — try new actions to discover better strategies
  • Exploitation — use the best-known action to collect reward
ε-greedy policy:
  With probability ε: take a random action (explore)
  With probability 1-ε: take the best-known action (exploit)

Early training: ε = 0.9 (explore a lot)
Later training:  ε = 0.05 (mostly exploit, small exploration)

Cumulative Reward (Return)

The agent doesn't just care about immediate reward — it cares about the sum of future rewards, discounted by how far away they are.

Return = r₀ + γ·r₁ + γ²·r₂ + γ³·r₃ + ...

γ (gamma) = discount factor, typically 0.9–0.99
  γ close to 1: agent cares a lot about future rewards
  γ close to 0: agent is short-sighted

RL Algorithms

| Algorithm | Category | Use Case | |---|---|---| | Q-Learning / DQN | Value-based | Discrete actions, Atari games | | Policy Gradient / REINFORCE | Policy-based | Continuous actions | | PPO (Proximal Policy Optimization) | Actor-Critic | Default for RLHF, robotics | | SAC (Soft Actor-Critic) | Actor-Critic | Continuous control, sample-efficient | | AlphaGo / AlphaZero | Model-based | Game-playing with lookahead |


RL in LLMs: RLHF

Reinforcement Learning from Human Feedback (RLHF) is how modern LLMs like GPT-4 and Claude are aligned to be helpful and harmless. It's the most important RL application in AI engineering today.

Step 1 — Pre-training:
  LLM trained on internet text via next-token prediction (self-supervised)

Step 2 — Supervised Fine-Tuning (SFT):
  Human-written (prompt, ideal_response) pairs → LLM learns the style

Step 3 — Reward Model Training:
  Humans rank two responses: "A is better than B"
  A reward model learns to predict human preference scores

Step 4 — RL with PPO:
  LLM generates responses
  Reward model scores them
  PPO updates LLM weights to maximize reward score
  KL penalty: prevents LLM from drifting too far from SFT base
Python
# Conceptual pseudocode for RLHF
def rlhf_step(llm, reward_model, prompts, ppo_optimizer):
    for prompt in prompts:
        # LLM generates response
        response = llm.generate(prompt)

        # Reward model scores the response
        reward = reward_model.score(prompt, response)

        # PPO update: increase probability of high-reward responses
        ppo_optimizer.step(llm, prompt, response, reward)

        # KL penalty: keep LLM close to original SFT model
        kl_penalty = compute_kl(llm, sft_baseline, prompt)
        total_loss = -reward + beta * kl_penalty

RL vs Supervised Learning for LLMs

| Aspect | Supervised Fine-Tuning (SFT) | RLHF | |---|---|---| | Labels | Human-written ideal responses | Human preference rankings | | Feedback signal | Token-level cross-entropy loss | Scalar reward per response | | What it optimizes | Matching human-written text | Maximizing human preference | | Cost | Expensive to create ideal responses | Cheaper — just rank pairs | | Risk | Mode collapse to specific style | Reward hacking |


When RL Applies in AI Systems

  • LLM alignment — RLHF, DPO (Direct Preference Optimization)
  • Recommendation systems — Netflix, Spotify optimize for long-term engagement
  • Robotics — learning to walk, pick-and-place tasks
  • Game AI — AlphaGo, OpenAI Five, Dota bots
  • Drug discovery — optimizing molecule properties through iterative generation
  • AutoML — searching neural architecture space

Interview Answer Template

Q: What is reinforcement learning and how does it relate to LLMs?

Reinforcement learning is a learning paradigm where an agent takes actions in an environment and learns to maximize cumulative reward through trial and error — with no labeled dataset of correct answers. The most important RL application in modern AI is RLHF (Reinforcement Learning from Human Feedback), used to align LLMs like GPT-4 and Claude. In RLHF, a reward model learns human preferences from ranked response pairs, then PPO (Proximal Policy Optimization) updates the LLM to maximize that reward while a KL divergence penalty keeps the model from drifting too far from the supervised baseline. The result is a model that produces helpful, harmless responses aligned with human values, not just statistically likely text.