Learnixo
Back to blog
AI Systemsintermediate

What is Reinforcement Learning?

Understand reinforcement learning: agents, environments, rewards, policies, and the connection to RLHF in LLMs β€” with clear intuition for AI engineering interviews.

Asma Hafeez KhanMay 16, 20265 min read
Machine LearningReinforcement LearningRLHFAI AlignmentInterview
Share:𝕏

The Core Idea

In reinforcement learning (RL), an agent takes actions in an environment and learns to maximize cumulative reward through trial and error.

Supervised:     Labels tell the model the right answer
Unsupervised:   Model finds structure without any feedback
Reinforcement:  Model receives reward/penalty AFTER taking actions

No labeled dataset of (input, correct_output) β€” instead, the agent learns by doing and receiving feedback from the environment.


The RL Framework

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                                                         β”‚
β”‚   Agent  ─────action──────►  Environment                β”‚
β”‚     β–²                              β”‚                    β”‚
β”‚     β”‚                     state + reward                β”‚
β”‚     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                     β”‚
β”‚                                                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

| Component | Definition | Example | |---|---|---| | Agent | The decision-making system | AI dosing assistant | | Environment | What the agent interacts with | Simulated patient physiology | | State | The current situation | Patient vitals, current dose | | Action | What the agent can do | Increase / decrease / hold dose | | Reward | Signal indicating how good the action was | +1 if INR in range, -2 if bleeding | | Policy | Strategy for choosing actions given state | "If INR above 3, decrease dose" | | Episode | One complete interaction sequence | 30-day patient treatment period |


Key Concepts

Reward Signal

The reward defines what the agent is optimizing for. Designing it correctly is critical and difficult.

Good reward design:
  +1.0  β€” INR in therapeutic range (2.0–3.0)
  -0.5  β€” INR subtherapeutic (below 2.0)
  -2.0  β€” INR supratherapeutic (above 4.0, bleeding risk)
  -5.0  β€” major bleeding event

Bad reward design:
  +1.0  β€” patient survives today
  β†’ Agent learns to do nothing (survivorship bias)

Reward hacking: the agent finds unintended ways to maximize reward that don't match the actual goal. This is why reward design is so important.


Exploration vs Exploitation

The agent must balance:

  • Exploration β€” try new actions to discover better strategies
  • Exploitation β€” use the best-known action to collect reward
Ξ΅-greedy policy:
  With probability Ξ΅: take a random action (explore)
  With probability 1-Ξ΅: take the best-known action (exploit)

Early training: Ξ΅ = 0.9 (explore a lot)
Later training:  Ξ΅ = 0.05 (mostly exploit, small exploration)

Cumulative Reward (Return)

The agent doesn't just care about immediate reward β€” it cares about the sum of future rewards, discounted by how far away they are.

Return = rβ‚€ + Ξ³Β·r₁ + Ξ³Β²Β·rβ‚‚ + Ξ³Β³Β·r₃ + ...

Ξ³ (gamma) = discount factor, typically 0.9–0.99
  Ξ³ close to 1: agent cares a lot about future rewards
  Ξ³ close to 0: agent is short-sighted

RL Algorithms

| Algorithm | Category | Use Case | |---|---|---| | Q-Learning / DQN | Value-based | Discrete actions, Atari games | | Policy Gradient / REINFORCE | Policy-based | Continuous actions | | PPO (Proximal Policy Optimization) | Actor-Critic | Default for RLHF, robotics | | SAC (Soft Actor-Critic) | Actor-Critic | Continuous control, sample-efficient | | AlphaGo / AlphaZero | Model-based | Game-playing with lookahead |


RL in LLMs: RLHF

Reinforcement Learning from Human Feedback (RLHF) is how modern LLMs like GPT-4 and Claude are aligned to be helpful and harmless. It's the most important RL application in AI engineering today.

Step 1 β€” Pre-training:
  LLM trained on internet text via next-token prediction (self-supervised)

Step 2 β€” Supervised Fine-Tuning (SFT):
  Human-written (prompt, ideal_response) pairs β†’ LLM learns the style

Step 3 β€” Reward Model Training:
  Humans rank two responses: "A is better than B"
  A reward model learns to predict human preference scores

Step 4 β€” RL with PPO:
  LLM generates responses
  Reward model scores them
  PPO updates LLM weights to maximize reward score
  KL penalty: prevents LLM from drifting too far from SFT base
Python
# Conceptual pseudocode for RLHF
def rlhf_step(llm, reward_model, prompts, ppo_optimizer):
    for prompt in prompts:
        # LLM generates response
        response = llm.generate(prompt)

        # Reward model scores the response
        reward = reward_model.score(prompt, response)

        # PPO update: increase probability of high-reward responses
        ppo_optimizer.step(llm, prompt, response, reward)

        # KL penalty: keep LLM close to original SFT model
        kl_penalty = compute_kl(llm, sft_baseline, prompt)
        total_loss = -reward + beta * kl_penalty

RL vs Supervised Learning for LLMs

| Aspect | Supervised Fine-Tuning (SFT) | RLHF | |---|---|---| | Labels | Human-written ideal responses | Human preference rankings | | Feedback signal | Token-level cross-entropy loss | Scalar reward per response | | What it optimizes | Matching human-written text | Maximizing human preference | | Cost | Expensive to create ideal responses | Cheaper β€” just rank pairs | | Risk | Mode collapse to specific style | Reward hacking |


When RL Applies in AI Systems

  • LLM alignment β€” RLHF, DPO (Direct Preference Optimization)
  • Recommendation systems β€” Netflix, Spotify optimize for long-term engagement
  • Robotics β€” learning to walk, pick-and-place tasks
  • Game AI β€” AlphaGo, OpenAI Five, Dota bots
  • Drug discovery β€” optimizing molecule properties through iterative generation
  • AutoML β€” searching neural architecture space

Interview Answer Template

Q: What is reinforcement learning and how does it relate to LLMs?

Reinforcement learning is a learning paradigm where an agent takes actions in an environment and learns to maximize cumulative reward through trial and error β€” with no labeled dataset of correct answers. The most important RL application in modern AI is RLHF (Reinforcement Learning from Human Feedback), used to align LLMs like GPT-4 and Claude. In RLHF, a reward model learns human preferences from ranked response pairs, then PPO (Proximal Policy Optimization) updates the LLM to maximize that reward while a KL divergence penalty keeps the model from drifting too far from the supervised baseline. The result is a model that produces helpful, harmless responses aligned with human values, not just statistically likely text.

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.