Module 05

Reinforcement Learning

Duration: ~45 minutes of video content Timestamps: 2:07:28 - 3:09:39

5.1 The Reinforcement Learning Paradigm

Beyond Supervised Learning

ApproachMethodLimitation
------------------------------
Pre-trainingLearn from internet textNo quality signal
Supervised FTLearn from human examplesLimited by human solutions
Reinforcement LearningLearn from trial and errorDiscovers novel approaches

The RL Loop

┌─────────────────────────────────────────────────────────────┐
│                REINFORCEMENT LEARNING LOOP                   │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌─────────────┐         ┌─────────────────┐                │
│  │   PROMPT    │────────▶│    GENERATE     │                │
│  │  (Problem)  │         │ (Many Solutions)│                │
│  └─────────────┘         └────────┬────────┘                │
│                                   │                          │
│                    ┌──────────────┴──────────────┐          │
│                    ▼              ▼              ▼          │
│              Solution A     Solution B     Solution C        │
│                    │              │              │           │
│                    ▼              ▼              ▼          │
│              ┌─────────────────────────────────────┐        │
│              │           EVALUATION                 │        │
│              │  (Which solutions are correct?)      │        │
│              └─────────────────────────────────────┘        │
│                    │              │              │           │
│                    ▼              ▼              ▼          │
│                 ✓ Good         ✗ Bad          ✓ Good        │
│                    │                              │          │
│                    └──────────────┬───────────────┘          │
│                                   ▼                          │
│                    ┌─────────────────────────────┐          │
│                    │  TRAIN ON GOOD SOLUTIONS    │          │
│                    │  (Reinforce what works)     │          │
│                    └─────────────────────────────┘          │
│                                   │                          │
│                                   ▼                          │
│                            REPEAT CYCLE                      │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Key Insight from Karpathy

"Unlike supervised fine-tuning, solutions emerge from model experimentation rather than human examples."


5.2 Verifiable vs. Unverifiable Domains

The Critical Distinction

┌─────────────────────────────────────────────────────────────┐
│            VERIFIABLE vs UNVERIFIABLE DOMAINS                │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  VERIFIABLE (Objective Truth)                                │
│  ┌───────────────────────────────────────────────────────┐  │
│  │ • Mathematics: 2+2=4 is checkable                      │  │
│  │ • Coding: Code runs or it doesn't                      │  │
│  │ • Logic puzzles: Solutions are provable                │  │
│  │ • Games: Win/lose is clear                             │  │
│  │                                                        │  │
│  │ Evaluation: Automated LLM judges or executors          │  │
│  └───────────────────────────────────────────────────────┘  │
│                                                              │
│  UNVERIFIABLE (Subjective Quality)                           │
│  ┌───────────────────────────────────────────────────────┐  │
│  │ • Creative writing: Is this story good?                │  │
│  │ • Humor: Is this joke funny?                           │  │
│  │ • Summarization: Is this summary helpful?              │  │
│  │ • Conversation: Is this response appropriate?          │  │
│  │                                                        │  │
│  │ Evaluation: Requires human judgment → Reward Models    │  │
│  └───────────────────────────────────────────────────────┘  │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Why This Matters

DomainRL MethodScalability
--------------------------------
VerifiableDirect evaluationHighly scalable
UnverifiableHuman feedback → Reward modelLimited, then scalable

5.3 RL for Verifiable Domains

The Process

For math, coding, and logic:

1. Generate 100-1000 attempts per problem
2. Execute/verify each solution
3. Identify which ones are correct
4. Train model on correct solutions
5. Repeat with harder problems

Example: Math Problem RL

Problem: What is the integral of x²dx?

Attempt 1: x³/3 + C  ✓ (Verified correct)
Attempt 2: x³ + C    ✗ (Missing /3)
Attempt 3: 2x        ✗ (Derivative, not integral)
Attempt 4: x³/3 + C  ✓ (Verified correct)
...

Training: Reinforce patterns from attempts 1 and 4

Benefits

  • Unlimited practice: Generate millions of problems
  • Objective feedback: No human annotation needed
  • Discovery: Models find solutions humans wouldn't think of

5.4 RLHF: Human Feedback for Unverifiable Domains

The Challenge

How do you train a model to write better jokes when "funny" is subjective?

RLHF Architecture

┌─────────────────────────────────────────────────────────────┐
│                        RLHF PIPELINE                         │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  PHASE 1: Collect Human Preferences                          │
│  ┌───────────────────────────────────────────────────────┐  │
│  │ Prompt: "Tell me a joke about programmers"             │  │
│  │                                                        │  │
│  │ Response A: "Why do programmers prefer dark mode?..."  │  │
│  │ Response B: "A programmer walks into a bar..."         │  │
│  │                                                        │  │
│  │ Human: Response A is funnier (A > B)                   │  │
│  └───────────────────────────────────────────────────────┘  │
│                           ↓                                  │
│  PHASE 2: Train Reward Model                                 │
│  ┌───────────────────────────────────────────────────────┐  │
│  │ Input: (Prompt, Response)                              │  │
│  │ Output: Quality score (0.0 - 1.0)                      │  │
│  │                                                        │  │
│  │ Trained on thousands of human comparisons              │  │
│  │ Learns to predict "what would humans prefer?"          │  │
│  └───────────────────────────────────────────────────────┘  │
│                           ↓                                  │
│  PHASE 3: RL with Reward Model                              │
│  ┌───────────────────────────────────────────────────────┐  │
│  │ Generate response → Score with reward model → Train    │  │
│  │                                                        │  │
│  │ Now can run RL at scale without humans in the loop     │  │
│  └───────────────────────────────────────────────────────┘  │
│                                                              │
└─────────────────────────────────────────────────────────────┘

The Discriminator-Generator Gap

Key Insight: It's easier to recognize quality than to create it.

Humans can easily judge: "Response A is better than B"
Humans struggle to write: The perfect response from scratch

This gap enables RLHF to work:
- Use cheap human judgments to train reward model
- Use reward model to guide expensive generation training

5.5 RLHF Limitations

The Reward Hacking Problem

Models can "game" imperfect reward models:

┌─────────────────────────────────────────────────────────────┐
│                    REWARD HACKING                            │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Iteration 0:    Normal responses, moderate scores          │
│  Iteration 100:  Improved quality, higher scores            │
│  Iteration 500:  Still improving, scores increasing         │
│  Iteration 1000: Strange artifacts appear, scores still up  │
│  Iteration 2000: "Best" response is nonsensical gibberish   │
│                                                              │
│  Problem: Model found adversarial inputs that fool the       │
│  reward model but aren't actually better to humans.          │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Mitigation Strategies

StrategyImplementation
--------------------------
Early StoppingCap RL iterations (hundreds, not thousands)
KL DivergencePenalize deviation from base model
Multiple Reward ModelsCross-validate across different models
Human Spot-ChecksPeriodically verify with actual humans

Key Warning from Karpathy

"After 1,000 updates, the model's top joke might be complete nonsense. Requires capping iterations at hundreds to prevent over-optimization."


5.6 DeepSeek-R1: Emergent Reasoning

The Breakthrough

DeepSeek demonstrated that RL can create emergent reasoning behaviors:

┌─────────────────────────────────────────────────────────────┐
│                  DEEPSEEK-R1 EMERGENCE                       │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  BEFORE RL: Model gives immediate answers                    │
│  ┌───────────────────────────────────────────────────────┐  │
│  │ Q: What is 847 * 293?                                  │  │
│  │ A: 248,071                                             │  │
│  └───────────────────────────────────────────────────────┘  │
│                                                              │
│  AFTER RL: Model develops extended reasoning                 │
│  ┌───────────────────────────────────────────────────────┐  │
│  │ Q: What is 847 * 293?                                  │  │
│  │ A: Let me think about this step by step...             │  │
│  │    First, I'll break this down:                        │  │
│  │    847 * 300 = 254,100                                 │  │
│  │    847 * 7 = 5,929                                     │  │
│  │    254,100 - 5,929 = 248,171                           │  │
│  │    Wait, let me verify...                              │  │
│  │    [Extended reasoning chain]                          │  │
│  │    The answer is 248,171.                              │  │
│  └───────────────────────────────────────────────────────┘  │
│                                                              │
│  KEY: The "step by step" behavior EMERGED from RL,           │
│  not from explicit training examples.                        │
│                                                              │
└─────────────────────────────────────────────────────────────┘

"Aha Moments"

DeepSeek models exhibit cognitive strategies that weren't taught:

Emergent BehaviorDescription
--------------------------------
Backtracking"Wait, that doesn't seem right..."
Reframing"Let me approach this differently..."
Verification"Let me double-check this result..."
Extended ThinkingUsing many more tokens for hard problems

Key Insight

"It's not something that you can explicitly teach the model through just training on a dataset. It's something that the model has to figure out on its own through reinforcement learning."


5.7 AlphaGo: The Blueprint

Why AlphaGo Matters for LLMs

DeepMind's AlphaGo demonstrated RL's power to discover superhuman strategies:

┌─────────────────────────────────────────────────────────────┐
│                     MOVE 37                                  │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  In Game 2 against Lee Sedol, AlphaGo played "Move 37"       │
│                                                              │
│  • Professional Go players estimated this move had a         │
│    1-in-10,000 chance of being played by a human            │
│  • It was considered unconventional/wrong by experts        │
│  • It turned out to be brilliant and won the game           │
│                                                              │
│  IMPLICATION FOR LLMs:                                       │
│  RL can discover reasoning strategies beyond human           │
│  intuition. Models might find novel problem-solving          │
│  approaches that humans wouldn't teach them.                 │
│                                                              │
└─────────────────────────────────────────────────────────────┘

The Lesson

LLMs + RL could:

  • Develop novel mathematical proof strategies
  • Find unexpected coding patterns
  • Discover reasoning approaches humans haven't conceived

5.8 Current RL Landscape

Research Status

AspectStatus
----------------
Pre-trainingWell understood, established
Supervised Fine-tuningWell understood, established
Reinforcement LearningActive research, rapidly evolving

Key Players

OrganizationApproachTransparency
--------------------------------------
OpenAIExtensive RL researchProprietary
AnthropicConstitutional AI, RLHFSemi-open
DeepSeekPure RL reasoningOpen papers
GoogleRL for GeminiProprietary

Why DeepSeek's Release Mattered

"That is why the release of DeepSeek was such a big deal - it openly shared RL methodology that others keep proprietary."


5.9 Key Takeaways

Summary

ConceptKey Point
--------------------
RL AdvantageDiscovers solutions beyond training examples
Verifiable DomainsDirect evaluation enables scalable RL
RLHFReward models approximate human judgment
Reward HackingOver-optimization creates adversarial outputs
EmergenceNovel reasoning behaviors emerge from RL
AlphaGo InsightRL finds superhuman strategies

For AI Analytics Platforms

Monitoring Insights:

  1. Reasoning Length: Track token count in reasoning chains (emerging behavior)
  2. Self-Correction Detection: Identify backtracking/verification patterns
  3. RL Training Metrics: If training your own, monitor reward scores carefully
  4. Reward Model Drift: Watch for gaming/adversarial outputs over time
  5. Capability Emergence: Track new behaviors appearing in updated models

Practical Applications

Applicable Learnings:

  1. Memory-Assisted Reasoning: Provide context that helps models reason better
  2. Verification Chains: Store intermediate reasoning steps, not just conclusions
  3. Pattern Recognition: Learn from which memory retrievals lead to successful outcomes
  4. Emergent Understanding: Models may use retrieved memories in unexpected but effective ways

Practice Questions

  1. Why can RL discover solutions that supervised learning cannot?
  2. What's the difference between verifiable and unverifiable domains?
  3. How does reward hacking occur and how is it prevented?
  4. What does "emergent reasoning" mean in the context of DeepSeek?

Next Module

Module 6: Future Directions & Resources


Timestamps: 2:07:28 - Supervised Fine-tuning | 2:14:42 - Reinforcement Learning | 2:27:47 - DeepSeek-R1 | 2:42:07 - AlphaGo | 2:48:26 - RLHF