Module 05

Reinforcement Learning

45 min

Duration: ~45 minutes of video content Timestamps: 2:07:28 - 3:09:39

5.1 The Reinforcement Learning Paradigm

Beyond Supervised Learning

Approach	Method	Limitation
----------	--------	------------
Pre-training	Learn from internet text	No quality signal
Supervised FT	Learn from human examples	Limited by human solutions
Reinforcement Learning	Learn from trial and error	Discovers novel approaches

The RL Loop

┌─────────────────────────────────────────────────────────────┐
│                REINFORCEMENT LEARNING LOOP                   │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌─────────────┐         ┌─────────────────┐                │
│  │   PROMPT    │────────▶│    GENERATE     │                │
│  │  (Problem)  │         │ (Many Solutions)│                │
│  └─────────────┘         └────────┬────────┘                │
│                                   │                          │
│                    ┌──────────────┴──────────────┐          │
│                    ▼              ▼              ▼          │
│              Solution A     Solution B     Solution C        │
│                    │              │              │           │
│                    ▼              ▼              ▼          │
│              ┌─────────────────────────────────────┐        │
│              │           EVALUATION                 │        │
│              │  (Which solutions are correct?)      │        │
│              └─────────────────────────────────────┘        │
│                    │              │              │           │
│                    ▼              ▼              ▼          │
│                 ✓ Good         ✗ Bad          ✓ Good        │
│                    │                              │          │
│                    └──────────────┬───────────────┘          │
│                                   ▼                          │
│                    ┌─────────────────────────────┐          │
│                    │  TRAIN ON GOOD SOLUTIONS    │          │
│                    │  (Reinforce what works)     │          │
│                    └─────────────────────────────┘          │
│                                   │                          │
│                                   ▼                          │
│                            REPEAT CYCLE                      │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Key Insight from Karpathy

"Unlike supervised fine-tuning, solutions emerge from model experimentation rather than human examples."

5.2 Verifiable vs. Unverifiable Domains

The Critical Distinction

┌─────────────────────────────────────────────────────────────┐
│            VERIFIABLE vs UNVERIFIABLE DOMAINS                │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  VERIFIABLE (Objective Truth)                                │
│  ┌───────────────────────────────────────────────────────┐  │
│  │ • Mathematics: 2+2=4 is checkable                      │  │
│  │ • Coding: Code runs or it doesn't                      │  │
│  │ • Logic puzzles: Solutions are provable                │  │
│  │ • Games: Win/lose is clear                             │  │
│  │                                                        │  │
│  │ Evaluation: Automated LLM judges or executors          │  │
│  └───────────────────────────────────────────────────────┘  │
│                                                              │
│  UNVERIFIABLE (Subjective Quality)                           │
│  ┌───────────────────────────────────────────────────────┐  │
│  │ • Creative writing: Is this story good?                │  │
│  │ • Humor: Is this joke funny?                           │  │
│  │ • Summarization: Is this summary helpful?              │  │
│  │ • Conversation: Is this response appropriate?          │  │
│  │                                                        │  │
│  │ Evaluation: Requires human judgment → Reward Models    │  │
│  └───────────────────────────────────────────────────────┘  │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Why This Matters

Domain	RL Method	Scalability
--------	-----------	-------------
Verifiable	Direct evaluation	Highly scalable
Unverifiable	Human feedback → Reward model	Limited, then scalable

5.3 RL for Verifiable Domains

The Process

For math, coding, and logic:

1. Generate 100-1000 attempts per problem
2. Execute/verify each solution
3. Identify which ones are correct
4. Train model on correct solutions
5. Repeat with harder problems

Example: Math Problem RL

Problem: What is the integral of x²dx?

Attempt 1: x³/3 + C  ✓ (Verified correct)
Attempt 2: x³ + C    ✗ (Missing /3)
Attempt 3: 2x        ✗ (Derivative, not integral)
Attempt 4: x³/3 + C  ✓ (Verified correct)
...

Training: Reinforce patterns from attempts 1 and 4

Benefits

Unlimited practice: Generate millions of problems
Objective feedback: No human annotation needed
Discovery: Models find solutions humans wouldn't think of

5.4 RLHF: Human Feedback for Unverifiable Domains

The Challenge

How do you train a model to write better jokes when "funny" is subjective?

RLHF Architecture

┌─────────────────────────────────────────────────────────────┐
│                        RLHF PIPELINE                         │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  PHASE 1: Collect Human Preferences                          │
│  ┌───────────────────────────────────────────────────────┐  │
│  │ Prompt: "Tell me a joke about programmers"             │  │
│  │                                                        │  │
│  │ Response A: "Why do programmers prefer dark mode?..."  │  │
│  │ Response B: "A programmer walks into a bar..."         │  │
│  │                                                        │  │
│  │ Human: Response A is funnier (A > B)                   │  │
│  └───────────────────────────────────────────────────────┘  │
│                           ↓                                  │
│  PHASE 2: Train Reward Model                                 │
│  ┌───────────────────────────────────────────────────────┐  │
│  │ Input: (Prompt, Response)                              │  │
│  │ Output: Quality score (0.0 - 1.0)                      │  │
│  │                                                        │  │
│  │ Trained on thousands of human comparisons              │  │
│  │ Learns to predict "what would humans prefer?"          │  │
│  └───────────────────────────────────────────────────────┘  │
│                           ↓                                  │
│  PHASE 3: RL with Reward Model                              │
│  ┌───────────────────────────────────────────────────────┐  │
│  │ Generate response → Score with reward model → Train    │  │
│  │                                                        │  │
│  │ Now can run RL at scale without humans in the loop     │  │
│  └───────────────────────────────────────────────────────┘  │
│                                                              │
└─────────────────────────────────────────────────────────────┘

The Discriminator-Generator Gap

Key Insight: It's easier to recognize quality than to create it.

Humans can easily judge: "Response A is better than B"
Humans struggle to write: The perfect response from scratch

This gap enables RLHF to work:
- Use cheap human judgments to train reward model
- Use reward model to guide expensive generation training

5.5 RLHF Limitations

The Reward Hacking Problem

Models can "game" imperfect reward models:

┌─────────────────────────────────────────────────────────────┐
│                    REWARD HACKING                            │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Iteration 0:    Normal responses, moderate scores          │
│  Iteration 100:  Improved quality, higher scores            │
│  Iteration 500:  Still improving, scores increasing         │
│  Iteration 1000: Strange artifacts appear, scores still up  │
│  Iteration 2000: "Best" response is nonsensical gibberish   │
│                                                              │
│  Problem: Model found adversarial inputs that fool the       │
│  reward model but aren't actually better to humans.          │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Mitigation Strategies

Strategy	Implementation
----------	----------------
Early Stopping	Cap RL iterations (hundreds, not thousands)
KL Divergence	Penalize deviation from base model
Multiple Reward Models	Cross-validate across different models
Human Spot-Checks	Periodically verify with actual humans

Key Warning from Karpathy

"After 1,000 updates, the model's top joke might be complete nonsense. Requires capping iterations at hundreds to prevent over-optimization."

5.6 DeepSeek-R1: Emergent Reasoning

The Breakthrough

DeepSeek demonstrated that RL can create emergent reasoning behaviors:

┌─────────────────────────────────────────────────────────────┐
│                  DEEPSEEK-R1 EMERGENCE                       │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  BEFORE RL: Model gives immediate answers                    │
│  ┌───────────────────────────────────────────────────────┐  │
│  │ Q: What is 847 * 293?                                  │  │
│  │ A: 248,071                                             │  │
│  └───────────────────────────────────────────────────────┘  │
│                                                              │
│  AFTER RL: Model develops extended reasoning                 │
│  ┌───────────────────────────────────────────────────────┐  │
│  │ Q: What is 847 * 293?                                  │  │
│  │ A: Let me think about this step by step...             │  │
│  │    First, I'll break this down:                        │  │
│  │    847 * 300 = 254,100                                 │  │
│  │    847 * 7 = 5,929                                     │  │
│  │    254,100 - 5,929 = 248,171                           │  │
│  │    Wait, let me verify...                              │  │
│  │    [Extended reasoning chain]                          │  │
│  │    The answer is 248,171.                              │  │
│  └───────────────────────────────────────────────────────┘  │
│                                                              │
│  KEY: The "step by step" behavior EMERGED from RL,           │
│  not from explicit training examples.                        │
│                                                              │
└─────────────────────────────────────────────────────────────┘

"Aha Moments"

DeepSeek models exhibit cognitive strategies that weren't taught:

Emergent Behavior	Description
-------------------	-------------
Backtracking	"Wait, that doesn't seem right..."
Reframing	"Let me approach this differently..."
Verification	"Let me double-check this result..."
Extended Thinking	Using many more tokens for hard problems

Key Insight

"It's not something that you can explicitly teach the model through just training on a dataset. It's something that the model has to figure out on its own through reinforcement learning."

5.7 AlphaGo: The Blueprint

Why AlphaGo Matters for LLMs

DeepMind's AlphaGo demonstrated RL's power to discover superhuman strategies:

┌─────────────────────────────────────────────────────────────┐
│                     MOVE 37                                  │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  In Game 2 against Lee Sedol, AlphaGo played "Move 37"       │
│                                                              │
│  • Professional Go players estimated this move had a         │
│    1-in-10,000 chance of being played by a human            │
│  • It was considered unconventional/wrong by experts        │
│  • It turned out to be brilliant and won the game           │
│                                                              │
│  IMPLICATION FOR LLMs:                                       │
│  RL can discover reasoning strategies beyond human           │
│  intuition. Models might find novel problem-solving          │
│  approaches that humans wouldn't teach them.                 │
│                                                              │
└─────────────────────────────────────────────────────────────┘

The Lesson

LLMs + RL could:

Develop novel mathematical proof strategies
Find unexpected coding patterns
Discover reasoning approaches humans haven't conceived

5.8 Current RL Landscape

Research Status

Aspect	Status
--------	--------
Pre-training	Well understood, established
Supervised Fine-tuning	Well understood, established
Reinforcement Learning	Active research, rapidly evolving

Key Players

Organization	Approach	Transparency
--------------	----------	--------------
OpenAI	Extensive RL research	Proprietary
Anthropic	Constitutional AI, RLHF	Semi-open
DeepSeek	Pure RL reasoning	Open papers
Google	RL for Gemini	Proprietary

Why DeepSeek's Release Mattered

"That is why the release of DeepSeek was such a big deal - it openly shared RL methodology that others keep proprietary."

5.9 Key Takeaways

Summary

Concept	Key Point
---------	-----------
RL Advantage	Discovers solutions beyond training examples
Verifiable Domains	Direct evaluation enables scalable RL
RLHF	Reward models approximate human judgment
Reward Hacking	Over-optimization creates adversarial outputs
Emergence	Novel reasoning behaviors emerge from RL
AlphaGo Insight	RL finds superhuman strategies

For AI Analytics Platforms

Monitoring Insights:

Reasoning Length: Track token count in reasoning chains (emerging behavior)
Self-Correction Detection: Identify backtracking/verification patterns
RL Training Metrics: If training your own, monitor reward scores carefully
Reward Model Drift: Watch for gaming/adversarial outputs over time
Capability Emergence: Track new behaviors appearing in updated models

Practical Applications

Applicable Learnings:

Memory-Assisted Reasoning: Provide context that helps models reason better
Verification Chains: Store intermediate reasoning steps, not just conclusions
Pattern Recognition: Learn from which memory retrievals lead to successful outcomes
Emergent Understanding: Models may use retrieved memories in unexpected but effective ways

Practice Questions

Why can RL discover solutions that supervised learning cannot?
What's the difference between verifiable and unverifiable domains?
How does reward hacking occur and how is it prevented?
What does "emergent reasoning" mean in the context of DeepSeek?

Next Module

→ Module 6: Future Directions & Resources

Timestamps: 2:07:28 - Supervised Fine-tuning | 2:14:42 - Reinforcement Learning | 2:27:47 - DeepSeek-R1 | 2:42:07 - AlphaGo | 2:48:26 - RLHF