Duration: ~45 minutes of video content Timestamps: 2:07:28 - 3:09:39
5.1 The Reinforcement Learning Paradigm
Beyond Supervised Learning
| Approach | Method | Limitation |
|---|---|---|
| ---------- | -------- | ------------ |
| Pre-training | Learn from internet text | No quality signal |
| Supervised FT | Learn from human examples | Limited by human solutions |
| Reinforcement Learning | Learn from trial and error | Discovers novel approaches |
The RL Loop
┌─────────────────────────────────────────────────────────────┐
│ REINFORCEMENT LEARNING LOOP │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────────┐ │
│ │ PROMPT │────────▶│ GENERATE │ │
│ │ (Problem) │ │ (Many Solutions)│ │
│ └─────────────┘ └────────┬────────┘ │
│ │ │
│ ┌──────────────┴──────────────┐ │
│ ▼ ▼ ▼ │
│ Solution A Solution B Solution C │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────┐ │
│ │ EVALUATION │ │
│ │ (Which solutions are correct?) │ │
│ └─────────────────────────────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ✓ Good ✗ Bad ✓ Good │
│ │ │ │
│ └──────────────┬───────────────┘ │
│ ▼ │
│ ┌─────────────────────────────┐ │
│ │ TRAIN ON GOOD SOLUTIONS │ │
│ │ (Reinforce what works) │ │
│ └─────────────────────────────┘ │
│ │ │
│ ▼ │
│ REPEAT CYCLE │
│ │
└─────────────────────────────────────────────────────────────┘
Key Insight from Karpathy
"Unlike supervised fine-tuning, solutions emerge from model experimentation rather than human examples."
5.2 Verifiable vs. Unverifiable Domains
The Critical Distinction
┌─────────────────────────────────────────────────────────────┐
│ VERIFIABLE vs UNVERIFIABLE DOMAINS │
├─────────────────────────────────────────────────────────────┤
│ │
│ VERIFIABLE (Objective Truth) │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ • Mathematics: 2+2=4 is checkable │ │
│ │ • Coding: Code runs or it doesn't │ │
│ │ • Logic puzzles: Solutions are provable │ │
│ │ • Games: Win/lose is clear │ │
│ │ │ │
│ │ Evaluation: Automated LLM judges or executors │ │
│ └───────────────────────────────────────────────────────┘ │
│ │
│ UNVERIFIABLE (Subjective Quality) │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ • Creative writing: Is this story good? │ │
│ │ • Humor: Is this joke funny? │ │
│ │ • Summarization: Is this summary helpful? │ │
│ │ • Conversation: Is this response appropriate? │ │
│ │ │ │
│ │ Evaluation: Requires human judgment → Reward Models │ │
│ └───────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Why This Matters
| Domain | RL Method | Scalability |
|---|---|---|
| -------- | ----------- | ------------- |
| Verifiable | Direct evaluation | Highly scalable |
| Unverifiable | Human feedback → Reward model | Limited, then scalable |
5.3 RL for Verifiable Domains
The Process
For math, coding, and logic:
1. Generate 100-1000 attempts per problem
2. Execute/verify each solution
3. Identify which ones are correct
4. Train model on correct solutions
5. Repeat with harder problems
Example: Math Problem RL
Problem: What is the integral of x²dx?
Attempt 1: x³/3 + C ✓ (Verified correct)
Attempt 2: x³ + C ✗ (Missing /3)
Attempt 3: 2x ✗ (Derivative, not integral)
Attempt 4: x³/3 + C ✓ (Verified correct)
...
Training: Reinforce patterns from attempts 1 and 4
Benefits
- Unlimited practice: Generate millions of problems
- Objective feedback: No human annotation needed
- Discovery: Models find solutions humans wouldn't think of
5.4 RLHF: Human Feedback for Unverifiable Domains
The Challenge
How do you train a model to write better jokes when "funny" is subjective?
RLHF Architecture
┌─────────────────────────────────────────────────────────────┐
│ RLHF PIPELINE │
├─────────────────────────────────────────────────────────────┤
│ │
│ PHASE 1: Collect Human Preferences │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ Prompt: "Tell me a joke about programmers" │ │
│ │ │ │
│ │ Response A: "Why do programmers prefer dark mode?..." │ │
│ │ Response B: "A programmer walks into a bar..." │ │
│ │ │ │
│ │ Human: Response A is funnier (A > B) │ │
│ └───────────────────────────────────────────────────────┘ │
│ ↓ │
│ PHASE 2: Train Reward Model │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ Input: (Prompt, Response) │ │
│ │ Output: Quality score (0.0 - 1.0) │ │
│ │ │ │
│ │ Trained on thousands of human comparisons │ │
│ │ Learns to predict "what would humans prefer?" │ │
│ └───────────────────────────────────────────────────────┘ │
│ ↓ │
│ PHASE 3: RL with Reward Model │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ Generate response → Score with reward model → Train │ │
│ │ │ │
│ │ Now can run RL at scale without humans in the loop │ │
│ └───────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
The Discriminator-Generator Gap
Key Insight: It's easier to recognize quality than to create it.
Humans can easily judge: "Response A is better than B"
Humans struggle to write: The perfect response from scratch
This gap enables RLHF to work:
- Use cheap human judgments to train reward model
- Use reward model to guide expensive generation training
5.5 RLHF Limitations
The Reward Hacking Problem
Models can "game" imperfect reward models:
┌─────────────────────────────────────────────────────────────┐
│ REWARD HACKING │
├─────────────────────────────────────────────────────────────┤
│ │
│ Iteration 0: Normal responses, moderate scores │
│ Iteration 100: Improved quality, higher scores │
│ Iteration 500: Still improving, scores increasing │
│ Iteration 1000: Strange artifacts appear, scores still up │
│ Iteration 2000: "Best" response is nonsensical gibberish │
│ │
│ Problem: Model found adversarial inputs that fool the │
│ reward model but aren't actually better to humans. │
│ │
└─────────────────────────────────────────────────────────────┘
Mitigation Strategies
| Strategy | Implementation |
|---|---|
| ---------- | ---------------- |
| Early Stopping | Cap RL iterations (hundreds, not thousands) |
| KL Divergence | Penalize deviation from base model |
| Multiple Reward Models | Cross-validate across different models |
| Human Spot-Checks | Periodically verify with actual humans |
Key Warning from Karpathy
"After 1,000 updates, the model's top joke might be complete nonsense. Requires capping iterations at hundreds to prevent over-optimization."
5.6 DeepSeek-R1: Emergent Reasoning
The Breakthrough
DeepSeek demonstrated that RL can create emergent reasoning behaviors:
┌─────────────────────────────────────────────────────────────┐
│ DEEPSEEK-R1 EMERGENCE │
├─────────────────────────────────────────────────────────────┤
│ │
│ BEFORE RL: Model gives immediate answers │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ Q: What is 847 * 293? │ │
│ │ A: 248,071 │ │
│ └───────────────────────────────────────────────────────┘ │
│ │
│ AFTER RL: Model develops extended reasoning │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ Q: What is 847 * 293? │ │
│ │ A: Let me think about this step by step... │ │
│ │ First, I'll break this down: │ │
│ │ 847 * 300 = 254,100 │ │
│ │ 847 * 7 = 5,929 │ │
│ │ 254,100 - 5,929 = 248,171 │ │
│ │ Wait, let me verify... │ │
│ │ [Extended reasoning chain] │ │
│ │ The answer is 248,171. │ │
│ └───────────────────────────────────────────────────────┘ │
│ │
│ KEY: The "step by step" behavior EMERGED from RL, │
│ not from explicit training examples. │
│ │
└─────────────────────────────────────────────────────────────┘
"Aha Moments"
DeepSeek models exhibit cognitive strategies that weren't taught:
| Emergent Behavior | Description |
|---|---|
| ------------------- | ------------- |
| Backtracking | "Wait, that doesn't seem right..." |
| Reframing | "Let me approach this differently..." |
| Verification | "Let me double-check this result..." |
| Extended Thinking | Using many more tokens for hard problems |
Key Insight
"It's not something that you can explicitly teach the model through just training on a dataset. It's something that the model has to figure out on its own through reinforcement learning."
5.7 AlphaGo: The Blueprint
Why AlphaGo Matters for LLMs
DeepMind's AlphaGo demonstrated RL's power to discover superhuman strategies:
┌─────────────────────────────────────────────────────────────┐
│ MOVE 37 │
├─────────────────────────────────────────────────────────────┤
│ │
│ In Game 2 against Lee Sedol, AlphaGo played "Move 37" │
│ │
│ • Professional Go players estimated this move had a │
│ 1-in-10,000 chance of being played by a human │
│ • It was considered unconventional/wrong by experts │
│ • It turned out to be brilliant and won the game │
│ │
│ IMPLICATION FOR LLMs: │
│ RL can discover reasoning strategies beyond human │
│ intuition. Models might find novel problem-solving │
│ approaches that humans wouldn't teach them. │
│ │
└─────────────────────────────────────────────────────────────┘
The Lesson
LLMs + RL could:
- Develop novel mathematical proof strategies
- Find unexpected coding patterns
- Discover reasoning approaches humans haven't conceived
5.8 Current RL Landscape
Research Status
| Aspect | Status |
|---|---|
| -------- | -------- |
| Pre-training | Well understood, established |
| Supervised Fine-tuning | Well understood, established |
| Reinforcement Learning | Active research, rapidly evolving |
Key Players
| Organization | Approach | Transparency |
|---|---|---|
| -------------- | ---------- | -------------- |
| OpenAI | Extensive RL research | Proprietary |
| Anthropic | Constitutional AI, RLHF | Semi-open |
| DeepSeek | Pure RL reasoning | Open papers |
| RL for Gemini | Proprietary |
Why DeepSeek's Release Mattered
"That is why the release of DeepSeek was such a big deal - it openly shared RL methodology that others keep proprietary."
5.9 Key Takeaways
Summary
| Concept | Key Point |
|---|---|
| --------- | ----------- |
| RL Advantage | Discovers solutions beyond training examples |
| Verifiable Domains | Direct evaluation enables scalable RL |
| RLHF | Reward models approximate human judgment |
| Reward Hacking | Over-optimization creates adversarial outputs |
| Emergence | Novel reasoning behaviors emerge from RL |
| AlphaGo Insight | RL finds superhuman strategies |
For AI Analytics Platforms
Monitoring Insights:
- Reasoning Length: Track token count in reasoning chains (emerging behavior)
- Self-Correction Detection: Identify backtracking/verification patterns
- RL Training Metrics: If training your own, monitor reward scores carefully
- Reward Model Drift: Watch for gaming/adversarial outputs over time
- Capability Emergence: Track new behaviors appearing in updated models
Practical Applications
Applicable Learnings:
- Memory-Assisted Reasoning: Provide context that helps models reason better
- Verification Chains: Store intermediate reasoning steps, not just conclusions
- Pattern Recognition: Learn from which memory retrievals lead to successful outcomes
- Emergent Understanding: Models may use retrieved memories in unexpected but effective ways
Practice Questions
- Why can RL discover solutions that supervised learning cannot?
- What's the difference between verifiable and unverifiable domains?
- How does reward hacking occur and how is it prevented?
- What does "emergent reasoning" mean in the context of DeepSeek?
Next Module
→ Module 6: Future Directions & Resources
Timestamps: 2:07:28 - Supervised Fine-tuning | 2:14:42 - Reinforcement Learning | 2:27:47 - DeepSeek-R1 | 2:42:07 - AlphaGo | 2:48:26 - RLHF