Module 03

Post-Training & Fine-Tuning

Duration: ~45 minutes of video content Timestamps: 0:59:23 - 2:07:28

3.1 Supervised Fine-Tuning (SFT)

The Goal

Transform a base model (token simulator) into a helpful assistant through exposure to high-quality conversations.

Base Model                        Assistant Model
     │                                  │
     │  + Conversation Dataset          │
     │  + Chat Template                 │
     │  + SFT Training                  │
     ▼                                  ▼
"Internet text     ───────────▶    "Helpful, honest,
 autocomplete"                      and harmless"

Training Process

Same algorithm as pre-training, different data:

  1. Use curated conversation examples instead of raw internet text
  2. Train model to predict assistant responses given user queries
  3. Much smaller datasets (thousands vs. billions of examples)
  4. Much cheaper compute (hours vs. months)

3.2 Conversation Data & Chat Templates

Chat Template Structure

Special tokens organize conversations:

<|im_start|>system
You are a helpful AI assistant.
<|im_end|>
<|im_start|>user
What is the capital of France?
<|im_end|>
<|im_start|>assistant
The capital of France is Paris.
<|im_end|>

Special Tokens

TokenPurpose
----------------
`<im_start>`Marks turn beginning
`<im_end>`Marks turn ending
systemSets assistant behavior/persona
userHuman input
assistantModel response

Important: These tokens are NEW during post-training - the base model never saw them during pre-training.

Dataset Sources

TypeExampleMethod
-----------------------
Human-CuratedOASST1Paid annotators on Upwork/Scale
SyntheticUltraChatLLMs generate conversations
HybridModern datasetsHuman seed + LLM expansion

OpenAI's Labeling Instructions

Annotators follow guidelines for "helpful, truthful, and harmless" responses:

  • Provide accurate information
  • Acknowledge uncertainty when appropriate
  • Refuse harmful requests politely
  • Maintain consistent persona

Key Insight: Chat templates provide a framework for how conversation context should be structured when retrieved from memory and injected into prompts.


3.3 LLM Psychology

The "Mind" of an LLM

Karpathy introduces the concept of LLM Psychology - mental models for understanding model behavior:

LLM Cognitive Model:

Parameters (Vague Recollection)Context Window (Working Memory)
----------------------------------------------------------------
Training patternsCurrent conversation
General knowledgeExplicit information
Compressed, lossyDirect, verbatim
Always availableLimited capacity

Metaphor: Parameters are like something you read months ago. Context window is like notes in front of you now.

Key Psychological Traits

TraitDescriptionImplication
---------------------------------
People PleasingModels tend to agree and provide answersMay hallucinate rather than say "I don't know"
Pattern MatchingContinues training patternsCan be exploited with few-shot examples
No Persistent MemoryEach conversation is freshCannot learn from previous sessions
Jagged IntelligenceBrilliant in some areas, fails in othersDon't trust uniformly

3.4 Knowledge Architecture

Two Types of Knowledge

1. Parametric Knowledge (Parameters)

  • Encoded during training
  • Compressed and approximate
  • Like human long-term memory
  • Cannot be updated without retraining

2. Contextual Knowledge (Context Window)

  • Provided during inference
  • Exact and complete
  • Like human working memory
  • Limited by context length

Practical Implications

# Less Reliable: Relying on parametric knowledge
prompt = "What were Apple's Q3 2024 earnings?"
# Model may hallucinate outdated/wrong numbers

# More Reliable: Providing context
prompt = """
Based on this earnings report:
[Paste actual earnings report here]

What were Apple's Q3 2024 earnings?
"""
# Model can cite exact numbers from context

Key Insight from Karpathy

"Pasting information directly into context windows produces higher-quality outputs than relying on parametric knowledge."

Key Insight: This validates the core architecture - retrieving relevant memories and injecting them into context is MORE reliable than expecting the model to "remember" from training. Cloud brain enables this extended, accurate memory.


3.5 Computational Limitations

Tokens for Thinking

Models need tokens to think - they cannot do complex computation in a single step.

Bad Pattern (immediate answer):

User: What is 17 * 24?
Assistant: 408

[Problem: Answer committed before computation]

Good Pattern (step-by-step):

User: What is 17 * 24?
Assistant: Let me calculate step by step:
- 17 * 20 = 340
- 17 * 4 = 68
- 340 + 68 = 408

The answer is 408.

Why This Matters

Each token prediction happens with a fixed computational budget. Complex reasoning requires distributing computation across multiple tokens.

Practical Tip: Prompt for step-by-step reasoning, especially for:

  • Math problems
  • Logic puzzles
  • Multi-step planning
  • Code debugging

3.6 Tokenization Limitations

Spelling & Counting Failures

Models struggle with character-level tasks:

User: How many 'r's are in 'strawberry'?
Model: There are 2 r's in strawberry.  [WRONG - there are 3]

User: Spell 'banana' backwards.
Model: ananab  [May fail due to tokenization]

Why This Happens

Tokenization doesn't preserve character boundaries:

"strawberry" → ["straw", "berry"]  # Model doesn't see individual letters

Solutions

  1. Use code: Ask model to write Python to solve it
  2. External tools: Let model call character-counting functions
  3. Explicit breakdown: Have model list characters one by one

3.7 Jagged Intelligence

The Phenomenon

LLMs exhibit jagged intelligence - brilliant at some tasks, surprisingly bad at others:

┌─────────────────────────────────────────────────────────────┐
│                 JAGGED INTELLIGENCE                          │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  CAPABILITY                                                  │
│       ▲                                                      │
│  High │    ████        ████████        ██                   │
│       │    ████        ████████        ██                   │
│       │    ████            ████        ██     ██            │
│       │    ████  ██        ████  ██    ██     ██            │
│  Low  │    ████  ██        ████  ██    ██  ██ ██            │
│       └─────────────────────────────────────────────────▶   │
│            A     B         C     D      E    F  G           │
│                        TASKS                                 │
│                                                              │
│  Example: Solve PhD-level physics (A) but fail 7*8 (B)      │
└─────────────────────────────────────────────────────────────┘

Examples

SuccessFailure
------------------
Write complex codeCount letters
Explain quantum physicsSimple arithmetic errors
Synthesize research papersRemember conversation start
Generate creative fictionConsistent factual details

Key Insight

"Even high-performing models can exhibit inexplicable errors in simple tasks. Don't trust LLMs uniformly across all domains."

For AI Analytics: This creates a critical monitoring requirement - track task-specific performance not just overall metrics. Different tasks have different reliability profiles.


3.8 Key Takeaways

Summary

ConceptKey Point
--------------------
SFTTrain on conversations to create assistants
Chat TemplatesSpecial tokens structure multi-turn dialogue
Knowledge TypesParameters (vague) vs. Context (precise)
Thinking TokensComplex tasks need step-by-step reasoning
TokenizationCharacter-level tasks are problematic
Jagged IntelligenceCapabilities are uneven and unpredictable

For AI Analytics Platforms

Monitoring Recommendations:

  1. Task Classification: Categorize prompts by type to track per-category performance
  2. Reasoning Detection: Identify whether model used step-by-step reasoning
  3. Context Utilization: Measure how much provided context was used in response
  4. Failure Pattern Analysis: Track which task types have highest error rates

Practical Applications

Applicable Learnings:

  1. Context > Parameters: Always inject relevant memories into context
  2. Structured Retrieval: Use chat template patterns for memory injection
  3. Task-Aware Memory: Different tasks may need different memory retrieval strategies
  4. Step-by-Step Logging: Capture reasoning chains for better learning

Practice Questions

  1. Why is post-training much faster than pre-training?
  2. What's the difference between parametric and contextual knowledge?
  3. Why do models need "tokens to think"?
  4. How does jagged intelligence affect production reliability?

Next Module

Module 4: Hallucinations & Mitigations


Timestamps: 0:59:23 - Pre to Post-Training | 1:01:06 - Post-Training Data | 1:41:46 - Knowledge of Self | 1:46:56 - Tokens for Thinking | 2:01:11 - Tokenization Limitations | 2:04:53 - Jagged Intelligence