Duration: ~30 minutes of video content Timestamps: 0:26:01 - 0:59:23
2.1 The Inference Process
How Text Generation Works
During inference (generation), the model produces text one token at a time:
┌──────────────────────────────────────────────────────────────┐
│ AUTOREGRESSIVE GENERATION │
├──────────────────────────────────────────────────────────────┤
│ │
│ Step 1: "The quick brown" │
│ → Predict next: [fox: 0.42, dog: 0.15, cat: 0.08] │
│ → Sample: "fox" │
│ │
│ Step 2: "The quick brown fox" │
│ → Predict next: [jumps: 0.38, runs: 0.22, ...] │
│ → Sample: "jumps" │
│ │
│ Step 3: "The quick brown fox jumps" │
│ → Continue until stop condition... │
│ │
└──────────────────────────────────────────────────────────────┘
Key Properties
- One token at a time: Each new token depends on all previous tokens
- Probability distribution: Model outputs probabilities for ALL vocabulary tokens
- Sampling: Final token is selected (not always the highest probability)
- Autoregressive: Output feeds back as input for next prediction
2.2 Stochastic Outputs
Why Outputs Vary
LLMs are stochastic (probabilistic), not deterministic:
# Same prompt, different outputs each time
prompt = "Write a poem about the ocean"
# Run 1: "The waves crash upon the shore..."
# Run 2: "Deep blue waters call to me..."
# Run 3: "Salty breeze and endless tides..."
Temperature & Sampling
Temperature controls randomness:
| Temperature | Behavior | Use Case |
|---|---|---|
| ------------- | ---------- | ---------- |
| 0.0 | Deterministic (greedy) | Code, math, factual |
| 0.7 | Balanced creativity | General conversation |
| 1.0+ | High randomness | Creative writing, brainstorming |
Sampling Methods:
- Greedy: Always pick highest probability token
- Top-K: Sample from K highest probability tokens
- Top-P (Nucleus): Sample from tokens covering P% probability mass
Key Insight from Karpathy
"The process is stochastic, producing varied outputs through random sampling. This enables creativity but also potential inaccuracies or 'hallucinations.'"
For AI Analytics: Temperature and sampling parameters are critical metrics to log - they significantly affect output quality and consistency.
2.3 GPT-2: A Historical Baseline
GPT-2 Specifications (2019)
| Attribute | Value |
|---|---|
| ----------- | ------- |
| Parameters | 1.6 billion |
| Context Length | 1,024 tokens |
| Training Tokens | ~100 billion |
| Training Cost | ~$40,000 |
| Vocabulary | 50,257 tokens |
Why GPT-2 Matters
- First "large" language model to gain public attention
- OpenAI initially withheld release citing misuse concerns
- Now considered small by modern standards
- Excellent learning baseline
Karpathy's Reproduction
Using his llm.c project:
- Reproduced GPT-2 for $672
- Optimized pipeline could reduce to ~$100
- Demonstrates democratization of AI training
2.4 Modern Base Models
Llama 3.1 (2024)
| Attribute | Value |
|---|---|
| ----------- | ------- |
| Parameters | 405 billion |
| Context Length | 128,000 tokens |
| Training | Trillions of tokens |
| Status | Open weights |
Base Model vs. Assistant Model
┌─────────────────────────────────────────────────────────────┐
│ BASE MODEL │
├─────────────────────────────────────────────────────────────┤
│ Input: "The capital of France is" │
│ Output: "Paris. The capital of Germany is Berlin. The..." │
│ │
│ Behavior: Continues text in training data style │
│ Problem: Not helpful, no conversation, no safety │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ ASSISTANT MODEL │
├─────────────────────────────────────────────────────────────┤
│ Input: "What is the capital of France?" │
│ Output: "The capital of France is Paris." │
│ │
│ Behavior: Answers questions helpfully │
│ Training: Post-training on conversations │
└─────────────────────────────────────────────────────────────┘
Base Model Characteristics
Karpathy describes base models as "token simulators":
- Generate text matching internet patterns
- No inherent concept of "helpful" or "harmful"
- Require creative prompting for usefulness (few-shot examples)
- Hallucinate freely without constraints
Practical Demonstration
Base models can be made useful through prompting:
# Few-shot prompting to create Q&A behavior
User: What is 2+2?
Assistant: 4
User: What is the capital of Japan?
Assistant: Tokyo
User: How do I make coffee?
Assistant: [Model continues the pattern...]
2.5 From Pre-training to Post-training
The Transition
| Stage | Compute | Time | Purpose |
|---|---|---|---|
| ------- | --------- | ------ | --------- |
| Pre-training | Massive (millions of $) | Months | Learn language patterns |
| Post-training | Minimal (comparatively) | Hours-Days | Learn helpful behavior |
Key Insight
"Post-training is way cheaper than pre-training (e.g., months vs. hours). The algorithm remains unchanged; only parameters are fine-tuned."
This is crucial for understanding LLM economics:
- Pre-training: One-time massive investment
- Post-training: Relatively cheap customization
- Fine-tuning: Accessible to most organizations
2.6 Components of a Usable LLM
Two Essential Components
- Inference Code: The software that runs the model
- Model Weights: The learned parameters (billions of numbers)
Accessing Models
| Type | Examples | Access |
|---|---|---|
| ------ | ---------- | -------- |
| Proprietary | GPT-4, Claude, Gemini | API only |
| Open-Weight | Llama, Mistral, DeepSeek | Downloadable weights |
| Local | Via LM Studio, Ollama | Run on your hardware |
For KeenDreams: Understanding model access patterns helps design memory architectures that work across different LLM providers and local/cloud deployments.
2.7 Key Takeaways
Summary
| Concept | Key Point |
|---|---|
| --------- | ----------- |
| Inference | Token-by-token generation, autoregressive |
| Stochasticity | Outputs vary; controlled by temperature |
| Base Models | Token simulators, not assistants |
| Post-training | Cheap relative to pre-training |
| Access | Proprietary APIs vs. open weights |
For AI Analytics Platforms
Critical Metrics to Monitor:
- Token Generation Rate: Tokens per second for latency tracking
- Temperature Settings: Affects output consistency and quality
- Context Window Usage: How much of available context is consumed
- Output Length: Token count of responses
- Stop Reason: Why generation ended (length limit, stop token, etc.)
For KeenDreams
Applicable Learnings:
- Probabilistic Retrieval: Like model sampling, memory retrieval can use relevance scores
- Context Priming: Few-shot patterns work - KeenDreams can provide "example" memories
- Model Agnosticism: Design memory to work with any LLM backend
Practice Questions
- Why do LLMs produce different outputs for the same prompt?
- What makes a base model different from ChatGPT?
- How does temperature affect model outputs?
- Why is post-training much cheaper than pre-training?
Next Module
→ Module 3: Post-Training & Fine-Tuning
Timestamps: 0:26:01 - Inference | 0:31:09 - GPT-2 | 0:42:52 - Llama 3.1 Base Model | 0:59:23 - Pre to Post-Training