Duration: ~15 minutes of video content Timestamps: 3:09:39 - 3:31:00
6.1 Multimodal Models
Beyond Text
The next generation of models handles multiple modalities natively:
┌─────────────────────────────────────────────────────────────┐
│ MULTIMODAL LLMs │
├─────────────────────────────────────────────────────────────┤
│ │
│ INPUT MODALITIES UNIFIED MODEL OUTPUT │
│ ┌─────────────┐ │
│ │ TEXT │─────┐ │
│ └─────────────┘ │ ┌─────────────────┐ │
│ ┌─────────────┐ │ │ │ Text │
│ │ IMAGES │─────┼───▶│ TRANSFORMER │───▶ Images │
│ └─────────────┘ │ │ (Unified) │ Audio │
│ ┌─────────────┐ │ │ │ Code │
│ │ AUDIO │─────┘ └─────────────────┘ │
│ └─────────────┘ │
│ ┌─────────────┐ │
│ │ VIDEO │─────┘ │
│ └─────────────┘ │
│ │
│ KEY: Same transformer architecture, extended tokenization │
│ │
└─────────────────────────────────────────────────────────────┘
How It Works
| Modality | Tokenization Approach |
|---|---|
| ---------- | ---------------------- |
| Text | Byte Pair Encoding → tokens |
| Images | Patch embeddings → tokens |
| Audio | Spectrogram frames → tokens |
| Video | Frame patches + temporal → tokens |
Key Insight from Karpathy
"At the baseline it's possible to tokenize audio and images and apply the same approaches as for LLMs. Not a fundamental change, just adding more tokens is required."
Current State (2025)
| Model | Modalities | Notes |
|---|---|---|
| ------- | ------------ | ------- |
| GPT-4V | Text, Images | Image understanding |
| Gemini | Text, Images, Video, Audio | Native multimodal |
| Claude 3 | Text, Images | Vision analysis |
| Sora (OpenAI) | Text → Video | Video generation |
6.2 AI Agents
The Evolution
┌─────────────────────────────────────────────────────────────┐
│ FROM CHATBOTS TO AGENTS │
├─────────────────────────────────────────────────────────────┤
│ │
│ CHATBOT (Today) │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ • Single turn or short conversations │ │
│ │ • Immediate response required │ │
│ │ • No persistent state between sessions │ │
│ │ • Limited tool use │ │
│ └───────────────────────────────────────────────────────┘ │
│ ↓ │
│ AGENT (Emerging) │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ • Long-running tasks (hours, days) │ │
│ │ • Autonomous execution with error recovery │ │
│ │ • Persistent memory across sessions │ │
│ │ • Complex tool orchestration │ │
│ │ • Self-correction and planning │ │
│ └───────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Agent Capabilities
| Capability | Description | Status |
|---|---|---|
| ------------ | ------------- | -------- |
| Computer Use | Navigate desktop, use software | Emerging |
| Browsing | Research, fill forms, make purchases | Active |
| Code Execution | Write, run, debug code | Mature |
| Long-term Memory | Remember across sessions | Developing |
| Planning | Break down complex tasks | Improving |
The Vision
Natural language task completion:
User: "Book me a flight to Tokyo next month, find a hotel
near Shibuya under $200/night, and add it to my calendar."
Agent: Searches flights, compares options, books ticket,
researches hotels, makes reservation, creates calendar
events, sends confirmation email.
Key Insight: Agents will need persistent memory systems. Cloud brain becomes infrastructure for agent long-term memory.
6.3 Test-Time Training
A New Paradigm
Current models: Parameters are frozen after training Future models: Can learn during inference
┌─────────────────────────────────────────────────────────────┐
│ TEST-TIME TRAINING │
├─────────────────────────────────────────────────────────────┤
│ │
│ CURRENT: Static Parameters │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ Training Data ──▶ Fixed Model ──▶ Inference │ │
│ │ (unchanging) │ │
│ └───────────────────────────────────────────────────────┘ │
│ │
│ FUTURE: Dynamic Parameters │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ Training Data ──▶ Model ──┬──▶ Inference │ │
│ │ │ │ │
│ │ └──▶ Continue Learning ──┐ │ │
│ │ │ │ │
│ │ ◄──────────────────◄─┘ │ │
│ └───────────────────────────────────────────────────────┘ │
│ │
│ Analogy: Like humans learning while sleeping │
│ │
└─────────────────────────────────────────────────────────────┘
Implications
- Models improve with use
- Personalization becomes deeper
- Edge cases get resolved over time
- Privacy concerns multiply
6.4 Staying Current
Karpathy's Recommended Resources
| Resource | Type | Purpose |
|---|---|---|
| ---------- | ------ | --------- |
| LM Arena | Leaderboard | Model rankings via human comparison |
| AI News Newsletter | Newsletter | Comprehensive AI coverage |
| X/Twitter | Social | Real-time researcher updates |
Key Researchers to Follow
Building a quality follow list provides:
- Early access to breakthroughs
- Nuanced technical discussions
- Pre-print paper analysis
- Industry trend signals
6.5 Accessing LLMs
Proprietary Models (API)
| Provider | Models | Access |
|---|---|---|
| ---------- | -------- | -------- |
| OpenAI | GPT-4, GPT-4o | api.openai.com |
| Anthropic | Claude 3/3.5 | anthropic.com |
| Gemini | ai.google.dev |
Open-Weight Models (Download/API)
| Model | Provider | Access |
|---|---|---|
| ------- | ---------- | -------- |
| Llama 3 | Meta | Together.ai, HuggingFace |
| DeepSeek | DeepSeek | deepseek.com |
| Mistral | Mistral AI | mistral.ai |
Local Deployment
| Tool | Description | Best For |
|---|---|---|
| ------ | ------------- | ---------- |
| LM Studio | GUI for local models | Beginners |
| Ollama | CLI for local models | Developers |
| vLLM | High-performance serving | Production |
Base Model Access
For research into pure pre-trained models:
- Hyperbolic platform provides base model access
- Useful for studying model behavior without post-training influence
6.6 Course Conclusion
The Three-Stage Pipeline
┌─────────────────────────────────────────────────────────────┐
│ COMPLETE LLM DEVELOPMENT PIPELINE │
├─────────────────────────────────────────────────────────────┤
│ │
│ STAGE 1: PRE-TRAINING │
│ ├─ Internet-scale data collection │
│ ├─ Tokenization (BPE, ~100k vocabulary) │
│ ├─ Transformer training (billions of parameters) │
│ └─ Result: Base model (token simulator) │
│ │
│ STAGE 2: POST-TRAINING (SFT) │
│ ├─ Conversation datasets (human or synthetic) │
│ ├─ Chat template formatting │
│ ├─ Fine-tuning on helpful behavior │
│ └─ Result: Assistant model │
│ │
│ STAGE 3: REINFORCEMENT LEARNING │
│ ├─ Verifiable domains: Direct evaluation │
│ ├─ Unverifiable domains: RLHF with reward models │
│ ├─ Emergent reasoning capabilities │
│ └─ Result: Reasoning model with novel capabilities │
│ │
└─────────────────────────────────────────────────────────────┘
Core Mental Models
- Parameters vs. Context: Vague recollection vs. working memory
- Tokens for Thinking: Complex tasks need computation space
- Jagged Intelligence: Trust varies by domain
- Tool Augmentation: Extend capabilities with external systems
- RL Discovery: Models can find superhuman strategies
The Bottom Line
"Models generate responses as token sequences, requiring pre-training and supervised fine-tuning, with reinforcement learning showing promise for reasoning."
6.7 Key Takeaways
Summary
| Topic | Direction |
|---|---|
| ------- | ----------- |
| Multimodal | Unified text/image/audio/video models |
| Agents | Long-running autonomous task completion |
| Test-Time Training | Models that learn during use |
| Access | Proprietary APIs + open weights + local |
For AI Analytics Platforms
Forward-Looking Monitoring:
- Multimodal Metrics: Token types (text/image/audio) per request
- Agent Sessions: Track long-running task progress and success
- Model Drift: Monitor capability changes across versions
- Cost Optimization: Different modalities have different costs
Practical Applications
Strategic Implications:
- Multimodal Memory: Prepare to store image/audio context, not just text
- Agent Infrastructure: Cloud brain becomes critical for agent persistence
- Continuous Learning: Memory systems may need to integrate with test-time training
- Cross-Model Portability: Design memories that work across providers
Next Module
→ Module 7: AI Analytics Platform Considerations
Timestamps: 3:09:39 - Future Developments | 3:15:15 - Resources | 3:18:34 - Finding LLMs | 3:21:46 - Grand Summary