Module 06

Future Directions & Resources

15 min

Duration: ~15 minutes of video content Timestamps: 3:09:39 - 3:31:00

6.1 Multimodal Models

Beyond Text

The next generation of models handles multiple modalities natively:

┌─────────────────────────────────────────────────────────────┐
│                   MULTIMODAL LLMs                            │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  INPUT MODALITIES          UNIFIED MODEL        OUTPUT       │
│  ┌─────────────┐                                            │
│  │    TEXT     │─────┐                                      │
│  └─────────────┘     │    ┌─────────────────┐               │
│  ┌─────────────┐     │    │                 │    Text       │
│  │   IMAGES    │─────┼───▶│   TRANSFORMER   │───▶ Images    │
│  └─────────────┘     │    │     (Unified)   │    Audio      │
│  ┌─────────────┐     │    │                 │    Code       │
│  │   AUDIO     │─────┘    └─────────────────┘               │
│  └─────────────┘                                            │
│  ┌─────────────┐                                            │
│  │   VIDEO     │─────┘                                      │
│  └─────────────┘                                            │
│                                                              │
│  KEY: Same transformer architecture, extended tokenization   │
│                                                              │
└─────────────────────────────────────────────────────────────┘

How It Works

Modality	Tokenization Approach
----------	----------------------
Text	Byte Pair Encoding → tokens
Images	Patch embeddings → tokens
Audio	Spectrogram frames → tokens
Video	Frame patches + temporal → tokens

Key Insight from Karpathy

"At the baseline it's possible to tokenize audio and images and apply the same approaches as for LLMs. Not a fundamental change, just adding more tokens is required."

Current State (2025)

Model	Modalities	Notes
-------	------------	-------
GPT-4V	Text, Images	Image understanding
Gemini	Text, Images, Video, Audio	Native multimodal
Claude 3	Text, Images	Vision analysis
Sora (OpenAI)	Text → Video	Video generation

6.2 AI Agents

The Evolution

┌─────────────────────────────────────────────────────────────┐
│                FROM CHATBOTS TO AGENTS                       │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  CHATBOT (Today)                                             │
│  ┌───────────────────────────────────────────────────────┐  │
│  │ • Single turn or short conversations                   │  │
│  │ • Immediate response required                          │  │
│  │ • No persistent state between sessions                 │  │
│  │ • Limited tool use                                     │  │
│  └───────────────────────────────────────────────────────┘  │
│                           ↓                                  │
│  AGENT (Emerging)                                            │
│  ┌───────────────────────────────────────────────────────┐  │
│  │ • Long-running tasks (hours, days)                     │  │
│  │ • Autonomous execution with error recovery             │  │
│  │ • Persistent memory across sessions                    │  │
│  │ • Complex tool orchestration                           │  │
│  │ • Self-correction and planning                         │  │
│  └───────────────────────────────────────────────────────┘  │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Agent Capabilities

Capability	Description	Status
------------	-------------	--------
Computer Use	Navigate desktop, use software	Emerging
Browsing	Research, fill forms, make purchases	Active
Code Execution	Write, run, debug code	Mature
Long-term Memory	Remember across sessions	Developing
Planning	Break down complex tasks	Improving

The Vision

Natural language task completion:

User: "Book me a flight to Tokyo next month, find a hotel
near Shibuya under $200/night, and add it to my calendar."

Agent: Searches flights, compares options, books ticket,
researches hotels, makes reservation, creates calendar
events, sends confirmation email.

Key Insight: Agents will need persistent memory systems. Cloud brain becomes infrastructure for agent long-term memory.

6.3 Test-Time Training

A New Paradigm

Current models: Parameters are frozen after training Future models: Can learn during inference

┌─────────────────────────────────────────────────────────────┐
│                   TEST-TIME TRAINING                         │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  CURRENT: Static Parameters                                  │
│  ┌───────────────────────────────────────────────────────┐  │
│  │ Training Data ──▶ Fixed Model ──▶ Inference            │  │
│  │                   (unchanging)                         │  │
│  └───────────────────────────────────────────────────────┘  │
│                                                              │
│  FUTURE: Dynamic Parameters                                  │
│  ┌───────────────────────────────────────────────────────┐  │
│  │ Training Data ──▶ Model ──┬──▶ Inference              │  │
│  │                           │                            │  │
│  │                           └──▶ Continue Learning ──┐   │  │
│  │                                                    │   │  │
│  │                               ◄──────────────────◄─┘   │  │
│  └───────────────────────────────────────────────────────┘  │
│                                                              │
│  Analogy: Like humans learning while sleeping               │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Implications

Models improve with use
Personalization becomes deeper
Edge cases get resolved over time
Privacy concerns multiply

6.4 Staying Current

Karpathy's Recommended Resources

Resource	Type	Purpose
----------	------	---------
LM Arena	Leaderboard	Model rankings via human comparison
AI News Newsletter	Newsletter	Comprehensive AI coverage
X/Twitter	Social	Real-time researcher updates

Key Researchers to Follow

Building a quality follow list provides:

Early access to breakthroughs
Nuanced technical discussions
Pre-print paper analysis
Industry trend signals

6.5 Accessing LLMs

Proprietary Models (API)

Provider	Models	Access
----------	--------	--------
OpenAI	GPT-4, GPT-4o	api.openai.com
Anthropic	Claude 3/3.5	anthropic.com
Google	Gemini	ai.google.dev

Open-Weight Models (Download/API)

Model	Provider	Access
-------	----------	--------
Llama 3	Meta	Together.ai, HuggingFace
DeepSeek	DeepSeek	deepseek.com
Mistral	Mistral AI	mistral.ai

Local Deployment

Tool	Description	Best For
------	-------------	----------
LM Studio	GUI for local models	Beginners
Ollama	CLI for local models	Developers
vLLM	High-performance serving	Production

Base Model Access

For research into pure pre-trained models:

Hyperbolic platform provides base model access
Useful for studying model behavior without post-training influence

6.6 Course Conclusion

The Three-Stage Pipeline

┌─────────────────────────────────────────────────────────────┐
│            COMPLETE LLM DEVELOPMENT PIPELINE                 │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  STAGE 1: PRE-TRAINING                                       │
│  ├─ Internet-scale data collection                           │
│  ├─ Tokenization (BPE, ~100k vocabulary)                    │
│  ├─ Transformer training (billions of parameters)           │
│  └─ Result: Base model (token simulator)                    │
│                                                              │
│  STAGE 2: POST-TRAINING (SFT)                               │
│  ├─ Conversation datasets (human or synthetic)              │
│  ├─ Chat template formatting                                 │
│  ├─ Fine-tuning on helpful behavior                         │
│  └─ Result: Assistant model                                  │
│                                                              │
│  STAGE 3: REINFORCEMENT LEARNING                            │
│  ├─ Verifiable domains: Direct evaluation                   │
│  ├─ Unverifiable domains: RLHF with reward models           │
│  ├─ Emergent reasoning capabilities                          │
│  └─ Result: Reasoning model with novel capabilities         │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Core Mental Models

Parameters vs. Context: Vague recollection vs. working memory
Tokens for Thinking: Complex tasks need computation space
Jagged Intelligence: Trust varies by domain
Tool Augmentation: Extend capabilities with external systems
RL Discovery: Models can find superhuman strategies

The Bottom Line

"Models generate responses as token sequences, requiring pre-training and supervised fine-tuning, with reinforcement learning showing promise for reasoning."

6.7 Key Takeaways

Summary

Topic	Direction
-------	-----------
Multimodal	Unified text/image/audio/video models
Agents	Long-running autonomous task completion
Test-Time Training	Models that learn during use
Access	Proprietary APIs + open weights + local

For AI Analytics Platforms

Forward-Looking Monitoring:

Multimodal Metrics: Token types (text/image/audio) per request
Agent Sessions: Track long-running task progress and success
Model Drift: Monitor capability changes across versions
Cost Optimization: Different modalities have different costs

Practical Applications

Strategic Implications:

Multimodal Memory: Prepare to store image/audio context, not just text
Agent Infrastructure: Cloud brain becomes critical for agent persistence
Continuous Learning: Memory systems may need to integrate with test-time training
Cross-Model Portability: Design memories that work across providers

Next Module

→ Module 7: AI Analytics Platform Considerations

Timestamps: 3:09:39 - Future Developments | 3:15:15 - Resources | 3:18:34 - Finding LLMs | 3:21:46 - Grand Summary