Module 06

Future Directions & Resources

Duration: ~15 minutes of video content Timestamps: 3:09:39 - 3:31:00

6.1 Multimodal Models

Beyond Text

The next generation of models handles multiple modalities natively:

┌─────────────────────────────────────────────────────────────┐
│                   MULTIMODAL LLMs                            │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  INPUT MODALITIES          UNIFIED MODEL        OUTPUT       │
│  ┌─────────────┐                                            │
│  │    TEXT     │─────┐                                      │
│  └─────────────┘     │    ┌─────────────────┐               │
│  ┌─────────────┐     │    │                 │    Text       │
│  │   IMAGES    │─────┼───▶│   TRANSFORMER   │───▶ Images    │
│  └─────────────┘     │    │     (Unified)   │    Audio      │
│  ┌─────────────┐     │    │                 │    Code       │
│  │   AUDIO     │─────┘    └─────────────────┘               │
│  └─────────────┘                                            │
│  ┌─────────────┐                                            │
│  │   VIDEO     │─────┘                                      │
│  └─────────────┘                                            │
│                                                              │
│  KEY: Same transformer architecture, extended tokenization   │
│                                                              │
└─────────────────────────────────────────────────────────────┘

How It Works

ModalityTokenization Approach
--------------------------------
TextByte Pair Encoding → tokens
ImagesPatch embeddings → tokens
AudioSpectrogram frames → tokens
VideoFrame patches + temporal → tokens

Key Insight from Karpathy

"At the baseline it's possible to tokenize audio and images and apply the same approaches as for LLMs. Not a fundamental change, just adding more tokens is required."

Current State (2025)

ModelModalitiesNotes
--------------------------
GPT-4VText, ImagesImage understanding
GeminiText, Images, Video, AudioNative multimodal
Claude 3Text, ImagesVision analysis
Sora (OpenAI)Text → VideoVideo generation

6.2 AI Agents

The Evolution

┌─────────────────────────────────────────────────────────────┐
│                FROM CHATBOTS TO AGENTS                       │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  CHATBOT (Today)                                             │
│  ┌───────────────────────────────────────────────────────┐  │
│  │ • Single turn or short conversations                   │  │
│  │ • Immediate response required                          │  │
│  │ • No persistent state between sessions                 │  │
│  │ • Limited tool use                                     │  │
│  └───────────────────────────────────────────────────────┘  │
│                           ↓                                  │
│  AGENT (Emerging)                                            │
│  ┌───────────────────────────────────────────────────────┐  │
│  │ • Long-running tasks (hours, days)                     │  │
│  │ • Autonomous execution with error recovery             │  │
│  │ • Persistent memory across sessions                    │  │
│  │ • Complex tool orchestration                           │  │
│  │ • Self-correction and planning                         │  │
│  └───────────────────────────────────────────────────────┘  │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Agent Capabilities

CapabilityDescriptionStatus
---------------------------------
Computer UseNavigate desktop, use softwareEmerging
BrowsingResearch, fill forms, make purchasesActive
Code ExecutionWrite, run, debug codeMature
Long-term MemoryRemember across sessionsDeveloping
PlanningBreak down complex tasksImproving

The Vision

Natural language task completion:

User: "Book me a flight to Tokyo next month, find a hotel
near Shibuya under $200/night, and add it to my calendar."

Agent: Searches flights, compares options, books ticket,
researches hotels, makes reservation, creates calendar
events, sends confirmation email.

Key Insight: Agents will need persistent memory systems. Cloud brain becomes infrastructure for agent long-term memory.


6.3 Test-Time Training

A New Paradigm

Current models: Parameters are frozen after training Future models: Can learn during inference

┌─────────────────────────────────────────────────────────────┐
│                   TEST-TIME TRAINING                         │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  CURRENT: Static Parameters                                  │
│  ┌───────────────────────────────────────────────────────┐  │
│  │ Training Data ──▶ Fixed Model ──▶ Inference            │  │
│  │                   (unchanging)                         │  │
│  └───────────────────────────────────────────────────────┘  │
│                                                              │
│  FUTURE: Dynamic Parameters                                  │
│  ┌───────────────────────────────────────────────────────┐  │
│  │ Training Data ──▶ Model ──┬──▶ Inference              │  │
│  │                           │                            │  │
│  │                           └──▶ Continue Learning ──┐   │  │
│  │                                                    │   │  │
│  │                               ◄──────────────────◄─┘   │  │
│  └───────────────────────────────────────────────────────┘  │
│                                                              │
│  Analogy: Like humans learning while sleeping               │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Implications

  • Models improve with use
  • Personalization becomes deeper
  • Edge cases get resolved over time
  • Privacy concerns multiply

6.4 Staying Current

ResourceTypePurpose
-------------------------
LM ArenaLeaderboardModel rankings via human comparison
AI News NewsletterNewsletterComprehensive AI coverage
X/TwitterSocialReal-time researcher updates

Key Researchers to Follow

Building a quality follow list provides:

  • Early access to breakthroughs
  • Nuanced technical discussions
  • Pre-print paper analysis
  • Industry trend signals

6.5 Accessing LLMs

Proprietary Models (API)

ProviderModelsAccess
--------------------------
OpenAIGPT-4, GPT-4oapi.openai.com
AnthropicClaude 3/3.5anthropic.com
GoogleGeminiai.google.dev

Open-Weight Models (Download/API)

ModelProviderAccess
-------------------------
Llama 3MetaTogether.ai, HuggingFace
DeepSeekDeepSeekdeepseek.com
MistralMistral AImistral.ai

Local Deployment

ToolDescriptionBest For
-----------------------------
LM StudioGUI for local modelsBeginners
OllamaCLI for local modelsDevelopers
vLLMHigh-performance servingProduction

Base Model Access

For research into pure pre-trained models:

  • Hyperbolic platform provides base model access
  • Useful for studying model behavior without post-training influence

6.6 Course Conclusion

The Three-Stage Pipeline

┌─────────────────────────────────────────────────────────────┐
│            COMPLETE LLM DEVELOPMENT PIPELINE                 │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  STAGE 1: PRE-TRAINING                                       │
│  ├─ Internet-scale data collection                           │
│  ├─ Tokenization (BPE, ~100k vocabulary)                    │
│  ├─ Transformer training (billions of parameters)           │
│  └─ Result: Base model (token simulator)                    │
│                                                              │
│  STAGE 2: POST-TRAINING (SFT)                               │
│  ├─ Conversation datasets (human or synthetic)              │
│  ├─ Chat template formatting                                 │
│  ├─ Fine-tuning on helpful behavior                         │
│  └─ Result: Assistant model                                  │
│                                                              │
│  STAGE 3: REINFORCEMENT LEARNING                            │
│  ├─ Verifiable domains: Direct evaluation                   │
│  ├─ Unverifiable domains: RLHF with reward models           │
│  ├─ Emergent reasoning capabilities                          │
│  └─ Result: Reasoning model with novel capabilities         │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Core Mental Models

  1. Parameters vs. Context: Vague recollection vs. working memory
  2. Tokens for Thinking: Complex tasks need computation space
  3. Jagged Intelligence: Trust varies by domain
  4. Tool Augmentation: Extend capabilities with external systems
  5. RL Discovery: Models can find superhuman strategies

The Bottom Line

"Models generate responses as token sequences, requiring pre-training and supervised fine-tuning, with reinforcement learning showing promise for reasoning."


6.7 Key Takeaways

Summary

TopicDirection
------------------
MultimodalUnified text/image/audio/video models
AgentsLong-running autonomous task completion
Test-Time TrainingModels that learn during use
AccessProprietary APIs + open weights + local

For AI Analytics Platforms

Forward-Looking Monitoring:

  1. Multimodal Metrics: Token types (text/image/audio) per request
  2. Agent Sessions: Track long-running task progress and success
  3. Model Drift: Monitor capability changes across versions
  4. Cost Optimization: Different modalities have different costs

Practical Applications

Strategic Implications:

  1. Multimodal Memory: Prepare to store image/audio context, not just text
  2. Agent Infrastructure: Cloud brain becomes critical for agent persistence
  3. Continuous Learning: Memory systems may need to integrate with test-time training
  4. Cross-Model Portability: Design memories that work across providers

Next Module

Module 7: AI Analytics Platform Considerations


Timestamps: 3:09:39 - Future Developments | 3:15:15 - Resources | 3:18:34 - Finding LLMs | 3:21:46 - Grand Summary