Module 02

Inference & Base Models

Duration: ~30 minutes of video content Timestamps: 0:26:01 - 0:59:23

2.1 The Inference Process

How Text Generation Works

During inference (generation), the model produces text one token at a time:

┌──────────────────────────────────────────────────────────────┐
│                    AUTOREGRESSIVE GENERATION                  │
├──────────────────────────────────────────────────────────────┤
│                                                               │
│  Step 1: "The quick brown"                                    │
│          → Predict next: [fox: 0.42, dog: 0.15, cat: 0.08]   │
│          → Sample: "fox"                                      │
│                                                               │
│  Step 2: "The quick brown fox"                                │
│          → Predict next: [jumps: 0.38, runs: 0.22, ...]      │
│          → Sample: "jumps"                                    │
│                                                               │
│  Step 3: "The quick brown fox jumps"                          │
│          → Continue until stop condition...                   │
│                                                               │
└──────────────────────────────────────────────────────────────┘

Key Properties

  1. One token at a time: Each new token depends on all previous tokens
  2. Probability distribution: Model outputs probabilities for ALL vocabulary tokens
  3. Sampling: Final token is selected (not always the highest probability)
  4. Autoregressive: Output feeds back as input for next prediction

2.2 Stochastic Outputs

Why Outputs Vary

LLMs are stochastic (probabilistic), not deterministic:

# Same prompt, different outputs each time
prompt = "Write a poem about the ocean"

# Run 1: "The waves crash upon the shore..."
# Run 2: "Deep blue waters call to me..."
# Run 3: "Salty breeze and endless tides..."

Temperature & Sampling

Temperature controls randomness:

TemperatureBehaviorUse Case
---------------------------------
0.0Deterministic (greedy)Code, math, factual
0.7Balanced creativityGeneral conversation
1.0+High randomnessCreative writing, brainstorming

Sampling Methods:

  • Greedy: Always pick highest probability token
  • Top-K: Sample from K highest probability tokens
  • Top-P (Nucleus): Sample from tokens covering P% probability mass

Key Insight from Karpathy

"The process is stochastic, producing varied outputs through random sampling. This enables creativity but also potential inaccuracies or 'hallucinations.'"

For AI Analytics: Temperature and sampling parameters are critical metrics to log - they significantly affect output quality and consistency.


2.3 GPT-2: A Historical Baseline

GPT-2 Specifications (2019)

AttributeValue
------------------
Parameters1.6 billion
Context Length1,024 tokens
Training Tokens~100 billion
Training Cost~$40,000
Vocabulary50,257 tokens

Why GPT-2 Matters

  • First "large" language model to gain public attention
  • OpenAI initially withheld release citing misuse concerns
  • Now considered small by modern standards
  • Excellent learning baseline

Karpathy's Reproduction

Using his llm.c project:

  • Reproduced GPT-2 for $672
  • Optimized pipeline could reduce to ~$100
  • Demonstrates democratization of AI training

2.4 Modern Base Models

Llama 3.1 (2024)

AttributeValue
------------------
Parameters405 billion
Context Length128,000 tokens
TrainingTrillions of tokens
StatusOpen weights

Base Model vs. Assistant Model

┌─────────────────────────────────────────────────────────────┐
│                 BASE MODEL                                   │
├─────────────────────────────────────────────────────────────┤
│  Input:  "The capital of France is"                         │
│  Output: "Paris. The capital of Germany is Berlin. The..."  │
│                                                              │
│  Behavior: Continues text in training data style             │
│  Problem:  Not helpful, no conversation, no safety          │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│                 ASSISTANT MODEL                              │
├─────────────────────────────────────────────────────────────┤
│  Input:  "What is the capital of France?"                    │
│  Output: "The capital of France is Paris."                   │
│                                                              │
│  Behavior: Answers questions helpfully                       │
│  Training: Post-training on conversations                    │
└─────────────────────────────────────────────────────────────┘

Base Model Characteristics

Karpathy describes base models as "token simulators":

  • Generate text matching internet patterns
  • No inherent concept of "helpful" or "harmful"
  • Require creative prompting for usefulness (few-shot examples)
  • Hallucinate freely without constraints

Practical Demonstration

Base models can be made useful through prompting:

# Few-shot prompting to create Q&A behavior
User: What is 2+2?
Assistant: 4

User: What is the capital of Japan?
Assistant: Tokyo

User: How do I make coffee?
Assistant: [Model continues the pattern...]

2.5 From Pre-training to Post-training

The Transition

StageComputeTimePurpose
-------------------------------
Pre-trainingMassive (millions of $)MonthsLearn language patterns
Post-trainingMinimal (comparatively)Hours-DaysLearn helpful behavior

Key Insight

"Post-training is way cheaper than pre-training (e.g., months vs. hours). The algorithm remains unchanged; only parameters are fine-tuned."

This is crucial for understanding LLM economics:

  • Pre-training: One-time massive investment
  • Post-training: Relatively cheap customization
  • Fine-tuning: Accessible to most organizations

2.6 Components of a Usable LLM

Two Essential Components

  1. Inference Code: The software that runs the model
  2. Model Weights: The learned parameters (billions of numbers)

Accessing Models

TypeExamplesAccess
------------------------
ProprietaryGPT-4, Claude, GeminiAPI only
Open-WeightLlama, Mistral, DeepSeekDownloadable weights
LocalVia LM Studio, OllamaRun on your hardware

For KeenDreams: Understanding model access patterns helps design memory architectures that work across different LLM providers and local/cloud deployments.


2.7 Key Takeaways

Summary

ConceptKey Point
--------------------
InferenceToken-by-token generation, autoregressive
StochasticityOutputs vary; controlled by temperature
Base ModelsToken simulators, not assistants
Post-trainingCheap relative to pre-training
AccessProprietary APIs vs. open weights

For AI Analytics Platforms

Critical Metrics to Monitor:

  1. Token Generation Rate: Tokens per second for latency tracking
  2. Temperature Settings: Affects output consistency and quality
  3. Context Window Usage: How much of available context is consumed
  4. Output Length: Token count of responses
  5. Stop Reason: Why generation ended (length limit, stop token, etc.)

For KeenDreams

Applicable Learnings:

  1. Probabilistic Retrieval: Like model sampling, memory retrieval can use relevance scores
  2. Context Priming: Few-shot patterns work - KeenDreams can provide "example" memories
  3. Model Agnosticism: Design memory to work with any LLM backend

Practice Questions

  1. Why do LLMs produce different outputs for the same prompt?
  2. What makes a base model different from ChatGPT?
  3. How does temperature affect model outputs?
  4. Why is post-training much cheaper than pre-training?

Next Module

Module 3: Post-Training & Fine-Tuning


Timestamps: 0:26:01 - Inference | 0:31:09 - GPT-2 | 0:42:52 - Llama 3.1 Base Model | 0:59:23 - Pre to Post-Training