Module 02

Inference & Base Models

30 min

Duration: ~30 minutes of video content Timestamps: 0:26:01 - 0:59:23

2.1 The Inference Process

How Text Generation Works

During inference (generation), the model produces text one token at a time:

┌──────────────────────────────────────────────────────────────┐
│                    AUTOREGRESSIVE GENERATION                  │
├──────────────────────────────────────────────────────────────┤
│                                                               │
│  Step 1: "The quick brown"                                    │
│          → Predict next: [fox: 0.42, dog: 0.15, cat: 0.08]   │
│          → Sample: "fox"                                      │
│                                                               │
│  Step 2: "The quick brown fox"                                │
│          → Predict next: [jumps: 0.38, runs: 0.22, ...]      │
│          → Sample: "jumps"                                    │
│                                                               │
│  Step 3: "The quick brown fox jumps"                          │
│          → Continue until stop condition...                   │
│                                                               │
└──────────────────────────────────────────────────────────────┘

Key Properties

One token at a time: Each new token depends on all previous tokens
Probability distribution: Model outputs probabilities for ALL vocabulary tokens
Sampling: Final token is selected (not always the highest probability)
Autoregressive: Output feeds back as input for next prediction

2.2 Stochastic Outputs

Why Outputs Vary

LLMs are stochastic (probabilistic), not deterministic:

# Same prompt, different outputs each time
prompt = "Write a poem about the ocean"

# Run 1: "The waves crash upon the shore..."
# Run 2: "Deep blue waters call to me..."
# Run 3: "Salty breeze and endless tides..."

Temperature & Sampling

Temperature controls randomness:

Temperature	Behavior	Use Case
-------------	----------	----------
0.0	Deterministic (greedy)	Code, math, factual
0.7	Balanced creativity	General conversation
1.0+	High randomness	Creative writing, brainstorming

Sampling Methods:

Greedy: Always pick highest probability token
Top-K: Sample from K highest probability tokens
Top-P (Nucleus): Sample from tokens covering P% probability mass

Key Insight from Karpathy

"The process is stochastic, producing varied outputs through random sampling. This enables creativity but also potential inaccuracies or 'hallucinations.'"

For AI Analytics: Temperature and sampling parameters are critical metrics to log - they significantly affect output quality and consistency.

2.3 GPT-2: A Historical Baseline

GPT-2 Specifications (2019)

Attribute	Value
-----------	-------
Parameters	1.6 billion
Context Length	1,024 tokens
Training Tokens	~100 billion
Training Cost	~$40,000
Vocabulary	50,257 tokens

Why GPT-2 Matters

First "large" language model to gain public attention
OpenAI initially withheld release citing misuse concerns
Now considered small by modern standards
Excellent learning baseline

Karpathy's Reproduction

Using his llm.c project:

Reproduced GPT-2 for $672
Optimized pipeline could reduce to ~$100
Demonstrates democratization of AI training

2.4 Modern Base Models

Llama 3.1 (2024)

Attribute	Value
-----------	-------
Parameters	405 billion
Context Length	128,000 tokens
Training	Trillions of tokens
Status	Open weights

Base Model vs. Assistant Model

┌─────────────────────────────────────────────────────────────┐
│                 BASE MODEL                                   │
├─────────────────────────────────────────────────────────────┤
│  Input:  "The capital of France is"                         │
│  Output: "Paris. The capital of Germany is Berlin. The..."  │
│                                                              │
│  Behavior: Continues text in training data style             │
│  Problem:  Not helpful, no conversation, no safety          │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│                 ASSISTANT MODEL                              │
├─────────────────────────────────────────────────────────────┤
│  Input:  "What is the capital of France?"                    │
│  Output: "The capital of France is Paris."                   │
│                                                              │
│  Behavior: Answers questions helpfully                       │
│  Training: Post-training on conversations                    │
└─────────────────────────────────────────────────────────────┘

Base Model Characteristics

Karpathy describes base models as "token simulators":

Generate text matching internet patterns
No inherent concept of "helpful" or "harmful"
Require creative prompting for usefulness (few-shot examples)
Hallucinate freely without constraints

Practical Demonstration

Base models can be made useful through prompting:

# Few-shot prompting to create Q&A behavior
User: What is 2+2?
Assistant: 4

User: What is the capital of Japan?
Assistant: Tokyo

User: How do I make coffee?
Assistant: [Model continues the pattern...]

2.5 From Pre-training to Post-training

The Transition

Stage	Compute	Time	Purpose
-------	---------	------	---------
Pre-training	Massive (millions of $)	Months	Learn language patterns
Post-training	Minimal (comparatively)	Hours-Days	Learn helpful behavior

Key Insight

"Post-training is way cheaper than pre-training (e.g., months vs. hours). The algorithm remains unchanged; only parameters are fine-tuned."

This is crucial for understanding LLM economics:

Pre-training: One-time massive investment
Post-training: Relatively cheap customization
Fine-tuning: Accessible to most organizations

2.6 Components of a Usable LLM

Two Essential Components

Inference Code: The software that runs the model
Model Weights: The learned parameters (billions of numbers)

Accessing Models

Type	Examples	Access
------	----------	--------
Proprietary	GPT-4, Claude, Gemini	API only
Open-Weight	Llama, Mistral, DeepSeek	Downloadable weights
Local	Via LM Studio, Ollama	Run on your hardware

For KeenDreams: Understanding model access patterns helps design memory architectures that work across different LLM providers and local/cloud deployments.

2.7 Key Takeaways

Summary

Concept	Key Point
---------	-----------
Inference	Token-by-token generation, autoregressive
Stochasticity	Outputs vary; controlled by temperature
Base Models	Token simulators, not assistants
Post-training	Cheap relative to pre-training
Access	Proprietary APIs vs. open weights

For AI Analytics Platforms

Critical Metrics to Monitor:

Token Generation Rate: Tokens per second for latency tracking
Temperature Settings: Affects output consistency and quality
Context Window Usage: How much of available context is consumed
Output Length: Token count of responses
Stop Reason: Why generation ended (length limit, stop token, etc.)

For KeenDreams

Applicable Learnings:

Probabilistic Retrieval: Like model sampling, memory retrieval can use relevance scores
Context Priming: Few-shot patterns work - KeenDreams can provide "example" memories
Model Agnosticism: Design memory to work with any LLM backend

Practice Questions

Why do LLMs produce different outputs for the same prompt?
What makes a base model different from ChatGPT?
How does temperature affect model outputs?
Why is post-training much cheaper than pre-training?

Next Module

→ Module 3: Post-Training & Fine-Tuning

Timestamps: 0:26:01 - Inference | 0:31:09 - GPT-2 | 0:42:52 - Llama 3.1 Base Model | 0:59:23 - Pre to Post-Training