Module 01

Pre-Training Fundamentals

Duration: ~45 minutes of video content Timestamps: 0:00:00 - 0:42:52

1.1 Data Collection & Preprocessing

The Internet as Training Data

LLMs begin by crawling the internet to build massive text datasets. The scale is staggering:

DatasetSizeDescription
----------------------------
Common CrawlPetabytesRaw internet snapshots
FineWeb1.2B+ web pagesFiltered, cleaned dataset
Training Data (compressed)~44 terabytesAfter processing

The Data Pipeline

Raw Internet → URL Filtering → Text Extraction → Language Filtering → PII Removal → Quality Filtering → Training Data

Filtering Stages:

  1. URL Filtering: Excludes domains with marketing, spam, or malware
  2. Text Extraction: Removes HTML markup, retains text only
  3. Language Filtering: Keeps pages with >65% target language content
  4. PII Removal: Eliminates addresses, SSNs, personal identifiers
  5. Quality Filtering: Removes duplicates, low-quality content

Key Insight

"Raw data is noisy and full of duplicate content, low-quality text, and irrelevant information. Before training, it needs heavy filtering."

Relevance to KeenDreams: This mirrors how cloud brain memory should work - not storing everything, but filtering for meaningful, high-quality context that improves recall and decision-making.


1.2 Tokenization

What is Tokenization?

Text is converted into tokens - numerical representations of text patterns. This is how LLMs "see" language.

Example:

Input:  "Hello, world!"
Tokens: [15496, 11, 995, 0]  (4 tokens)

Byte Pair Encoding (BPE)

The dominant tokenization algorithm:

  1. Start with individual bytes/characters
  2. Find most common adjacent pairs
  3. Merge into new tokens
  4. Repeat until vocabulary size reached

GPT-4 Vocabulary: ~100,277 unique tokens

Why Tokenization Matters

AspectImpact
----------------
EfficiencyCompress text into fewer units
Context WindowMore content fits in limited window
CostFewer tokens = lower API costs
LimitationsCharacter-level tasks become difficult

Tokenization Pitfalls

Models struggle with:

  • Spelling tasks: "How many 'r's in 'strawberry'?" (tokens don't preserve characters)
  • Counting: Ellipses "..." may be single or multiple tokens
  • Non-English text: Often requires more tokens per word

Practical Tip: Use code-based solutions for character/counting tasks.

# Bad: Ask LLM to count letters
"How many r's in strawberry?"  # LLM may fail

# Good: Have LLM generate code
len([c for c in "strawberry" if c == 'r'])  # Returns 3

Tools

  • Tiktokenizer: Visualize how text becomes tokens
  • tiktoken (Python library): OpenAI's tokenizer

1.3 Neural Network Fundamentals

The Transformer Architecture

Modern LLMs use the Transformer architecture (Vaswani et al., 2017).

Key Properties:

  • Processes sequences of tokens in parallel
  • Uses "attention" to relate tokens to each other
  • Billions of parameters (weights) store learned patterns

Input/Output Flow

┌─────────────────────────────────────────────────────────┐
│                    TRANSFORMER I/O                       │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  INPUT: Token sequence [91, 860, 287, 11579, ...]       │
│                         ↓                                │
│  ┌─────────────────────────────────────────────┐        │
│  │           TRANSFORMER NETWORK                │        │
│  │  (Billions of parameters doing matrix ops)   │        │
│  └─────────────────────────────────────────────┘        │
│                         ↓                                │
│  OUTPUT: Probability distribution over 100k tokens      │
│          [0.001, 0.023, ..., 0.412, ...]                │
│                         ↓                                │
│  PREDICTION: Most likely next token (or sampled)        │
│                                                          │
└─────────────────────────────────────────────────────────┘

Context Windows

The context window defines how much "memory" the model has during generation:

ModelContext Length~Words
-------------------------------
GPT-21,024 tokens~750
GPT-3.54,096 tokens~3,000
GPT-48,192-128k tokens~6,000-96,000
Claude 3200k tokens~150,000

Relevance to KeenDreams: Context window is like "working memory." KeenDreams acts as extended memory beyond this window - storing project context that can be retrieved and loaded into the model's active context.


1.4 The Training Process

How Neural Networks Learn

Training adjusts billions of parameters to minimize prediction error:

  1. Forward Pass: Feed tokens, get predictions
  2. Loss Calculation: Compare predictions to actual next tokens
  3. Backpropagation: Calculate how to adjust each parameter
  4. Update: Apply small adjustments to all parameters
  5. Repeat: Millions of times across training data

Key Insight from Karpathy

"Every single line [of training] is improving the prediction of 1B tokens in the training set simultaneously."

Training at Scale

GPT-2 Training (2019):

  • 1.6 billion parameters
  • ~100 billion tokens
  • ~$40,000 cost
  • Weeks of compute

Karpathy's Reproduction (2024):

  • Same model using llm.c
  • Cost: $672 (optimized could be ~$100)
  • Demonstrates massive efficiency gains

What Training Produces

The result is a Base Model - essentially an "expensive autocomplete" that simulates internet text patterns.

Base Model Characteristics:

  • Completes text in style of training data
  • No inherent helpfulness or safety
  • Requires post-training to become useful assistant

1.5 Key Takeaways

Summary

ConceptKey Point
--------------------
DataInternet-scale, heavily filtered, ~44TB
Tokenization100k vocabulary, BPE algorithm, efficiency trade-offs
Neural NetworkBillions of parameters, predicts next token
Context WindowLimited "working memory" (1k-200k tokens)
TrainingWeeks/months, expensive but improving rapidly
OutputBase model = token simulator, not assistant

For AI Analytics Platforms

Monitoring Insights from This Module:

  1. Token Usage Tracking: Every API call consumes tokens - track and optimize
  2. Context Window Utilization: Monitor how much context is used vs. available
  3. Tokenization Costs: Different text has different token densities

For KeenDreams

Applicable Learnings:

  1. Semantic Memory: Like training data filtering, store meaningful patterns not raw data
  2. Context Loading: Strategic retrieval to maximize context window utility
  3. Working Memory Metaphor: Context window = active session; KeenDreams = long-term storage

Practice Questions

  1. Why can't LLMs easily count letters in words?
  2. What's the relationship between vocabulary size and tokenization efficiency?
  3. Why does a base model require post-training to be useful?
  4. How does context window size affect model capabilities?

Next Module

Module 2: Inference & Base Models


Timestamps: 0:00:00 - Introduction | 0:01:00 - Pretraining Data | 0:07:47 - Tokenization | 0:14:27 - Neural Network I/O | 0:20:11 - Neural Network Internals | 0:31:09 - GPT-2