Module 01

Pre-Training Fundamentals

45 min

Duration: ~45 minutes of video content Timestamps: 0:00:00 - 0:42:52

1.1 Data Collection & Preprocessing

The Internet as Training Data

LLMs begin by crawling the internet to build massive text datasets. The scale is staggering:

Dataset	Size	Description
---------	------	-------------
Common Crawl	Petabytes	Raw internet snapshots
FineWeb	1.2B+ web pages	Filtered, cleaned dataset
Training Data (compressed)	~44 terabytes	After processing

The Data Pipeline

Raw Internet → URL Filtering → Text Extraction → Language Filtering → PII Removal → Quality Filtering → Training Data

Filtering Stages:

URL Filtering: Excludes domains with marketing, spam, or malware
Text Extraction: Removes HTML markup, retains text only
Language Filtering: Keeps pages with >65% target language content
PII Removal: Eliminates addresses, SSNs, personal identifiers
Quality Filtering: Removes duplicates, low-quality content

Key Insight

"Raw data is noisy and full of duplicate content, low-quality text, and irrelevant information. Before training, it needs heavy filtering."

Relevance to KeenDreams: This mirrors how cloud brain memory should work - not storing everything, but filtering for meaningful, high-quality context that improves recall and decision-making.

1.2 Tokenization

What is Tokenization?

Text is converted into tokens - numerical representations of text patterns. This is how LLMs "see" language.

Example:

Input:  "Hello, world!"
Tokens: [15496, 11, 995, 0]  (4 tokens)

Byte Pair Encoding (BPE)

The dominant tokenization algorithm:

Start with individual bytes/characters
Find most common adjacent pairs
Merge into new tokens
Repeat until vocabulary size reached

GPT-4 Vocabulary: ~100,277 unique tokens

Why Tokenization Matters

Aspect	Impact
--------	--------
Efficiency	Compress text into fewer units
Context Window	More content fits in limited window
Cost	Fewer tokens = lower API costs
Limitations	Character-level tasks become difficult

Tokenization Pitfalls

Models struggle with:

Spelling tasks: "How many 'r's in 'strawberry'?" (tokens don't preserve characters)
Counting: Ellipses "..." may be single or multiple tokens
Non-English text: Often requires more tokens per word

Practical Tip: Use code-based solutions for character/counting tasks.

# Bad: Ask LLM to count letters
"How many r's in strawberry?"  # LLM may fail

# Good: Have LLM generate code
len([c for c in "strawberry" if c == 'r'])  # Returns 3

Tools

Tiktokenizer: Visualize how text becomes tokens
tiktoken (Python library): OpenAI's tokenizer

1.3 Neural Network Fundamentals

The Transformer Architecture

Modern LLMs use the Transformer architecture (Vaswani et al., 2017).

Key Properties:

Processes sequences of tokens in parallel
Uses "attention" to relate tokens to each other
Billions of parameters (weights) store learned patterns

Input/Output Flow

┌─────────────────────────────────────────────────────────┐
│                    TRANSFORMER I/O                       │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  INPUT: Token sequence [91, 860, 287, 11579, ...]       │
│                         ↓                                │
│  ┌─────────────────────────────────────────────┐        │
│  │           TRANSFORMER NETWORK                │        │
│  │  (Billions of parameters doing matrix ops)   │        │
│  └─────────────────────────────────────────────┘        │
│                         ↓                                │
│  OUTPUT: Probability distribution over 100k tokens      │
│          [0.001, 0.023, ..., 0.412, ...]                │
│                         ↓                                │
│  PREDICTION: Most likely next token (or sampled)        │
│                                                          │
└─────────────────────────────────────────────────────────┘

Context Windows

The context window defines how much "memory" the model has during generation:

Model	Context Length	~Words
-------	----------------	--------
GPT-2	1,024 tokens	~750
GPT-3.5	4,096 tokens	~3,000
GPT-4	8,192-128k tokens	~6,000-96,000
Claude 3	200k tokens	~150,000

Relevance to KeenDreams: Context window is like "working memory." KeenDreams acts as extended memory beyond this window - storing project context that can be retrieved and loaded into the model's active context.

1.4 The Training Process

How Neural Networks Learn

Training adjusts billions of parameters to minimize prediction error:

Forward Pass: Feed tokens, get predictions
Loss Calculation: Compare predictions to actual next tokens
Backpropagation: Calculate how to adjust each parameter
Update: Apply small adjustments to all parameters
Repeat: Millions of times across training data

Key Insight from Karpathy

"Every single line [of training] is improving the prediction of 1B tokens in the training set simultaneously."

Training at Scale

GPT-2 Training (2019):

1.6 billion parameters
~100 billion tokens
~$40,000 cost
Weeks of compute

Karpathy's Reproduction (2024):

Same model using llm.c
Cost: $672 (optimized could be ~$100)
Demonstrates massive efficiency gains

What Training Produces

The result is a Base Model - essentially an "expensive autocomplete" that simulates internet text patterns.

Base Model Characteristics:

Completes text in style of training data
No inherent helpfulness or safety
Requires post-training to become useful assistant

1.5 Key Takeaways

Summary

Concept	Key Point
---------	-----------
Data	Internet-scale, heavily filtered, ~44TB
Tokenization	100k vocabulary, BPE algorithm, efficiency trade-offs
Neural Network	Billions of parameters, predicts next token
Context Window	Limited "working memory" (1k-200k tokens)
Training	Weeks/months, expensive but improving rapidly
Output	Base model = token simulator, not assistant

For AI Analytics Platforms

Monitoring Insights from This Module:

Token Usage Tracking: Every API call consumes tokens - track and optimize
Context Window Utilization: Monitor how much context is used vs. available
Tokenization Costs: Different text has different token densities

For KeenDreams

Applicable Learnings:

Semantic Memory: Like training data filtering, store meaningful patterns not raw data
Context Loading: Strategic retrieval to maximize context window utility
Working Memory Metaphor: Context window = active session; KeenDreams = long-term storage

Practice Questions

Why can't LLMs easily count letters in words?
What's the relationship between vocabulary size and tokenization efficiency?
Why does a base model require post-training to be useful?
How does context window size affect model capabilities?

Next Module

→ Module 2: Inference & Base Models