Duration: ~45 minutes of video content Timestamps: 0:00:00 - 0:42:52
1.1 Data Collection & Preprocessing
The Internet as Training Data
LLMs begin by crawling the internet to build massive text datasets. The scale is staggering:
| Dataset | Size | Description |
|---|---|---|
| --------- | ------ | ------------- |
| Common Crawl | Petabytes | Raw internet snapshots |
| FineWeb | 1.2B+ web pages | Filtered, cleaned dataset |
| Training Data (compressed) | ~44 terabytes | After processing |
The Data Pipeline
Raw Internet → URL Filtering → Text Extraction → Language Filtering → PII Removal → Quality Filtering → Training Data
Filtering Stages:
- URL Filtering: Excludes domains with marketing, spam, or malware
- Text Extraction: Removes HTML markup, retains text only
- Language Filtering: Keeps pages with >65% target language content
- PII Removal: Eliminates addresses, SSNs, personal identifiers
- Quality Filtering: Removes duplicates, low-quality content
Key Insight
"Raw data is noisy and full of duplicate content, low-quality text, and irrelevant information. Before training, it needs heavy filtering."
Relevance to KeenDreams: This mirrors how cloud brain memory should work - not storing everything, but filtering for meaningful, high-quality context that improves recall and decision-making.
1.2 Tokenization
What is Tokenization?
Text is converted into tokens - numerical representations of text patterns. This is how LLMs "see" language.
Example:
Input: "Hello, world!"
Tokens: [15496, 11, 995, 0] (4 tokens)
Byte Pair Encoding (BPE)
The dominant tokenization algorithm:
- Start with individual bytes/characters
- Find most common adjacent pairs
- Merge into new tokens
- Repeat until vocabulary size reached
GPT-4 Vocabulary: ~100,277 unique tokens
Why Tokenization Matters
| Aspect | Impact |
|---|---|
| -------- | -------- |
| Efficiency | Compress text into fewer units |
| Context Window | More content fits in limited window |
| Cost | Fewer tokens = lower API costs |
| Limitations | Character-level tasks become difficult |
Tokenization Pitfalls
Models struggle with:
- Spelling tasks: "How many 'r's in 'strawberry'?" (tokens don't preserve characters)
- Counting: Ellipses "..." may be single or multiple tokens
- Non-English text: Often requires more tokens per word
Practical Tip: Use code-based solutions for character/counting tasks.
# Bad: Ask LLM to count letters
"How many r's in strawberry?" # LLM may fail
# Good: Have LLM generate code
len([c for c in "strawberry" if c == 'r']) # Returns 3
Tools
- Tiktokenizer: Visualize how text becomes tokens
- tiktoken (Python library): OpenAI's tokenizer
1.3 Neural Network Fundamentals
The Transformer Architecture
Modern LLMs use the Transformer architecture (Vaswani et al., 2017).
Key Properties:
- Processes sequences of tokens in parallel
- Uses "attention" to relate tokens to each other
- Billions of parameters (weights) store learned patterns
Input/Output Flow
┌─────────────────────────────────────────────────────────┐
│ TRANSFORMER I/O │
├─────────────────────────────────────────────────────────┤
│ │
│ INPUT: Token sequence [91, 860, 287, 11579, ...] │
│ ↓ │
│ ┌─────────────────────────────────────────────┐ │
│ │ TRANSFORMER NETWORK │ │
│ │ (Billions of parameters doing matrix ops) │ │
│ └─────────────────────────────────────────────┘ │
│ ↓ │
│ OUTPUT: Probability distribution over 100k tokens │
│ [0.001, 0.023, ..., 0.412, ...] │
│ ↓ │
│ PREDICTION: Most likely next token (or sampled) │
│ │
└─────────────────────────────────────────────────────────┘
Context Windows
The context window defines how much "memory" the model has during generation:
| Model | Context Length | ~Words |
|---|---|---|
| ------- | ---------------- | -------- |
| GPT-2 | 1,024 tokens | ~750 |
| GPT-3.5 | 4,096 tokens | ~3,000 |
| GPT-4 | 8,192-128k tokens | ~6,000-96,000 |
| Claude 3 | 200k tokens | ~150,000 |
Relevance to KeenDreams: Context window is like "working memory." KeenDreams acts as extended memory beyond this window - storing project context that can be retrieved and loaded into the model's active context.
1.4 The Training Process
How Neural Networks Learn
Training adjusts billions of parameters to minimize prediction error:
- Forward Pass: Feed tokens, get predictions
- Loss Calculation: Compare predictions to actual next tokens
- Backpropagation: Calculate how to adjust each parameter
- Update: Apply small adjustments to all parameters
- Repeat: Millions of times across training data
Key Insight from Karpathy
"Every single line [of training] is improving the prediction of 1B tokens in the training set simultaneously."
Training at Scale
GPT-2 Training (2019):
- 1.6 billion parameters
- ~100 billion tokens
- ~$40,000 cost
- Weeks of compute
Karpathy's Reproduction (2024):
- Same model using llm.c
- Cost: $672 (optimized could be ~$100)
- Demonstrates massive efficiency gains
What Training Produces
The result is a Base Model - essentially an "expensive autocomplete" that simulates internet text patterns.
Base Model Characteristics:
- Completes text in style of training data
- No inherent helpfulness or safety
- Requires post-training to become useful assistant
1.5 Key Takeaways
Summary
| Concept | Key Point |
|---|---|
| --------- | ----------- |
| Data | Internet-scale, heavily filtered, ~44TB |
| Tokenization | 100k vocabulary, BPE algorithm, efficiency trade-offs |
| Neural Network | Billions of parameters, predicts next token |
| Context Window | Limited "working memory" (1k-200k tokens) |
| Training | Weeks/months, expensive but improving rapidly |
| Output | Base model = token simulator, not assistant |
For AI Analytics Platforms
Monitoring Insights from This Module:
- Token Usage Tracking: Every API call consumes tokens - track and optimize
- Context Window Utilization: Monitor how much context is used vs. available
- Tokenization Costs: Different text has different token densities
For KeenDreams
Applicable Learnings:
- Semantic Memory: Like training data filtering, store meaningful patterns not raw data
- Context Loading: Strategic retrieval to maximize context window utility
- Working Memory Metaphor: Context window = active session; KeenDreams = long-term storage
Practice Questions
- Why can't LLMs easily count letters in words?
- What's the relationship between vocabulary size and tokenization efficiency?
- Why does a base model require post-training to be useful?
- How does context window size affect model capabilities?
Next Module
→ Module 2: Inference & Base Models
Timestamps: 0:00:00 - Introduction | 0:01:00 - Pretraining Data | 0:07:47 - Tokenization | 0:14:27 - Neural Network I/O | 0:20:11 - Neural Network Internals | 0:31:09 - GPT-2