Module 08

Building Memory Systems for LLMs

Bonus

Supplementary Material Focus: Designing persistent memory to enhance LLM applications

8.1 The Memory Problem

Why LLMs Need External Memory

From the course, we learned that LLMs have two types of "knowledge":

Type	Location	Characteristics
------	----------	-----------------
Parametric	Model weights	Compressed, approximate, can't update
Contextual	Context window	Exact, limited size, per-request only

The fundamental limitation: Context windows reset every conversation.

┌─────────────────────────────────────────────────────────────┐
│                 THE MEMORY GAP                               │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Session 1: User discusses Project Alpha                     │
│  ─────────────────────────────────────────                  │
│  "We're building a React app with Supabase..."              │
│  "The auth system uses JWT tokens..."                       │
│  "We fixed the CORS bug yesterday..."                       │
│                                                              │
│                    ⬇️ SESSION ENDS ⬇️                        │
│                    🧠 CONTEXT CLEARED                        │
│                                                              │
│  Session 2: Same user returns                                │
│  ─────────────────────────────────────────                  │
│  User: "What was that CORS fix we did?"                     │
│  LLM: "I don't have any information about previous          │
│        conversations. Could you provide more context?"       │
│                                                              │
│  😞 USER FRUSTRATED - HAS TO RE-EXPLAIN EVERYTHING          │
│                                                              │
└─────────────────────────────────────────────────────────────┘

The Solution: External Memory

Build a system that:

Captures important context during conversations
Stores it persistently (database, vector store)
Retrieves relevant context for new sessions
Injects it into the context window

8.2 Memory Architecture Patterns

Pattern 1: Simple Key-Value Memory

Best for: User preferences, settings, simple facts

# Store
memory.set("user_123:preferred_language", "Python")
memory.set("user_123:timezone", "America/New_York")

# Retrieve and inject
preferences = memory.get_all("user_123:*")
system_prompt = f"""
User preferences:
- Preferred language: {preferences['preferred_language']}
- Timezone: {preferences['timezone']}
"""

Pattern 2: Conversation Summarization

Best for: Long-running conversations, session continuity

# At end of session
summary = llm.summarize(conversation_history)
memory.store({
    "user_id": "user_123",
    "session_id": "session_456",
    "summary": summary,
    "key_topics": extract_topics(conversation_history),
    "timestamp": now()
})

# At start of new session
recent_summaries = memory.get_recent("user_123", limit=3)
system_prompt = f"""
Previous conversation summaries:
{format_summaries(recent_summaries)}
"""

Pattern 3: Semantic/Vector Memory

Best for: Finding relevant past context based on meaning

# Store with embeddings
embedding = embed(conversation_chunk)
vector_store.upsert({
    "id": generate_id(),
    "embedding": embedding,
    "text": conversation_chunk,
    "metadata": {"user_id": "user_123", "timestamp": now()}
})

# Retrieve semantically similar memories
query_embedding = embed(user_query)
relevant_memories = vector_store.search(
    query_embedding,
    filter={"user_id": "user_123"},
    top_k=5
)

Pattern 4: Structured Knowledge Graph

Best for: Complex relationships, entity tracking

┌─────────────────────────────────────────────────────────────┐
│                  KNOWLEDGE GRAPH                             │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  [Project Alpha] ──uses──▶ [React]                          │
│        │                      │                              │
│        ├──uses──▶ [Supabase] ◀──is_a── [Database]          │
│        │                                                     │
│        └──has_bug──▶ [CORS Issue] ──fixed_by──▶ [JWT Fix]  │
│                           │                                  │
│                           └──occurred── [2024-01-15]        │
│                                                              │
└─────────────────────────────────────────────────────────────┘

8.3 The Grounding Principle

Context > Parameters

From Module 3, Karpathy's key insight:

"Pasting information directly into context windows produces higher-quality outputs than relying on parametric knowledge."

This means memory-injected context beats model "knowledge":

┌─────────────────────────────────────────────────────────────┐
│              GROUNDING COMPARISON                            │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  WITHOUT MEMORY (Parametric Only)                            │
│  ────────────────────────────────                           │
│  User: "What's the status of Project Alpha?"                │
│  LLM: "I don't have specific information about              │
│        Project Alpha. Could you tell me more?"              │
│                                                              │
│        ⚠️ No grounding = no useful answer                   │
│                                                              │
│  WITH MEMORY (Context Injection)                            │
│  ────────────────────────────────                           │
│  System: [Memory retrieved and injected]                    │
│  "Project Alpha context:                                     │
│   - React + Supabase stack                                   │
│   - Last commit: Fixed CORS bug (Jan 15)                    │
│   - Current sprint: User authentication                      │
│   - 3 open issues, 2 in progress"                           │
│                                                              │
│  User: "What's the status of Project Alpha?"                │
│  LLM: "Project Alpha is currently focused on user           │
│        authentication. You recently fixed the CORS bug.     │
│        There are 3 open issues with 2 in progress."         │
│                                                              │
│        ✅ Grounded = accurate, helpful answer               │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Why Grounding Reduces Hallucinations

Without context, models have two choices:

Admit ignorance (rare without training)
Generate plausible-sounding content (hallucinate)

With context:

Extract facts from provided information
Cite sources
Acknowledge gaps in provided context

8.4 What to Remember

High-Value Memory Types

Memory Type	Example	Retrieval Trigger
-------------	---------	-------------------
User Preferences	"Prefers TypeScript"	Every session
Project Context	Tech stack, file structure	Project-related queries
Problem-Solution Pairs	"Fixed X by doing Y"	Similar problems
Decisions Made	"Chose Postgres over MySQL because..."	Architecture questions
Terminology	"Widget = the sidebar component"	Uses of term

Memory Schema Example

{
  "id": "mem_abc123",
  "type": "problem_solution",
  "created_at": "2024-01-15T10:30:00Z",
  "user_id": "user_123",
  "project": "project_alpha",

  "content": {
    "problem": "CORS errors when calling API from frontend",
    "context": "React app calling Supabase Edge Functions",
    "solution": "Added Access-Control-Allow-Origin header to function",
    "file_changed": "supabase/functions/api/index.ts",
    "verified": true
  },

  "embedding": [0.123, -0.456, ...],

  "metadata": {
    "importance": "high",
    "access_count": 3,
    "last_accessed": "2024-01-20T15:00:00Z"
  }
}

8.5 Retrieval Strategies

When to Retrieve

Trigger	Action
---------	--------
Session start	Load user preferences, recent context
Project mentioned	Load project-specific memories
Problem described	Search for similar problems
Question asked	Semantic search for relevant info

How Much to Retrieve

Balance between context quality and token usage:

def get_optimal_context(query, user_id, max_tokens=2000):
    memories = []
    token_count = 0

    # Priority 1: User preferences (always include)
    prefs = get_user_preferences(user_id)
    memories.append(prefs)
    token_count += count_tokens(prefs)

    # Priority 2: Semantically relevant memories
    relevant = semantic_search(query, user_id, limit=10)
    for mem in relevant:
        if token_count + count_tokens(mem) < max_tokens:
            memories.append(mem)
            token_count += count_tokens(mem)
        else:
            break

    return format_as_context(memories)

Relevance Scoring

Not all memories are equally useful:

def score_memory_relevance(memory, query, current_context):
    score = 0.0

    # Semantic similarity (0-1)
    score += cosine_similarity(memory.embedding, query.embedding) * 0.4

    # Recency boost
    days_old = (now() - memory.created_at).days
    score += max(0, 1 - days_old/30) * 0.2

    # Usage frequency
    score += min(memory.access_count / 10, 1) * 0.2

    # Explicit importance
    importance_map = {"low": 0.1, "medium": 0.5, "high": 1.0}
    score += importance_map[memory.importance] * 0.2

    return score

8.6 Memory and Reasoning

Tokens for Thinking

From Module 3: LLMs need tokens to reason. Memory systems can help:

Store Reasoning Patterns

{
  "type": "reasoning_pattern",
  "trigger": "debugging React app",
  "pattern": [
    "1. Check browser console for errors",
    "2. Verify component props",
    "3. Check state management",
    "4. Review recent git changes",
    "5. Test in isolation"
  ]
}

Inject When Relevant

System: When debugging React issues, follow this proven approach:
1. Check browser console for errors
2. Verify component props
...

Learn from Success and Failure

def log_outcome(memory_id, outcome, feedback=None):
    """Track whether retrieved memories helped"""
    memory = get_memory(memory_id)

    if outcome == "helpful":
        memory.success_count += 1
        memory.importance = recalculate_importance(memory)
    elif outcome == "not_helpful":
        memory.failure_count += 1
        if memory.failure_count > 3:
            memory.importance = "low"

    if feedback:
        memory.add_feedback(feedback)

    save_memory(memory)

8.7 Implementation Approaches

Approach 1: Simple Database

Good for: Getting started, small scale

# SQLite / PostgreSQL
class Memory(Base):
    id = Column(String, primary_key=True)
    user_id = Column(String, index=True)
    content = Column(JSON)
    created_at = Column(DateTime)
    embedding = Column(Vector(1536))  # pgvector

Approach 2: Vector Database

Good for: Semantic search, scaling

Database	Strengths
----------	-----------
Pinecone	Managed, easy to start
Weaviate	Open source, hybrid search
Qdrant	Fast, Rust-based
Chroma	Embedded, great for prototyping
pgvector	If already using Postgres

Approach 3: Hybrid (Recommended)

Combine structured + vector:

┌─────────────────────────────────────────────────────────────┐
│                  HYBRID MEMORY SYSTEM                        │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  PostgreSQL (Structured)         Pinecone (Semantic)        │
│  ┌─────────────────────┐        ┌────────────────────┐     │
│  │ Users, Projects     │        │ Memory embeddings  │     │
│  │ Preferences         │        │ for similarity     │     │
│  │ Relationships       │◀──────▶│ search             │     │
│  │ Metadata            │        │                    │     │
│  └─────────────────────┘        └────────────────────┘     │
│           │                              │                   │
│           └──────────┬───────────────────┘                  │
│                      ▼                                       │
│           ┌─────────────────────┐                           │
│           │  Application Layer  │                           │
│           │  - Query routing    │                           │
│           │  - Context assembly │                           │
│           │  - Token management │                           │
│           └─────────────────────┘                           │
│                                                              │
└─────────────────────────────────────────────────────────────┘

8.8 Privacy and Security

What to Consider

Concern	Mitigation
---------	------------
PII in memories	Automatic redaction before storage
Data retention	User-controlled deletion
Access control	Per-user isolation
Encryption	At-rest and in-transit

User Control

# Allow users to manage their memories
def delete_user_memories(user_id):
    """GDPR-style complete deletion"""
    vector_store.delete(filter={"user_id": user_id})
    database.delete(Memory.user_id == user_id)

def export_user_memories(user_id):
    """Data portability"""
    memories = get_all_memories(user_id)
    return json.dumps(memories, indent=2)

8.9 Measuring Memory Effectiveness

Key Metrics

Metric	Description	Target
--------	-------------	--------
Retrieval relevance	% of retrieved memories used	> 70%
Grounding rate	% responses citing context	> 80%
Hallucination reduction	Compared to no-memory baseline	> 50% reduction
User satisfaction	Repeat context requests	Decreasing

A/B Testing

# Compare with/without memory
def run_experiment(user_id, query):
    if user_id in treatment_group:
        context = retrieve_memories(user_id, query)
        response = llm.generate(query, context=context)
    else:
        response = llm.generate(query)

    log_experiment(user_id, response, has_memory=user_id in treatment_group)

8.10 Key Takeaways

The Memory Hierarchy

┌─────────────────────────────────────────────────────────────┐
│                  MEMORY PRIORITY                             │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  1. User Preferences (always load)                          │
│     └─ Language, timezone, communication style              │
│                                                              │
│  2. Active Project Context (load when relevant)             │
│     └─ Tech stack, recent changes, current goals            │
│                                                              │
│  3. Problem-Solution History (semantic search)              │
│     └─ Past bugs, fixes, decisions                          │
│                                                              │
│  4. General Knowledge (lowest priority)                      │
│     └─ Documentation, tutorials, patterns                   │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Design Principles

Context beats parameters - Always prefer injected context
Less is more - Retrieve only what's relevant
Recency matters - Recent memories often most useful
Learn from usage - Track what helps, demote what doesn't
User control - Let users manage their memories

Remember

"External memory transforms an LLM's 'vague recollection' into 'exact working memory.'"

Well-designed memory systems:

Reduce hallucinations through grounding
Improve user experience with continuity
Enable learning across sessions
Build trust through consistency

Practice Exercises

Design a memory schema for a coding assistant
Implement relevance scoring for your use case
Build a simple vector memory with Chroma
Create a retrieval strategy that respects token limits