Module 08

Building Memory Systems for LLMs

Supplementary Material Focus: Designing persistent memory to enhance LLM applications

8.1 The Memory Problem

Why LLMs Need External Memory

From the course, we learned that LLMs have two types of "knowledge":

TypeLocationCharacteristics
---------------------------------
ParametricModel weightsCompressed, approximate, can't update
ContextualContext windowExact, limited size, per-request only

The fundamental limitation: Context windows reset every conversation.

┌─────────────────────────────────────────────────────────────┐
│                 THE MEMORY GAP                               │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Session 1: User discusses Project Alpha                     │
│  ─────────────────────────────────────────                  │
│  "We're building a React app with Supabase..."              │
│  "The auth system uses JWT tokens..."                       │
│  "We fixed the CORS bug yesterday..."                       │
│                                                              │
│                    ⬇️ SESSION ENDS ⬇️                        │
│                    🧠 CONTEXT CLEARED                        │
│                                                              │
│  Session 2: Same user returns                                │
│  ─────────────────────────────────────────                  │
│  User: "What was that CORS fix we did?"                     │
│  LLM: "I don't have any information about previous          │
│        conversations. Could you provide more context?"       │
│                                                              │
│  😞 USER FRUSTRATED - HAS TO RE-EXPLAIN EVERYTHING          │
│                                                              │
└─────────────────────────────────────────────────────────────┘

The Solution: External Memory

Build a system that:

  1. Captures important context during conversations
  2. Stores it persistently (database, vector store)
  3. Retrieves relevant context for new sessions
  4. Injects it into the context window

8.2 Memory Architecture Patterns

Pattern 1: Simple Key-Value Memory

Best for: User preferences, settings, simple facts

# Store
memory.set("user_123:preferred_language", "Python")
memory.set("user_123:timezone", "America/New_York")

# Retrieve and inject
preferences = memory.get_all("user_123:*")
system_prompt = f"""
User preferences:
- Preferred language: {preferences['preferred_language']}
- Timezone: {preferences['timezone']}
"""

Pattern 2: Conversation Summarization

Best for: Long-running conversations, session continuity

# At end of session
summary = llm.summarize(conversation_history)
memory.store({
    "user_id": "user_123",
    "session_id": "session_456",
    "summary": summary,
    "key_topics": extract_topics(conversation_history),
    "timestamp": now()
})

# At start of new session
recent_summaries = memory.get_recent("user_123", limit=3)
system_prompt = f"""
Previous conversation summaries:
{format_summaries(recent_summaries)}
"""

Pattern 3: Semantic/Vector Memory

Best for: Finding relevant past context based on meaning

# Store with embeddings
embedding = embed(conversation_chunk)
vector_store.upsert({
    "id": generate_id(),
    "embedding": embedding,
    "text": conversation_chunk,
    "metadata": {"user_id": "user_123", "timestamp": now()}
})

# Retrieve semantically similar memories
query_embedding = embed(user_query)
relevant_memories = vector_store.search(
    query_embedding,
    filter={"user_id": "user_123"},
    top_k=5
)

Pattern 4: Structured Knowledge Graph

Best for: Complex relationships, entity tracking

┌─────────────────────────────────────────────────────────────┐
│                  KNOWLEDGE GRAPH                             │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  [Project Alpha] ──uses──▶ [React]                          │
│        │                      │                              │
│        ├──uses──▶ [Supabase] ◀──is_a── [Database]          │
│        │                                                     │
│        └──has_bug──▶ [CORS Issue] ──fixed_by──▶ [JWT Fix]  │
│                           │                                  │
│                           └──occurred── [2024-01-15]        │
│                                                              │
└─────────────────────────────────────────────────────────────┘

8.3 The Grounding Principle

Context > Parameters

From Module 3, Karpathy's key insight:

"Pasting information directly into context windows produces higher-quality outputs than relying on parametric knowledge."

This means memory-injected context beats model "knowledge":

┌─────────────────────────────────────────────────────────────┐
│              GROUNDING COMPARISON                            │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  WITHOUT MEMORY (Parametric Only)                            │
│  ────────────────────────────────                           │
│  User: "What's the status of Project Alpha?"                │
│  LLM: "I don't have specific information about              │
│        Project Alpha. Could you tell me more?"              │
│                                                              │
│        ⚠️ No grounding = no useful answer                   │
│                                                              │
│  WITH MEMORY (Context Injection)                            │
│  ────────────────────────────────                           │
│  System: [Memory retrieved and injected]                    │
│  "Project Alpha context:                                     │
│   - React + Supabase stack                                   │
│   - Last commit: Fixed CORS bug (Jan 15)                    │
│   - Current sprint: User authentication                      │
│   - 3 open issues, 2 in progress"                           │
│                                                              │
│  User: "What's the status of Project Alpha?"                │
│  LLM: "Project Alpha is currently focused on user           │
│        authentication. You recently fixed the CORS bug.     │
│        There are 3 open issues with 2 in progress."         │
│                                                              │
│        ✅ Grounded = accurate, helpful answer               │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Why Grounding Reduces Hallucinations

Without context, models have two choices:

  1. Admit ignorance (rare without training)
  2. Generate plausible-sounding content (hallucinate)

With context:

  1. Extract facts from provided information
  2. Cite sources
  3. Acknowledge gaps in provided context

8.4 What to Remember

High-Value Memory Types

Memory TypeExampleRetrieval Trigger
-----------------------------------------
User Preferences"Prefers TypeScript"Every session
Project ContextTech stack, file structureProject-related queries
Problem-Solution Pairs"Fixed X by doing Y"Similar problems
Decisions Made"Chose Postgres over MySQL because..."Architecture questions
Terminology"Widget = the sidebar component"Uses of term

Memory Schema Example

{
  "id": "mem_abc123",
  "type": "problem_solution",
  "created_at": "2024-01-15T10:30:00Z",
  "user_id": "user_123",
  "project": "project_alpha",

  "content": {
    "problem": "CORS errors when calling API from frontend",
    "context": "React app calling Supabase Edge Functions",
    "solution": "Added Access-Control-Allow-Origin header to function",
    "file_changed": "supabase/functions/api/index.ts",
    "verified": true
  },

  "embedding": [0.123, -0.456, ...],

  "metadata": {
    "importance": "high",
    "access_count": 3,
    "last_accessed": "2024-01-20T15:00:00Z"
  }
}

8.5 Retrieval Strategies

When to Retrieve

TriggerAction
-----------------
Session startLoad user preferences, recent context
Project mentionedLoad project-specific memories
Problem describedSearch for similar problems
Question askedSemantic search for relevant info

How Much to Retrieve

Balance between context quality and token usage:

def get_optimal_context(query, user_id, max_tokens=2000):
    memories = []
    token_count = 0

    # Priority 1: User preferences (always include)
    prefs = get_user_preferences(user_id)
    memories.append(prefs)
    token_count += count_tokens(prefs)

    # Priority 2: Semantically relevant memories
    relevant = semantic_search(query, user_id, limit=10)
    for mem in relevant:
        if token_count + count_tokens(mem) < max_tokens:
            memories.append(mem)
            token_count += count_tokens(mem)
        else:
            break

    return format_as_context(memories)

Relevance Scoring

Not all memories are equally useful:

def score_memory_relevance(memory, query, current_context):
    score = 0.0

    # Semantic similarity (0-1)
    score += cosine_similarity(memory.embedding, query.embedding) * 0.4

    # Recency boost
    days_old = (now() - memory.created_at).days
    score += max(0, 1 - days_old/30) * 0.2

    # Usage frequency
    score += min(memory.access_count / 10, 1) * 0.2

    # Explicit importance
    importance_map = {"low": 0.1, "medium": 0.5, "high": 1.0}
    score += importance_map[memory.importance] * 0.2

    return score

8.6 Memory and Reasoning

Tokens for Thinking

From Module 3: LLMs need tokens to reason. Memory systems can help:

Store Reasoning Patterns

{
  "type": "reasoning_pattern",
  "trigger": "debugging React app",
  "pattern": [
    "1. Check browser console for errors",
    "2. Verify component props",
    "3. Check state management",
    "4. Review recent git changes",
    "5. Test in isolation"
  ]
}

Inject When Relevant

System: When debugging React issues, follow this proven approach:
1. Check browser console for errors
2. Verify component props
...

Learn from Success and Failure

def log_outcome(memory_id, outcome, feedback=None):
    """Track whether retrieved memories helped"""
    memory = get_memory(memory_id)

    if outcome == "helpful":
        memory.success_count += 1
        memory.importance = recalculate_importance(memory)
    elif outcome == "not_helpful":
        memory.failure_count += 1
        if memory.failure_count > 3:
            memory.importance = "low"

    if feedback:
        memory.add_feedback(feedback)

    save_memory(memory)

8.7 Implementation Approaches

Approach 1: Simple Database

Good for: Getting started, small scale

# SQLite / PostgreSQL
class Memory(Base):
    id = Column(String, primary_key=True)
    user_id = Column(String, index=True)
    content = Column(JSON)
    created_at = Column(DateTime)
    embedding = Column(Vector(1536))  # pgvector

Approach 2: Vector Database

Good for: Semantic search, scaling

DatabaseStrengths
---------------------
PineconeManaged, easy to start
WeaviateOpen source, hybrid search
QdrantFast, Rust-based
ChromaEmbedded, great for prototyping
pgvectorIf already using Postgres

Combine structured + vector:

┌─────────────────────────────────────────────────────────────┐
│                  HYBRID MEMORY SYSTEM                        │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  PostgreSQL (Structured)         Pinecone (Semantic)        │
│  ┌─────────────────────┐        ┌────────────────────┐     │
│  │ Users, Projects     │        │ Memory embeddings  │     │
│  │ Preferences         │        │ for similarity     │     │
│  │ Relationships       │◀──────▶│ search             │     │
│  │ Metadata            │        │                    │     │
│  └─────────────────────┘        └────────────────────┘     │
│           │                              │                   │
│           └──────────┬───────────────────┘                  │
│                      ▼                                       │
│           ┌─────────────────────┐                           │
│           │  Application Layer  │                           │
│           │  - Query routing    │                           │
│           │  - Context assembly │                           │
│           │  - Token management │                           │
│           └─────────────────────┘                           │
│                                                              │
└─────────────────────────────────────────────────────────────┘

8.8 Privacy and Security

What to Consider

ConcernMitigation
---------------------
PII in memoriesAutomatic redaction before storage
Data retentionUser-controlled deletion
Access controlPer-user isolation
EncryptionAt-rest and in-transit

User Control

# Allow users to manage their memories
def delete_user_memories(user_id):
    """GDPR-style complete deletion"""
    vector_store.delete(filter={"user_id": user_id})
    database.delete(Memory.user_id == user_id)

def export_user_memories(user_id):
    """Data portability"""
    memories = get_all_memories(user_id)
    return json.dumps(memories, indent=2)

8.9 Measuring Memory Effectiveness

Key Metrics

MetricDescriptionTarget
-----------------------------
Retrieval relevance% of retrieved memories used> 70%
Grounding rate% responses citing context> 80%
Hallucination reductionCompared to no-memory baseline> 50% reduction
User satisfactionRepeat context requestsDecreasing

A/B Testing

# Compare with/without memory
def run_experiment(user_id, query):
    if user_id in treatment_group:
        context = retrieve_memories(user_id, query)
        response = llm.generate(query, context=context)
    else:
        response = llm.generate(query)

    log_experiment(user_id, response, has_memory=user_id in treatment_group)

8.10 Key Takeaways

The Memory Hierarchy

┌─────────────────────────────────────────────────────────────┐
│                  MEMORY PRIORITY                             │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  1. User Preferences (always load)                          │
│     └─ Language, timezone, communication style              │
│                                                              │
│  2. Active Project Context (load when relevant)             │
│     └─ Tech stack, recent changes, current goals            │
│                                                              │
│  3. Problem-Solution History (semantic search)              │
│     └─ Past bugs, fixes, decisions                          │
│                                                              │
│  4. General Knowledge (lowest priority)                      │
│     └─ Documentation, tutorials, patterns                   │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Design Principles

  1. Context beats parameters - Always prefer injected context
  2. Less is more - Retrieve only what's relevant
  3. Recency matters - Recent memories often most useful
  4. Learn from usage - Track what helps, demote what doesn't
  5. User control - Let users manage their memories

Remember

"External memory transforms an LLM's 'vague recollection' into 'exact working memory.'"

Well-designed memory systems:

  • Reduce hallucinations through grounding
  • Improve user experience with continuity
  • Enable learning across sessions
  • Build trust through consistency

Practice Exercises

  1. Design a memory schema for a coding assistant
  2. Implement relevance scoring for your use case
  3. Build a simple vector memory with Chroma
  4. Create a retrieval strategy that respects token limits

Further Reading


Synthesized from production memory system patterns