Supplementary Material Focus: Designing persistent memory to enhance LLM applications
8.1 The Memory Problem
Why LLMs Need External Memory
From the course, we learned that LLMs have two types of "knowledge":
| Type | Location | Characteristics |
|---|---|---|
| ------ | ---------- | ----------------- |
| Parametric | Model weights | Compressed, approximate, can't update |
| Contextual | Context window | Exact, limited size, per-request only |
The fundamental limitation: Context windows reset every conversation.
┌─────────────────────────────────────────────────────────────┐
│ THE MEMORY GAP │
├─────────────────────────────────────────────────────────────┤
│ │
│ Session 1: User discusses Project Alpha │
│ ───────────────────────────────────────── │
│ "We're building a React app with Supabase..." │
│ "The auth system uses JWT tokens..." │
│ "We fixed the CORS bug yesterday..." │
│ │
│ ⬇️ SESSION ENDS ⬇️ │
│ 🧠 CONTEXT CLEARED │
│ │
│ Session 2: Same user returns │
│ ───────────────────────────────────────── │
│ User: "What was that CORS fix we did?" │
│ LLM: "I don't have any information about previous │
│ conversations. Could you provide more context?" │
│ │
│ 😞 USER FRUSTRATED - HAS TO RE-EXPLAIN EVERYTHING │
│ │
└─────────────────────────────────────────────────────────────┘
The Solution: External Memory
Build a system that:
- Captures important context during conversations
- Stores it persistently (database, vector store)
- Retrieves relevant context for new sessions
- Injects it into the context window
8.2 Memory Architecture Patterns
Pattern 1: Simple Key-Value Memory
Best for: User preferences, settings, simple facts
# Store
memory.set("user_123:preferred_language", "Python")
memory.set("user_123:timezone", "America/New_York")
# Retrieve and inject
preferences = memory.get_all("user_123:*")
system_prompt = f"""
User preferences:
- Preferred language: {preferences['preferred_language']}
- Timezone: {preferences['timezone']}
"""
Pattern 2: Conversation Summarization
Best for: Long-running conversations, session continuity
# At end of session
summary = llm.summarize(conversation_history)
memory.store({
"user_id": "user_123",
"session_id": "session_456",
"summary": summary,
"key_topics": extract_topics(conversation_history),
"timestamp": now()
})
# At start of new session
recent_summaries = memory.get_recent("user_123", limit=3)
system_prompt = f"""
Previous conversation summaries:
{format_summaries(recent_summaries)}
"""
Pattern 3: Semantic/Vector Memory
Best for: Finding relevant past context based on meaning
# Store with embeddings
embedding = embed(conversation_chunk)
vector_store.upsert({
"id": generate_id(),
"embedding": embedding,
"text": conversation_chunk,
"metadata": {"user_id": "user_123", "timestamp": now()}
})
# Retrieve semantically similar memories
query_embedding = embed(user_query)
relevant_memories = vector_store.search(
query_embedding,
filter={"user_id": "user_123"},
top_k=5
)
Pattern 4: Structured Knowledge Graph
Best for: Complex relationships, entity tracking
┌─────────────────────────────────────────────────────────────┐
│ KNOWLEDGE GRAPH │
├─────────────────────────────────────────────────────────────┤
│ │
│ [Project Alpha] ──uses──▶ [React] │
│ │ │ │
│ ├──uses──▶ [Supabase] ◀──is_a── [Database] │
│ │ │
│ └──has_bug──▶ [CORS Issue] ──fixed_by──▶ [JWT Fix] │
│ │ │
│ └──occurred── [2024-01-15] │
│ │
└─────────────────────────────────────────────────────────────┘
8.3 The Grounding Principle
Context > Parameters
From Module 3, Karpathy's key insight:
"Pasting information directly into context windows produces higher-quality outputs than relying on parametric knowledge."
This means memory-injected context beats model "knowledge":
┌─────────────────────────────────────────────────────────────┐
│ GROUNDING COMPARISON │
├─────────────────────────────────────────────────────────────┤
│ │
│ WITHOUT MEMORY (Parametric Only) │
│ ──────────────────────────────── │
│ User: "What's the status of Project Alpha?" │
│ LLM: "I don't have specific information about │
│ Project Alpha. Could you tell me more?" │
│ │
│ ⚠️ No grounding = no useful answer │
│ │
│ WITH MEMORY (Context Injection) │
│ ──────────────────────────────── │
│ System: [Memory retrieved and injected] │
│ "Project Alpha context: │
│ - React + Supabase stack │
│ - Last commit: Fixed CORS bug (Jan 15) │
│ - Current sprint: User authentication │
│ - 3 open issues, 2 in progress" │
│ │
│ User: "What's the status of Project Alpha?" │
│ LLM: "Project Alpha is currently focused on user │
│ authentication. You recently fixed the CORS bug. │
│ There are 3 open issues with 2 in progress." │
│ │
│ ✅ Grounded = accurate, helpful answer │
│ │
└─────────────────────────────────────────────────────────────┘
Why Grounding Reduces Hallucinations
Without context, models have two choices:
- Admit ignorance (rare without training)
- Generate plausible-sounding content (hallucinate)
With context:
- Extract facts from provided information
- Cite sources
- Acknowledge gaps in provided context
8.4 What to Remember
High-Value Memory Types
| Memory Type | Example | Retrieval Trigger |
|---|---|---|
| ------------- | --------- | ------------------- |
| User Preferences | "Prefers TypeScript" | Every session |
| Project Context | Tech stack, file structure | Project-related queries |
| Problem-Solution Pairs | "Fixed X by doing Y" | Similar problems |
| Decisions Made | "Chose Postgres over MySQL because..." | Architecture questions |
| Terminology | "Widget = the sidebar component" | Uses of term |
Memory Schema Example
{
"id": "mem_abc123",
"type": "problem_solution",
"created_at": "2024-01-15T10:30:00Z",
"user_id": "user_123",
"project": "project_alpha",
"content": {
"problem": "CORS errors when calling API from frontend",
"context": "React app calling Supabase Edge Functions",
"solution": "Added Access-Control-Allow-Origin header to function",
"file_changed": "supabase/functions/api/index.ts",
"verified": true
},
"embedding": [0.123, -0.456, ...],
"metadata": {
"importance": "high",
"access_count": 3,
"last_accessed": "2024-01-20T15:00:00Z"
}
}
8.5 Retrieval Strategies
When to Retrieve
| Trigger | Action |
|---|---|
| --------- | -------- |
| Session start | Load user preferences, recent context |
| Project mentioned | Load project-specific memories |
| Problem described | Search for similar problems |
| Question asked | Semantic search for relevant info |
How Much to Retrieve
Balance between context quality and token usage:
def get_optimal_context(query, user_id, max_tokens=2000):
memories = []
token_count = 0
# Priority 1: User preferences (always include)
prefs = get_user_preferences(user_id)
memories.append(prefs)
token_count += count_tokens(prefs)
# Priority 2: Semantically relevant memories
relevant = semantic_search(query, user_id, limit=10)
for mem in relevant:
if token_count + count_tokens(mem) < max_tokens:
memories.append(mem)
token_count += count_tokens(mem)
else:
break
return format_as_context(memories)
Relevance Scoring
Not all memories are equally useful:
def score_memory_relevance(memory, query, current_context):
score = 0.0
# Semantic similarity (0-1)
score += cosine_similarity(memory.embedding, query.embedding) * 0.4
# Recency boost
days_old = (now() - memory.created_at).days
score += max(0, 1 - days_old/30) * 0.2
# Usage frequency
score += min(memory.access_count / 10, 1) * 0.2
# Explicit importance
importance_map = {"low": 0.1, "medium": 0.5, "high": 1.0}
score += importance_map[memory.importance] * 0.2
return score
8.6 Memory and Reasoning
Tokens for Thinking
From Module 3: LLMs need tokens to reason. Memory systems can help:
Store Reasoning Patterns
{
"type": "reasoning_pattern",
"trigger": "debugging React app",
"pattern": [
"1. Check browser console for errors",
"2. Verify component props",
"3. Check state management",
"4. Review recent git changes",
"5. Test in isolation"
]
}
Inject When Relevant
System: When debugging React issues, follow this proven approach:
1. Check browser console for errors
2. Verify component props
...
Learn from Success and Failure
def log_outcome(memory_id, outcome, feedback=None):
"""Track whether retrieved memories helped"""
memory = get_memory(memory_id)
if outcome == "helpful":
memory.success_count += 1
memory.importance = recalculate_importance(memory)
elif outcome == "not_helpful":
memory.failure_count += 1
if memory.failure_count > 3:
memory.importance = "low"
if feedback:
memory.add_feedback(feedback)
save_memory(memory)
8.7 Implementation Approaches
Approach 1: Simple Database
Good for: Getting started, small scale
# SQLite / PostgreSQL
class Memory(Base):
id = Column(String, primary_key=True)
user_id = Column(String, index=True)
content = Column(JSON)
created_at = Column(DateTime)
embedding = Column(Vector(1536)) # pgvector
Approach 2: Vector Database
Good for: Semantic search, scaling
| Database | Strengths |
|---|---|
| ---------- | ----------- |
| Pinecone | Managed, easy to start |
| Weaviate | Open source, hybrid search |
| Qdrant | Fast, Rust-based |
| Chroma | Embedded, great for prototyping |
| pgvector | If already using Postgres |
Approach 3: Hybrid (Recommended)
Combine structured + vector:
┌─────────────────────────────────────────────────────────────┐
│ HYBRID MEMORY SYSTEM │
├─────────────────────────────────────────────────────────────┤
│ │
│ PostgreSQL (Structured) Pinecone (Semantic) │
│ ┌─────────────────────┐ ┌────────────────────┐ │
│ │ Users, Projects │ │ Memory embeddings │ │
│ │ Preferences │ │ for similarity │ │
│ │ Relationships │◀──────▶│ search │ │
│ │ Metadata │ │ │ │
│ └─────────────────────┘ └────────────────────┘ │
│ │ │ │
│ └──────────┬───────────────────┘ │
│ ▼ │
│ ┌─────────────────────┐ │
│ │ Application Layer │ │
│ │ - Query routing │ │
│ │ - Context assembly │ │
│ │ - Token management │ │
│ └─────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
8.8 Privacy and Security
What to Consider
| Concern | Mitigation |
|---|---|
| --------- | ------------ |
| PII in memories | Automatic redaction before storage |
| Data retention | User-controlled deletion |
| Access control | Per-user isolation |
| Encryption | At-rest and in-transit |
User Control
# Allow users to manage their memories
def delete_user_memories(user_id):
"""GDPR-style complete deletion"""
vector_store.delete(filter={"user_id": user_id})
database.delete(Memory.user_id == user_id)
def export_user_memories(user_id):
"""Data portability"""
memories = get_all_memories(user_id)
return json.dumps(memories, indent=2)
8.9 Measuring Memory Effectiveness
Key Metrics
| Metric | Description | Target |
|---|---|---|
| -------- | ------------- | -------- |
| Retrieval relevance | % of retrieved memories used | > 70% |
| Grounding rate | % responses citing context | > 80% |
| Hallucination reduction | Compared to no-memory baseline | > 50% reduction |
| User satisfaction | Repeat context requests | Decreasing |
A/B Testing
# Compare with/without memory
def run_experiment(user_id, query):
if user_id in treatment_group:
context = retrieve_memories(user_id, query)
response = llm.generate(query, context=context)
else:
response = llm.generate(query)
log_experiment(user_id, response, has_memory=user_id in treatment_group)
8.10 Key Takeaways
The Memory Hierarchy
┌─────────────────────────────────────────────────────────────┐
│ MEMORY PRIORITY │
├─────────────────────────────────────────────────────────────┤
│ │
│ 1. User Preferences (always load) │
│ └─ Language, timezone, communication style │
│ │
│ 2. Active Project Context (load when relevant) │
│ └─ Tech stack, recent changes, current goals │
│ │
│ 3. Problem-Solution History (semantic search) │
│ └─ Past bugs, fixes, decisions │
│ │
│ 4. General Knowledge (lowest priority) │
│ └─ Documentation, tutorials, patterns │
│ │
└─────────────────────────────────────────────────────────────┘
Design Principles
- Context beats parameters - Always prefer injected context
- Less is more - Retrieve only what's relevant
- Recency matters - Recent memories often most useful
- Learn from usage - Track what helps, demote what doesn't
- User control - Let users manage their memories
Remember
"External memory transforms an LLM's 'vague recollection' into 'exact working memory.'"
Well-designed memory systems:
- Reduce hallucinations through grounding
- Improve user experience with continuity
- Enable learning across sessions
- Build trust through consistency
Practice Exercises
- Design a memory schema for a coding assistant
- Implement relevance scoring for your use case
- Build a simple vector memory with Chroma
- Create a retrieval strategy that respects token limits
Further Reading
- Building LLM Applications with Long-Term Memory
- RAG vs. Fine-tuning: A Practical Guide
- Vector Databases Compared
Synthesized from production memory system patterns