Supplementary Material Focus: Building production-ready AI systems with comprehensive monitoring
7.1 Why LLM Observability Matters
The Unique Challenge
Unlike traditional software, LLMs present unique monitoring challenges:
| Traditional Software | LLM Applications |
|---|---|
| --------------------- | ------------------ |
| Deterministic outputs | Stochastic outputs |
| Clear success/failure | Subjective quality |
| Predictable latency | Variable generation time |
| Fixed resource usage | Dynamic token consumption |
| Easily testable | Probabilistic behavior |
The Black Box Problem
Running LLM applications in production is more challenging than traditional ML. The difficulty arises from:
- Massive model sizes
- Intricate architecture
- Non-deterministic outputs
- Subjective quality assessment
Without proper observability, you're flying blind.
7.2 The Three Pillars of LLM Observability
Metrics, Logs, and Traces
┌─────────────────────────────────────────────────────────────┐
│ LLM OBSERVABILITY PILLARS │
├─────────────────────────────────────────────────────────────┤
│ │
│ METRICS │
│ ├─ Token usage (input/output) │
│ ├─ Latency (TTFT, generation, total) │
│ ├─ Cost per request │
│ ├─ Error rates │
│ └─ Quality scores │
│ │
│ LOGS │
│ ├─ Full request/response pairs │
│ ├─ Model parameters used │
│ ├─ User context and session info │
│ └─ Error details and stack traces │
│ │
│ TRACES │
│ ├─ End-to-end request flow │
│ ├─ RAG retrieval steps │
│ ├─ Tool/function calls │
│ └─ Multi-model orchestration │
│ │
└─────────────────────────────────────────────────────────────┘
7.3 Essential Metrics to Track
Token-Level Metrics
From Module 1 (Tokenization) learnings - tokens are the currency of LLMs:
| Metric | Description | Why It Matters |
|---|---|---|
| -------- | ------------- | ---------------- |
prompt_tokens | Input token count | Cost, context usage |
completion_tokens | Output token count | Cost, response length |
context_utilization | % of context window used | Efficiency |
tokens_per_second | Generation speed | User experience |
Latency Metrics
Critical for user experience:
| Metric | Description | Target |
|---|---|---|
| -------- | ------------- | -------- |
| TTFT | Time to First Token | < 500ms |
| Generation | Time for full response | Varies by length |
| Total | End-to-end latency | < 3s for most apps |
Quality Metrics
From Module 4 (Hallucinations) - quality is subjective but measurable:
| Metric | Detection Method |
|---|---|
| -------- | ------------------ |
| Hallucination rate | Fact-checking against sources |
| Source attribution | Does response cite context? |
| Coherence score | Automated evaluation |
| User feedback | Thumbs up/down, ratings |
Cost Metrics
# Cost calculation
input_cost = prompt_tokens * price_per_input_token
output_cost = completion_tokens * price_per_output_token
total_cost = input_cost + output_cost
# Track per user, session, feature
cost_per_user = sum(request_costs) / active_users
7.4 Logging Best Practices
What to Log
{
"request_id": "uuid-here",
"timestamp": "2024-01-15T10:30:00Z",
"user_id": "user_123",
"session_id": "session_456",
"model": {
"provider": "openai",
"model_id": "gpt-4-turbo",
"version": "2024-01-15"
},
"request": {
"prompt_tokens": 1500,
"system_prompt_hash": "abc123",
"has_context": true,
"context_source": "rag"
},
"response": {
"completion_tokens": 350,
"stop_reason": "end_turn",
"has_tool_calls": false
},
"timing": {
"ttft_ms": 450,
"generation_ms": 2100,
"total_ms": 2550
},
"parameters": {
"temperature": 0.7,
"max_tokens": 1000,
"top_p": 0.9
},
"cost_usd": 0.05
}
What NOT to Log (Privacy)
- Full user prompts (unless consented)
- PII in responses
- API keys or secrets
- Internal system prompts (competitive advantage)
7.5 Detecting Hallucinations in Production
Automated Detection Pipeline
┌─────────────────────────────────────────────────────────────┐
│ HALLUCINATION DETECTION │
├─────────────────────────────────────────────────────────────┤
│ │
│ INPUT │
│ ├─ User query │
│ ├─ Retrieved context (RAG) │
│ └─ Model response │
│ │ │
│ ▼ │
│ DETECTION METHODS │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ 1. SOURCE GROUNDING │ │
│ │ Extract claims → Match to context → Score │ │
│ │ │ │
│ │ 2. SELF-CONSISTENCY │ │
│ │ Generate 3x → Compare facts → Flag conflicts │ │
│ │ │ │
│ │ 3. PATTERN DETECTION │ │
│ │ Specific numbers? Named citations? URLs? │ │
│ │ → Higher hallucination risk │ │
│ │ │ │
│ │ 4. UNCERTAINTY SIGNALS │ │
│ │ "I believe..." "probably..." → Lower confidence │ │
│ │ │ │
│ └───────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ OUTPUT: Risk Score (0.0 - 1.0) │
│ │
└─────────────────────────────────────────────────────────────┘
Implementation Example
def detect_hallucination_risk(response: str, context: str) -> float:
risk_score = 0.0
# Check for unsourced specific numbers
numbers = re.findall(r'\d+\.?\d*%?', response)
for num in numbers:
if num not in context:
risk_score += 0.1
# Check for citation patterns
citations = re.findall(r'according to|study shows|research indicates', response.lower())
if citations and not any(c in context.lower() for c in citations):
risk_score += 0.2
# Check for URLs
urls = re.findall(r'https?://\S+', response)
risk_score += len(urls) * 0.15
return min(risk_score, 1.0)
7.6 Building a Monitoring Dashboard
Key Visualizations
| Panel | Metric | Purpose |
|---|---|---|
| ------- | -------- | --------- |
| Request Volume | Requests/min | Load monitoring |
| Latency P50/P95/P99 | Milliseconds | Performance |
| Token Usage | Tokens/hour | Cost projection |
| Error Rate | % failed | Reliability |
| Quality Score | 0-100 | User satisfaction |
| Cost Tracker | $/day | Budget management |
Alert Thresholds
| Alert | Threshold | Severity |
|---|---|---|
| ------- | ----------- | ---------- |
| Latency P95 > 5s | 5 consecutive | Warning |
| Error rate > 1% | 5 min window | Critical |
| Hallucination rate > 15% | Hourly avg | Warning |
| Cost > 120% daily budget | Real-time | Critical |
7.7 Debugging LLM Issues
Common Problems and Solutions
| Problem | Symptoms | Debug Approach |
|---|---|---|
| --------- | ---------- | ---------------- |
| High latency | Slow responses | Check token counts, model load |
| Inconsistent quality | Variable outputs | Review temperature, check prompts |
| High costs | Budget overruns | Analyze token usage patterns |
| Hallucinations | Incorrect facts | Check context injection, prompts |
| Rate limiting | 429 errors | Implement backoff, caching |
The Debug Checklist
- Check the prompt - Is context being injected correctly?
- Check the parameters - Temperature, max_tokens appropriate?
- Check the model - Right model for the task?
- Check the context - RAG returning relevant results?
- Check for patterns - Does error correlate with input type?
7.8 Multi-Model Observability
When Using Multiple Models
Track per-model performance to optimize routing:
┌─────────────────────────────────────────────────────────────┐
│ MODEL COMPARISON │
├─────────────────────────────────────────────────────────────┤
│ │
│ Task Type │ GPT-4 │ Claude │ Gemini │ Llama-3 │
│ ─────────────┼────────┼────────┼────────┼───────── │
│ Reasoning │ 0.92 │ 0.90 │ 0.88 │ 0.82 │
│ Coding │ 0.89 │ 0.91 │ 0.85 │ 0.80 │
│ Creative │ 0.85 │ 0.88 │ 0.82 │ 0.75 │
│ Cost/1M tok │ $30 │ $15 │ $7 │ ~$0 │
│ │
│ INSIGHT: Route simple queries to cheaper models │
│ │
└─────────────────────────────────────────────────────────────┘
7.9 Tools and Platforms
Open Source
| Tool | Purpose |
|---|---|
| ------ | --------- |
| Langfuse | LLM tracing and analytics |
| Phoenix | ML observability |
| OpenTelemetry | Distributed tracing |
| Prometheus | Metrics collection |
Commercial
| Tool | Strengths |
|---|---|
| ------ | ----------- |
| Datadog LLM | Full-stack integration |
| LangSmith | LangChain ecosystem |
| Weights & Biases | Experiment tracking |
| Helicone | Cost optimization |
Build vs. Buy
| Build When | Buy When |
|---|---|
| ------------ | ---------- |
| Custom requirements | Standard needs |
| Data sensitivity | Speed to market |
| Cost optimization | Team capacity |
| Competitive advantage | Focus on core product |
7.10 Key Takeaways
The Observability Checklist
- [ ] Track token usage (input/output separately)
- [ ] Measure latency (TTFT, generation, total)
- [ ] Log requests with correlation IDs
- [ ] Monitor costs in real-time
- [ ] Detect hallucinations automatically
- [ ] Set up alerts for anomalies
- [ ] Build dashboards for visibility
- [ ] Implement multi-model tracking if applicable
Remember
"You can't improve what you can't measure."
LLM observability isn't optional in production - it's essential for:
- Cost control - LLMs are expensive
- Quality assurance - Hallucinations damage trust
- User experience - Latency matters
- Debugging - Black boxes need visibility
Practice Questions
- What metrics would you prioritize for a customer-facing chatbot?
- How would you detect if your RAG system is returning irrelevant context?
- What's the difference between TTFT and total latency, and why track both?
- How would you set up alerts for detecting prompt injection attacks?
Next Module
→ Module 8: Building Memory Systems for LLMs
Based on production patterns from leading AI teams