Module 07

LLM Observability & Monitoring

Bonus

Supplementary Material Focus: Building production-ready AI systems with comprehensive monitoring

7.1 Why LLM Observability Matters

The Unique Challenge

Unlike traditional software, LLMs present unique monitoring challenges:

Traditional Software	LLM Applications
---------------------	------------------
Deterministic outputs	Stochastic outputs
Clear success/failure	Subjective quality
Predictable latency	Variable generation time
Fixed resource usage	Dynamic token consumption
Easily testable	Probabilistic behavior

The Black Box Problem

Running LLM applications in production is more challenging than traditional ML. The difficulty arises from:

Massive model sizes
Intricate architecture
Non-deterministic outputs
Subjective quality assessment

Without proper observability, you're flying blind.

7.2 The Three Pillars of LLM Observability

Metrics, Logs, and Traces

┌─────────────────────────────────────────────────────────────┐
│              LLM OBSERVABILITY PILLARS                       │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  METRICS                                                     │
│  ├─ Token usage (input/output)                              │
│  ├─ Latency (TTFT, generation, total)                       │
│  ├─ Cost per request                                         │
│  ├─ Error rates                                              │
│  └─ Quality scores                                           │
│                                                              │
│  LOGS                                                        │
│  ├─ Full request/response pairs                             │
│  ├─ Model parameters used                                    │
│  ├─ User context and session info                           │
│  └─ Error details and stack traces                          │
│                                                              │
│  TRACES                                                      │
│  ├─ End-to-end request flow                                 │
│  ├─ RAG retrieval steps                                      │
│  ├─ Tool/function calls                                      │
│  └─ Multi-model orchestration                               │
│                                                              │
└─────────────────────────────────────────────────────────────┘

7.3 Essential Metrics to Track

Token-Level Metrics

From Module 1 (Tokenization) learnings - tokens are the currency of LLMs:

Metric	Description	Why It Matters
--------	-------------	----------------
`prompt_tokens`	Input token count	Cost, context usage
`completion_tokens`	Output token count	Cost, response length
`context_utilization`	% of context window used	Efficiency
`tokens_per_second`	Generation speed	User experience

Latency Metrics

Critical for user experience:

Metric	Description	Target
--------	-------------	--------
TTFT	Time to First Token	< 500ms
Generation	Time for full response	Varies by length
Total	End-to-end latency	< 3s for most apps

Quality Metrics

From Module 4 (Hallucinations) - quality is subjective but measurable:

Metric	Detection Method
--------	------------------
Hallucination rate	Fact-checking against sources
Source attribution	Does response cite context?
Coherence score	Automated evaluation
User feedback	Thumbs up/down, ratings

Cost Metrics

# Cost calculation
input_cost = prompt_tokens * price_per_input_token
output_cost = completion_tokens * price_per_output_token
total_cost = input_cost + output_cost

# Track per user, session, feature
cost_per_user = sum(request_costs) / active_users

7.4 Logging Best Practices

What to Log

{
  "request_id": "uuid-here",
  "timestamp": "2024-01-15T10:30:00Z",
  "user_id": "user_123",
  "session_id": "session_456",

  "model": {
    "provider": "openai",
    "model_id": "gpt-4-turbo",
    "version": "2024-01-15"
  },

  "request": {
    "prompt_tokens": 1500,
    "system_prompt_hash": "abc123",
    "has_context": true,
    "context_source": "rag"
  },

  "response": {
    "completion_tokens": 350,
    "stop_reason": "end_turn",
    "has_tool_calls": false
  },

  "timing": {
    "ttft_ms": 450,
    "generation_ms": 2100,
    "total_ms": 2550
  },

  "parameters": {
    "temperature": 0.7,
    "max_tokens": 1000,
    "top_p": 0.9
  },

  "cost_usd": 0.05
}

What NOT to Log (Privacy)

Full user prompts (unless consented)
PII in responses
API keys or secrets
Internal system prompts (competitive advantage)

7.5 Detecting Hallucinations in Production

Automated Detection Pipeline

┌─────────────────────────────────────────────────────────────┐
│               HALLUCINATION DETECTION                        │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  INPUT                                                       │
│  ├─ User query                                               │
│  ├─ Retrieved context (RAG)                                  │
│  └─ Model response                                           │
│              │                                               │
│              ▼                                               │
│  DETECTION METHODS                                           │
│  ┌───────────────────────────────────────────────────────┐  │
│  │                                                        │  │
│  │  1. SOURCE GROUNDING                                   │  │
│  │     Extract claims → Match to context → Score          │  │
│  │                                                        │  │
│  │  2. SELF-CONSISTENCY                                   │  │
│  │     Generate 3x → Compare facts → Flag conflicts       │  │
│  │                                                        │  │
│  │  3. PATTERN DETECTION                                  │  │
│  │     Specific numbers? Named citations? URLs?           │  │
│  │     → Higher hallucination risk                        │  │
│  │                                                        │  │
│  │  4. UNCERTAINTY SIGNALS                                │  │
│  │     "I believe..." "probably..." → Lower confidence    │  │
│  │                                                        │  │
│  └───────────────────────────────────────────────────────┘  │
│              │                                               │
│              ▼                                               │
│  OUTPUT: Risk Score (0.0 - 1.0)                             │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Implementation Example

def detect_hallucination_risk(response: str, context: str) -> float:
    risk_score = 0.0

    # Check for unsourced specific numbers
    numbers = re.findall(r'\d+\.?\d*%?', response)
    for num in numbers:
        if num not in context:
            risk_score += 0.1

    # Check for citation patterns
    citations = re.findall(r'according to|study shows|research indicates', response.lower())
    if citations and not any(c in context.lower() for c in citations):
        risk_score += 0.2

    # Check for URLs
    urls = re.findall(r'https?://\S+', response)
    risk_score += len(urls) * 0.15

    return min(risk_score, 1.0)

7.6 Building a Monitoring Dashboard

Key Visualizations

Panel	Metric	Purpose
-------	--------	---------
Request Volume	Requests/min	Load monitoring
Latency P50/P95/P99	Milliseconds	Performance
Token Usage	Tokens/hour	Cost projection
Error Rate	% failed	Reliability
Quality Score	0-100	User satisfaction
Cost Tracker	$/day	Budget management

Alert Thresholds

Alert	Threshold	Severity
-------	-----------	----------
Latency P95 > 5s	5 consecutive	Warning
Error rate > 1%	5 min window	Critical
Hallucination rate > 15%	Hourly avg	Warning
Cost > 120% daily budget	Real-time	Critical

7.7 Debugging LLM Issues

Common Problems and Solutions

Problem	Symptoms	Debug Approach
---------	----------	----------------
High latency	Slow responses	Check token counts, model load
Inconsistent quality	Variable outputs	Review temperature, check prompts
High costs	Budget overruns	Analyze token usage patterns
Hallucinations	Incorrect facts	Check context injection, prompts
Rate limiting	429 errors	Implement backoff, caching

The Debug Checklist

Check the prompt - Is context being injected correctly?
Check the parameters - Temperature, max_tokens appropriate?
Check the model - Right model for the task?
Check the context - RAG returning relevant results?
Check for patterns - Does error correlate with input type?

7.8 Multi-Model Observability

When Using Multiple Models

Track per-model performance to optimize routing:

┌─────────────────────────────────────────────────────────────┐
│                 MODEL COMPARISON                             │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Task Type    │ GPT-4  │ Claude │ Gemini │ Llama-3         │
│  ─────────────┼────────┼────────┼────────┼─────────         │
│  Reasoning    │  0.92  │  0.90  │  0.88  │  0.82           │
│  Coding       │  0.89  │  0.91  │  0.85  │  0.80           │
│  Creative     │  0.85  │  0.88  │  0.82  │  0.75           │
│  Cost/1M tok  │  $30   │  $15   │  $7    │  ~$0            │
│                                                              │
│  INSIGHT: Route simple queries to cheaper models            │
│                                                              │
└─────────────────────────────────────────────────────────────┘

7.9 Tools and Platforms

Open Source

Tool	Purpose
------	---------
Langfuse	LLM tracing and analytics
Phoenix	ML observability
OpenTelemetry	Distributed tracing
Prometheus	Metrics collection

Commercial

Tool	Strengths
------	-----------
Datadog LLM	Full-stack integration
LangSmith	LangChain ecosystem
Weights & Biases	Experiment tracking
Helicone	Cost optimization

Build vs. Buy

Build When	Buy When
------------	----------
Custom requirements	Standard needs
Data sensitivity	Speed to market
Cost optimization	Team capacity
Competitive advantage	Focus on core product

7.10 Key Takeaways

The Observability Checklist

[ ] Track token usage (input/output separately)
[ ] Measure latency (TTFT, generation, total)
[ ] Log requests with correlation IDs
[ ] Monitor costs in real-time
[ ] Detect hallucinations automatically
[ ] Set up alerts for anomalies
[ ] Build dashboards for visibility
[ ] Implement multi-model tracking if applicable

Remember

"You can't improve what you can't measure."

LLM observability isn't optional in production - it's essential for:

Cost control - LLMs are expensive
Quality assurance - Hallucinations damage trust
User experience - Latency matters
Debugging - Black boxes need visibility

Practice Questions

What metrics would you prioritize for a customer-facing chatbot?
How would you detect if your RAG system is returning irrelevant context?
What's the difference between TTFT and total latency, and why track both?
How would you set up alerts for detecting prompt injection attacks?

Next Module

→ Module 8: Building Memory Systems for LLMs

Based on production patterns from leading AI teams