Module 07

LLM Observability & Monitoring

Supplementary Material Focus: Building production-ready AI systems with comprehensive monitoring

7.1 Why LLM Observability Matters

The Unique Challenge

Unlike traditional software, LLMs present unique monitoring challenges:

Traditional SoftwareLLM Applications
---------------------------------------
Deterministic outputsStochastic outputs
Clear success/failureSubjective quality
Predictable latencyVariable generation time
Fixed resource usageDynamic token consumption
Easily testableProbabilistic behavior

The Black Box Problem

Running LLM applications in production is more challenging than traditional ML. The difficulty arises from:

  • Massive model sizes
  • Intricate architecture
  • Non-deterministic outputs
  • Subjective quality assessment

Without proper observability, you're flying blind.


7.2 The Three Pillars of LLM Observability

Metrics, Logs, and Traces

┌─────────────────────────────────────────────────────────────┐
│              LLM OBSERVABILITY PILLARS                       │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  METRICS                                                     │
│  ├─ Token usage (input/output)                              │
│  ├─ Latency (TTFT, generation, total)                       │
│  ├─ Cost per request                                         │
│  ├─ Error rates                                              │
│  └─ Quality scores                                           │
│                                                              │
│  LOGS                                                        │
│  ├─ Full request/response pairs                             │
│  ├─ Model parameters used                                    │
│  ├─ User context and session info                           │
│  └─ Error details and stack traces                          │
│                                                              │
│  TRACES                                                      │
│  ├─ End-to-end request flow                                 │
│  ├─ RAG retrieval steps                                      │
│  ├─ Tool/function calls                                      │
│  └─ Multi-model orchestration                               │
│                                                              │
└─────────────────────────────────────────────────────────────┘

7.3 Essential Metrics to Track

Token-Level Metrics

From Module 1 (Tokenization) learnings - tokens are the currency of LLMs:

MetricDescriptionWhy It Matters
-------------------------------------
prompt_tokensInput token countCost, context usage
completion_tokensOutput token countCost, response length
context_utilization% of context window usedEfficiency
tokens_per_secondGeneration speedUser experience

Latency Metrics

Critical for user experience:

MetricDescriptionTarget
-----------------------------
TTFTTime to First Token< 500ms
GenerationTime for full responseVaries by length
TotalEnd-to-end latency< 3s for most apps

Quality Metrics

From Module 4 (Hallucinations) - quality is subjective but measurable:

MetricDetection Method
--------------------------
Hallucination rateFact-checking against sources
Source attributionDoes response cite context?
Coherence scoreAutomated evaluation
User feedbackThumbs up/down, ratings

Cost Metrics

# Cost calculation
input_cost = prompt_tokens * price_per_input_token
output_cost = completion_tokens * price_per_output_token
total_cost = input_cost + output_cost

# Track per user, session, feature
cost_per_user = sum(request_costs) / active_users

7.4 Logging Best Practices

What to Log

{
  "request_id": "uuid-here",
  "timestamp": "2024-01-15T10:30:00Z",
  "user_id": "user_123",
  "session_id": "session_456",

  "model": {
    "provider": "openai",
    "model_id": "gpt-4-turbo",
    "version": "2024-01-15"
  },

  "request": {
    "prompt_tokens": 1500,
    "system_prompt_hash": "abc123",
    "has_context": true,
    "context_source": "rag"
  },

  "response": {
    "completion_tokens": 350,
    "stop_reason": "end_turn",
    "has_tool_calls": false
  },

  "timing": {
    "ttft_ms": 450,
    "generation_ms": 2100,
    "total_ms": 2550
  },

  "parameters": {
    "temperature": 0.7,
    "max_tokens": 1000,
    "top_p": 0.9
  },

  "cost_usd": 0.05
}

What NOT to Log (Privacy)

  • Full user prompts (unless consented)
  • PII in responses
  • API keys or secrets
  • Internal system prompts (competitive advantage)

7.5 Detecting Hallucinations in Production

Automated Detection Pipeline

┌─────────────────────────────────────────────────────────────┐
│               HALLUCINATION DETECTION                        │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  INPUT                                                       │
│  ├─ User query                                               │
│  ├─ Retrieved context (RAG)                                  │
│  └─ Model response                                           │
│              │                                               │
│              ▼                                               │
│  DETECTION METHODS                                           │
│  ┌───────────────────────────────────────────────────────┐  │
│  │                                                        │  │
│  │  1. SOURCE GROUNDING                                   │  │
│  │     Extract claims → Match to context → Score          │  │
│  │                                                        │  │
│  │  2. SELF-CONSISTENCY                                   │  │
│  │     Generate 3x → Compare facts → Flag conflicts       │  │
│  │                                                        │  │
│  │  3. PATTERN DETECTION                                  │  │
│  │     Specific numbers? Named citations? URLs?           │  │
│  │     → Higher hallucination risk                        │  │
│  │                                                        │  │
│  │  4. UNCERTAINTY SIGNALS                                │  │
│  │     "I believe..." "probably..." → Lower confidence    │  │
│  │                                                        │  │
│  └───────────────────────────────────────────────────────┘  │
│              │                                               │
│              ▼                                               │
│  OUTPUT: Risk Score (0.0 - 1.0)                             │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Implementation Example

def detect_hallucination_risk(response: str, context: str) -> float:
    risk_score = 0.0

    # Check for unsourced specific numbers
    numbers = re.findall(r'\d+\.?\d*%?', response)
    for num in numbers:
        if num not in context:
            risk_score += 0.1

    # Check for citation patterns
    citations = re.findall(r'according to|study shows|research indicates', response.lower())
    if citations and not any(c in context.lower() for c in citations):
        risk_score += 0.2

    # Check for URLs
    urls = re.findall(r'https?://\S+', response)
    risk_score += len(urls) * 0.15

    return min(risk_score, 1.0)

7.6 Building a Monitoring Dashboard

Key Visualizations

PanelMetricPurpose
------------------------
Request VolumeRequests/minLoad monitoring
Latency P50/P95/P99MillisecondsPerformance
Token UsageTokens/hourCost projection
Error Rate% failedReliability
Quality Score0-100User satisfaction
Cost Tracker$/dayBudget management

Alert Thresholds

AlertThresholdSeverity
----------------------------
Latency P95 > 5s5 consecutiveWarning
Error rate > 1%5 min windowCritical
Hallucination rate > 15%Hourly avgWarning
Cost > 120% daily budgetReal-timeCritical

7.7 Debugging LLM Issues

Common Problems and Solutions

ProblemSymptomsDebug Approach
-----------------------------------
High latencySlow responsesCheck token counts, model load
Inconsistent qualityVariable outputsReview temperature, check prompts
High costsBudget overrunsAnalyze token usage patterns
HallucinationsIncorrect factsCheck context injection, prompts
Rate limiting429 errorsImplement backoff, caching

The Debug Checklist

  1. Check the prompt - Is context being injected correctly?
  2. Check the parameters - Temperature, max_tokens appropriate?
  3. Check the model - Right model for the task?
  4. Check the context - RAG returning relevant results?
  5. Check for patterns - Does error correlate with input type?

7.8 Multi-Model Observability

When Using Multiple Models

Track per-model performance to optimize routing:

┌─────────────────────────────────────────────────────────────┐
│                 MODEL COMPARISON                             │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Task Type    │ GPT-4  │ Claude │ Gemini │ Llama-3         │
│  ─────────────┼────────┼────────┼────────┼─────────         │
│  Reasoning    │  0.92  │  0.90  │  0.88  │  0.82           │
│  Coding       │  0.89  │  0.91  │  0.85  │  0.80           │
│  Creative     │  0.85  │  0.88  │  0.82  │  0.75           │
│  Cost/1M tok  │  $30   │  $15   │  $7    │  ~$0            │
│                                                              │
│  INSIGHT: Route simple queries to cheaper models            │
│                                                              │
└─────────────────────────────────────────────────────────────┘

7.9 Tools and Platforms

Open Source

ToolPurpose
---------------
LangfuseLLM tracing and analytics
PhoenixML observability
OpenTelemetryDistributed tracing
PrometheusMetrics collection

Commercial

ToolStrengths
-----------------
Datadog LLMFull-stack integration
LangSmithLangChain ecosystem
Weights & BiasesExperiment tracking
HeliconeCost optimization

Build vs. Buy

Build WhenBuy When
----------------------
Custom requirementsStandard needs
Data sensitivitySpeed to market
Cost optimizationTeam capacity
Competitive advantageFocus on core product

7.10 Key Takeaways

The Observability Checklist

  • [ ] Track token usage (input/output separately)
  • [ ] Measure latency (TTFT, generation, total)
  • [ ] Log requests with correlation IDs
  • [ ] Monitor costs in real-time
  • [ ] Detect hallucinations automatically
  • [ ] Set up alerts for anomalies
  • [ ] Build dashboards for visibility
  • [ ] Implement multi-model tracking if applicable

Remember

"You can't improve what you can't measure."

LLM observability isn't optional in production - it's essential for:

  • Cost control - LLMs are expensive
  • Quality assurance - Hallucinations damage trust
  • User experience - Latency matters
  • Debugging - Black boxes need visibility

Practice Questions

  1. What metrics would you prioritize for a customer-facing chatbot?
  2. How would you detect if your RAG system is returning irrelevant context?
  3. What's the difference between TTFT and total latency, and why track both?
  4. How would you set up alerts for detecting prompt injection attacks?

Next Module

Module 8: Building Memory Systems for LLMs


Based on production patterns from leading AI teams