Why Traditional APM Fails for LLM Applications
Your existing APM tools (Datadog, New Relic, Grafana) tell you that an HTTP request took 2.3 seconds, returned 200 OK, and consumed 45MB of memory. For an LLM-powered endpoint, this tells you almost nothing useful.
The request succeeded — but did the LLM hallucinate an answer? Did it follow the system prompt? Did it cost $0.002 or $0.20? Did it use the right model? Was the response actually helpful to the user?
LLM observability requires a fundamentally different approach. You need to track not just infrastructure metrics, but content quality, cost efficiency, and model behavior. Here's how.
The 12 Metrics Every LLM System Needs
Reliability Metrics
1. Latency (p50, p95, p99) per endpoint
Track time to first token (TTFT) and total response time separately. TTFT matters for streaming UX. Total time matters for billing and throughput.
Healthy targets: TTFT < 500ms (p95), total < 3s (p95) for chat features. Alert if p95 exceeds 5s.
2. Error rate by error type
Segment errors into: rate limits (429), timeouts, context length exceeded, content filtered, server errors (500), and parsing failures. Each type has different root causes and remediation.
Healthy target: < 1% total error rate. Alert if any single error type exceeds 0.5%.
3. Timeout rate
Track separately from errors because timeouts are a reliability signal, not a correctness signal. A spike in timeouts usually means the provider is degraded.
Healthy target: < 0.5%. Alert if it exceeds 2% in a 5-minute window.
4. Fallback activation rate
How often are you hitting your secondary model or cached responses? Some fallback usage is expected (0.5-2%). A spike indicates primary provider issues.
Cost Metrics
5. Cost per request
Calculate in real-time: (input_tokens × input_price) + (output_tokens × output_price). Track per endpoint, per model, per user.
6. Cost per user per day
Aggregate cost at the user level. This feeds into per-user budget enforcement and pricing decisions. Track the distribution — most users will be cheap, but your top 5% will drive 50%+ of cost.
7. Token usage (input vs output)
Track input and output tokens separately. Input tokens are your biggest optimization lever (context management). Output tokens indicate response verbosity (which you can control via prompts and max_tokens).
8. Cache hit rate
Target 20-40% for conversational features, 50%+ for search/FAQ. Track per cache layer (exact match, semantic, prefix). If overall hit rate is below 10%, your caching strategy needs work.
Quality Metrics
9. User feedback score
Thumbs up/down on AI responses. Track the ratio over time. A sustained drop (e.g., from 85% positive to 70% positive) indicates quality degradation — possibly from model updates, prompt drift, or data issues.
10. Hallucination rate (sampled)
Run automated hallucination detection on a sample (5-10%) of responses. Use an LLM-as-judge approach: pass the response and the source context to a judge model, ask if the response is faithful to the sources. Track weekly trends.
11. Format compliance rate
If your prompts request structured output (JSON, specific format), track how often the response actually matches. Non-compliant responses cause downstream parsing errors. Target 99%+.
12. Regeneration rate
How often do users click "regenerate" or "try again"? This is a direct signal that the first response was unsatisfactory. Track per feature. Rising regeneration rate = falling quality.
Distributed Tracing for AI
Standard distributed tracing tracks request → service → database. AI tracing needs additional spans:
[User Request]
└── [AI Gateway]
├── [Cache Check] (hit/miss, latency)
├── [Model Router] (selected model, reason)
├── [LLM Call] (model, tokens_in, tokens_out, latency, cost)
│ └── [Tool Calls] (tool_name, parameters, result, latency)
├── [Response Processing] (parsing, validation, filtering)
└── [Cache Write] (key, TTL)Each span should include AI-specific attributes:
- •
llm.model: which model was used - •
llm.tokens.input: input token count - •
llm.tokens.output: output token count - •
llm.cost: calculated cost in USD - •
llm.cache_hit: boolean - •
llm.fallback: boolean (was a fallback model used?)
This trace structure lets you answer questions like: "Why did this request cost $0.15?" (Answer: cache miss, routed to Opus, 3 tool calls, 8K output tokens).
Cost Dashboards
Build three cost views:
Real-time view: Cost per minute/hour for the last 24 hours. This is your smoke alarm. A spike means a runaway prompt, a cache failure, or a traffic burst.
Daily rollup view: Cost per day, broken down by model, feature, and top users. This feeds into budgeting and optimization planning.
Monthly projection view: Current month spend with linear projection to month-end. Compare against budget. Alert at 80% of monthly budget.
Quality Monitoring: Detecting Output Degradation
Model providers silently update models. What worked last Tuesday might not work this Tuesday. You need automated quality monitoring to catch degradation.
The Quality Pipeline
- •Sample 5-10% of production requests
- •Evaluate using LLM-as-judge (a separate model scores the response on a 1-5 scale for relevance, accuracy, and format compliance)
- •Aggregate scores into a daily quality score per feature
- •Alert if the 7-day rolling average drops by more than 10%
- •Investigate by comparing recent responses to historical baselines
Cost of this pipeline: approximately 5-10% of your primary LLM spend. Worth it.
Anomaly Detection
Beyond quality scores, watch for statistical anomalies:
- •Response length anomalies: If average response length changes by more than 20%, the model behavior has shifted.
- •Token usage anomalies: Sudden increase in input or output tokens per request.
- •Refusal rate anomalies: The model refusing to answer queries it previously handled.
- •Latency anomalies: p95 latency increasing without traffic increase.
Tool Comparison
LangSmith (LangChain)
Best for teams already using LangChain. Strong tracing and evaluation features. Less useful if you're not in the LangChain ecosystem.
Helicone
Best for cost monitoring and gateway-level observability. Easy to integrate (proxy-based). Good dashboards. Limited custom evaluation.
Custom (recommended for production)
Build on your existing observability stack (Datadog, Grafana, Prometheus). Add AI-specific metrics as custom metrics. Most flexible, most work. Recommended for teams with strong observability culture.
Our Recommendation
Start with Helicone for immediate visibility (15-minute setup). Build custom metrics in your existing observability stack for long-term. Use LangSmith only if you're using LangChain.
Building Your LLM Observability Stack
Week 1: Instrument the 4 reliability metrics (latency, errors, timeouts, fallback). Ship a basic cost dashboard.
Week 2: Add quality metrics (user feedback, format compliance). Set up sampling for input/output logging.
Week 3: Build the quality evaluation pipeline. Implement anomaly detection on key metrics.
Week 4: Create alerting rules. Document runbooks for common alert scenarios.
Total effort: 2-3 engineering weeks. This is not optional for production AI — it's the difference between knowing your AI is working and hoping it is.