StoAI
Blog/LLM Architecture

LLM Architecture for Production: A Systems Engineer's Guide

The complete guide to building production LLM systems. Covers API gateway design, model routing, fallback chains, token management, caching, observability, and the architecture decisions that separate hobby projects from production systems.

·16 min read·Updated Mar 11, 2026

Why LLM Architecture Is Distributed Systems Engineering

Building LLM-powered features isn't machine learning engineering — it's distributed systems engineering. You're dealing with unreliable external services (LLM APIs go down), variable latency (50ms to 30 seconds for the same call), non-deterministic outputs (same input, different output), and cost that scales with usage (not infrastructure).

If you've built systems on top of unreliable external APIs before, you already have the mental models you need. If you haven't, this guide gives you the architecture patterns that production LLM systems require.

The LLM Gateway: Your Single Point of Control

Every production LLM system needs a gateway — a service that sits between your application and LLM providers. The gateway is where you implement all cross-cutting concerns: routing, rate limiting, caching, fallbacks, observability, and cost control.

Gateway Responsibilities

Request validation: Check token budget, rate limits, and permissions before making the LLM call. Reject requests that would exceed limits.

Model routing: Route requests to the appropriate model based on task type, cost constraints, or latency requirements. A classification task goes to Haiku. A complex analysis goes to Opus.

Caching: Check for cached responses before calling the LLM. This is your single biggest cost optimization lever.

Provider abstraction: Your application code calls the gateway with a task-level API. The gateway translates to provider-specific APIs. When you switch from GPT-4o to Claude, only the gateway changes.

Observability: Log every request with latency, token count, cost, model, and quality signals. This is the data you need for optimization.

Gateway Implementation

You have three options:

  • Build it: A thin Node.js or Python service that wraps LLM API calls. 500-1000 lines of code. Full control, no vendor lock-in.
  • Open source: LiteLLM is the best option — it supports 100+ models with a unified API, handles retries and fallbacks, and provides basic observability.
  • Managed: Helicone, Portkey, or similar services. Fastest to set up, limited customization, ongoing cost.

For most SaaS products, start with LiteLLM and add custom logic as needed.

Model Routing: Choosing the Right Model Per Request

Not every request needs your most expensive model. A well-designed routing strategy saves 20-40% on LLM costs without noticeable quality degradation.

Routing Strategies

Task-based routing: Define model assignments per task type.

  • Classification/extraction: Claude Haiku or GPT-4o mini — fast, cheap, accurate for structured tasks
  • Summarization: Claude Sonnet — good balance of quality and cost
  • Complex reasoning: Claude Opus or GPT-4o — when quality matters more than cost
  • Code generation: Claude Sonnet or GPT-4o — both strong for code

Cost-based routing: Set a per-request cost budget. Route to the cheapest model that meets quality requirements for the task.

Latency-based routing: For real-time features (autocomplete, search), route to the fastest model. For async processing, optimize for quality and cost.

Cascade routing: Start with a cheap model. If the response quality is below threshold (measured by a classifier), retry with a more expensive model. This gives you cheap-model pricing for easy requests and expensive-model quality for hard ones.

Fallback Chains: Claude → GPT-4o → Cached Response

Every production system needs a fallback chain. When your primary model provider goes down — and it will — your fallback chain determines whether your feature keeps working or your users see errors.

The Three-Level Fallback

Level 1: Primary model (e.g., Claude Sonnet) — Normal operation. Used for all requests when available.

Level 2: Secondary model (e.g., GPT-4o) — Activated when primary is unavailable or exceeds timeout. Different provider eliminates single-provider risk. Accept slightly different output quality.

Level 3: Cached response — Activated when both providers are down. Return the best cached response for the query. For many use cases (FAQ, documentation search), cached responses are perfectly acceptable.

Implementation

Use a circuit breaker pattern. Track the last N requests to each provider. If failure rate exceeds 50% in a 30-second window, trip the circuit breaker and skip directly to the fallback. Check every 30 seconds if the primary is back.

The timeout for the primary should be aggressive — 5 seconds for real-time, 15 seconds for background tasks. Don't let slow responses from a degraded provider cascade through your system.

Token Management: Budgets, Counting, and Optimization

Tokens are the unit of cost in LLM systems. Managing them is managing your margins.

Token Budgets

Set budgets at three levels:

  • Per-request: Maximum 4,096 input tokens + 1,024 output tokens for a chat response. Truncate context to fit.
  • Per-user: Maximum 50,000 tokens per day for free tier, 500,000 for paid. Enforce at the gateway.
  • Per-tenant: Maximum monthly token budget based on their plan. Alert at 80%, hard stop at 100%.

Context Window Optimization

The biggest cost driver is input tokens — the context you send with each request. Optimize aggressively:

  • Truncate conversation history to the last 5-10 messages, not the full history
  • Summarize old context instead of including verbatim messages
  • Select relevant context using retrieval instead of including everything
  • Remove boilerplate from system prompts — every token counts

A typical optimization pass reduces input tokens by 30-50%.

Caching: Semantic Cache, Prefix Cache, Response Cache

Caching is the highest-ROI optimization in any LLM system. A well-implemented cache saves 15-30% on costs and improves latency by 10x for cache hits.

Three Caching Layers

Exact match cache: Hash the full input (system prompt + user message + parameters). If the hash matches, return the cached response. Simple, reliable, no quality risk.

Prefix cache: Provider-level optimization (both Anthropic and OpenAI support this). If your system prompt is the same across requests, the provider caches the processed prefix. Reduces input cost by up to 90% for the cached prefix portion. Requires consistent system prompts.

Semantic cache: Embed the query and find similar previous queries using vector similarity. If similarity exceeds a threshold (typically 0.95), return the cached response. Higher hit rate than exact match, but risk of returning incorrect cached responses for similar-but-different queries.

Start with exact match + prefix caching. Add semantic caching only if you need higher hit rates and can tolerate occasional cache mismatches.

The 5 Metrics You Must Track

Without these five metrics, you're flying blind:

  • Latency p95 — Not average, p95. If your p95 exceeds 5 seconds for real-time features, you have a problem. Track per endpoint.
  • Cost per request(input_tokens × price) + (output_tokens × price). Track per request, aggregate per user, per feature, per day.
  • Error rate by type — Rate limits, timeouts, content filters, and model errors. Each requires different remediation.
  • Cache hit rate — Your target is 20-40% for conversational features, 50%+ for search/FAQ. If hit rate is below 10%, your caching strategy needs work.
  • Quality score — User feedback (thumbs up/down), or automated LLM-as-judge scoring on a sample. Track weekly trends. Any sustained drop indicates prompt degradation or model updates.

Reference Architecture

Here's the complete reference architecture for a production LLM system:

Client → Application API → LLM Gateway
                              ├── Request Validator
                              ├── Rate Limiter (per-user, per-tenant)
                              ├── Cache Check (exact → semantic)
                              ├── Model Router (task → model mapping)
                              ├── Provider Client (with circuit breaker)
                              │    ├── Primary: Claude API
                              │    ├── Fallback: GPT-4o API
                              │    └── Emergency: Cached responses
                              ├── Response Processor
                              ├── Cache Writer
                              └── Observability Logger
                                   ├── Metrics (latency, tokens, cost)
                                   ├── Traces (request → response)
                                   └── Quality (feedback, sampling)

Every component is independently deployable and testable. The gateway is stateless — cache and metrics live in external stores (Redis, TimescaleDB). The application never talks to LLM providers directly.

Conclusion

Production LLM architecture is about control. Control over cost (budgets, caching, routing). Control over reliability (fallbacks, circuit breakers, timeouts). Control over quality (monitoring, evaluation, alerting). The gateway pattern gives you that control.

Build the gateway first. Everything else — the features, the prompts, the fine-tuning — depends on having solid infrastructure underneath.

About the author

Written by Rafael Danieli, founder of StoAI. Systems engineer specializing in production AI for SaaS companies. Background in distributed systems, reliability engineering, and integration architecture.