Token Optimization: How We Cut LLM Costs by 63% Without Losing Quality

The Cost Problem: Why LLM Bills Spiral

When we audited a fintech platform's AI system, they were spending $8,700/month on LLM API calls. Eight months earlier, they'd launched at $900/month. The feature set hadn't changed much — but usage had grown 4x, they'd switched to a more expensive model "for quality," and nobody had optimized the prompts since launch.

This is the default trajectory for every LLM-powered feature: costs spiral unless you actively optimize. Here are the 8 techniques we used to cut that $8,700/month to $3,200/month — a 63% reduction — without measurable quality degradation.

Technique 1: Prompt Compression (Saved 18%)

The system prompt was 2,100 tokens. It had been written during prototyping and never trimmed. It included examples that were no longer relevant, redundant instructions, and verbose formatting rules.

What we did:

•Removed redundant instructions (the model already follows these by default)
•Compressed examples from 3 to 1 (the most representative one)
•Replaced verbose formatting rules with a JSON schema reference
•Removed "please" and other conversational filler from system prompts

After: 890 tokens. Same output quality on our evaluation suite (94.2% before, 93.8% after — within noise margin).

Why it matters: The system prompt is sent with every single request. At 40,000 requests/day, saving 1,210 tokens per request saved approximately $1,570/month.

Rule of thumb: Your production system prompt should be under 1,000 tokens. If it's longer, you're likely paying for verbosity.

Technique 2: Response Caching (Saved 15%)

Many queries are repeated or near-identical. FAQ-style questions, common error lookups, and standard workflow questions hit the same underlying queries repeatedly.

What we did:

•Implemented exact-match caching with a 24-hour TTL
•Cache key = hash(system_prompt + user_message + model + temperature)
•Used Redis with 10GB allocation
•Added cache-busting for time-sensitive queries (anything referencing "today," "current," etc.)

Results: 32% cache hit rate. Each cache hit saves the full LLM API call (both input and output tokens) and returns in <5ms instead of 800ms.

Monthly savings: ~$1,305/month.

Note: We evaluated semantic caching (finding similar queries via embeddings) but decided the complexity and occasional mismatches weren't worth the incremental hit rate improvement for this use case.

Technique 3: Model Routing (Saved 12%)

Everything was running on Claude Sonnet. But not everything needed Sonnet's capabilities.

What we changed:

•Classification tasks (ticket categorization, sentiment analysis) → Claude Haiku. 90% cheaper, same accuracy for classification.
•Simple extraction (pulling structured data from templates) → Claude Haiku. No quality difference for well-structured inputs.
•Complex analysis (root cause analysis, detailed recommendations) → Stayed on Claude Sonnet.
•Summarization of long documents → Claude Sonnet with reduced max_tokens.

Distribution after routing: 45% Haiku, 50% Sonnet, 5% cached. Previously: 95% Sonnet, 5% cached.

Monthly savings: ~$1,044/month.

Key insight: The hardest part isn't implementing routing — it's classifying which requests need which model. We used a simple rules-based classifier (endpoint + input length + feature flag). No ML needed for routing.

Technique 4: Context Window Optimization (Saved 8%)

The conversational AI feature was sending the entire conversation history with every request. A 20-message conversation meant 15K+ input tokens per request.

What we changed:

•Limited conversation history to the last 8 messages
•For conversations longer than 8 messages, added a 200-token summary of earlier context
•Truncated individual messages to 500 tokens max
•Removed system-generated metadata from conversation context (timestamps, status updates)

After: Average input tokens dropped from 6,200 to 3,100 per conversational request.

Monthly savings: ~$696/month.

Quality impact: We ran our evaluation suite on 500 conversations. Quality score dropped from 4.21/5 to 4.15/5 — not statistically significant. Users didn't notice.

Technique 5: Batch Processing for Non-Real-Time Tasks (Saved 5%)

Document analysis, nightly report generation, and bulk classification were running as individual API calls. Batch processing APIs (available from both Anthropic and OpenAI) offer 50% discounts for async processing with 24-hour SLAs.

What we changed:

•Moved document analysis to batch processing (results within 4 hours instead of real-time)
•Moved nightly report generation to batch
•Kept all user-facing features on real-time APIs

Monthly savings: ~$435/month.

Trade-off: Document analysis results now take up to 4 hours instead of 30 seconds. For this use case, that was acceptable — users upload documents and check back later.

Technique 6: Output Token Limits and Structured Responses (Saved 3%)

The model was generating verbose responses. A yes/no classification would return a paragraph of explanation. A JSON extraction would include helpful-but-unnecessary commentary.

What we changed:

•Set max_tokens appropriate to each task (50 for classification, 200 for extraction, 500 for analysis)
•Used structured output mode to force JSON responses where applicable
•Added "Be concise" to system prompts for appropriate endpoints

Monthly savings: ~$261/month. Small but free — these changes took 30 minutes to implement.

Technique 7: Embedding Model Optimization (Saved 1.5%)

The RAG pipeline was using OpenAI's text-embedding-ada-002 for all embeddings. We switched to text-embedding-3-small for non-critical embeddings (logging, analytics) while keeping text-embedding-3-large for retrieval-critical paths.

Monthly savings: ~$130/month.

Technique 8: Provider Negotiation and Commitment Discounts (Saved 0.5%)

At their spend level ($8,700/month), they qualified for a usage commitment discount from Anthropic. A 12-month commitment at their projected volume secured a small but meaningful discount.

Monthly savings: ~$44/month.

Note: This only makes sense at significant scale ($5K+/month). Below that, the negotiation overhead isn't worth it.

Implementation Priority: Highest ROI First

Here's the priority order we recommend:

|----------|-----------|--------|---------|

| 1 | Prompt compression | 2 hours | 18% |

| 2 | Response caching | 1-2 days | 15% |

| 3 | Model routing | 2-3 days | 12% |

| 4 | Context optimization | 1 day | 8% |

| 5 | Batch processing | 1 day | 5% |

| 6 | Output limits | 30 min | 3% |

| 7 | Embedding optimization | 2 hours | 1.5% |

| 8 | Provider negotiation | 1 week | 0.5% |

Start with #1 and #6 — they take under 3 hours combined and save 21%. Then implement #2 and #3 in the first week. The rest can be done in week 2.

Conclusion

LLM cost optimization isn't a one-time effort. We recommend a monthly cost review where you check:

•Cost per request trending (up or down?)
•Cache hit rate (stable or declining?)
•Model routing distribution (any drift?)
•Top 10 most expensive users (abuse or legitimate?)

The $8,700 → $3,200 reduction we achieved was not the floor. With more aggressive caching and potential fine-tuning of a smaller model for classification, this system could likely reach $2,000/month. But the 63% reduction was achieved in 2 weeks of engineering time — and the ROI on that investment is clear.

Token Optimization: How We Cut LLM Costs by 63% Without Losing Quality

The Cost Problem: Why LLM Bills Spiral

Technique 1: Prompt Compression (Saved 18%)

Technique 2: Response Caching (Saved 15%)

Technique 3: Model Routing (Saved 12%)

Technique 4: Context Window Optimization (Saved 8%)

Technique 5: Batch Processing for Non-Real-Time Tasks (Saved 5%)

Technique 6: Output Token Limits and Structured Responses (Saved 3%)

Technique 7: Embedding Model Optimization (Saved 1.5%)

Technique 8: Provider Negotiation and Commitment Discounts (Saved 0.5%)

Implementation Priority: Highest ROI First

Conclusion

About the author

Related articles

LLM Architecture for Production: A Systems Engineer's Guide

Choosing the Right LLM for Your SaaS Product: Claude vs GPT vs Open Source

LLM Observability: Monitoring What Your AI Is Actually Doing