StoAI
Blog/AI Integration

AI Integration Architecture: 4 Patterns That Scale in Production

Stop building AI features that break at scale. Learn the four architecture patterns used by production SaaS companies to integrate AI reliably — with code examples, trade-offs, and real failure modes.

·14 min read·Updated Mar 11, 2026

Why Architecture Matters More Than Model Selection

Every week, a CTO asks me: "Should we use Claude or GPT-4o?" Every week, I give the same answer: it doesn't matter nearly as much as your integration architecture.

The model is a commodity. Providers release new models every quarter. Prices drop 50% annually. But your architecture? That's the foundation everything else builds on. Get it wrong, and switching models won't save you.

This guide covers the four architecture patterns that work at scale for AI-powered SaaS products, with the specific trade-offs that matter in production.

Pattern 1: The API Gateway Pattern

The API Gateway pattern centralizes all AI calls through a single gateway service that handles routing, rate limiting, caching, fallbacks, and observability.

Architecture

Application Services → AI Gateway → [Claude API, GPT-4o API, Local Model]
                         ↓
                    [Cache Layer]
                         ↓
                    [Observability]

How It Works

Every AI request in your system flows through the gateway. The gateway:

  • Validates the request (token budget, rate limits, permissions)
  • Checks the cache (exact match or semantic similarity)
  • Routes to the appropriate model (based on task type, cost, or latency requirements)
  • Handles errors (retry, fallback to secondary model, return cached response)
  • Logs everything (latency, tokens, cost, model, quality signals)

When to Use

  • You have multiple services making AI calls
  • You need centralized cost control and monitoring
  • You want to switch models without changing application code
  • You need multi-model routing (different models for different tasks)

Implementation Decisions

Sync vs Async: The gateway should support both. Synchronous for real-time features (chat, search). Asynchronous for batch processing (document analysis, content generation).

Cache Strategy: Start with exact-match caching (same input → same output). Add semantic caching later if you need it — embedding-based similarity matching adds complexity.

Timeout Strategy: Set aggressive timeouts (5 seconds for real-time, 30 seconds for batch). The gateway handles timeouts by falling back to secondary models or cached responses.

Pattern 2: The Event-Driven Pipeline

The event-driven pipeline processes AI work asynchronously using a message queue. Events trigger AI processing, and results are delivered via webhooks, WebSocket, or polling.

Architecture

Application → Event Bus → AI Worker Pool → Result Store → Notification
               (SQS/Kafka)  (Auto-scaling)    (Redis/DB)    (WebSocket)

When to Use

  • AI processing isn't in the critical path (users don't wait for results)
  • You need to process large volumes (document analysis, batch classification)
  • AI workloads are bursty and need auto-scaling
  • You want to decouple AI processing from application logic

Key Design Decisions

Queue Selection: SQS for simplicity. Kafka for ordering guarantees and replay. Redis Streams for low-latency internal queues.

Worker Scaling: Scale workers based on queue depth, not CPU. AI workers are I/O-bound (waiting for LLM API responses), so they can handle many concurrent requests per instance.

Dead Letter Queue: Essential. When an AI task fails after retries, move it to a DLQ for investigation. Common failures: context too long, content filtered, provider outage.

Idempotency: Design workers to be idempotent. The same message processed twice should produce the same result. Use request IDs and result deduplication.

Pattern 3: The Agent Orchestrator

The agent orchestrator pattern uses an AI model as the coordination layer, deciding which tools to call, in what order, and how to combine results.

Architecture

User Input → Orchestrator Agent → [Tool Registry]
                    ↓                    ↓
              Planning Step     [DB Query, API Call, Search,
                    ↓            Calculation, File Read]
              Execution Step
                    ↓
              Response Generation

When to Use

  • Complex, multi-step tasks that require reasoning
  • The sequence of operations isn't known in advance
  • Building copilots, assistants, or workflow automation
  • The AI needs to interact with multiple data sources

Critical Design Decisions

Tool Design: Each tool should have a clear, single responsibility. The tool description is what the model reads to decide when to use it — make descriptions precise.

Planning vs ReAct: For simple 2-3 step tasks, use ReAct (reason-act-observe loops). For complex tasks requiring 5+ steps, consider upfront planning where the model creates a plan before executing.

Max Iterations: Always set a hard limit on iterations (typically 5-10). Without this, agents can loop indefinitely. Budget 15-30 seconds for complex agent tasks.

Guardrails: Implement permission checks at the tool level, not the orchestrator level. Each tool validates that the requested action is allowed for the current user.

Pattern 4: The Hybrid Retrieval Pattern

The hybrid retrieval pattern combines vector search, keyword search, and AI-powered generation for knowledge-intensive applications (RAG systems, search, Q&A).

Architecture

Query → Query Processing → [Vector Search + BM25 Search]
                                    ↓
                              Reciprocal Rank Fusion
                                    ↓
                              Cross-Encoder Re-ranking
                                    ↓
                              LLM Generation (with context)
                                    ↓
                              Streaming Response

When to Use

  • Knowledge base or documentation search
  • Customer support automation with domain knowledge
  • Any feature that needs to ground LLM responses in your data
  • Product search with natural language queries

Design Decisions

Vector vs Hybrid: Always go hybrid. Pure vector search misses exact matches (product names, error codes, IDs). BM25 catches what vectors miss. Reciprocal rank fusion combines scores.

Chunk Size: Start with 512 tokens with 50-token overlap. Too small = lost context. Too large = diluted relevance. Tune based on your evaluation metrics.

Re-ranking: A cross-encoder re-ranker (like Cohere Rerank or a local model) typically improves precision by 15-25%. It's worth the extra 100-200ms latency.

Top-K: Retrieve 20-50 chunks, re-rank to top 5-10, then pass to the LLM. More context isn't always better — it increases cost and can confuse the model.

Anti-Patterns That Will Cost You in Production

The Monolith AI Call: Stuffing everything into one giant prompt. Break complex tasks into multiple focused calls. Each call should have a single responsibility.

No Fallback: Relying on a single model provider with no fallback. When OpenAI goes down (it will), your feature goes down. Always have a fallback chain.

Optimistic Timeouts: Setting 30-second timeouts on user-facing AI calls. Users leave after 5 seconds. Set aggressive timeouts and fall back to cached responses.

Logging Everything: Logging full prompts and responses in production without PII redaction. You'll either run out of storage or violate privacy regulations. Log metadata, sample full content.

No Cost Tracking: Launching without per-request cost tracking. By the time you notice the bill, you've already overspent. Track cost per request from day one.

Conclusion

Choose your architecture pattern based on your primary constraint:

  • Latency-sensitive, multi-service: API Gateway Pattern
  • High-volume, async processing: Event-Driven Pipeline
  • Complex multi-step reasoning: Agent Orchestrator
  • Knowledge-grounded responses: Hybrid Retrieval Pattern

Most production systems combine 2-3 patterns. Start with the one that solves your most critical use case, then layer in additional patterns as your AI capabilities expand.

Sobre o autor

Escrito por Rafael Danieli, fundador da StoAI. Engenheiro de sistemas especializado em IA de produção para empresas SaaS. Background em sistemas distribuídos, engenharia de confiabilidade e arquitetura de integração.