Architecture Overview: The Production RAG Pipeline
This guide walks through building a complete RAG pipeline that's production-ready — not a tutorial that works in a notebook but breaks under real traffic. We'll cover each component with the specific decisions that matter for SaaS applications.
The pipeline has four major components:
Ingestion Pipeline → Vector Store ← Retrieval API → Generation Layer → Response
(async, batch) (pgvector) (sync, real-time) (Claude/GPT) (streaming)Ingestion runs asynchronously — documents are processed and indexed in the background. Retrieval and generation run synchronously in the request path, with strict latency budgets.
Document Processing: Handling Real-World Formats
Production documents aren't clean Markdown files. They're PDFs with headers and footers, HTML with navigation chrome, Word documents with formatting artifacts, and API responses with boilerplate.
PDF Processing
Use pdf-parse for text-based PDFs and Tesseract or Amazon Textract for scanned documents. Key decisions:
- •Header/footer removal: PDFs repeat headers and footers on every page. Detect and remove them by comparing content across page boundaries.
- •Table extraction: Standard text extraction linearizes tables into nonsense. Use dedicated table extraction (Textract, or Claude's vision capabilities for complex tables).
- •Layout analysis: Multi-column PDFs need column detection before text extraction. Process left-to-right, top-to-bottom within each column.
HTML Processing
Strip navigation, sidebars, footers, and scripts. Keep the main content area. Mozilla's Readability library is the best tool for this — it's what Firefox Reader Mode uses.
Preserve heading hierarchy (h1 → h6) as metadata. This helps chunking respect document structure.
Metadata Extraction
For every document, extract and store:
interface DocumentMetadata {
source: string; // URL or file path
title: string; // Document title
section: string; // Section/chapter name
contentType: string; // "api-docs", "faq", "tutorial", etc.
lastModified: string; // For freshness filtering
hash: string; // For change detection
}This metadata enables filtered retrieval ("search only API docs") and freshness management ("prefer recent content").
Chunking: Recursive Strategy With Metadata Inheritance
After document processing, split content into chunks that are small enough for precise retrieval but large enough to contain meaningful context.
The Recursive Strategy
function chunkDocument(text: string, metadata: DocumentMetadata): Chunk[] {
// Split on section boundaries first (## headings)
// Then on paragraph boundaries (double newline)
// Then on sentence boundaries (period + space)
// Target: 400-600 tokens per chunk, 50 token overlap
}Metadata Inheritance
Each chunk inherits metadata from its parent document plus chunk-specific metadata:
interface Chunk {
id: string;
content: string;
embedding: number[];
metadata: {
...DocumentMetadata;
chunkIndex: number;
headingHierarchy: string[]; // ["Authentication", "OAuth 2.0", "Token Refresh"]
previousChunkId: string | null;
nextChunkId: string | null;
}
}The heading hierarchy is critical — it provides context that the chunk text alone doesn't have. A chunk about "token refresh" makes much more sense when you know it's under "Authentication > OAuth 2.0."
Parent-Child Chunking
For complex documents, index small chunks (256 tokens) for precise retrieval but store a reference to the parent chunk (1024 tokens). When a small chunk is retrieved, return the parent chunk to the LLM for more context.
Document Section (2048 tokens)
├── Parent Chunk 1 (1024 tokens)
│ ├── Child Chunk 1a (256 tokens) ← indexed for retrieval
│ ├── Child Chunk 1b (256 tokens) ← indexed for retrieval
│ ├── Child Chunk 1c (256 tokens) ← indexed for retrieval
│ └── Child Chunk 1d (256 tokens) ← indexed for retrieval
└── Parent Chunk 2 (1024 tokens)
├── Child Chunk 2a (256 tokens)
└── ...When child chunk 1b is retrieved, the LLM receives parent chunk 1 (all 1024 tokens). This gives precise retrieval with rich context.
Embedding: Model Selection and Batching
Model Choice
For most SaaS RAG systems, OpenAI text-embedding-3-small offers the best cost/quality ratio. It's $0.02 per 1M tokens and produces 1536-dimensional vectors.
If retrieval quality is your top priority (e.g., medical or legal applications), use text-embedding-3-large (3072 dimensions, $0.13 per 1M tokens).
Batching Strategy
Embed chunks in batches of 100-500 for optimal throughput:
async function embedBatch(chunks: string[]): Promise<number[][]> {
const BATCH_SIZE = 200;
const results: number[][] = [];
for (let i = 0; i < chunks.length; i += BATCH_SIZE) {
const batch = chunks.slice(i, i + BATCH_SIZE);
const response = await openai.embeddings.create({
model: "text-embedding-3-small",
input: batch,
});
results.push(...response.data.map(d => d.embedding));
}
return results;
}Query Embedding
At query time, embed the user's question with the same model. This is a single API call with sub-100ms latency.
Vector Database: pgvector Setup and Optimization
For SaaS products already using PostgreSQL, pgvector is the pragmatic choice. No additional infrastructure to manage — it's a Postgres extension.
Schema
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE chunks (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
content TEXT NOT NULL,
embedding vector(1536) NOT NULL,
metadata JSONB NOT NULL,
document_id UUID REFERENCES documents(id),
created_at TIMESTAMPTZ DEFAULT now()
);
-- HNSW index for fast approximate nearest neighbor search
CREATE INDEX ON chunks USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
-- GIN index for metadata filtering
CREATE INDEX ON chunks USING gin (metadata);
-- Full-text search index for BM25
ALTER TABLE chunks ADD COLUMN tsv tsvector
GENERATED ALWAYS AS (to_tsvector('english', content)) STORED;
CREATE INDEX ON chunks USING gin (tsv);Performance Tuning
- •HNSW parameters:
m = 16andef_construction = 64give good recall (95%+) with fast queries. Increaseef_constructionto 128 for higher recall at the cost of slower index builds. - •Query parameters: Set
hnsw.ef_search = 40for a good recall/speed trade-off at query time. - •Memory: pgvector index needs to fit in RAM for best performance. 1M vectors × 1536 dimensions × 4 bytes = ~6GB. Ensure your Postgres instance has sufficient
shared_buffers.
Retrieval: Hybrid Search Implementation
The Hybrid Pipeline
async function retrieve(query: string, filters?: Record<string, string>, topK = 5) {
// 1. Embed the query
const queryEmbedding = await embed(query);
// 2. Vector search (top 20)
const vectorResults = await db.query(`
SELECT id, content, metadata,
1 - (embedding <=> $1) AS score
FROM chunks
WHERE ($2::jsonb IS NULL OR metadata @> $2::jsonb)
ORDER BY embedding <=> $1
LIMIT 20
`, [queryEmbedding, filters ? JSON.stringify(filters) : null]);
// 3. BM25 search (top 20)
const bm25Results = await db.query(`
SELECT id, content, metadata,
ts_rank(tsv, plainto_tsquery('english', $1)) AS score
FROM chunks
WHERE tsv @@ plainto_tsquery('english', $1)
AND ($2::jsonb IS NULL OR metadata @> $2::jsonb)
ORDER BY score DESC
LIMIT 20
`, [query, filters ? JSON.stringify(filters) : null]);
// 4. Reciprocal Rank Fusion
const fused = reciprocalRankFusion(vectorResults, bm25Results);
// 5. Return top K
return fused.slice(0, topK);
}Reciprocal Rank Fusion
RRF combines rankings from multiple sources without needing to normalize scores:
function reciprocalRankFusion(
...resultSets: Array<Array<{ id: string; score: number }>>
): Array<{ id: string; score: number }> {
const K = 60; // constant, standard value
const scores = new Map<string, number>();
for (const results of resultSets) {
results.forEach((result, rank) => {
const current = scores.get(result.id) || 0;
scores.set(result.id, current + 1 / (K + rank + 1));
});
}
return Array.from(scores.entries())
.map(([id, score]) => ({ id, score }))
.sort((a, b) => b.score - a.score);
}Generation: Prompt Construction and Streaming
Context Assembly
function buildPrompt(query: string, chunks: Chunk[]): string {
const context = chunks
.map((chunk, i) =>
`[Source ${i + 1}: ${chunk.metadata.title} > ${chunk.metadata.headingHierarchy.join(" > ")}]\n${chunk.content}`
)
.join("\n\n---\n\n");
return `Answer the following question using only the provided context. Include source references [1], [2], etc. If the context doesn't contain enough information, say so.
Context:
${context}
Question: ${query}`;
}Streaming Response
Always stream RAG responses. Users expect immediate feedback, and RAG responses often take 2-5 seconds for the full response:
async function* generateResponse(query: string, chunks: Chunk[]) {
const prompt = buildPrompt(query, chunks);
const stream = await anthropic.messages.stream({
model: "claude-sonnet-4-20250514",
max_tokens: 1024,
system: "You are a helpful assistant. Answer questions accurately based on the provided context.",
messages: [{ role: "user", content: prompt }],
});
for await (const event of stream) {
if (event.type === "content_block_delta" && event.delta.type === "text_delta") {
yield event.delta.text;
}
}
}Production Hardening
Error Handling
- •Embedding API failure: Queue for retry, return "processing" status to user.
- •Vector search timeout: Fall back to BM25-only search.
- •LLM timeout (>5s): Return retrieved chunks with a "generating..." message, then update via WebSocket.
- •Empty retrieval: Return "I don't have information about that" instead of hallucinating.
Monitoring
Track these RAG-specific metrics:
- •Retrieval latency (p95 < 100ms)
- •Generation latency (p95 < 3s)
- •Empty retrieval rate (target < 5%)
- •User feedback on RAG responses
- •Cache hit rate for repeated queries
Scaling
For most SaaS products (< 500K chunks), a single Postgres instance with pgvector handles production traffic. Beyond that, consider:
- •Read replicas for search queries
- •Pinecone or Weaviate for managed scaling
- •Separate embedding computation from serving
Conclusion
A production RAG pipeline is five components that each need to work well: document processing, chunking, embedding, retrieval, and generation. Start with the simple version of each — recursive chunking, pgvector, hybrid search, Claude Sonnet — and optimize based on evaluation metrics. The infrastructure described here handles 90% of SaaS RAG use cases.