RAG in production: a pragmatic architecture with pgvector, Claude and Symfony
RAG: promise vs reality
Retrieval-Augmented Generation: we retrieve the relevant passages from a corpus, hand them to an LLM, get a natural-language answer with sources. On paper, magical. In production, a serious engineering project.
The first RAG demos built in 2023 with LangChain and Pinecone stayed as POCs in 80% of the teams we audited in 2025. The reasons repeat: bad chunking, generic embeddings on a business corpus, no reranking, shallow prompts, spiraling costs, zero quantitative evaluation.
This article presents the architecture we have been shipping to production for two years: Symfony up front, PostgreSQL + pgvector for storage, Claude 4.5 Sonnet for generation, OpenAI embeddings, Cohere reranking. Code, SQL schema, real costs.
Target architecture
The pragmatic RAG pipeline breaks into seven steps.
┌─────────────┐ ┌──────────┐ ┌───────────┐ ┌──────────┐
│ Documents │────▶│ Chunking │────▶│ Embedding │────▶│ Storage │
│ (PDF, MD, │ │ smart │ │ OpenAI │ │ pgvector │
│ HTML) │ │ │ │ │ │ + meta │
└─────────────┘ └──────────┘ └───────────┘ └──────────┘
│
▼
┌─────────────┐ ┌──────────┐ ┌───────────┐ ┌──────────┐
│ Answer │◀────│ Claude │◀────│ Reranking │◀────│ Retrieval│
│ + sources │ │ 4.5 │ │ Cohere │ │ HNSW │
└─────────────┘ └──────────┘ └───────────┘ └──────────┘
- Ingestion: extract text from source formats.
- Chunking: split into contextually coherent passages.
- Embedding: transform into dense vectors.
- Storage: persist into PostgreSQL + pgvector.
- Retrieval: pull top-k candidates by cosine similarity.
- Reranking: re-sort candidates with a more precise model.
- Generation: send final context to Claude, compose the prompt.
pgvector rather than Pinecone, Qdrant, Weaviate
The vector DB choice is often oversized. pgvector 0.8 (released late 2024) covers most needs with unbeatable operational simplicity.
| Criterion | pgvector 0.8 | Pinecone | Qdrant | Weaviate |
|---|---|---|---|---|
| Volume up to | 5M 1536-dim vectors on 1 instance 16 GB | Unlimited | Unlimited | Unlimited |
| p95 HNSW latency | 20 to 80 ms | 30 to 100 ms | 10 to 50 ms | 40 to 150 ms |
| ACID transactions with business data | Yes, native | No | No | No |
| Unified backup with rest of DB | Yes | No | Separate | Separate |
| Standard SQL | Yes | Proprietary API | Proprietary API | GraphQL / REST |
| Monthly cost (5M vectors) | Existing PG cost | 70 to 200 USD | 50 to 150 USD | 80 USD |
| Learning curve | None if team knows PG | Moderate | Moderate | Steep |
Our rule: pgvector by default, switch to Qdrant if we exceed 50M vectors or if pre-query filtering becomes complex (hybrid dense + structured queries). Pinecone is still relevant for a team that wants zero ops.
PostgreSQL schema
The schema we deploy on most Symfony RAGs. PostgreSQL 16 or 17, pgvector 0.8 extension enabled.
-- Enable the extension
CREATE EXTENSION IF NOT EXISTS vector;
-- Source document table
CREATE TABLE document (
id BIGSERIAL PRIMARY KEY,
tenant_id BIGINT NOT NULL,
external_id TEXT NOT NULL,
title TEXT NOT NULL,
source_type TEXT NOT NULL, -- 'pdf', 'md', 'confluence', etc.
source_url TEXT,
checksum TEXT NOT NULL, -- sha256 of source file
raw_bytes INTEGER NOT NULL,
language TEXT NOT NULL DEFAULT 'en',
ingested_at TIMESTAMPTZ NOT NULL DEFAULT now(),
deleted_at TIMESTAMPTZ,
UNIQUE (tenant_id, external_id)
);
CREATE INDEX idx_document_tenant ON document (tenant_id) WHERE deleted_at IS NULL;
-- Chunk table with embedding
CREATE TABLE document_chunk (
id BIGSERIAL PRIMARY KEY,
document_id BIGINT NOT NULL REFERENCES document(id) ON DELETE CASCADE,
tenant_id BIGINT NOT NULL,
chunk_index INTEGER NOT NULL,
content TEXT NOT NULL,
content_tokens INTEGER NOT NULL,
heading_path TEXT, -- 'Section 2 > Subsection A'
embedding vector(3072) NOT NULL, -- text-embedding-3-large = 3072
embedding_model TEXT NOT NULL, -- for future swap
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
UNIQUE (document_id, chunk_index)
);
-- HNSW index: balance speed / accuracy / memory
CREATE INDEX idx_chunk_embedding_hnsw
ON document_chunk
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
-- Index for tenant filter (critical in multi-tenant)
CREATE INDEX idx_chunk_tenant ON document_chunk (tenant_id);
-- Trace table for observability and evaluation
CREATE TABLE rag_query (
id BIGSERIAL PRIMARY KEY,
tenant_id BIGINT NOT NULL,
user_id BIGINT,
query_text TEXT NOT NULL,
retrieved_ids BIGINT[] NOT NULL,
reranked_ids BIGINT[] NOT NULL,
final_prompt TEXT NOT NULL,
answer TEXT NOT NULL,
model TEXT NOT NULL,
prompt_tokens INTEGER NOT NULL,
completion_tokens INTEGER NOT NULL,
cost_cents INTEGER NOT NULL,
latency_ms INTEGER NOT NULL,
rating SMALLINT, -- user feedback 1..5
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX idx_rag_query_tenant_date ON rag_query (tenant_id, created_at DESC);
The m = 16 HNSW parameter gives a good trade-off on most corpora. Raise to m = 32 if precision is insufficient (cost: +50% memory). ef_construction = 64 suffices for corpora under 5 million vectors.
Important: the HNSW index is built in memory. For a table with 5 million 3072-dim vectors, budget about 60 GB of RAM. Size the PostgreSQL instance accordingly, or use a smaller dimension (text-embedding-3-small = 1536, -80% RAM and -60% cost).
Symfony embedding service
The embedding client is an injectable dependency that isolates the provider. The same project can swap OpenAI, Voyage AI, Cohere or a self-hosted model without touching business logic.
<?php
declare(strict_types=1);
namespace App\Rag\Embedding;
use Symfony\Contracts\HttpClient\HttpClientInterface;
interface EmbeddingClient
{
/**
* @param list<string> $texts
* @return list<list<float>>
*/
public function embedBatch(array $texts): array;
public function dimensions(): int;
public function modelId(): string;
}
final class OpenAIEmbeddingClient implements EmbeddingClient
{
public function __construct(
private readonly HttpClientInterface $client,
#[\SensitiveParameter] private readonly string $apiKey,
private readonly string $model = 'text-embedding-3-large',
private readonly int $dimensions = 3072,
) {}
public function embedBatch(array $texts): array
{
if ($texts === []) {
return [];
}
if (count($texts) > 2048) {
throw new \InvalidArgumentException('OpenAI embeds max 2048 inputs per call.');
}
$response = $this->client->request('POST', 'https://api.openai.com/v1/embeddings', [
'headers' => [
'Authorization' => 'Bearer ' . $this->apiKey,
'Content-Type' => 'application/json',
],
'json' => [
'model' => $this->model,
'input' => $texts,
'encoding_format' => 'float',
],
'timeout' => 60,
]);
$payload = $response->toArray();
/** @var list<list<float>> $vectors */
$vectors = array_map(
static fn (array $d) => $d['embedding'],
$payload['data'],
);
return $vectors;
}
public function dimensions(): int
{
return $this->dimensions;
}
public function modelId(): string
{
return $this->model;
}
}
Chunking is the most underestimated step. We use a hierarchical approach: split by section heading first, then by paragraph, then by max size of 800 tokens with 100 tokens overlap. The heading_path is preserved in metadata, critical to final prompt composition.
Retrieval service with Doctrine + pgvector
Doctrine has no native support for the vector type. We work in raw DBAL at this layer, which keeps performance optimal.
<?php
declare(strict_types=1);
namespace App\Rag\Retrieval;
use Doctrine\DBAL\Connection;
use Doctrine\DBAL\ParameterType;
final class PgVectorRetriever
{
public function __construct(
private readonly Connection $conn,
) {}
/**
* @param list<float> $queryVector
* @return list<RetrievedChunk>
*/
public function topK(array $queryVector, int $tenantId, int $k = 40): array
{
$vectorLiteral = $this->toPgvectorLiteral($queryVector);
$sql = <<<SQL
SELECT
dc.id,
dc.document_id,
dc.content,
dc.heading_path,
d.title AS document_title,
d.source_url AS document_url,
1 - (dc.embedding <=> :q) AS similarity
FROM document_chunk dc
INNER JOIN document d ON d.id = dc.document_id
WHERE dc.tenant_id = :tenant
AND d.deleted_at IS NULL
ORDER BY dc.embedding <=> :q
LIMIT :k
SQL;
$rows = $this->conn->executeQuery($sql, [
'q' => $vectorLiteral,
'tenant' => $tenantId,
'k' => $k,
], [
'q' => ParameterType::STRING,
'tenant' => ParameterType::INTEGER,
'k' => ParameterType::INTEGER,
])->fetchAllAssociative();
return array_map(
static fn (array $r) => new RetrievedChunk(
id: (int) $r['id'],
documentId: (int) $r['document_id'],
content: $r['content'],
headingPath: $r['heading_path'] ?? null,
documentTitle: $r['document_title'],
documentUrl: $r['document_url'] ?? null,
similarity: (float) $r['similarity'],
),
$rows,
);
}
/**
* @param list<float> $vector
*/
private function toPgvectorLiteral(array $vector): string
{
return '[' . implode(',', array_map(fn (float $f) => sprintf('%.8f', $f), $vector)) . ']';
}
}
final readonly class RetrievedChunk
{
public function __construct(
public int $id,
public int $documentId,
public string $content,
public ?string $headingPath,
public string $documentTitle,
public ?string $documentUrl,
public float $similarity,
) {}
}
pgvector's <=> operator computes cosine distance. To boost HNSW precision on a specific query, run SET LOCAL hnsw.ef_search = 80; before the query (default 40).
Filtering by tenant_id in WHERE is paramount in multi-tenant. The HNSW index does pre-selection without the filter, then PostgreSQL applies WHERE. For large volumes, consider partitioning per tenant.
Reranking with Cohere
Vector retrieval alone pulls 30 to 50 candidates. Reranking sorts them more finely with a cross-encoder that sees the query and passage together. Typical gain: +15 to +25% on precision@5.
<?php
declare(strict_types=1);
namespace App\Rag\Reranking;
use App\Rag\Retrieval\RetrievedChunk;
use Symfony\Contracts\HttpClient\HttpClientInterface;
final class CohereReranker
{
public function __construct(
private readonly HttpClientInterface $client,
#[\SensitiveParameter] private readonly string $apiKey,
private readonly string $model = 'rerank-v3.5',
) {}
/**
* @param list<RetrievedChunk> $candidates
* @return list<RetrievedChunk>
*/
public function rerank(string $query, array $candidates, int $topN = 5): array
{
if ($candidates === []) {
return [];
}
$response = $this->client->request('POST', 'https://api.cohere.com/v2/rerank', [
'headers' => [
'Authorization' => 'Bearer ' . $this->apiKey,
'Content-Type' => 'application/json',
],
'json' => [
'model' => $this->model,
'query' => $query,
'documents' => array_map(
static fn (RetrievedChunk $c) => $c->content,
$candidates,
),
'top_n' => $topN,
],
'timeout' => 30,
]);
$payload = $response->toArray();
$ordered = [];
foreach ($payload['results'] as $result) {
$idx = (int) $result['index'];
$ordered[] = $candidates[$idx];
}
return $ordered;
}
}
Self-hosted alternative: BAAI/bge-reranker-large through a vLLM or TEI (Text Embeddings Inference) endpoint. Lower marginal cost above 200,000 queries per month.
Prompt composition and Claude call
The prompt is the critical piece. A RAG with great retrieval and a lazy prompt produces hallucinations.
<?php
declare(strict_types=1);
namespace App\Rag\Generation;
use App\Rag\Retrieval\RetrievedChunk;
use Symfony\Contracts\HttpClient\HttpClientInterface;
final class ClaudeGenerator
{
private const SYSTEM_PROMPT = <<<'PROMPT'
You are the company documentation assistant. Absolute rules:
1. Answer only from the passages provided in <context>.
2. If the information is not in <context>, say so explicitly ("I cannot find this information in the documentation.").
3. Cite sources by number in brackets, e.g. [1], [2].
4. Answer in the user's language.
5. Be concise and factual. No speculation, no filler.
PROMPT;
public function __construct(
private readonly HttpClientInterface $client,
#[\SensitiveParameter] private readonly string $apiKey,
private readonly string $model = 'claude-sonnet-4-5',
) {}
/**
* @param list<RetrievedChunk> $chunks
*/
public function generate(string $query, array $chunks): GeneratedAnswer
{
$context = $this->buildContext($chunks);
$userContent = sprintf("<context>\n%s\n</context>\n\nQuestion: %s", $context, $query);
$response = $this->client->request('POST', 'https://api.anthropic.com/v1/messages', [
'headers' => [
'x-api-key' => $this->apiKey,
'anthropic-version' => '2023-06-01',
'Content-Type' => 'application/json',
],
'json' => [
'model' => $this->model,
'max_tokens' => 1024,
'temperature' => 0.2,
'system' => [[
'type' => 'text',
'text' => self::SYSTEM_PROMPT,
'cache_control' => ['type' => 'ephemeral'],
]],
'messages' => [[
'role' => 'user',
'content' => $userContent,
]],
],
'timeout' => 60,
]);
$payload = $response->toArray();
return new GeneratedAnswer(
text: $payload['content'][0]['text'],
promptTokens: (int) ($payload['usage']['input_tokens'] ?? 0),
completionTokens: (int) ($payload['usage']['output_tokens'] ?? 0),
cachedPromptTokens: (int) ($payload['usage']['cache_read_input_tokens'] ?? 0),
model: $payload['model'],
);
}
/**
* @param list<RetrievedChunk> $chunks
*/
private function buildContext(array $chunks): string
{
$parts = [];
foreach ($chunks as $i => $chunk) {
$heading = $chunk->headingPath ? ' > ' . $chunk->headingPath : '';
$parts[] = sprintf(
"[%d] Source: %s%s\n%s",
$i + 1,
$chunk->documentTitle,
$heading,
$chunk->content,
);
}
return implode("\n\n---\n\n", $parts);
}
}
final readonly class GeneratedAnswer
{
public function __construct(
public string $text,
public int $promptTokens,
public int $completionTokens,
public int $cachedPromptTokens,
public string $model,
) {}
}
Important notes:
cache_control: ephemeralon the system prompt activates Anthropic's native prompt caching. On follow-up calls within 5 minutes, system tokens cost 90% less. Massive gain on high-traffic RAGs.temperature: 0.2for deterministic answers. Raise it if you want more narrative creativity, but at the price of more hallucinations.- The instruction to cite
[1], [2]lets you materialize sources in the UI and lets users verify.
Observability: Langfuse and dashboards
Without observability, a production RAG becomes an expensive black box. We instrument each query with Langfuse (open-source, self-hostable).
<?php
declare(strict_types=1);
namespace App\Rag;
use Symfony\Contracts\HttpClient\HttpClientInterface;
final class LangfuseTracer
{
public function __construct(
private readonly HttpClientInterface $client,
#[\SensitiveParameter] private readonly string $publicKey,
#[\SensitiveParameter] private readonly string $secretKey,
private readonly string $host = 'https://cloud.langfuse.com',
) {}
/**
* @param array<string, mixed> $metadata
*/
public function trace(string $name, array $metadata): void
{
$this->client->request('POST', $this->host . '/api/public/ingestion', [
'auth_basic' => [$this->publicKey, $this->secretKey],
'json' => [
'batch' => [[
'id' => bin2hex(random_bytes(8)),
'type' => 'trace-create',
'timestamp' => gmdate('Y-m-d\TH:i:s.v\Z'),
'body' => ['name' => $name, 'metadata' => $metadata],
]],
],
'timeout' => 5,
]);
}
}
Dashboards to wire systematically:
- Cost per tenant per day (alert at 120% of the median of the last 7 days).
- Full pipeline p50, p95, p99 latency.
- Prompt cache hit rate.
- User satisfaction score (1 to 5) per feature.
- Distribution of top-1 similarity scores (corpus drift indicator).
For qualitative evaluation, we run a CI eval suite with Ragas on an annotated dataset of 100 to 200 question-answer pairs. Ragas metrics we track: context_precision, context_recall, faithfulness, answer_relevancy. Build fails if any drops below 0.75.
Real-world costs
Metrics compiled across 4 RAGs in production in our portfolio, 30,000 to 80,000 documents, 10,000 to 60,000 queries per month.
| Line item | 2026 unit cost | Typical monthly volume | Monthly cost |
|---|---|---|---|
| Initial embeddings (3072-dim) | 0.13 USD / 1M tokens | 50M tokens (initial ingest) | ~6 USD (one-shot) |
| Incremental embeddings | 0.13 USD / 1M tokens | 5M tokens | ~0.65 USD |
| pgvector storage (managed PG) | Existing PG cost | +15 GB | 0 to 50 USD |
| Cohere reranking | 2 USD / 1000 searches | 30,000 queries | 60 USD |
| Claude 4.5 Sonnet generation | 3 USD / 1M input, 15 USD / 1M output | 2k-token prompts, 400-token answers, 30k queries | 270 USD |
| Prompt caching (-90% on cache hit) | - | ~85% cache hit | Savings ~170 USD |
| Langfuse observability | Self-host or 29 USD / month cloud | - | 0 to 29 USD |
| Operational total | ~150 to 230 USD / month |
The amount stays contained if you apply three optimizations systematically:
- Anthropic prompt caching. System prompt and long instructions are cached, real gain 70 to 90%.
- Model selection per query. Simple classifications run on Claude Haiku (5 to 10 times cheaper). Only complex Q&A hits Sonnet.
- Batch API for non-real-time workflows (reports, enrichment): -50% on price.
The six classic pitfalls
Across twenty RAGs audited in 2025, the errors that kill POCs keep recurring.
- Chunking too coarse. 2000-token chunks dilute relevant information. Target 500 to 1000 tokens, with 10 to 15% overlap.
- Missing metadata. No
heading_path, no date, no document type. The LLM cannot cite or weight. - No reranking. Retrieval alone returns 30% noise in the top-10. Without reranking, the LLM hallucinates.
- Shallow system prompt. "You are a helpful assistant" is not enough. Explicitly forbid off-context answers, require source citation, fix the tone.
- No quantitative evaluation. Without Ragas or an annotated dataset, you cannot detect regressions on model or prompt change.
- Unmonitored costs. An automated user can 100x usual traffic overnight. Per-tenant rate limiting is mandatory on day 1.
Conclusion
A production RAG is not a LangChain demo. It is an engineering pipeline with its quality, observability, security and cost layers. The good news: with pgvector, Symfony and Claude, the technical foundation is accessible, maintainable over ten years, and economically viable for a mid-sized company.
For a RAG scoping on your business data, a production rollout or a cost optimization on an existing POC, write to contact@your-digital-hub.com or discover our Artificial Intelligence expertise.