RAG in production: a pragmatic architecture with pgvector, Claude and Symfony

RAG: promise vs reality

Retrieval-Augmented Generation: we retrieve the relevant passages from a corpus, hand them to an LLM, get a natural-language answer with sources. On paper, magical. In production, a serious engineering project.

The first RAG demos built in 2023 with LangChain and Pinecone stayed as POCs in 80% of the teams we audited in 2025. The reasons repeat: bad chunking, generic embeddings on a business corpus, no reranking, shallow prompts, spiraling costs, zero quantitative evaluation.

This article presents the architecture we have been shipping to production for two years: Symfony up front, PostgreSQL + pgvector for storage, Claude 4.5 Sonnet for generation, OpenAI embeddings, Cohere reranking. Code, SQL schema, real costs.

Target architecture

The pragmatic RAG pipeline breaks into seven steps.

 ┌─────────────┐     ┌──────────┐     ┌───────────┐     ┌──────────┐
 │  Documents  │────▶│ Chunking │────▶│ Embedding │────▶│ Storage  │
 │  (PDF, MD,  │     │  smart   │     │ OpenAI    │     │ pgvector │
 │   HTML)     │     │          │     │           │     │  + meta  │
 └─────────────┘     └──────────┘     └───────────┘     └──────────┘
                                                              │
                                                              ▼
 ┌─────────────┐     ┌──────────┐     ┌───────────┐     ┌──────────┐
 │  Answer     │◀────│  Claude  │◀────│ Reranking │◀────│ Retrieval│
 │  + sources  │     │   4.5    │     │  Cohere   │     │  HNSW    │
 └─────────────┘     └──────────┘     └───────────┘     └──────────┘

Ingestion: extract text from source formats.
Chunking: split into contextually coherent passages.
Embedding: transform into dense vectors.
Storage: persist into PostgreSQL + pgvector.
Retrieval: pull top-k candidates by cosine similarity.
Reranking: re-sort candidates with a more precise model.
Generation: send final context to Claude, compose the prompt.

pgvector rather than Pinecone, Qdrant, Weaviate

The vector DB choice is often oversized. pgvector 0.8 (released late 2024) covers most needs with unbeatable operational simplicity.

Criterion	pgvector 0.8	Pinecone	Qdrant	Weaviate
Volume up to	5M 1536-dim vectors on 1 instance 16 GB	Unlimited	Unlimited	Unlimited
p95 HNSW latency	20 to 80 ms	30 to 100 ms	10 to 50 ms	40 to 150 ms
ACID transactions with business data	Yes, native	No	No	No
Unified backup with rest of DB	Yes	No	Separate	Separate
Standard SQL	Yes	Proprietary API	Proprietary API	GraphQL / REST
Monthly cost (5M vectors)	Existing PG cost	70 to 200 USD	50 to 150 USD	80 USD
Learning curve	None if team knows PG	Moderate	Moderate	Steep

Our rule: pgvector by default, switch to Qdrant if we exceed 50M vectors or if pre-query filtering becomes complex (hybrid dense + structured queries). Pinecone is still relevant for a team that wants zero ops.

PostgreSQL schema

The schema we deploy on most Symfony RAGs. PostgreSQL 16 or 17, pgvector 0.8 extension enabled.

-- Enable the extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Source document table
CREATE TABLE document (
    id           BIGSERIAL PRIMARY KEY,
    tenant_id    BIGINT NOT NULL,
    external_id  TEXT NOT NULL,
    title        TEXT NOT NULL,
    source_type  TEXT NOT NULL,               -- 'pdf', 'md', 'confluence', etc.
    source_url   TEXT,
    checksum     TEXT NOT NULL,               -- sha256 of source file
    raw_bytes    INTEGER NOT NULL,
    language     TEXT NOT NULL DEFAULT 'en',
    ingested_at  TIMESTAMPTZ NOT NULL DEFAULT now(),
    deleted_at   TIMESTAMPTZ,
    UNIQUE (tenant_id, external_id)
);
CREATE INDEX idx_document_tenant ON document (tenant_id) WHERE deleted_at IS NULL;

-- Chunk table with embedding
CREATE TABLE document_chunk (
    id                BIGSERIAL PRIMARY KEY,
    document_id       BIGINT NOT NULL REFERENCES document(id) ON DELETE CASCADE,
    tenant_id         BIGINT NOT NULL,
    chunk_index       INTEGER NOT NULL,
    content           TEXT NOT NULL,
    content_tokens    INTEGER NOT NULL,
    heading_path      TEXT,                   -- 'Section 2 > Subsection A'
    embedding         vector(3072) NOT NULL,  -- text-embedding-3-large = 3072
    embedding_model   TEXT NOT NULL,          -- for future swap
    created_at        TIMESTAMPTZ NOT NULL DEFAULT now(),
    UNIQUE (document_id, chunk_index)
);

-- HNSW index: balance speed / accuracy / memory
CREATE INDEX idx_chunk_embedding_hnsw
    ON document_chunk
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);

-- Index for tenant filter (critical in multi-tenant)
CREATE INDEX idx_chunk_tenant ON document_chunk (tenant_id);

-- Trace table for observability and evaluation
CREATE TABLE rag_query (
    id                BIGSERIAL PRIMARY KEY,
    tenant_id         BIGINT NOT NULL,
    user_id           BIGINT,
    query_text        TEXT NOT NULL,
    retrieved_ids     BIGINT[] NOT NULL,
    reranked_ids      BIGINT[] NOT NULL,
    final_prompt      TEXT NOT NULL,
    answer            TEXT NOT NULL,
    model             TEXT NOT NULL,
    prompt_tokens     INTEGER NOT NULL,
    completion_tokens INTEGER NOT NULL,
    cost_cents        INTEGER NOT NULL,
    latency_ms        INTEGER NOT NULL,
    rating            SMALLINT,               -- user feedback 1..5
    created_at        TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX idx_rag_query_tenant_date ON rag_query (tenant_id, created_at DESC);

The m = 16 HNSW parameter gives a good trade-off on most corpora. Raise to m = 32 if precision is insufficient (cost: +50% memory). ef_construction = 64 suffices for corpora under 5 million vectors.

Important: the HNSW index is built in memory. For a table with 5 million 3072-dim vectors, budget about 60 GB of RAM. Size the PostgreSQL instance accordingly, or use a smaller dimension (text-embedding-3-small = 1536, -80% RAM and -60% cost).

Symfony embedding service

The embedding client is an injectable dependency that isolates the provider. The same project can swap OpenAI, Voyage AI, Cohere or a self-hosted model without touching business logic.

&#x3C;?php
declare(strict_types=1);

namespace App\Rag\Embedding;

use Symfony\Contracts\HttpClient\HttpClientInterface;

interface EmbeddingClient
{
    /**
     * @param list&#x3C;string> $texts
     * @return list&#x3C;list&#x3C;float>>
     */
    public function embedBatch(array $texts): array;

    public function dimensions(): int;

    public function modelId(): string;
}

final class OpenAIEmbeddingClient implements EmbeddingClient
{
    public function __construct(
        private readonly HttpClientInterface $client,
        #[\SensitiveParameter] private readonly string $apiKey,
        private readonly string $model = 'text-embedding-3-large',
        private readonly int $dimensions = 3072,
    ) {}

    public function embedBatch(array $texts): array
    {
        if ($texts === []) {
            return [];
        }
        if (count($texts) > 2048) {
            throw new \InvalidArgumentException('OpenAI embeds max 2048 inputs per call.');
        }
        $response = $this->client->request('POST', 'https://api.openai.com/v1/embeddings', [
            'headers' => [
                'Authorization' => 'Bearer ' . $this->apiKey,
                'Content-Type' => 'application/json',
            ],
            'json' => [
                'model' => $this->model,
                'input' => $texts,
                'encoding_format' => 'float',
            ],
            'timeout' => 60,
        ]);
        $payload = $response->toArray();
        /** @var list&#x3C;list&#x3C;float>> $vectors */
        $vectors = array_map(
            static fn (array $d) => $d['embedding'],
            $payload['data'],
        );
        return $vectors;
    }

    public function dimensions(): int
    {
        return $this->dimensions;
    }

    public function modelId(): string
    {
        return $this->model;
    }
}

Chunking is the most underestimated step. We use a hierarchical approach: split by section heading first, then by paragraph, then by max size of 800 tokens with 100 tokens overlap. The heading_path is preserved in metadata, critical to final prompt composition.

Retrieval service with Doctrine + pgvector

Doctrine has no native support for the vector type. We work in raw DBAL at this layer, which keeps performance optimal.

&#x3C;?php
declare(strict_types=1);

namespace App\Rag\Retrieval;

use Doctrine\DBAL\Connection;
use Doctrine\DBAL\ParameterType;

final class PgVectorRetriever
{
    public function __construct(
        private readonly Connection $conn,
    ) {}

    /**
     * @param list&#x3C;float> $queryVector
     * @return list&#x3C;RetrievedChunk>
     */
    public function topK(array $queryVector, int $tenantId, int $k = 40): array
    {
        $vectorLiteral = $this->toPgvectorLiteral($queryVector);
        $sql = &#x3C;&#x3C;&#x3C;SQL
            SELECT
                dc.id,
                dc.document_id,
                dc.content,
                dc.heading_path,
                d.title           AS document_title,
                d.source_url      AS document_url,
                1 - (dc.embedding &#x3C;=> :q) AS similarity
            FROM document_chunk dc
            INNER JOIN document d ON d.id = dc.document_id
            WHERE dc.tenant_id = :tenant
              AND d.deleted_at IS NULL
            ORDER BY dc.embedding &#x3C;=> :q
            LIMIT :k
        SQL;
        $rows = $this->conn->executeQuery($sql, [
            'q' => $vectorLiteral,
            'tenant' => $tenantId,
            'k' => $k,
        ], [
            'q' => ParameterType::STRING,
            'tenant' => ParameterType::INTEGER,
            'k' => ParameterType::INTEGER,
        ])->fetchAllAssociative();

        return array_map(
            static fn (array $r) => new RetrievedChunk(
                id: (int) $r['id'],
                documentId: (int) $r['document_id'],
                content: $r['content'],
                headingPath: $r['heading_path'] ?? null,
                documentTitle: $r['document_title'],
                documentUrl: $r['document_url'] ?? null,
                similarity: (float) $r['similarity'],
            ),
            $rows,
        );
    }

    /**
     * @param list&#x3C;float> $vector
     */
    private function toPgvectorLiteral(array $vector): string
    {
        return '[' . implode(',', array_map(fn (float $f) => sprintf('%.8f', $f), $vector)) . ']';
    }
}

final readonly class RetrievedChunk
{
    public function __construct(
        public int $id,
        public int $documentId,
        public string $content,
        public ?string $headingPath,
        public string $documentTitle,
        public ?string $documentUrl,
        public float $similarity,
    ) {}
}

pgvector's <=> operator computes cosine distance. To boost HNSW precision on a specific query, run SET LOCAL hnsw.ef_search = 80; before the query (default 40).

Filtering by tenant_id in WHERE is paramount in multi-tenant. The HNSW index does pre-selection without the filter, then PostgreSQL applies WHERE. For large volumes, consider partitioning per tenant.

Reranking with Cohere

Vector retrieval alone pulls 30 to 50 candidates. Reranking sorts them more finely with a cross-encoder that sees the query and passage together. Typical gain: +15 to +25% on precision@5.

&#x3C;?php
declare(strict_types=1);

namespace App\Rag\Reranking;

use App\Rag\Retrieval\RetrievedChunk;
use Symfony\Contracts\HttpClient\HttpClientInterface;

final class CohereReranker
{
    public function __construct(
        private readonly HttpClientInterface $client,
        #[\SensitiveParameter] private readonly string $apiKey,
        private readonly string $model = 'rerank-v3.5',
    ) {}

    /**
     * @param list&#x3C;RetrievedChunk> $candidates
     * @return list&#x3C;RetrievedChunk>
     */
    public function rerank(string $query, array $candidates, int $topN = 5): array
    {
        if ($candidates === []) {
            return [];
        }
        $response = $this->client->request('POST', 'https://api.cohere.com/v2/rerank', [
            'headers' => [
                'Authorization' => 'Bearer ' . $this->apiKey,
                'Content-Type' => 'application/json',
            ],
            'json' => [
                'model' => $this->model,
                'query' => $query,
                'documents' => array_map(
                    static fn (RetrievedChunk $c) => $c->content,
                    $candidates,
                ),
                'top_n' => $topN,
            ],
            'timeout' => 30,
        ]);
        $payload = $response->toArray();
        $ordered = [];
        foreach ($payload['results'] as $result) {
            $idx = (int) $result['index'];
            $ordered[] = $candidates[$idx];
        }
        return $ordered;
    }
}

Self-hosted alternative: BAAI/bge-reranker-large through a vLLM or TEI (Text Embeddings Inference) endpoint. Lower marginal cost above 200,000 queries per month.

Prompt composition and Claude call

The prompt is the critical piece. A RAG with great retrieval and a lazy prompt produces hallucinations.

&#x3C;?php
declare(strict_types=1);

namespace App\Rag\Generation;

use App\Rag\Retrieval\RetrievedChunk;
use Symfony\Contracts\HttpClient\HttpClientInterface;

final class ClaudeGenerator
{
    private const SYSTEM_PROMPT = &#x3C;&#x3C;&#x3C;'PROMPT'
    You are the company documentation assistant. Absolute rules:
    1. Answer only from the passages provided in &#x3C;context>.
    2. If the information is not in &#x3C;context>, say so explicitly ("I cannot find this information in the documentation.").
    3. Cite sources by number in brackets, e.g. [1], [2].
    4. Answer in the user's language.
    5. Be concise and factual. No speculation, no filler.
    PROMPT;

    public function __construct(
        private readonly HttpClientInterface $client,
        #[\SensitiveParameter] private readonly string $apiKey,
        private readonly string $model = 'claude-sonnet-4-5',
    ) {}

    /**
     * @param list&#x3C;RetrievedChunk> $chunks
     */
    public function generate(string $query, array $chunks): GeneratedAnswer
    {
        $context = $this->buildContext($chunks);
        $userContent = sprintf("&#x3C;context>\n%s\n&#x3C;/context>\n\nQuestion: %s", $context, $query);

        $response = $this->client->request('POST', 'https://api.anthropic.com/v1/messages', [
            'headers' => [
                'x-api-key' => $this->apiKey,
                'anthropic-version' => '2023-06-01',
                'Content-Type' => 'application/json',
            ],
            'json' => [
                'model' => $this->model,
                'max_tokens' => 1024,
                'temperature' => 0.2,
                'system' => [[
                    'type' => 'text',
                    'text' => self::SYSTEM_PROMPT,
                    'cache_control' => ['type' => 'ephemeral'],
                ]],
                'messages' => [[
                    'role' => 'user',
                    'content' => $userContent,
                ]],
            ],
            'timeout' => 60,
        ]);
        $payload = $response->toArray();

        return new GeneratedAnswer(
            text: $payload['content'][0]['text'],
            promptTokens: (int) ($payload['usage']['input_tokens'] ?? 0),
            completionTokens: (int) ($payload['usage']['output_tokens'] ?? 0),
            cachedPromptTokens: (int) ($payload['usage']['cache_read_input_tokens'] ?? 0),
            model: $payload['model'],
        );
    }

    /**
     * @param list&#x3C;RetrievedChunk> $chunks
     */
    private function buildContext(array $chunks): string
    {
        $parts = [];
        foreach ($chunks as $i => $chunk) {
            $heading = $chunk->headingPath ? ' > ' . $chunk->headingPath : '';
            $parts[] = sprintf(
                "[%d] Source: %s%s\n%s",
                $i + 1,
                $chunk->documentTitle,
                $heading,
                $chunk->content,
            );
        }
        return implode("\n\n---\n\n", $parts);
    }
}

final readonly class GeneratedAnswer
{
    public function __construct(
        public string $text,
        public int $promptTokens,
        public int $completionTokens,
        public int $cachedPromptTokens,
        public string $model,
    ) {}
}

Important notes:

cache_control: ephemeral on the system prompt activates Anthropic's native prompt caching. On follow-up calls within 5 minutes, system tokens cost 90% less. Massive gain on high-traffic RAGs.
temperature: 0.2 for deterministic answers. Raise it if you want more narrative creativity, but at the price of more hallucinations.
The instruction to cite [1], [2] lets you materialize sources in the UI and lets users verify.

Observability: Langfuse and dashboards

Without observability, a production RAG becomes an expensive black box. We instrument each query with Langfuse (open-source, self-hostable).

&#x3C;?php
declare(strict_types=1);

namespace App\Rag;

use Symfony\Contracts\HttpClient\HttpClientInterface;

final class LangfuseTracer
{
    public function __construct(
        private readonly HttpClientInterface $client,
        #[\SensitiveParameter] private readonly string $publicKey,
        #[\SensitiveParameter] private readonly string $secretKey,
        private readonly string $host = 'https://cloud.langfuse.com',
    ) {}

    /**
     * @param array&#x3C;string, mixed> $metadata
     */
    public function trace(string $name, array $metadata): void
    {
        $this->client->request('POST', $this->host . '/api/public/ingestion', [
            'auth_basic' => [$this->publicKey, $this->secretKey],
            'json' => [
                'batch' => [[
                    'id' => bin2hex(random_bytes(8)),
                    'type' => 'trace-create',
                    'timestamp' => gmdate('Y-m-d\TH:i:s.v\Z'),
                    'body' => ['name' => $name, 'metadata' => $metadata],
                ]],
            ],
            'timeout' => 5,
        ]);
    }
}

Dashboards to wire systematically:

Cost per tenant per day (alert at 120% of the median of the last 7 days).
Full pipeline p50, p95, p99 latency.
Prompt cache hit rate.
User satisfaction score (1 to 5) per feature.
Distribution of top-1 similarity scores (corpus drift indicator).

For qualitative evaluation, we run a CI eval suite with Ragas on an annotated dataset of 100 to 200 question-answer pairs. Ragas metrics we track: context_precision, context_recall, faithfulness, answer_relevancy. Build fails if any drops below 0.75.

Real-world costs

Metrics compiled across 4 RAGs in production in our portfolio, 30,000 to 80,000 documents, 10,000 to 60,000 queries per month.

Line item	2026 unit cost	Typical monthly volume	Monthly cost
Initial embeddings (3072-dim)	0.13 USD / 1M tokens	50M tokens (initial ingest)	~6 USD (one-shot)
Incremental embeddings	0.13 USD / 1M tokens	5M tokens	~0.65 USD
pgvector storage (managed PG)	Existing PG cost	+15 GB	0 to 50 USD
Cohere reranking	2 USD / 1000 searches	30,000 queries	60 USD
Claude 4.5 Sonnet generation	3 USD / 1M input, 15 USD / 1M output	2k-token prompts, 400-token answers, 30k queries	270 USD
Prompt caching (-90% on cache hit)	-	~85% cache hit	Savings ~170 USD
Langfuse observability	Self-host or 29 USD / month cloud	-	0 to 29 USD
Operational total			~150 to 230 USD / month

The amount stays contained if you apply three optimizations systematically:

Anthropic prompt caching. System prompt and long instructions are cached, real gain 70 to 90%.
Model selection per query. Simple classifications run on Claude Haiku (5 to 10 times cheaper). Only complex Q&A hits Sonnet.
Batch API for non-real-time workflows (reports, enrichment): -50% on price.

The six classic pitfalls

Across twenty RAGs audited in 2025, the errors that kill POCs keep recurring.

Chunking too coarse. 2000-token chunks dilute relevant information. Target 500 to 1000 tokens, with 10 to 15% overlap.
Missing metadata. No heading_path, no date, no document type. The LLM cannot cite or weight.
No reranking. Retrieval alone returns 30% noise in the top-10. Without reranking, the LLM hallucinates.
Shallow system prompt. "You are a helpful assistant" is not enough. Explicitly forbid off-context answers, require source citation, fix the tone.
No quantitative evaluation. Without Ragas or an annotated dataset, you cannot detect regressions on model or prompt change.
Unmonitored costs. An automated user can 100x usual traffic overnight. Per-tenant rate limiting is mandatory on day 1.

Conclusion

A production RAG is not a LangChain demo. It is an engineering pipeline with its quality, observability, security and cost layers. The good news: with pgvector, Symfony and Claude, the technical foundation is accessible, maintainable over ten years, and economically viable for a mid-sized company.

For a RAG scoping on your business data, a production rollout or a cost optimization on an existing POC, write to contact@your-digital-hub.com or discover our Artificial Intelligence expertise.