Why your vector DB choice matters more in 2026

For the first eighteen months of the retrieval-augmented generation era, the quality bottleneck sat firmly inside the language model. If your answers were wrong, the cure was a better model, a longer context window or a smarter prompt. That diagnosis no longer holds. By Q1 2026, with Claude 4.6, GPT-5.2 and Gemini 2.5 all clearing the same evaluation thresholds within a few percentage points of each other, the bottleneck has moved decisively upstream. Retrieval quality — what your vector database actually returns when a user asks a question — is now what separates a useful RAG application from an embarrassing one.

That shift changes the way you should choose a vector database. The decision used to be a checklist exercise: pick the engine with the lowest latency in a benchmark blog post, add it to the architecture diagram, move on. In 2026 the decision behaves more like choosing an application database. You are picking the recall behaviour your product will live with, the filter ergonomics your engineers will fight with, the cost curve your finance team will analyse, and the operational surface area your platform team will support for the next three years. Get it wrong and you do not pay for it in a benchmark number; you pay for it in retrieval misses that show up as hallucinations in production.

The four databases below cover the realistic shortlist. pgvector is the "use what you have" answer. Qdrant is the open-source performance answer. Pinecone is the managed-service answer. Turbopuffer is the cost-optimised answer for large corpora with moderate query rates. Weaviate, Milvus and Vespa also have legitimate niches, and we will mention where they belong; but ninety percent of teams shipping RAG today will pick from these four.

Pro tip

Before you compare engines, write down three numbers: your expected vector count at launch, your expected vector count in eighteen months, and your queries per second at the peak hour. If you cannot estimate these to within an order of magnitude, you are not ready to pick a vector database — you are ready to build a tiny prototype on pgvector and measure your own behaviour for a month.

The four at a glance

The table below summarises the four candidates at the ten-million-vector mark — the scale at which most production RAG workloads either succeed or quietly start to fall over. Numbers reflect May 2026 published pricing and the latency ranges observed by builders in our network.

Database Engine p95 @ 10M $/mo @ 10M Hybrid search Self-host Best for
pgvector Postgres extension (C) 8–25 ms ~$30 (on existing PG) Manual (ts_vector + vector) Yes (any Postgres) Vectors next to app data
Qdrant Rust, HNSW + payloads 15–40 ms (~22 ms typical) ~$40–80 Native, sparse + dense Yes (Apache 2.0) Filtered search at scale
Pinecone Managed (proprietary) ~45 ms ~$180 Sparse-dense hybrid No Zero-ops, fastest to ship
Turbopuffer Object-store + serverless 30 ms warm; 300–800 ms cold ~$64 + ~$9/M ops Native No (managed) Huge corpus, low QPS

pgvector: ship RAG without a new database

pgvector is the answer for the largest single category of RAG workloads: teams who already run Postgres, whose corpus is below the ten-million-vector ceiling, and who would prefer not to introduce a new piece of infrastructure if they do not have to. Adopting it is a single CREATE EXTENSION away. Embeddings live in the same database as users, billing rows and product data, which means you can join them, transact across them and back them up with the tools you already trust.

The HNSW index, available since pgvector 0.5, is what makes this story work at production scale. It optimises the trade-off between recall and latency well enough that a properly tuned pgvector instance will land in the 8 to 25 millisecond p95 band for most realistic queries up to roughly ten million vectors. The pgvectorscale extension from Timescale extends the comfortable range to fifty or even one hundred million vectors before you need to seriously think about migrating.

-- Enable the extension and create a table with a 1536-dim embedding
CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE docs (
  id          bigserial PRIMARY KEY,
  source      text NOT NULL,
  body        text,
  embedding   vector(1536)
);

-- HNSW index, cosine distance, sensible defaults for ~10M rows
CREATE INDEX docs_embedding_hnsw
  ON docs
  USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

-- Query
SELECT id, source, body
FROM docs
ORDER BY embedding <=> $1::vector
LIMIT 10;

The honest limitation is hybrid search. If your queries depend on a mix of semantic similarity and exact-term matching — product codes, SKU numbers, error strings — you will end up combining a ts_vector full-text column with the vector column and tuning the score blend yourself. It works, and Indian e-commerce teams in Bengaluru run it at considerable scale, but it is not as ergonomic as Qdrant's native hybrid.

Qdrant: filtered search, Rust-fast

Qdrant is the open-source vector database that beats Pinecone on raw latency and beats most rivals on filtered search. Written in Rust, it lands at roughly 22 milliseconds p95 at ten million vectors in typical configurations, and the gap widens as your filters get more complex. The reason is architectural: Qdrant keeps payload filters close to the HNSW traversal rather than applying them as a post-filter, which means a query like "find similar documents but only from this customer in the last 90 days" stays fast even when the filter is highly selective.

For teams that have outgrown pgvector, Qdrant is the most common next step. Self-hosting is straightforward — the official Docker image is small, the Helm chart is reasonable, and the operational behaviour is predictable. Qdrant Cloud exists if you prefer managed; the trade-off you are making is roughly one-third the cost of Pinecone for similar throughput and significantly better filter performance.

from qdrant_client import QdrantClient, models

client = QdrantClient(url="http://qdrant:6333")

# Filtered semantic query: vendors in London, indexed after 1 Apr 2026
results = client.search(
    collection_name="invoices",
    query_vector=embedding,            # list[float], length 1536
    limit=10,
    query_filter=models.Filter(
        must=[
            models.FieldCondition(
                key="city",
                match=models.MatchValue(value="London"),
            ),
            models.FieldCondition(
                key="indexed_at",
                range=models.Range(gte=1743465600),  # unix ts
            ),
        ]
    ),
    with_payload=True,
)
A Verified Builder · Hyderabad

"We moved off Pinecone after our corpus hit twelve million vectors. Bill went from roughly two hundred dollars a month to fifty for a single-node Qdrant on an existing Kubernetes cluster, and our filtered queries — every query has a tenant filter — got noticeably snappier. The only real cost was three engineering days writing a backfill script. We have not looked back."

Pinecone: zero-ops managed, premium price

Pinecone is the database you pick when you do not want to think about the database. It is fully managed, scales horizontally without you doing anything, has a clean SDK in every major language, and has by far the smoothest path from prototype to production. The p95 latency at ten million vectors sits around 45 milliseconds, and the cost will land near $180 a month for a comparable workload — both of which are higher than the open-source alternatives. What you are paying for is the absence of a backlog ticket called "rebalance the vector cluster".

For most startups in their first eighteen months, this is the right trade. Engineering time is the scarce resource; one founder-engineer running a Pinecone-backed RAG product will out-ship a comparable team that is also operating Qdrant on Kubernetes. Pinecone is also the safest pick for teams that have not yet built up internal expertise in vector indexing — there are fewer knobs to misconfigure.

from pinecone import Pinecone, ServerlessSpec

pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])

# Create once (idempotent)
if "docs" not in [i.name for i in pc.list_indexes()]:
    pc.create_index(
        name="docs",
        dimension=1536,
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1"),
    )

index = pc.Index("docs")

index.upsert(vectors=[
    {"id": "doc-1", "values": embedding, "metadata": {"city": "Edinburgh"}},
])

resp = index.query(
    vector=embedding,
    top_k=10,
    filter={"city": {"$eq": "Edinburgh"}},
    include_metadata=True,
)

Want to discuss this with other verified Builders?

Every article on AI Tech Connect is written by a Verified Builder or our editorial team. Browse profiles, shortlist who you want to hire or collaborate with.

Browse Builders →

Turbopuffer: cheapest at scale, with a cold-query catch

Turbopuffer is the most architecturally interesting of the four. It separates storage from compute, stores vector partitions in object storage rather than on the query node's disk, and pages partitions in on demand. The result is dramatic: pricing starts at roughly sixty-four dollars a month plus about nine dollars per million read-and-write operations, which works out cheaper than every other option in this article when your corpus is large but your query rate is moderate.

The catch is cold-query latency. When a partition is not already in memory on a query node, Turbopuffer has to fetch it from object storage before it can search. That fetch typically adds 300 to 800 milliseconds to the first query. For workloads with a long tail of partitions — internal log search, periodic semantic queries over an archive, infrequent analyst questions over a knowledge base — this is a non-issue, because the cost saving is enormous and a once-a-minute query can tolerate a sub-second cold start. For interactive chat-style RAG where every query needs sub-100-millisecond p95, Turbopuffer is the wrong tool.

// Turbopuffer Node SDK — upsert and query
import { Turbopuffer } from "@turbopuffer/turbopuffer";

const tpuf = new Turbopuffer({ apiKey: process.env.TPUF_API_KEY });
const ns = tpuf.namespace("logs-2026-05");

await ns.upsert({
  vectors: [
    { id: "ev-1", vector: embedding, attributes: { city: "Bengaluru" } },
  ],
  distance_metric: "cosine_distance",
});

const hits = await ns.query({
  vector: embedding,
  top_k: 10,
  filters: ["And", [["city", "Eq", "Bengaluru"]]],
});
Watch out

Do not benchmark Turbopuffer on a fresh namespace and conclude it is slow. The first few queries against any partition will be cold; what you want to measure is steady-state latency once your working set is paged in. Equally, do not benchmark it on a tiny corpus and conclude it is cheap — the cost story only dominates above several million vectors. Match the test to the workload, or the numbers will mislead.

Hybrid search — who does it natively, who doesn't

Hybrid search — combining dense vector similarity with sparse keyword matching such as BM25 — has gone from a nice-to-have to a near-requirement for any RAG application whose corpus contains proper nouns, error messages, product codes or any kind of rare token an embedding model dilutes. The behaviour shows up immediately: a user types a precise SKU, expects an exact match, gets a fuzzy paraphrase, and stops trusting the system.

  • Qdrant — native, with first-class sparse-vector support and a clean fusion API.
  • Weaviate — native, GraphQL-friendly, and historically the strongest hybrid story.
  • Turbopuffer — native, with built-in full-text search alongside vector search.
  • Elasticsearch — native; the OG sparse-search engine plus dense vectors.
  • Pinecone — sparse-dense vectors are supported and improving, with some workload-shaping required.
  • pgvector — possible by combining ts_vector full-text columns with the vector column; you write the score fusion yourself.

For deeper guidance on how to design and tune a hybrid pipeline in production — including BM25 weighting, reranking and evaluation — see our dedicated piece on hybrid search in production RAG.

Cost worked-example at three scales

The chart that actually decides most procurement conversations is the cost curve. The table below estimates monthly cost at three corpus sizes — 1 million, 10 million and 100 million vectors — assuming a moderate query rate of roughly 100 queries per minute and 1,536-dimension embeddings. These are May 2026 published prices; your mileage will vary with discounts, region and reservation.

Database 1M vectors 10M vectors 100M vectors
pgvector (on existing Postgres) ~$15 ~$30 ~$220 (with pgvectorscale)
Qdrant (single-node OSS) ~$20 ~$60 ~$420 (clustered)
Pinecone (serverless tier) ~$70 ~$180 ~$1,650
Turbopuffer (moderate QPS) ~$64 ~$95 ~$310

Two things to note. First, pgvector is the absolute cost winner at 1M and 10M, but the gap closes at 100M as you start needing pgvectorscale, more cores and more RAM. Second, Turbopuffer becomes the only obvious answer above roughly fifty million vectors if your QPS is moderate — at 100M, it is five times cheaper than Pinecone and noticeably cheaper than Qdrant Cloud at equivalent throughput. Inference and storage are no longer the only cost levers in a modern AI stack; see our broader analysis of AI inference cost engineering in 2026 for the bigger picture.

Implementation decision tree

Five questions, in order. Answer each one before moving on, and the right database falls out of the answers.

  1. Are you already running Postgres in production? If yes and your vector count is below ten million, start with pgvector. There is no cheaper or simpler path. If no, skip to question two.
  2. Will your corpus exceed 100 million vectors within twelve months? If yes, Turbopuffer or Milvus belong on the shortlist regardless of other answers. If no, continue.
  3. How complex are your filters? If most queries combine semantic similarity with multi-attribute filters (tenant, time range, language, region), Qdrant is the strongest fit. If filters are rare or simple, the field stays open.
  4. How big is your platform team? Zero or one engineer who can operate infrastructure — pick a managed service (Pinecone, Qdrant Cloud or Turbopuffer). A team of three or more with Kubernetes experience can comfortably run Qdrant or pgvector themselves.
  5. How sensitive is your workload to cold-query latency? If a 500-millisecond first hit is unacceptable, rule out Turbopuffer; if you can tolerate it for the cost saving, Turbopuffer dominates at scale.

Migration patterns

Most teams pick the wrong vector database the first time, and that is fine — the migrations are well-trodden. The three you will most likely face:

pgvector → Qdrant. The classic outgrowth pattern. Triggers when corpus crosses roughly ten million vectors, when filtered-query latency starts climbing past your budget, or when concurrent write load begins to interfere with normal Postgres traffic. The migration is mechanical: stream rows out of Postgres in batches, embed (or re-use existing embeddings), upsert into Qdrant with the same primary keys and payload, run both databases in shadow mode for a week to compare recall, then cut over. Plan three to five engineering days plus a week of observation.

Pinecone → Qdrant. The cost-optimisation pattern. Triggers when the monthly Pinecone bill becomes line-item-visible to your CFO. The migration is similar to the pgvector path, but easier in one way (you already have vectors, no re-embedding needed) and harder in another (you have to stand up a self-hosted or Qdrant Cloud target). Teams in London and Manchester have reported running both in parallel for two weeks to validate filter behaviour before turning Pinecone off. Typical saving is 50 to 70 percent of monthly spend.

Pinecone → Turbopuffer. The "we discovered we are not actually interactive" pattern. Common for internal RAG tools, knowledge-base search and analyst-facing systems where the actual query rate, once measured, turns out to be one query every few seconds. Migration is similar, with one extra design step: profile your access pattern to confirm cold-query tolerance, ideally by sampling production traces for a week. The saving is the largest of the three patterns and often the most surprising.

For a related decision that often comes up in the same architectural review, see our piece on context-window engineering versus RAG — the question of whether you even need a vector database at all when million-token context windows have become routine. And for the deeper retrieval research that is shaping vector-DB choices this quarter, our April 2026 agentic RAG research roundup catalogues the hierarchical and agentic retrieval patterns now landing in production. If GPU cost is part of the same conversation, the H100 price decline guide is the companion read.

Primary sources for this piece: github.com/pgvector/pgvector, qdrant.tech, pinecone.io/pricing and turbopuffer.com.