The quiet rise of hybrid retrieval — why pure vector search is losing in production
Two years into the RAG era, the production winners are not the vendors with the fanciest embedding model. They are the teams that quietly stopped betting everything on vector search.
by Skygena Editorial
Two years of RAG deployments have taught us one unfashionable lesson: pure vector search is rarely the right answer for enterprise retrieval. The teams that are shipping production agents in 2026 have quietly moved on to hybrid retrieval — and they are not advertising it, because it sounds boring.
Here is what the industry is converging on, and why it matters.
The pure-vector pitch never matched the reality
The original promise of vector retrieval was seductive: embed the query, embed the documents, nearest-neighbour search, done. No need for keyword engineering, no inverted index, no filters. One elegant pipeline.
In demos this works. In production it frequently fails in ways that make engineers tear their hair out:
- A customer asks about “U-value for building 3-A in the 2024 audit”. The closest embedding in the index is a similarly-phrased section about building 3-B in the 2019 audit. Close in semantic space, catastrophically wrong in practice.
- A query for “Q3 revenue in Germany, risk-adjusted” returns a beautifully-written paragraph about Q1 revenue in Germany that happens to use the word “risk-adjusted” once. The agent summarises it confidently. It is wrong by 40%.
- A search across a regulatory document corpus for “fine for violation of Article 17” returns paragraphs about Article 19 because the embedding model thinks 17 and 19 are semantically interchangeable.
Vector search cares about semantic similarity. It does not care about the exact entity you asked about. Enterprise queries almost always care about exact entities.
What hybrid retrieval actually means
The modern production stack combines three retrieval signals and ranks the fusion:
- Lexical (BM25 or equivalent) — catches exact entity mentions, identifiers, numbers, specific phrases. Boring. Fast. Cheap. Essential.
- Semantic (vector / ANN) — catches paraphrases, synonyms, conceptual similarity. Modern embeddings are excellent at this, so use them for this.
- Structured filters — applied BEFORE retrieval, not after. Date ranges, document type, region, authorisation level. Cuts the candidate set from millions to hundreds before the expensive steps run.
The output is a fused ranked list. The LLM then reads the top-K with full context and answers.
This is not rocket science. It is what every mature search infrastructure has looked like for 20 years, and every team that reinvents retrieval in 2026 rediscovers the same shape.
When each signal wins
Based on what we see in our own engagements and what clients report:
- Queries with entity identifiers (product codes, legal article numbers, store IDs, dates) → BM25 wins. Vector embeddings frequently blur the identifier into its near-neighbours.
- Queries that paraphrase or summarise (“what’s the general sentiment about…”, “explain the difference between…”) → vector wins. BM25 misses vocabulary it has not seen before.
- Queries with hard constraints (“in the last quarter”, “for the EU market only”, “in approved documents”) → structured filters dominate. Without them, the top-K is polluted with irrelevant-but- similar content.
The fusion is what makes the system robust. Single-signal retrieval is always fragile.
The operational implication
If you are running a RAG project in 2026 and your retrieval layer is “cosine similarity over all-MiniLM embeddings”, you are leaving accuracy on the floor.
Three practical moves:
- Add a BM25 index next to your vector index. Elasticsearch, OpenSearch, Typesense and pgvector-with-tsvector all support this. The incremental infrastructure is cheap.
- Extract structured metadata at ingestion time — dates, entities, document type, access controls. Use those as pre-filters, not post-filters.
- Use Reciprocal Rank Fusion or a small learned reranker to combine the signals. Do not try to tune a single weight — measure.
Where the benchmarks hide the problem
Public RAG benchmarks (including the ones LLM vendors promote) tend to be adversarial for lexical search — queries phrased in loose natural language, corpora without entity IDs, no structured metadata. They flatter vector search because they are designed to.
Enterprise retrieval looks nothing like these benchmarks. Your queries contain store codes. Your corpus has regulatory article numbers. Your users want documents filtered by their access level.
If your model selection is driven by MTEB leaderboard position, you are optimising for the wrong thing.
The boring conclusion
The teams winning in production are using hybrid retrieval with a healthy dose of BM25 and structured filtering. They are not bragging about it because it is unsexy. They are shipping working agents while louder competitors debug their pure-vector pipelines.
If your AI project is stuck because “the model hallucinates”, check the retrieval layer before the model. In our experience, seven out of ten times, the retrieval was the problem all along.
If you want a second opinion on your RAG stack — write to [email protected]. We do half-day retrieval audits.
Thinking about AI in your business?
Skygena is a boutique European AI studio engineering autonomous agents and LLM products. If you're wrestling with where to start — or where to stop — we can help.
Book a 30-minute call