Building RAG retrieval that doesn't quietly fail on keywords

The views and opinions expressed here are my own and do not reflect those of my employer.

A few months ago I built a quick prototype. Semantic search over a stack of internal documentation and ticket history. The usual recipe: chunk, embed, throw it in pgvector, expose a small API, post a link in a team channel.

Within two days, a colleague tried to find a specific ticket by its identifier (something like JOB-1245-RB) and got back a confidently ranked list of completely unrelated results. The actual ticket was at rank 47, well past anywhere a real retrieval pipeline would ever look.

That was the moment I stopped treating vector search as a search engine.

What dense embeddings actually do

When the model embeds JOB-1245-RB, it tokenizes the string into subwords, looks up each token's dense representation, and reduces them into a single 1500-ish-dimensional vector. That vector lives in a semantic space carved out by training on natural-language pairs. The model has seen "job," some digits, and the letter pair "RB" in millions of contexts that have nothing to do with our ticket system.

The result is a vector that points roughly toward "things involving work and identifiers." The exact string, the only thing my colleague actually cared about, gets averaged across the embedding. Any nearest-neighbor search over that space returns semantic neighbors, not exact-string matches.

BM25 doesn't have this problem. An inverted index sees JOB-1245-RB as a literal token, finds the four documents that contain it, ranks them by TF-IDF, and returns them in single-digit milliseconds. The algorithm is from the early 90s and it does not care about embeddings.

This is the gap where pure-RAG architectures quietly fail. Semantic search is genuinely good at "explain how our auth flow works" or "what did we decide about retries last quarter." It is bad at "find the document with this string in it." Your users do both kinds of search and they do not tell you which is which.

Hybrid retrieval, the boring answer

The standard fix is hybrid retrieval. Run BM25 and vector search in parallel, fuse the ranked lists. Reciprocal Rank Fusion (RRF) is the merging algorithm of choice because it doesn't require normalizing scores between two completely different scoring systems.

public static IEnumerable<string> ReciprocalRankFusion(
    IEnumerable<List<string>> rankedLists, int k = 60)
{
    var scores = new Dictionary<string, double>();
    foreach (var list in rankedLists)
    {
        for (int i = 0; i < list.Count; i++)
        {
            scores[list[i]] = scores.GetValueOrDefault(list[i])
                            + 1.0 / (k + i + 1);
        }
    }
    return scores
        .OrderByDescending(kv => kv.Value)
        .Select(kv => kv.Key);
}

That's the whole algorithm. Documents that show up in both lanes at decent ranks rocket to the top. Documents that only appear in one lane still get represented if they ranked well.

Failure mode I learned the hard way. RRF is only as good as both of its lanes. If your BM25 index has no stemming, no stopword filtering, and no synonym map, the sparse lane returns noise and drags the merged ranking down. A day spent properly configuring Postgres tsvector paid off more than a week of vector tuning did. Hybrid is not a free upgrade. It is two systems that both need to actually work.

Rerank what you retrieved

Hybrid gets the right document into the top 50. A cross-encoder gets it into the top 3.

A bi-encoder (your regular embedding model) encodes the query and the document separately and compares vectors. A cross-encoder takes them together as input and produces a single relevance score. It can attend across both texts simultaneously, which is more expressive. It cannot be precomputed, which is more expensive.

The typical pattern: pull top-50 from hybrid, run a small cross-encoder over those 50 query-document pairs, re-rank to top-5 or top-10. A MiniLM-class model runs locally in tens of milliseconds. Hosted rerankers from Cohere or Voyage add a network hop and a bill but spare you the GPU.

Failure mode. The throughput cliff is real. At top-50 latency stays inside a normal interactive budget. At top-200 the cross-encoder becomes the bottleneck. Don't reach for a reranker before squeezing the hybrid retrieval below it. Rerankers amplify good candidates. They cannot fix bad ones.

Query expansion for the queries users write badly

The third approach is to rewrite the query before retrieving. Send the user's query to a small fast model, ask for three or four alternate phrasings, retrieve for all of them, dedupe and re-rank.

This is cheap (one LLM call, a few cents per thousand queries) and effective on ambiguous queries. Users write search queries badly. They under-specify, over-specify, use the wrong jargon, and ask questions when keywords would work better.

Failure mode. Lazy prompting drifts the intent. "How do I reset my password" gets expanded into "account security best practices" by a model that's trying too hard, and now you're retrieving the wrong topic. Constrain the prompt to alternate phrasings only, same intent, no broadening. Evaluate the expansion against a small set of known-good queries before turning it on for everyone.

Buy versus build

Every major cloud now ships managed hybrid retrieval with reranking baked in. If retrieval is not your differentiator, the case for buying has gotten strong.

Service	Hybrid built-in	Reranker	Self-host	Lock-in
Amazon Kendra GenAI Index	Yes	Yes	No	High (AWS)
Azure AI Search	Yes	Optional	No	High (Azure)
OpenSearch + Neural Search	Yes	Optional	Yes	Low
Elastic + ELSER	Three-way	Optional	Yes	Medium

Kendra is the most "just works" option, pre-tuned, around a few hundred dollars a month at the entry tier. Azure AI Search gives you the most knobs and is the obvious choice if you are already in that ecosystem. OpenSearch with Neural Search is the cheaper self-hostable AWS option. Elastic plus ELSER does three-way hybrid (BM25 plus dense plus learned sparse) and is the move if you already have an Elastic license.

For a small team where retrieval is not a moat, buy one of these. The engineering hours you save go into the application layer where users actually feel the difference.

What I would build tomorrow morning

For a team shipping RAG this quarter, with Postgres already in the stack and no managed-service budget:

Use Postgres tsvector for sparse retrieval and pgvector for dense. One database, two indexes.
Implement RRF in application code. The whole function is fifteen lines.
Add a regex router that catches identifier-shaped queries (error codes, ticket numbers, SKUs, version strings) and routes them straight to a direct lookup instead of either retrieval lane. This is the single highest-leverage hour of work in the entire pipeline.
Skip the cross-encoder until you've measured precision and have a reason to add latency. Do not pre-optimize.
Add query expansion last, when you have enough query log data to know which queries are actually failing.

If you want a head start, drop the prompt below into Claude, Cursor, or your LLM of choice. It scaffolds the whole pipeline as a single C# service with raw SQL against Postgres. You'll need to adjust the schema and regex patterns to match your domain, but it gets you most of the way there.

You are building a RAG retrieval service in C# / .NET 10 with Postgres as the only datastore.

Schema:
- A `documents` table with columns: id (uuid), content (text), 
  embedding (vector(1536)), tsv (tsvector), updated_at (timestamptz).
- `tsv` has a GIN index. `embedding` has an HNSW index via pgvector.

Build a `HybridRetrievalService` class with one public method:

  Task<IReadOnlyList<RetrievalResult>> RetrieveAsync(string query, int topK = 10)

Behavior:

1. Run the query through a regex router that detects identifier-shaped patterns
   (ticket IDs like JOB-\d+-[A-Z]+, error codes like 0x[0-9A-F]+, version 
   strings like v\d+\.\d+\.\d+, etc). Patterns must be configurable via an 
   injected IRegexRouterConfig. If any pattern matches, perform a direct 
   ILIKE lookup against `content`, return up to topK results ordered by 
   updated_at DESC. Skip steps 2-3.

2. Otherwise, run two queries in parallel using Task.WhenAll:
   a. Sparse (BM25-style) via Postgres FTS:
      SELECT id FROM documents
      WHERE tsv @@ plainto_tsquery('english', @query)
      ORDER BY ts_rank(tsv, plainto_tsquery('english', @query)) DESC
      LIMIT 50;
   b. Dense via pgvector. Use an injected IEmbeddingClient.EmbedAsync(string) 
      to get the query vector, then:
      SELECT id FROM documents
      ORDER BY embedding <=> @query_embedding
      LIMIT 50;

3. Fuse the two ranked lists with Reciprocal Rank Fusion (k=60), 
   return top K with id, content, and the fused RRF score.

Constraints:
- Use Npgsql or Dapper. Raw SQL is fine; skip EF Core for query execution.
- Parameterize every query.
- All public methods accept a CancellationToken.
- Include xUnit tests covering: regex router triggers correctly on known 
  patterns, RRF math is correct on synthetic ranked lists, and the hybrid 
  path runs when no pattern matches.
- Do NOT add reranking, query expansion, or fine-tuned embeddings. 
  Save those for a later iteration.

That stack handles the queries pure vector search quietly fails on, costs nothing extra to run, and is debuggable when something breaks. The exotic options (custom-fine-tuned embeddings, LLM-as-retriever, inverted HyDE) are tools to reach for when this baseline is measurably falling short, not before.

Vector search is a useful primitive. It is not a search engine. Treat the retrieval layer like the engineering problem it is, with multiple tools sized to multiple shapes of query, and your RAG system will get embarrassingly better than the one you shipped by importing pgvector and hoping.