Brennan Hitchcock

If your RAG candidate says "cosine similarity" first, end the interview

Brennan Hitchcock — Mon, 18 May 2026 20:41:19 GMT

If your RAG candidate says "cosine similarity" before asking what the queries look like, end the interview.

I do not mean that as a punchline. The order is telling you something. A candidate who reaches for the embedding model before they reach for the query log is solving a problem they read about, not the one in front of them.

Last week I wrote about how retrieval quietly fails. Vector search is a primitive, not a search engine, and the systems that hold up under real traffic hybridize, route, and re-rank around that fact. A few people asked the obvious follow-up. How do you hire someone who already knows this?

I do not have a clean answer. I have six questions. They aren't a checklist so much as a set of tripwires, places where the candidate's instincts either match how the system fails in production or they do not. Most of them have nothing to do with embeddings. The last one is the question every candidate should want you to ask.

Here is what I would run, in order.

1. What does your query log look like?

Ask this near the start of the conversation, after they have described a RAG system they have built. The answer is binary. Either they have a specific picture in their head, or they do not.

A good answer sounds something like this. Most queries are short. About a third have an identifier in them. There is a long tail of weird stuff people type when they are frustrated, and last quarter we started bucketing the tail because most of the support escalations were coming from one shape of query. The candidate is describing a thing they have looked at, recently, with their own eyes.

A weak answer pivots. They start talking about chunking strategy, or which embedding model they chose, or how they tuned chunk overlap. The query log is offstage somewhere, a thing other people deal with.

Production retrieval is mostly a literacy problem. You cannot fix what you have not bothered to read. Candidates who skip this step build systems that work fine on the examples in the original paper and then fall over on whatever the support team is actually fielding.

If they have never opened the log, that is the answer. Move on.

2. Show me your eval set.

Not a benchmark. The actual artifact they opened on Monday morning to see whether Friday's change made the product better or worse.

The strongest answer is a candidate who pulls up a CSV. Two or three hundred real queries, labels for what good retrieval looks like on each, a column flagging the hard cases. They can walk you through the queries they argued over, the time they realized their eval set had a bias baked in because it was assembled by one person on a Tuesday, the change that took them by surprise. There is a history with this file.

The middle answer is a candidate who has an eval but it is the publisher's example queries, or whatever shipped with the framework. They have measured something other than their product.

The weak answer is some variation of "we look at the responses and they seem fine." Or precision and recall numbers with no context for what was being measured against. Or the confident claim that the LLM judges its own answers, which usually means nobody has audited what the judge is doing.

This is a proxy question for whether retrieval is being treated as a science or as a config file. The candidates who treat it as a science have already lost arguments to their own data. That is what you want.

3. What is the worst question your system answers confidently?

The word doing the work here is "confidently." Anyone can list the queries that returned nothing. You want the cases where the system returned an answer with full conviction and the answer was wrong.

A good candidate has at least one of these in their pocket, told with the slight wince of someone who has not fully forgiven themselves for letting it ship. The user asked about policy and got an answer from a deprecated doc. A finance question came back with numbers from the wrong fiscal year. A name collision sent someone to the wrong customer's record. The example is specific, the log line is specific, sometimes there is a specific complaint attached.

A weak candidate gives you a category. We had some hallucinations. The model sometimes gets things wrong. We use temperature zero. General statements about LLM behaviour, because they have not lived in any of the specific failures. If they had, they would still be a little upset about one of them.

This is also a humility test. Candidates who can talk about a confident wrong answer have made peace with the fact that the system is going to do this no matter how well it is tuned. The ones who cannot, often believe the next prompt tweak will fix it, and they argue against guardrails on the grounds that the model should know better. Those candidates are expensive to manage.

4. When would you not reach for vector search?

This one is short and it discriminates fast. A strong candidate has a list. Exact match against identifiers. Anything regulatory where the answer has to come from a specific named document. Structured filtering on date ranges, status flags, low cardinality fields where the user has already given you the answer in the query. They will sometimes pre-empt you and say something like, for half the traffic the better tool is a SQL query and we route there before anything else fires.

A weak candidate pauses. They might mumble about hybrid search, or note that BM25 is still useful, or wave at reranking. The shape of the pause is the signal: they are searching for the right answer instead of recognising it.

What the question is testing is not the list itself. It is whether they understand vector search as one tool among several, with specific failure modes that other tools handle better. Candidates who already do have probably wired a regex router in front of their retriever. For everyone else, every problem they encounter is going to look like an embedding tuning problem.

5. What does the failure look like to the user?

This pulls the candidate out of the engineer frame, and it is the question most likely to surprise them.

The answers I am listening for sit at the seam between system design and product. What does the user see when the retrieved documents do not contain the answer at all. What does the user see when the documents do contain it but the LLM rewords the answer into something they misread. What happens when somebody asks a question the system was never going to be able to handle and gets back a long, plausible, completely hallucinated paragraph. Whose desk does the support ticket land on, and how does the user even know something has gone wrong.

A strong candidate has thought about this and probably argued with someone about it. They might tell you they pushed for an "I do not know" response path and got overruled by product. Or describe a UI tweak they made to surface which document an answer came from, because users were quoting the bot's response in meetings and getting embarrassed. They have opinions about what should happen at the moment the system fails, because they have watched it fail.

A weak candidate stays at the model. The LLM hallucinated. The retrieval missed. They will not have a clear picture of what the user did next, because they have been thinking of the system as a model with some plumbing instead of a product with a person at the other end. That is the candidate you can hire as a research engineer. They are not yet the person you want owning a production retrieval system.

6. Walk me through a recent retrieval bug. Symptom to fix.

This is the universal senior engineer question. It also discriminates most sharply for retrieval work, because so much of the job is debugging things that did not throw an exception.

What I want is specificity. The symptom should be concrete, ideally something a user reported or something they spotted in a dashboard. The investigation should have a shape: a hypothesis they formed, an instrumentation step that confirmed or killed it, an obvious culprit they ruled out before they found the less obvious one. They can usually tell you the wrong answer they had in their head before they got to the right one, which is the tell that they are remembering the bug rather than constructing it.

The fix should be small and explained in plain language. The regression test should exist. If they mention that the bug came back six weeks later in a slightly different form, even better. That is how you know it was real, and not a war story that got polished for interviews.

Vague answers here are disqualifying in a way the other questions are not. Everyone has at least one retrieval bug if they have shipped one of these systems. If a candidate cannot produce one, the most generous reading is that they have not shipped. Less generously, they shipped without noticing what was wrong.

The question every candidate should want you to ask

If you are reading this from the other side of the table, the question to hope for is the last one.

The question is not easy, but it lets you do the thing you have actually spent the most time on. A retrieval bug story is the closest thing this work has to a portfolio piece. It contains evidence of the production access you have had, the metrics you watched, the colleagues you argued with, the decision you made under time pressure. It is what an interview is trying to extract anyway, told in a form that does not feel like an interrogation.

If you have one, prepare it the way you would prepare a code sample. Start at the user complaint, walk through what you checked first and why you were wrong, show the moment the real cause clicked, end with the smallest fix that solved it and the regression test you wrote to keep it from coming back.

If you do not have one yet, that is also useful information about where you are. Build something. Ship it to ten people. Watch a retrieval bug happen on a query you did not anticipate. The interview will get easier, and so will the work.

The retrieval part of RAG is becoming its own discipline, and it does not look like the role most teams are hiring for. The best people I have worked with on these systems are search engineers with some ML literacy, not the other way around. They know classical IR, they instrument production by reflex, and they are comfortable being wrong in measurable ways, which is the rarest qualification of any of them.

Hire for that and the rest is teachable. Hire for embedding intuition first and you will end up with a beautifully tuned system that quietly fails on the queries your users actually care about.

Building RAG retrieval that doesn't quietly fail on keywords

Brennan Hitchcock — Mon, 11 May 2026 14:49:56 GMT

The views and opinions expressed here are my own and do not reflect those of my employer.

A few months ago I built a quick prototype. Semantic search over a stack of internal documentation and ticket history. The usual recipe: chunk, embed, throw it in pgvector, expose a small API, post a link in a team channel.

Within two days, a colleague tried to find a specific ticket by its identifier (something like JOB-1245-RB) and got back a confidently ranked list of completely unrelated results. The actual ticket was at rank 47, well past anywhere a real retrieval pipeline would ever look.

That was the moment I stopped treating vector search as a search engine.

What dense embeddings actually do

When the model embeds JOB-1245-RB, it tokenizes the string into subwords, looks up each token's dense representation, and reduces them into a single 1500-ish-dimensional vector. That vector lives in a semantic space carved out by training on natural-language pairs. The model has seen "job," some digits, and the letter pair "RB" in millions of contexts that have nothing to do with our ticket system.

The result is a vector that points roughly toward "things involving work and identifiers." The exact string, the only thing my colleague actually cared about, gets averaged across the embedding. Any nearest-neighbor search over that space returns semantic neighbors, not exact-string matches.

BM25 doesn't have this problem. An inverted index sees JOB-1245-RB as a literal token, finds the four documents that contain it, ranks them by TF-IDF, and returns them in single-digit milliseconds. The algorithm is from the early 90s and it does not care about embeddings.

This is the gap where pure-RAG architectures quietly fail. Semantic search is genuinely good at "explain how our auth flow works" or "what did we decide about retries last quarter." It is bad at "find the document with this string in it." Your users do both kinds of search and they do not tell you which is which.

Hybrid retrieval, the boring answer

The standard fix is hybrid retrieval. Run BM25 and vector search in parallel, fuse the ranked lists. Reciprocal Rank Fusion (RRF) is the merging algorithm of choice because it doesn't require normalizing scores between two completely different scoring systems.

public static IEnumerable ReciprocalRankFusion(
    IEnumerable> rankedLists, int k = 60)
{
    var scores = new Dictionary();
    foreach (var list in rankedLists)
    {
        for (int i = 0; i < list.Count; i++)
        {
            scores[list[i]] = scores.GetValueOrDefault(list[i])
                            + 1.0 / (k + i + 1);
        }
    }
    return scores
        .OrderByDescending(kv => kv.Value)
        .Select(kv => kv.Key);
}

That's the whole algorithm. Documents that show up in both lanes at decent ranks rocket to the top. Documents that only appear in one lane still get represented if they ranked well.

Failure mode I learned the hard way. RRF is only as good as both of its lanes. If your BM25 index has no stemming, no stopword filtering, and no synonym map, the sparse lane returns noise and drags the merged ranking down. A day spent properly configuring Postgres tsvector paid off more than a week of vector tuning did. Hybrid is not a free upgrade. It is two systems that both need to actually work.

Rerank what you retrieved

Hybrid gets the right document into the top 50. A cross-encoder gets it into the top 3.

A bi-encoder (your regular embedding model) encodes the query and the document separately and compares vectors. A cross-encoder takes them together as input and produces a single relevance score. It can attend across both texts simultaneously, which is more expressive. It cannot be precomputed, which is more expensive.

The typical pattern: pull top-50 from hybrid, run a small cross-encoder over those 50 query-document pairs, re-rank to top-5 or top-10. A MiniLM-class model runs locally in tens of milliseconds. Hosted rerankers from Cohere or Voyage add a network hop and a bill but spare you the GPU.

Failure mode. The throughput cliff is real. At top-50 latency stays inside a normal interactive budget. At top-200 the cross-encoder becomes the bottleneck. Don't reach for a reranker before squeezing the hybrid retrieval below it. Rerankers amplify good candidates. They cannot fix bad ones.

Query expansion for the queries users write badly

The third approach is to rewrite the query before retrieving. Send the user's query to a small fast model, ask for three or four alternate phrasings, retrieve for all of them, dedupe and re-rank.

This is cheap (one LLM call, a few cents per thousand queries) and effective on ambiguous queries. Users write search queries badly. They under-specify, over-specify, use the wrong jargon, and ask questions when keywords would work better.

Failure mode. Lazy prompting drifts the intent. "How do I reset my password" gets expanded into "account security best practices" by a model that's trying too hard, and now you're retrieving the wrong topic. Constrain the prompt to alternate phrasings only, same intent, no broadening. Evaluate the expansion against a small set of known-good queries before turning it on for everyone.

Buy versus build

Every major cloud now ships managed hybrid retrieval with reranking baked in. If retrieval is not your differentiator, the case for buying has gotten strong.

Service	Hybrid built-in	Reranker	Self-host	Lock-in
Amazon Kendra GenAI Index	Yes	Yes	No	High (AWS)
Azure AI Search	Yes	Optional	No	High (Azure)
OpenSearch + Neural Search	Yes	Optional	Yes	Low
Elastic + ELSER	Three-way	Optional	Yes	Medium

Kendra is the most "just works" option, pre-tuned, around a few hundred dollars a month at the entry tier. Azure AI Search gives you the most knobs and is the obvious choice if you are already in that ecosystem. OpenSearch with Neural Search is the cheaper self-hostable AWS option. Elastic plus ELSER does three-way hybrid (BM25 plus dense plus learned sparse) and is the move if you already have an Elastic license.

For a small team where retrieval is not a moat, buy one of these. The engineering hours you save go into the application layer where users actually feel the difference.

What I would build tomorrow morning

For a team shipping RAG this quarter, with Postgres already in the stack and no managed-service budget:

Use Postgres tsvector for sparse retrieval and pgvector for dense. One database, two indexes.
Implement RRF in application code. The whole function is fifteen lines.
Add a regex router that catches identifier-shaped queries (error codes, ticket numbers, SKUs, version strings) and routes them straight to a direct lookup instead of either retrieval lane. This is the single highest-leverage hour of work in the entire pipeline.
Skip the cross-encoder until you've measured precision and have a reason to add latency. Do not pre-optimize.
Add query expansion last, when you have enough query log data to know which queries are actually failing.

If you want a head start, drop the prompt below into Claude, Cursor, or your LLM of choice. It scaffolds the whole pipeline as a single C# service with raw SQL against Postgres. You'll need to adjust the schema and regex patterns to match your domain, but it gets you most of the way there.

You are building a RAG retrieval service in C# / .NET 10 with Postgres as the only datastore.

Schema:
- A `documents` table with columns: id (uuid), content (text), 
  embedding (vector(1536)), tsv (tsvector), updated_at (timestamptz).
- `tsv` has a GIN index. `embedding` has an HNSW index via pgvector.

Build a `HybridRetrievalService` class with one public method:

  Task> RetrieveAsync(string query, int topK = 10)

Behavior:

1. Run the query through a regex router that detects identifier-shaped patterns
   (ticket IDs like JOB-\d+-[A-Z]+, error codes like 0x[0-9A-F]+, version 
   strings like v\d+\.\d+\.\d+, etc). Patterns must be configurable via an 
   injected IRegexRouterConfig. If any pattern matches, perform a direct 
   ILIKE lookup against `content`, return up to topK results ordered by 
   updated_at DESC. Skip steps 2-3.

2. Otherwise, run two queries in parallel using Task.WhenAll:
   a. Sparse (BM25-style) via Postgres FTS:
      SELECT id FROM documents
      WHERE tsv @@ plainto_tsquery('english', @query)
      ORDER BY ts_rank(tsv, plainto_tsquery('english', @query)) DESC
      LIMIT 50;
   b. Dense via pgvector. Use an injected IEmbeddingClient.EmbedAsync(string) 
      to get the query vector, then:
      SELECT id FROM documents
      ORDER BY embedding <=> @query_embedding
      LIMIT 50;

3. Fuse the two ranked lists with Reciprocal Rank Fusion (k=60), 
   return top K with id, content, and the fused RRF score.

Constraints:
- Use Npgsql or Dapper. Raw SQL is fine; skip EF Core for query execution.
- Parameterize every query.
- All public methods accept a CancellationToken.
- Include xUnit tests covering: regex router triggers correctly on known 
  patterns, RRF math is correct on synthetic ranked lists, and the hybrid 
  path runs when no pattern matches.
- Do NOT add reranking, query expansion, or fine-tuned embeddings. 
  Save those for a later iteration.

That stack handles the queries pure vector search quietly fails on, costs nothing extra to run, and is debuggable when something breaks. The exotic options (custom-fine-tuned embeddings, LLM-as-retriever, inverted HyDE) are tools to reach for when this baseline is measurably falling short, not before.

Vector search is a useful primitive. It is not a search engine. Treat the retrieval layer like the engineering problem it is, with multiple tools sized to multiple shapes of query, and your RAG system will get embarrassingly better than the one you shipped by importing pgvector and hoping.

AI Dev Tools Keep Shipping Like Every Engineer Has One Project

Brennan Hitchcock — Thu, 08 Jan 2026 14:17:08 GMT

I tried setting up claude-cognitive last week — an attention-based working-memory layer for Claude Code that promises persistent context across sessions on large codebases. The setup guide said fifteen minutes. It took three hours.

The bug count was small. The pattern behind the bugs is what I want to write about, because this is now the dominant failure mode I see across this generation of AI dev infrastructure: defaults written for the author's one project, on the author's one machine, with no plan for anyone else.

Three specific examples, each one a problem the Unix world solved decades ago.

1. Hardcoded paths

The router looks for documentation in ~/.claude/systems/, ~/.claude/modules/, and so on — the global Claude directory. But the setup guide has you create that documentation in your project's .claude/. These are different directories. The script fails silently when it finds nothing.

# Line 452 in the original context-router-v2.py
docs_root = Path(os.environ.get("CONTEXT_DOCS_ROOT", str(Path.home() / ".claude")))

Project-local-first, global-fallback is the convention every dev tool from git to direnv to mise has shipped for years. The fix is fifteen lines:

if os.environ.get("CONTEXT_DOCS_ROOT"):
    docs_root = Path(os.environ["CONTEXT_DOCS_ROOT"])
elif Path(".claude").exists():
    docs_root = Path(".claude")
else:
    docs_root = Path.home() / ".claude"

That's not a feature request. That's the baseline. Any tool that doesn't do this assumes you only ever work on one codebase, which has never been true for any working engineer.

2. Hardcoded keywords

The router decides which docs are relevant by matching the prompt against a KEYWORDS dictionary that maps trigger words to documentation files. The default dictionary contained over a hundred entries, all tied to the author's specific project — file names like server-one.md, terms like vram, cuda, trajectory, state machine. On my codebase, every one of them was dead weight.

# Original keywords (lines 79–197)
KEYWORDS: Dict[str, List[str]] = {
    "systems/server-one.md": [
        "server-one", "gpu", "local model", "inference",
        "vram", "cuda", "nvidia-smi"...
    ],
    # ...100+ more project-specific keywords
}

The fix is to externalize the config to where the project lives:

def load_project_config():
    config_paths = [
        Path(".claude/keywords.json"),
        Path.home() / ".claude/keywords.json",
    ]
    for config_path in config_paths:
        if config_path.exists():
            try:
                config = json.loads(config_path.read_text())
                return (
                    config.get("keywords", {}),
                    config.get("co_activation", {}),
                    config.get("pinned", []),
                )
            except (json.JSONDecodeError, IOError):
                continue
    return ({}, {}, [])

KEYWORDS, CO_ACTIVATION, PINNED_FILES = load_project_config()

Each project now gets its own .claude/keywords.json with the trigger words, co-activation map, and pinned files that match its own architecture. This is .eslintrc.json and pyproject.toml and .editorconfig. Per-project configuration in the project, with a sane fallback. It's the contract that lets a tool work for ten thousand different codebases instead of one.

3. Silent failures

When no files reached HOT or WARM status — which, given the previous two bugs, was every time — the router printed nothing.

# Line 492 in original
if stats["hot"] > 0 or stats["warm"] > 0:
    print(output)

This makes sense in steady-state production. You don't want injection noise on every prompt. But during setup, you have no signal that the hook is even running. The log file existed at ~/.claude/context_injection.log and contained the answer. I found it by find-ing for anything claude-cognitive had touched in the last hour. That's not a debugging strategy; that's archaeology.

If a tool is going to inject anything into my prompt — or fail to inject anything — it has to tell me. A single line on stderr (activated 3 docs from .claude/ or no docs matched, check .claude/keywords.json) would have saved me the three hours. The fix is one log line, or a --verbose flag, or simply printing the stats banner to stderr the first ten invocations and never again. Any of these costs less to implement than this paragraph cost to write.

The pattern

None of these are claude-cognitive bugs in isolation. They're the same three defaults — global-only paths, hardcoded project assumptions, silent execution — showing up across this generation of AI dev tooling.

There's a reason. Most of these tools start as one person's local script that worked well enough for them, then get open-sourced under the assumption that "open source it" and "make it usable by others" are the same step. They are not. The work of going from "works on the author's machine" to "works on a stranger's machine" is exactly the unglamorous engineering that gets skipped when a tool is gaining momentum on dev Twitter.

For the people building these tools: the bar is git. It's direnv. It's whatever shell-completion framework you ship. We figured this out a long time ago. Project-local config, sane env-var overrides, loud failures by default. The cost of not having these is measured in hours per user per setup, and the user count is going up fast.

For the rest of us evaluating: ask the questions before you install. Does it respect a project-local config directory? Does it tell me what it did? Does it have an env var I can use to override defaults without forking? If the answer to any of those is no, you're going to pay the same three hours I did.

claude-cognitive itself is worth the setup once you've patched the defaults. The attention routing — files heating up as you mention them, cooling off as the conversation drifts, co-activating related modules — is genuinely useful on a codebase large enough that you can't fit the whole repo in context. For anything smaller, you don't need it.

I've opened a PR against the upstream repo to add project-local keyword loading. Whether or not it lands, the broader claim stands: AI-native developer infrastructure is still in its hand-rolled-bash-script phase. The teams that win the next wave aren't the ones with the cleverest models — they're the ones that figure out the boring part: defaults that don't assume the user is the author.