If your RAG candidate says "cosine similarity" first, end the interview
If your RAG candidate says "cosine similarity" before asking what the queries look like, end the interview.
I do not mean that as a punchline. The order is telling you something. A candidate who reaches for the embedding model before they reach for the query log is solving a problem they read about, not the one in front of them.
Last week I wrote about how retrieval quietly fails. Vector search is a primitive, not a search engine, and the systems that hold up under real traffic hybridize, route, and re-rank around that fact. A few people asked the obvious follow-up. How do you hire someone who already knows this?
I do not have a clean answer. I have six questions. They aren't a checklist so much as a set of tripwires, places where the candidate's instincts either match how the system fails in production or they do not. Most of them have nothing to do with embeddings. The last one is the question every candidate should want you to ask.
Here is what I would run, in order.
1. What does your query log look like?
Ask this near the start of the conversation, after they have described a RAG system they have built. The answer is binary. Either they have a specific picture in their head, or they do not.
A good answer sounds something like this. Most queries are short. About a third have an identifier in them. There is a long tail of weird stuff people type when they are frustrated, and last quarter we started bucketing the tail because most of the support escalations were coming from one shape of query. The candidate is describing a thing they have looked at, recently, with their own eyes.
A weak answer pivots. They start talking about chunking strategy, or which embedding model they chose, or how they tuned chunk overlap. The query log is offstage somewhere, a thing other people deal with.
Production retrieval is mostly a literacy problem. You cannot fix what you have not bothered to read. Candidates who skip this step build systems that work fine on the examples in the original paper and then fall over on whatever the support team is actually fielding.
If they have never opened the log, that is the answer. Move on.
2. Show me your eval set.
Not a benchmark. The actual artifact they opened on Monday morning to see whether Friday's change made the product better or worse.
The strongest answer is a candidate who pulls up a CSV. Two or three hundred real queries, labels for what good retrieval looks like on each, a column flagging the hard cases. They can walk you through the queries they argued over, the time they realized their eval set had a bias baked in because it was assembled by one person on a Tuesday, the change that took them by surprise. There is a history with this file.
The middle answer is a candidate who has an eval but it is the publisher's example queries, or whatever shipped with the framework. They have measured something other than their product.
The weak answer is some variation of "we look at the responses and they seem fine." Or precision and recall numbers with no context for what was being measured against. Or the confident claim that the LLM judges its own answers, which usually means nobody has audited what the judge is doing.
This is a proxy question for whether retrieval is being treated as a science or as a config file. The candidates who treat it as a science have already lost arguments to their own data. That is what you want.
3. What is the worst question your system answers confidently?
The word doing the work here is "confidently." Anyone can list the queries that returned nothing. You want the cases where the system returned an answer with full conviction and the answer was wrong.
A good candidate has at least one of these in their pocket, told with the slight wince of someone who has not fully forgiven themselves for letting it ship. The user asked about policy and got an answer from a deprecated doc. A finance question came back with numbers from the wrong fiscal year. A name collision sent someone to the wrong customer's record. The example is specific, the log line is specific, sometimes there is a specific complaint attached.
A weak candidate gives you a category. We had some hallucinations. The model sometimes gets things wrong. We use temperature zero. General statements about LLM behaviour, because they have not lived in any of the specific failures. If they had, they would still be a little upset about one of them.
This is also a humility test. Candidates who can talk about a confident wrong answer have made peace with the fact that the system is going to do this no matter how well it is tuned. The ones who cannot, often believe the next prompt tweak will fix it, and they argue against guardrails on the grounds that the model should know better. Those candidates are expensive to manage.
4. When would you not reach for vector search?
This one is short and it discriminates fast. A strong candidate has a list. Exact match against identifiers. Anything regulatory where the answer has to come from a specific named document. Structured filtering on date ranges, status flags, low cardinality fields where the user has already given you the answer in the query. They will sometimes pre-empt you and say something like, for half the traffic the better tool is a SQL query and we route there before anything else fires.
A weak candidate pauses. They might mumble about hybrid search, or note that BM25 is still useful, or wave at reranking. The shape of the pause is the signal: they are searching for the right answer instead of recognising it.
What the question is testing is not the list itself. It is whether they understand vector search as one tool among several, with specific failure modes that other tools handle better. Candidates who already do have probably wired a regex router in front of their retriever. For everyone else, every problem they encounter is going to look like an embedding tuning problem.
5. What does the failure look like to the user?
This pulls the candidate out of the engineer frame, and it is the question most likely to surprise them.
The answers I am listening for sit at the seam between system design and product. What does the user see when the retrieved documents do not contain the answer at all. What does the user see when the documents do contain it but the LLM rewords the answer into something they misread. What happens when somebody asks a question the system was never going to be able to handle and gets back a long, plausible, completely hallucinated paragraph. Whose desk does the support ticket land on, and how does the user even know something has gone wrong.
A strong candidate has thought about this and probably argued with someone about it. They might tell you they pushed for an "I do not know" response path and got overruled by product. Or describe a UI tweak they made to surface which document an answer came from, because users were quoting the bot's response in meetings and getting embarrassed. They have opinions about what should happen at the moment the system fails, because they have watched it fail.
A weak candidate stays at the model. The LLM hallucinated. The retrieval missed. They will not have a clear picture of what the user did next, because they have been thinking of the system as a model with some plumbing instead of a product with a person at the other end. That is the candidate you can hire as a research engineer. They are not yet the person you want owning a production retrieval system.
6. Walk me through a recent retrieval bug. Symptom to fix.
This is the universal senior engineer question. It also discriminates most sharply for retrieval work, because so much of the job is debugging things that did not throw an exception.
What I want is specificity. The symptom should be concrete, ideally something a user reported or something they spotted in a dashboard. The investigation should have a shape: a hypothesis they formed, an instrumentation step that confirmed or killed it, an obvious culprit they ruled out before they found the less obvious one. They can usually tell you the wrong answer they had in their head before they got to the right one, which is the tell that they are remembering the bug rather than constructing it.
The fix should be small and explained in plain language. The regression test should exist. If they mention that the bug came back six weeks later in a slightly different form, even better. That is how you know it was real, and not a war story that got polished for interviews.
Vague answers here are disqualifying in a way the other questions are not. Everyone has at least one retrieval bug if they have shipped one of these systems. If a candidate cannot produce one, the most generous reading is that they have not shipped. Less generously, they shipped without noticing what was wrong.
The question every candidate should want you to ask
If you are reading this from the other side of the table, the question to hope for is the last one.
The question is not easy, but it lets you do the thing you have actually spent the most time on. A retrieval bug story is the closest thing this work has to a portfolio piece. It contains evidence of the production access you have had, the metrics you watched, the colleagues you argued with, the decision you made under time pressure. It is what an interview is trying to extract anyway, told in a form that does not feel like an interrogation.
If you have one, prepare it the way you would prepare a code sample. Start at the user complaint, walk through what you checked first and why you were wrong, show the moment the real cause clicked, end with the smallest fix that solved it and the regression test you wrote to keep it from coming back.
If you do not have one yet, that is also useful information about where you are. Build something. Ship it to ten people. Watch a retrieval bug happen on a query you did not anticipate. The interview will get easier, and so will the work.
The retrieval part of RAG is becoming its own discipline, and it does not look like the role most teams are hiring for. The best people I have worked with on these systems are search engineers with some ML literacy, not the other way around. They know classical IR, they instrument production by reflex, and they are comfortable being wrong in measurable ways, which is the rarest qualification of any of them.
Hire for that and the rest is teachable. Hire for embedding intuition first and you will end up with a beautifully tuned system that quietly fails on the queries your users actually care about.