Domain 1 — Retrieval Quality.
What it measures
Retrieval Quality is the foundation layer of every RAG assessment. It measures whether the system surfaces the correct documents for a given query, before any answer is generated, before any grounding is checked, before any model behaviour matters at all.
If retrieval fails, every downstream metric becomes unreliable. A system can have a state-of-the-art generator and still produce bad answers if the wrong chunks come back. Conversely, a mediocre generator with strong retrieval often outperforms a strong generator with weak retrieval. The retrieval layer is doing most of the work and getting most of the blame for problems that originate elsewhere.
This domain is also where the most expensive optimisation mistakes happen. Teams swap embedding models to chase a benchmark gain, lift Recall@5 by two points on a public test set, and silently regress on the specific query patterns their users actually issue. Sau5's Domain 1 testing is built to catch that pattern.
Why it matters
Three production failures cluster in this domain.
The first is silent drift after re-indexing. A re-ingestion of the corpus changes chunk boundaries, embedding distribution, or both. Aggregate recall looks unchanged. Users report that the system "feels worse than last week", a subjective complaint that turns out to be a 12-point drop on long-tail queries that the aggregate metric was hiding.
The second is embedding-model substitution regret. A team upgrades from one embedding model to a newer one with a stronger MTEB score. Six weeks later, support tickets reveal the new model handles short queries better but degrades sharply on multi-sentence queries, the exact distribution the team's users issue. Domain 1 testing forces the comparison on the client's query distribution, not a public benchmark's.
The third is reranker overconfidence. Adding a cross-encoder reranker improves NDCG by a clear margin on standard test sets, but reranker training data rarely matches enterprise corpora. Without per-query-class testing, teams discover too late that the reranker helps common questions and actively hurts the long tail.
How Sau5 tests it
The testing harness for Domain 1 has four parts.
Golden dataset construction
Sau5 builds a versioned dataset of 100+ records during Week 1 of every engagement. The dataset is structured across ten query types: definitional, comparative, procedural, multi-hop, temporal, numerical, negation, long-tail, ambiguous, and out-of-scope. Each record contains the query, the expected source chunks (labelled by subject-matter experts), and metadata flagging the query type and difficulty. Without a domain-tuned dataset, retrieval testing produces noise.
Multi-metric measurement
Sau5 measures four metrics per query and aggregates by query type:
| Metric | What it measures | Why it's tracked |
| Recall@k | Fraction of relevant documents in the top-k retrieved set | The headline metric, but cannot tell you whether ranking is good |
| Precision@k | Fraction of top-k results that are relevant | Catches retrieval that surfaces relevant content alongside high noise |
| MRR | Average inverse rank of the first relevant result | Sensitive to whether the top result is the right one |
| NDCG@k | Normalised discounted cumulative gain at k | Penalises late-rank correct results, rewards correct ordering |
Slice-level analysis
The aggregate score is almost always misleading. Sau5 reports per-query-type breakdowns and flags any slice that falls more than 5 points below the engagement-baseline target. This is where the silent regressions live.
Drift detection between runs
Once the baseline is established, every subsequent run is compared against it. Sau5's harness flags any per-query-type drop greater than 2 percentage points as a warn, and any drop greater than 5 percentage points as a fail that blocks deploy in CI/CD.
Metrics and pass thresholds
| Metric | Definition | Default pass threshold |
| Recall@5 (aggregate) | Recall across whole dataset, top-5 results | ≥ 0.85 |
| Recall@5 (per query type) | Recall on each of 10 query types | ≥ 0.70 (long-tail) to ≥ 0.95 (definitional) |
| Precision@5 | Precision in top-5 retrieved | ≥ 0.60 |
| MRR | Mean reciprocal rank | ≥ 0.70 |
| NDCG@10 | NDCG at rank 10 | ≥ 0.75 |
| Drift vs baseline | Per-query-type delta | < 2 pts warn, < 5 pts fail |
These are defaults. They are negotiated per engagement against the client's risk tolerance and use case. A medical-information RAG system uses tighter thresholds than an internal HR knowledge base.
What failure looks like in production
- "It used to know about X." Users notice that a previously-answered question now returns a different, usually worse chunk. A sign of post-reindex drift or embedding-model substitution.
- "The right answer is in there somewhere, just not first." Users see the right document on page 2 of an internal search experience, or as the third citation in a generated answer. A sign of reranker degradation or NDCG drop without recall change.
- "It thinks this is about something else." Users issue a query and the system retrieves chunks from a related-but-wrong topic. A sign of embedding-model drift on the client's domain vocabulary.
Sau5's Domain 1 testing catches all three before they reach users, if it runs continuously. A one-off Recall@5 measurement at deploy-time will catch the first two and miss the third entirely. Domain 1 testing isn't a project. It's a practice.
Tools and references
Sau5's harness uses RAGAS and BEIR for the metric implementations, pytrec_eval for the underlying scoring, and sentence-transformers for embedding-level diagnostics. The methodology follows the principles in NIST AI RMF 1.0 (measure function) and is consistent with the retrieval-evaluation patterns in Microsoft's RAG application evaluation guidance.