What is Retrieval Quality testing in a RAG system?

Retrieval Quality is the foundation layer of RAG testing. It measures whether the system surfaces the correct documents for a given query, using metrics like Recall@5, Precision@5, MRR and NDCG against a versioned golden dataset of 100+ records. If retrieval fails, no downstream metric can compensate.

What is Answer Grounding in RAG evaluation?

Answer Grounding tests whether every factual claim in a generated response can be traced back to a retrieved source passage. Faithfulness is scored using DeBERTa-v3 NLI plus an LLM-as-judge with a two-run agreement rule. The pass threshold is Faithfulness ≥ 0.90 and Answer Relevance ≥ 0.85.

How is hallucination detected in a RAG system?

Sau5 uses a three-stage hallucination detection pipeline: stage one is NLI claim entailment against retrieved chunks; stage two is LLM-as-judge with two-run agreement; stage three is atomic claim decomposition and individual verification. Five hallucination types are tracked — intrinsic, extrinsic fabricated, extrinsic plausible, over-specification, and entity substitution. The pass bar is a hallucination rate ≤ 5%.

What is adversarial testing for AI systems?

Adversarial testing probes whether an AI system can be manipulated via prompt injection, jailbreak, boundary violations or PII extraction. Sau5 runs the battery in an isolated environment only, with written client authorisation. The minimum test suite includes 8+ direct injection payloads, 4 jailbreak variants, 4 encoding obfuscations, a 5-turn escalation protocol, 5+ boundary probes, and 36+ PII canary retrieval probes. Pass bar: zero successful injections, zero PII surfaced.

What is an eval harness for RAG systems?

An eval harness packages all RAG quality tests into an automated pipeline the client owns and runs themselves. The Sau5 handover includes a Python runner that executes all five domains in sequence, a regression detector flagging drops beyond defined thresholds, a versioned golden dataset, and CI/CD configurations for GitHub Actions, GitLab CI or Azure DevOps. It turns a one-off assessment into an ongoing quality practice.

How does Sau5 measure retrieval latency, and what thresholds does it apply?

Retrieval latency is measured inside Domain 1 alongside Recall, Precision, MRR and NDCG. The retriever is instrumented to capture wall-clock time from query submission to ranked chunks returned, across every record in the 100+ golden dataset. Sau5 reports p50, p95 and p99 across the run plus per-query-type breakdowns. Default thresholds: p95 under 500ms and p99 under 1.2s for conversational deployments, p95 under 2s and p99 under 4s for batch use cases, negotiated per engagement. Sau5 measures retrieval-layer latency at steady state, not end-to-end response time or sustained load behaviour.

Sau5 — AI Testing Consultancy. RAG Assessment in 4 Weeks.

01 · THE PROBLEM

Most enterprises are deploying AI with no way to test it.

1^{in 3}

Answers from leading retrieval-grounded legal AI tools were factually wrong on real-world queries (Stanford RegLab, 2024). Models that look fine in a demo break in production.

83^%

of AI deployments lack any structured testing process.

4^×

faster growth than traditional QA across enterprise tooling spend.

02 · THE SOLUTION

A consultancy that tests AI —
only AI.

01 · OFFERING

RAG Accuracy & Grounding Assessment

A 5-domain, 4-week structured assessment of any RAG system. Findings report, remediation roadmap, and a repeatable eval harness the client keeps and runs themselves.

4-Week Engagement

02 · OFFERING

Training & Enablement

A practitioner curriculum to take QA engineers from zero AI testing experience to productive on RAG assessment engagements — covering all five domains of the Sau5 methodology. In development; not yet available.

Coming Soon

RETRIEVAL · GROUNDING · HALLUCINATION · ADVERSARIAL · EVAL OPS · 5 DOMAINS · 4 WEEKS · ONE METHODOLOGY RETRIEVAL · GROUNDING · HALLUCINATION · ADVERSARIAL · EVAL OPS · 5 DOMAINS · 4 WEEKS · ONE METHODOLOGY RETRIEVAL · GROUNDING · HALLUCINATION · ADVERSARIAL · EVAL OPS · 5 DOMAINS · 4 WEEKS · ONE METHODOLOGY RETRIEVAL · GROUNDING · HALLUCINATION · ADVERSARIAL · EVAL OPS · 5 DOMAINS · 4 WEEKS · ONE METHODOLOGY

04 · WHY NOT YOUR EXISTING QA TEAM

Traditional QA is trained for a different problem.

Dimension

Traditional QA

Sau5

AI Testing Expertise

Learning on the job

5-domain methodology — defined tooling, standard thresholds

Hallucination Detection

No structured approach

NLI + LLM-Judge + atomic decomposition — three-stage pipeline

Adversarial Testing

Not in scope

Full Domain 4 battery — injection, jailbreak, PII canary, boundary

Eval Ops & CI/CD

Manual and ad hoc

Automated harness, regression thresholds, CI gates as code

Delivery Model

Single-location, fixed model

Nearshore and local resourcing — engagement structured to suit your environment

Ramp Time

3–6 months to productive

2–4 weeks — engineers trained on the methodology before they touch your stack

AI TESTING · ONLY AI TESTING · GLOBAL DELIVERY · FIXED SCOPE · CLIENT-OWNED HARNESS · BILINGUAL READOUT AI TESTING · ONLY AI TESTING · GLOBAL DELIVERY · FIXED SCOPE · CLIENT-OWNED HARNESS · BILINGUAL READOUT AI TESTING · ONLY AI TESTING · GLOBAL DELIVERY · FIXED SCOPE · CLIENT-OWNED HARNESS · BILINGUAL READOUT AI TESTING · ONLY AI TESTING · GLOBAL DELIVERY · FIXED SCOPE · CLIENT-OWNED HARNESS · BILINGUAL READOUT

05 · WHAT YOU WALK AWAY WITH

Three deliverables that
outlast the engagement.

Golden Dataset

100+

SME-reviewed Q-A-context records, versioned and refreshed every 60 days or whenever your KB changes. Yours to extend and re-run forever.

Eval Harness Ownership

100%

At handover the Python runner, regression detector and CI/CD configs are yours. Zero ongoing dependency on Sau5 to keep testing.

Continuous re-runs

∞

CI/CD gates fire on every commit. Regressions are caught before deploy, not after. Testing becomes part of how you ship.

For what comes next

The harness is yours. Add a quarterly or half-yearly subscription and Sau5 keeps catching regressions, refreshing the dataset, and testing each release before it ships.

06 · FAQ

The questions buyers ask most.

How long is a Sau5 engagement?

The RAG Accuracy & Grounding Assessment is a fixed four-week engagement, kickoff to handover. Week 1 is scope and golden dataset construction. Weeks 2 and 3 run the four upstream domains in parallel. Week 4 packages the harness, wires the CI/CD, and delivers the findings report and bilingual readout. Fixed scope, fixed fee.

Can Sau5 run alongside our UAT — or does it replace it?

Neither. UAT for AI systems does not work the way UAT for traditional software works. Two testers running the same query get different model answers; three testers reach three different conclusions about whether the answer was right. UAT signal collapses.

The right shape: Sau5 runs first, producing quantitative scores against a defined golden dataset. UAT then runs on top — humans test usability, tone, edge-case judgment. UAT becomes the human layer on top of a system already proven to be factually correct. Without the Sau5 layer first, you are asking humans to certify quality on a system whose quality varies between runs. That is not a test. That is a hope.

How is AI testing different from AI observability, eval platforms, or guardrails?

Four separate vendor categories, often confused. AI Testing (Sau5) is a methodology applied at a defined point. AI Observability (Arize, WhyLabs, Helicone, Datadog) is runtime monitoring of production AI traffic. AI Eval Platforms (Braintrust, Galileo, LangSmith) are SaaS dashboards for managing test runs over time. AI Guardrails (Lakera, NeMo, Patronus) are runtime filters that block bad outputs before they reach users.

The full breakdown is in the Buyer's Guide PDF above.

What does Sau5 not do — and who should we call instead?

Sau5 does AI testing only. We do not sell observability dashboards, run runtime guardrails, or operate as a SaaS eval platform. If you need runtime monitoring, talk to Arize, WhyLabs, Helicone, Langfuse, or Datadog LLM Observability. For eval platforms, Braintrust, Galileo, LangSmith, Vellum, or Humanloop — our harness integrates with any of them. For guardrails, Lakera, NeMo, Guardrails AI, Patronus, or Aporia.

Where is our data stored during a Sau5 engagement?

Sau5 minimises data handling by design. Test execution runs on client infrastructure by default. The harness is deployed into the client's environment, and client data never leaves the client perimeter for normal testing.

Where Sau5 does hold client artefacts (off-site golden dataset construction, findings report preparation): encrypted at rest, access limited to the engagement team only (typically one to three engineers), isolated from any other Sau5 work, access-logged.

The harness uses the client's existing LLM endpoints — OpenAI, Anthropic, Azure OpenAI, or self-hosted — under the client's existing vendor agreements. No new third-party LLM data flows are introduced.

What happens to our data after the engagement ends?

At handover the client takes ownership of the golden dataset, the runnable harness, the findings report, and any associated artefacts.

Sau5 retains only the methodology (not client-specific) and aggregated anonymised metrics. Sau5 deletes client knowledge base content, customer queries, test outputs containing client-identifiable information, attack-surface details from Domain 4, and credentials. Deletion is confirmed in writing within 14 days.

For Path B and Path C clients (managed service or hybrid retainer), Sau5 retains only the minimum needed to run the ongoing work between cadence runs.

What happens after the four weeks?

Three options. Path A: full handover, the client runs the methodology themselves, no ongoing relationship with Sau5. Path B: Sau5 managed service, where Sau5 runs the harness on the client's behalf on a quarterly or half-yearly cadence and surfaces regressions. Path C: hybrid, where the client owns and runs the harness and Sau5 stays on retainer for quarterly re-runs, methodology updates, and advisory. Most clients choose B or C because AI systems drift continuously and a one-off assessment ages out within three to six months.

How does Sau5 measure retrieval latency?

Latency is measured inside Domain 1, on the same calls that produce Recall, Precision, MRR and NDCG — at zero added client cost. Sau5 instruments the retriever to capture wall-clock time from query submission to ranked chunks returned, then reports p50, p95 and p99 across the full golden dataset plus per-query-type.

Default thresholds: p95 < 500ms / p99 < 1.2s for conversational deployments; p95 < 2s for batch use cases. Negotiated per engagement against the client's SLA. Drift across runs is gated — >20% raises a warn, >50% raises a fail. Sau5 measures retrieval-layer latency at steady state, not end-to-end response time or load behaviour.

Does Sau5 test for permission and access-control leaks in retrieval?

Yes — access-control verification sits inside Domain 4. The common failure: vector search bypasses row-level security on the source system, and an in-scope question from a junior user returns a chunk from a document only senior users were meant to see. The chunk's underlying document had ACLs; the embedding in the vector store did not.

Sau5 seeds the corpus during scoping with documents tagged for specific user roles or groups, then issues identical queries as different user personas and confirms retrieval respects the documented boundaries. Pass bar is zero leaks. Maps directly to OWASP LLM02 (Sensitive Information Disclosure).

Can Sau5 test a RAG system built with LlamaIndex or LlamaCloud?

Yes, with one caveat. Open-source LlamaIndex is fully testable end-to-end because chunking, embeddings and retriever internals are all inspectable. LlamaCloud (managed indexing, embedding and retrieval behind their API) is testable as a black box for outcome measurement, but root-cause diagnostics — explaining why a query failed — often require visibility into the chunking and embedding layer, which managed services abstract away.

LlamaIndex's built-in evaluation module (Faithfulness, Answer Relevance, Context Relevance) is a useful in-loop tool for developers during build. It is not a substitute for an independent third-party assessment with SME-labelled golden datasets, two-run judge agreement, and slice-level drift detection.

07 · INSIGHTS

AI Insights.

First engagements are being scoped now.

Slots are limited. Join the waitlist to hear from Sau5 first, before the next round opens.