Methodology — How Sau5 tests AI

Q: How long is a Sau5 engagement?

The RAG Accuracy & Grounding Assessment is a fixed four-week engagement, kickoff to handover. Week 1 is scope and golden dataset construction. Weeks 2 and 3 run the four upstream domains in parallel. Week 4 packages the harness, wires the CI/CD, and delivers the findings report and bilingual readout. Fixed scope, fixed fee.

Q: What does Sau5 not test?

Sau5 does not do general software QA, browser or UI testing, performance and load testing, security audits outside the LLM attack surface, or project management. The methodology covers RAG systems and LLM-mediated applications. If the question is about non-AI systems, Sau5 is not the right vendor.

Q: Why are adversarial test cases restricted?

Publishing specific prompt injection payloads, jailbreak patterns, and PII canary techniques on the open web arms attackers without commensurate benefit to defenders. Domain 4 test cases are restricted to engaged clients with signed adversarial-test authorisation. Sau5 retains nothing related to a client's specific attack surface after handover.

Q: Is the eval harness open source?

The harness is licensed to the client at handover, not open source. The underlying libraries it sits on (RAGAS, BEIR, DeBERTa, Garak, PyRIT, sentence-transformers) are open source. The orchestration, the test cases, and the dataset construction methodology are Sau5's IP. The client owns their copy of the delivered harness and can extend it indefinitely without ongoing Sau5 involvement.

Q: Can Sau5 work with our existing eval platform (Braintrust, Galileo, LangSmith)?

Yes. Sau5's harness writes results to any platform that accepts structured input. If you already have Braintrust, Galileo, LangSmith or similar in place, Sau5 integrates rather than replaces. The methodology and the test cases are the deliverable; the storage and dashboard layer is whatever you prefer.

Q: What languages does Sau5 deliver in?

English and Spanish, end to end. Findings reports, walkthrough sessions, harness documentation, and dataset construction can all run in either language. This is one of Sau5's stated competitive advantages for clients with bilingual development teams or LatAm operations.

Q: Who owns the dataset and the harness at handover?

The client. Both the golden dataset and the runnable harness are delivered as the client's property. Sau5 retains the methodology that produced them but does not keep copies of the client's dataset, the harness configuration, or any attack-surface information after handover.

Q: What happens after the four weeks?

Three options. Path A: full handover, the client runs the methodology themselves, no ongoing relationship with Sau5. Path B: Sau5 managed service, where Sau5 runs the harness on the client's behalf on a quarterly or half-yearly cadence and surfaces regressions. Path C: hybrid, where the client owns and runs the harness and Sau5 stays on retainer for quarterly re-runs, methodology updates, and advisory. Most clients choose B or C because AI systems drift continuously and a one-off assessment ages out within three to six months.

Why Sau5 tests AI differently.

The AI testing market splits roughly into three camps. Platform vendors sell infrastructure: dashboards, eval runners, traces, logs. Big-Four consultancies sell governance: frameworks, risk registers, audit reports. Software QA shops sell automation: Selenium scripts, regression suites, browser testing.

None of those three things are testing AI.

A dashboard cannot tell you whether a faithfulness score of 0.87 is acceptable for your use case. A governance framework cannot decompose a multi-hop claim into its atomic propositions. A Selenium script cannot detect prompt injection at conversation turn five.

Sau5 sits in the gap. The methodology is engineering, not paperwork. The deliverable is a runnable harness, not a report. The skill set is built from running the same five-domain assessment across every engagement and refining what works.

That refinement is the product. The first engagement Sau5 ran produced 14 changes to the methodology. The fourth produced two. The methodology converges because the failure modes converge: every RAG system fails in recognisably similar ways, and once you've tested enough of them you stop being surprised. The skeleton the manual sits on doesn't change. The depth on each chapter keeps growing.

This page is the public version of that methodology. The full Test Cases Manual ships with every engagement. A sample chapter is available below.

The five domains.

Every Sau5 assessment runs against five domains, in order. The order matters: each domain depends on the one above it being measured first. If retrieval is broken, no amount of grounding work compensates. If grounding is broken, hallucination metrics become noise.

#	Domain	The question it answers
1	Retrieval Quality	Is the right content being surfaced?
2	Answer Grounding	Are responses traceable to retrieved source?
3	Hallucination Detection	Is the model inventing things?
4	Adversarial Robustness	Can the system be manipulated?
5	Eval Operations	Is testing repeatable and continuous?

The rest of this page is one section per domain. Each section follows the same shape: what the domain measures, why it matters, how Sau5 tests it, what the metrics and thresholds are, and what failure looks like when it reaches production.

Domain 1 — Retrieval Quality.

What it measures

Retrieval Quality is the foundation layer of every RAG assessment. It measures whether the system surfaces the correct documents for a given query, before any answer is generated, before any grounding is checked, before any model behaviour matters at all.

If retrieval fails, every downstream metric becomes unreliable. A system can have a state-of-the-art generator and still produce bad answers if the wrong chunks come back. Conversely, a mediocre generator with strong retrieval often outperforms a strong generator with weak retrieval. The retrieval layer is doing most of the work and getting most of the blame for problems that originate elsewhere.

This domain is also where the most expensive optimisation mistakes happen. Teams swap embedding models to chase a benchmark gain, lift Recall@5 by two points on a public test set, and silently regress on the specific query patterns their users actually issue. Sau5's Domain 1 testing is built to catch that pattern.

Why it matters

Three production failures cluster in this domain.

The first is silent drift after re-indexing. A re-ingestion of the corpus changes chunk boundaries, embedding distribution, or both. Aggregate recall looks unchanged. Users report that the system "feels worse than last week", a subjective complaint that turns out to be a 12-point drop on long-tail queries that the aggregate metric was hiding.

The second is embedding-model substitution regret. A team upgrades from one embedding model to a newer one with a stronger MTEB score. Six weeks later, support tickets reveal the new model handles short queries better but degrades sharply on multi-sentence queries, the exact distribution the team's users issue. Domain 1 testing forces the comparison on the client's query distribution, not a public benchmark's.

The third is reranker overconfidence. Adding a cross-encoder reranker improves NDCG by a clear margin on standard test sets, but reranker training data rarely matches enterprise corpora. Without per-query-class testing, teams discover too late that the reranker helps common questions and actively hurts the long tail.

How Sau5 tests it

The testing harness for Domain 1 has four parts.

Golden dataset construction

Sau5 builds a versioned dataset of 100+ records during Week 1 of every engagement. The dataset is structured across ten query types: definitional, comparative, procedural, multi-hop, temporal, numerical, negation, long-tail, ambiguous, and out-of-scope. Each record contains the query, the expected source chunks (labelled by subject-matter experts), and metadata flagging the query type and difficulty. Without a domain-tuned dataset, retrieval testing produces noise.

Multi-metric measurement

Sau5 measures four metrics per query and aggregates by query type:

Metric	What it measures	Why it's tracked
Recall@k	Fraction of relevant documents in the top-k retrieved set	The headline metric, but cannot tell you whether ranking is good
Precision@k	Fraction of top-k results that are relevant	Catches retrieval that surfaces relevant content alongside high noise
MRR	Average inverse rank of the first relevant result	Sensitive to whether the top result is the right one
NDCG@k	Normalised discounted cumulative gain at k	Penalises late-rank correct results, rewards correct ordering

Slice-level analysis

The aggregate score is almost always misleading. Sau5 reports per-query-type breakdowns and flags any slice that falls more than 5 points below the engagement-baseline target. This is where the silent regressions live.

Drift detection between runs

Once the baseline is established, every subsequent run is compared against it. Sau5's harness flags any per-query-type drop greater than 2 percentage points as a warn, and any drop greater than 5 percentage points as a fail that blocks deploy in CI/CD.

Metrics and pass thresholds

Metric	Definition	Default pass threshold
Recall@5 (aggregate)	Recall across whole dataset, top-5 results	≥ 0.85
Recall@5 (per query type)	Recall on each of 10 query types	≥ 0.70 (long-tail) to ≥ 0.95 (definitional)
Precision@5	Precision in top-5 retrieved	≥ 0.60
MRR	Mean reciprocal rank	≥ 0.70
NDCG@10	NDCG at rank 10	≥ 0.75
Drift vs baseline	Per-query-type delta	< 2 pts warn, < 5 pts fail

These are defaults. They are negotiated per engagement against the client's risk tolerance and use case. A medical-information RAG system uses tighter thresholds than an internal HR knowledge base.

What failure looks like in production

"It used to know about X." Users notice that a previously-answered question now returns a different, usually worse chunk. A sign of post-reindex drift or embedding-model substitution.
"The right answer is in there somewhere, just not first." Users see the right document on page 2 of an internal search experience, or as the third citation in a generated answer. A sign of reranker degradation or NDCG drop without recall change.
"It thinks this is about something else." Users issue a query and the system retrieves chunks from a related-but-wrong topic. A sign of embedding-model drift on the client's domain vocabulary.

Sau5's Domain 1 testing catches all three before they reach users, if it runs continuously. A one-off Recall@5 measurement at deploy-time will catch the first two and miss the third entirely. Domain 1 testing isn't a project. It's a practice.

Tools and references

Sau5's harness uses RAGAS and BEIR for the metric implementations, pytrec_eval for the underlying scoring, and sentence-transformers for embedding-level diagnostics. The methodology follows the principles in NIST AI RMF 1.0 (measure function) and is consistent with the retrieval-evaluation patterns in Microsoft's RAG application evaluation guidance.

Domain 2 — Answer Grounding.

What it measures

Answer Grounding tests whether every factual claim in a generated response can be traced back to a retrieved source passage. A response that is fluent, on-topic, and unsourced fails this domain. A response that is awkwardly written but cites every claim to retrieved evidence passes.

The domain exists because the failure mode it catches, confidently stated, plausible, but unsupported claims, is the failure mode that does the most reputational damage to deployed RAG systems. A grounded wrong answer is a model problem. An ungrounded wrong answer is a trust problem.

Why it matters

Across the production RAG deployments Sau5 has audited, ungrounded but plausible responses outnumber factually incorrect responses by roughly three to one in user-reported quality issues. Users tolerate "I don't know." They do not tolerate confident answers that turn out to have been invented.

Grounding is also the domain regulators and compliance teams ask about first. EU AI Act Article 13 (transparency obligations) effectively requires that responses produced by high-risk AI systems be attributable to identifiable source material, a grounding test, by another name. Domain 2 stands between a deployed RAG system and an auditor asking "where did the model get this answer from?"

How Sau5 tests it

Sau5 measures grounding using a three-stage pipeline.

Stage 1 — NLI Claim Entailment

Each sentence of the generated response is treated as a claim and passed, together with the retrieved chunks, to a DeBERTa-v3-large MNLI model. The model returns one of three labels: entailment (claim is supported by the chunk), neutral (chunk neither supports nor contradicts), contradiction (chunk contradicts the claim). A fast first-pass filter.

Stage 2 — LLM Judge × 2

Claims that come back neutral or contradiction from Stage 1 are escalated to an LLM-as-judge. The judge is run twice, with different seeds, against the same retrieved context. Only claims where both runs agree on the verdict are accepted. Disagreement is flagged for human review. The two-run rule was added after Sau5 observed an 18% single-run judge volatility on edge-case claims during pilot engagements.

Stage 3 — Atomic Decomposition

Long compound claims are decomposed into atomic propositions. "The Tesla Model 3 was released in 2017 and costs from $39,990" becomes two propositions: (Tesla Model 3, released, 2017) and (Tesla Model 3, base price, USD 39,990). Each is verified independently. A response is grounded only if every atomic proposition is supported.

Metrics and pass thresholds

Metric	Definition	Default pass threshold
Faithfulness	Fraction of generated claims supported by retrieved context	≥ 0.90
Answer Relevance	Cosine similarity between query and response	≥ 0.85
Citation Coverage	Fraction of factual claims with an explicit source citation	≥ 0.95
Judge Agreement Rate	Fraction of LLM-judge claims where two runs agree	≥ 0.92

A system fails Domain 2 if any of the four metrics falls below threshold, or if any atomic proposition in any tested claim is unsupported under atomic_strict_mode.

What failure looks like in production

The "confident summary" pattern. Model produces a polished one-paragraph answer that reads as if lifted from a single authoritative source. In reality it stitches together three retrieved chunks and adds a "framing sentence" that no chunk supports.
The "specific number" pattern. Model produces a numerical claim ("approximately 23% of cases") where retrieval returned no quantitative data. The number was generated.
The "implied authority" pattern. Model uses phrasings like "studies have shown" or "according to industry reports" without any retrieved chunk containing such a study or report.

The Sau5 grounding pipeline catches all three with high reliability, but only if the test dataset is built to elicit them. Domain 2 testing depends on a golden dataset specifically constructed with grounding-trap queries.

Tools and references

Sau5's harness uses DeBERTa-v3-large MNLI for NLI, RAGAS for faithfulness scoring, sentence-transformers for relevance, and a configurable LLM judge (GPT-4o or Claude 3.5 Sonnet by default). The methodology aligns with patterns in RAGTruth and is consistent with the faithfulness-evaluation guidance in NIST AI RMF 1.0.

Domain 3 — Hallucination Detection.

What it measures

Hallucination Detection tests the rate at which the model invents content, and, crucially, what kind of content it invents. Sau5 tracks five distinct hallucination types because the failure modes are not interchangeable. An intrinsic contradiction is a different bug from a fabricated citation, and the fix for each is different.

The five types Sau5 measures:

Intrinsic. Claim contradicts a retrieved chunk.
Extrinsic, fabricated. Claim has no support in retrieved chunks and no truth elsewhere.
Extrinsic, plausible. Claim has no support in retrieved chunks but happens to be true (or partially true) in the wider world.
Over-specification. Claim adds invented specifics (numbers, dates, named entities) to an otherwise grounded statement.
Entity substitution. Claim swaps named entities (people, organisations, products) for similar but incorrect ones.

A system fails Domain 3 if the total hallucination rate exceeds 5%, or if any individual hallucination type exceeds a domain-specific threshold negotiated per engagement.

Why it matters

Hallucination is the failure mode that breaks user trust fastest, because users cannot tell the difference between a hallucinated answer and a correct one without independently verifying every claim. Hallucinated answers also tend to be the ones that travel: they get screenshotted, shared, and quoted. By the time the model is corrected, the wrong answer has propagated.

In regulated industries the asymmetry sharpens. A single hallucinated numerical claim in a clinical-decision support tool, or a single fabricated citation in a legal-research assistant, is a containment event. Domain 3 prevents containment events from being the way you discover the problem.

How Sau5 tests it

Domain 3 reuses the three-stage detection pipeline from Domain 2 (NLI → LLM Judge × 2 → Atomic Decomposition) but feeds the verdicts into a five-type classifier rather than a single faithfulness score. The classifier assigns each unsupported claim to one of the five hallucination types, then aggregates per-type rates.

The dataset for Domain 3 is constructed differently from Domain 2. Where Domain 2 uses general-purpose grounding queries, Domain 3 uses trap queries deliberately designed to elicit each hallucination type:

Numerical traps: quantitative questions where the corpus contains only qualitative content
Citation traps: questions inviting the model to cite a "study" or "report" that doesn't exist
Entity traps: questions about real entities with intentionally similar-sounding distractors in the corpus
Temporal traps: questions about dates or sequences the corpus is silent on

Without trap queries, aggregate hallucination rates look reassuringly low. Hallucinations rarely surface on easy queries.

Metrics and pass thresholds

Metric	Definition	Default pass threshold
Total Hallucination Rate	All hallucination types ÷ all claims	≤ 5%
Intrinsic Rate	Claims contradicting retrieved chunks	≤ 2%
Extrinsic Rate (combined)	Fabricated + plausible	≤ 3%
Over-specification Rate	Invented specifics added to grounded claims	≤ 1%
Entity Substitution Rate	Named-entity swaps	≤ 0.5%

For regulated-industry clients (finance, healthcare, legal), Sau5 tightens thresholds, typically by half, with the over-specification and entity-substitution rates often set to zero.

What failure looks like in production

The "confident statistic" pattern. Customer-facing assistant produces a percentage, dollar amount, or count where the underlying data contains none. Over-specification hallucination type.
The "almost right entity" pattern. Assistant attributes a quote, policy, or product feature to a similar-named but wrong entity. Entity substitution hallucination type.
The "common knowledge" pattern. Assistant answers from its parametric knowledge when the corpus is silent, producing a plausible answer that happens to be wrong in the client's specific domain. Extrinsic, plausible hallucination type, the most subtle and the hardest to catch without trap queries.

Tools and references

The detection pipeline reuses the Domain 2 stack. The five-type classifier and trap-query dataset construction are proprietary to Sau5 and ship with the harness at handover. The taxonomy itself draws on academic work in HaluEval and FActScore, refined against the failure-mode distribution Sau5 has observed in production engagements.

Domain 4 — Adversarial Robustness.

What it measures

Adversarial Robustness tests whether a deployed AI system can be manipulated into behaving outside its intended scope: emitting prohibited content, leaking sensitive data, bypassing safety controls, or being coerced into actions its designers explicitly prevented.

This is the only Sau5 domain where the test cases themselves are not published in any public document. Specific injection payloads, jailbreak patterns, and PII-canary techniques are restricted to engaged clients with signed adversarial-test authorisation. Publishing them on the open web arms attackers without commensurate benefit to defenders.

Why it matters

Three things have changed in the last 18 months. First, prompt injection moved from a research curiosity to a documented attack pattern with public proofs-of-concept against major commercial AI systems. Second, regulatory frameworks (EU AI Act, NIST AI RMF, ISO/IEC 42001) now treat adversarial robustness as a baseline expectation for high-risk AI deployment. Third, the cost of a single successful attack, measured in remediation, customer trust, and regulatory response, has climbed sharply.

Domain 4 is the test that converts "we believe the system is robust" into "we have tested the system against the current threat surface, here is the evidence, here is the date."

How Sau5 tests it

Sau5 runs Domain 4 in an isolated test environment only, never against production. The environment is provisioned by Sau5, scoped to the system under test, and torn down at the end of the engagement. No client production data enters the environment.

The test battery is structured across six categories:

Category	What it probes
Direct prompt injection	Whether attacker-controlled input can override system instructions
Jailbreak variants	Whether the system can be coerced past safety controls via roleplay, hypotheticals, or refusal-class subversion
Encoding obfuscation	Whether instruction-following persists across base64, Unicode tricks, language switching, and similar
Multi-turn escalation	Whether the system can be gradually steered off-scope over a 5-turn conversation
Boundary probes	Whether the system enforces its documented scope under deliberate edge-case input
PII canary retrieval	Whether sensitive data planted in the corpus can be elicited via crafted queries

Specific payloads and techniques in each category remain restricted.

Metrics and pass thresholds

Metric	Definition	Pass threshold
Direct injection success rate	Successful overrides ÷ injection attempts	0
Jailbreak success rate	Successful policy violations ÷ jailbreak attempts	0
PII canary retrieval rate	Canaries surfaced ÷ canaries planted	0
Multi-turn escalation success	Steered off-scope within 5 turns	0
Boundary violations	Documented scope exceeded under edge input	0

Domain 4 uses a binary pass bar. Any single successful attack is a failure regardless of overall rate. There is no acceptable rate. The only acceptable count is zero.

What failure looks like in production

Failures in this domain don't appear gradually. They appear all at once, usually in a public forum, usually with a screenshot. Sau5 doesn't publish failure-mode case studies for Domain 4, but every senior engineering buyer has seen at least one in the news in the last 12 months.

Tools and references

The test battery is built on top of Garak and PyRIT for the public-domain attack scaffolds, extended with Sau5's proprietary payload library and the PII canary methodology. Test cases align with the threat taxonomy in OWASP Top 10 for LLM Applications and the adversarial-testing guidance in NIST AI RMF 1.0.

Engagement teams receive the full battery, the harness, the dataset, and the methodology runbook. Sau5 retains nothing related to the client's specific attack surface after handover.

Domain 5 — Eval Operations.

What it measures

Eval Operations tests whether the testing itself is repeatable, continuous, and integrated into the development pipeline. A perfect set of findings from Domains 1–4 is worth very little if those tests don't run again automatically every time the model is updated, the corpus is re-ingested, or the prompt is changed.

This is the domain that turns the four-week engagement from a snapshot into a practice. It's also the most overlooked. Most AI quality work focuses on the first four domains and treats operations as plumbing. Sau5 treats it as the domain that determines whether the other four hold their value over time.

Why it matters

AI systems drift in ways traditional software doesn't:

Model updates: provider releases a new version, sometimes silently, and behaviour changes
Corpus drift: documents are added, removed, re-indexed; chunk boundaries shift
Embedding model updates: vector representations of the same content change
Prompt edits: small changes to system prompts produce non-obvious downstream effects
Adversarial-pattern evolution: new injection techniques emerge faster than annual review cycles can catch

A static eval suite ages out in three to six months. Domain 5 prevents the eval suite from becoming evidence of testing the client used to do.

How Sau5 tests it

Domain 5 is delivered as a packaged, runnable harness that the client owns at handover. The harness has five components:

Domain orchestrator: a Python runner that executes Domains 1–4 in sequence against the system under test
Versioned golden dataset: the client's dataset under semantic versioning, with diff tooling for tracking changes
Regression baseline store: every run is recorded; the current run is compared against the baseline
CI/CD integration: pre-built configurations for GitHub Actions, GitLab CI, and Azure DevOps
Alert routing: failures route to Slack / Microsoft Teams / PagerDuty; warnings route to the client's preferred channel

The deploy gate is the operational centrepiece. The harness exits non-zero on any FAIL across the four upstream domains, which causes the CI pipeline to block the deploy. This converts AI quality from a reporting function ("here are the scores") into a control function ("this deploy will not ship").

Metrics and pass thresholds

Metric	Definition	Default pass threshold
Harness runtime (p95)	95th-percentile end-to-end run time	< 10 minutes
Baseline freshness	Days since baseline last refreshed	< 30
Regression catch rate	Induced regressions caught ÷ injected	100%
CI integration coverage	Pipelines wired to the harness ÷ pipelines that should be	100%
Alert delivery latency (p95)	Time from failure to alert delivered	< 60s

What failure looks like in production

The "broken gate" pattern. Harness runs, results are recorded, but the CI gate is misconfigured and doesn't actually block the deploy. The client believes they have a safety net they don't have.
The "stale baseline" pattern. Baseline is from six months ago. Regressions get measured against a comparison point that no longer reflects reality. False positives proliferate, the team stops trusting the alerts, alerting gets silenced.
The "noisy dashboard" pattern. Results are posted to a Grafana / Datadog / internal dashboard with no clear owner. Nobody reads them. Failures pile up. The first signal that quality has degraded is a user complaint, not a harness alert.

Tools and references

The harness packaging is built on Python 3.11+, uses standard CI/CD primitives (no vendor lock-in), and is licensed to the client at handover with no ongoing dependency on Sau5 infrastructure. Sau5 does not host any part of the client's eval operation by default. Clients who want Sau5 to run it for them (Path B) receive the same harness, deployed to a Sau5-managed environment on their behalf.

The 4-week engagement.

The Sau5 RAG Accuracy & Grounding Assessment is a fixed-scope, fixed-fee, four-week engagement. The shape is the same on every engagement; the inputs and outputs are client-specific.

Week 1 — Scope & Dataset

System access, knowledge-base inventory, environment provisioning. Golden dataset design across the ten canonical query types. 100+ records constructed, labelled by subject-matter experts, and committed under version control. Domain 1 retrieval testing begins on Day 5.

Weeks 2–3 — Execution

All four upstream domains run in parallel. Retrieval is measured against Recall@k / Precision@k / MRR / NDCG. Grounding is measured through the NLI / Judge × 2 / Atomic pipeline. Hallucination is classified across the five types. Adversarial probes run against the isolated test environment under signed authorisation. Daily findings are surfaced to the client; root causes are investigated as they emerge rather than batched.

Week 4 — Ops & Handover

Eval harness packaged and delivered to the client. CI/CD wired into the client's pipeline of choice. The gate-validation meta-test executed against the live integration. Findings report delivered: per-domain scores, ranked root causes, severity-weighted remediation roadmap. Bilingual EN/ES readout, 90 minutes, recorded for the client's records.

The engagement closes when the client can re-run the assessment without Sau5 present. The relationship continues only if the client chooses Path B (managed service) or Path C (retainer + quarterly re-runs).

What you walk away with.

Every Sau5 engagement ships the same six deliverables on Day 20:

Eval harness. A runnable Python package, client-owned, wired into the client's CI/CD.
Golden dataset. Versioned, SME-reviewed, tagged by query type.
Test cases manual. The full methodology document (~200 pages).
Findings report. Per-domain scores, root causes, remediation roadmap.
Live walkthrough. In English or Spanish (if applicable).
CI/CD configurations. Validated against the live integration before sign-off.

Sample chapter · Free PDF

See what a real Sau5 test case looks like.

A 15-page extract from the Sau5 Test Cases Manual. One worked test case from each of four domains, the chapter cover for the restricted adversarial domain, and what the client receives at handover. No email required.

Download the sample · PDF, ~600 KB

Frequently asked questions.

How long is a Sau5 engagement?

The RAG Accuracy & Grounding Assessment is a fixed four-week engagement, kickoff to handover. Week 1 is scope and golden dataset construction. Weeks 2 and 3 run the four upstream domains in parallel. Week 4 packages the harness, wires the CI/CD, and delivers the findings report and bilingual readout. Fixed scope, fixed fee.

What does Sau5 not test?

Sau5 does not do general software QA, browser or UI testing, performance and load testing, security audits outside the LLM attack surface, or project management. The methodology covers RAG systems and LLM-mediated applications. If the question is about non-AI systems, Sau5 is not the right vendor.

Why are adversarial test cases restricted?

Publishing specific prompt injection payloads, jailbreak patterns, and PII canary techniques on the open web arms attackers without commensurate benefit to defenders. Domain 4 test cases are restricted to engaged clients with signed adversarial-test authorisation. Sau5 retains nothing related to a client's specific attack surface after handover.

Is the eval harness open source?

The harness is licensed to the client at handover, not open source. The underlying libraries it sits on (RAGAS, BEIR, DeBERTa, Garak, PyRIT, sentence-transformers) are open source. The orchestration, the test cases, and the dataset construction methodology are Sau5's IP. The client owns their copy of the delivered harness and can extend it indefinitely without ongoing Sau5 involvement.

Can Sau5 work with our existing eval platform (Braintrust, Galileo, LangSmith)?

Yes. Sau5's harness writes results to any platform that accepts structured input. If you already have Braintrust, Galileo, LangSmith or similar in place, Sau5 integrates rather than replaces. The methodology and the test cases are the deliverable; the storage and dashboard layer is whatever you prefer.

What languages does Sau5 deliver in?

English and Spanish, end to end. Findings reports, walkthrough sessions, harness documentation, and dataset construction can all run in either language. One of Sau5's stated competitive advantages for clients with bilingual development teams or LatAm operations.

Who owns the dataset and the harness at handover?

The client. Both the golden dataset and the runnable harness are delivered as the client's property. Sau5 retains the methodology that produced them but does not keep copies of the client's dataset, the harness configuration, or any attack-surface information after handover.

What happens after the four weeks?

Three options. Path A: full handover, the client runs the methodology themselves, no ongoing relationship with Sau5. Path B: Sau5 managed service, where Sau5 runs the harness on the client's behalf on a quarterly or half-yearly cadence and surfaces regressions. Path C: hybrid, where the client owns and runs the harness and Sau5 stays on retainer for quarterly re-runs, methodology updates, and advisory. Most clients choose B or C because AI systems drift continuously and a one-off assessment ages out within three to six months.

Why Sau5 tests AI differently.

The five domains.

Domain 1 — Retrieval Quality.

What it measures

Why it matters

How Sau5 tests it

Golden dataset construction

Multi-metric measurement

Slice-level analysis

Drift detection between runs

Metrics and pass thresholds

What failure looks like in production

Tools and references

Domain 2 — Answer Grounding.

What it measures

Why it matters

How Sau5 tests it

Stage 1 — NLI Claim Entailment

Stage 2 — LLM Judge × 2

Stage 3 — Atomic Decomposition

Metrics and pass thresholds

What failure looks like in production

Tools and references

Domain 3 — Hallucination Detection.

What it measures

Why it matters

How Sau5 tests it

Metrics and pass thresholds

What failure looks like in production

Tools and references

Domain 4 — Adversarial Robustness.

What it measures

Why it matters

How Sau5 tests it

Metrics and pass thresholds

What failure looks like in production

Tools and references

Domain 5 — Eval Operations.

What it measures

Why it matters

How Sau5 tests it

Metrics and pass thresholds

What failure looks like in production

Tools and references

The 4-week engagement.

Week 1 — Scope & Dataset

Weeks 2–3 — Execution

Week 4 — Ops & Handover

What you walk away with.

See what a real Sau5 test case looks like.

Frequently asked questions.

Ready to talk?