AI ASSURANCE

The AI testing market has a quality problem.

Every enterprise is shipping AI. Almost none are testing it.
Sau5 closes that gap in four weeks, across five domains.

01 · THE PROBLEM

Most enterprises are deploying AI with no way to test it.

1in 3
Answers from leading retrieval-grounded legal AI tools were factually wrong on real-world queries (Stanford RegLab, 2024). Models that look fine in a demo break in production.
83%
of AI deployments lack any structured testing process.
4×
faster growth than traditional QA across enterprise tooling spend.
02 · THE SOLUTION

A consultancy that tests AI —
only AI.

01 · OFFERING

RAG Accuracy & Grounding Assessment

A 5-domain, 4-week structured assessment of any RAG system. Findings report, remediation roadmap, and a repeatable eval harness the client keeps and runs themselves.

4-Week Engagement
02 · OFFERING

Training & Enablement

A practitioner curriculum to take QA engineers from zero AI testing experience to productive on RAG assessment engagements — covering all five domains of the Sau5 methodology. In development; not yet available.

Coming Soon
03 · METHODOLOGY

The 5-domain framework,
run end-to-end in 4 weeks.

DOMAIN 01

Retrieval Quality

Is the correct content being surfaced — fast enough? The foundation. Every metric downstream depends on getting this right.

Recall@5≥ 0.85
Precision@5≥ 0.70
Latency p95< 500ms
Read more →
DOMAIN 02

Answer Grounding

Faithfulness scored via NLI + LLM-as-judge. Every claim must trace to retrieved source.

Faithfulness≥ 0.90
Answer Rel.≥ 0.85
Two-run ruleREQ.
Read more →
DOMAIN 03

Hallucination Detection

Three-stage pipeline: NLI → LLM-judge → atomic claim decomposition. Five hallucination types tracked.

Hall. rate≤ 5%
IntrinsicCRITICAL
Pipeline3-STAGE
Read more →
DOMAIN 04

Adversarial Robustness

Injection, jailbreak, boundary and PII canary tests. Run in an isolated environment with written client authorisation.

Injections8+ payloads
PII canaries36+ probes
Pass barZERO
Read more →
DOMAIN 05

Eval Operations

Python harness, regression detector, golden dataset versioning, and CI/CD configs the client keeps and runs.

CI gatePR-BLOCKING
Dataset100+ Q-A
HandoverCLIENT-OWNED
Read more →
04 · WHY NOT YOUR EXISTING QA TEAM

Traditional QA is trained for a different problem.

Dimension
Traditional QA
Sau5
AI Testing Expertise
Learning on the job
5-domain methodology — defined tooling, standard thresholds
Hallucination Detection
No structured approach
NLI + LLM-Judge + atomic decomposition — three-stage pipeline
Adversarial Testing
Not in scope
Full Domain 4 battery — injection, jailbreak, PII canary, boundary
Eval Ops & CI/CD
Manual and ad hoc
Automated harness, regression thresholds, CI gates as code
Delivery Model
Single-location, fixed model
Nearshore and local resourcing — engagement structured to suit your environment
Ramp Time
3–6 months to productive
2–4 weeks — engineers trained on the methodology before they touch your stack
05 · WHAT YOU WALK AWAY WITH

Three deliverables that
outlast the engagement.

Golden Dataset
100+

SME-reviewed Q-A-context records, versioned and refreshed every 60 days or whenever your KB changes. Yours to extend and re-run forever.

Eval Harness Ownership
100%

At handover the Python runner, regression detector and CI/CD configs are yours. Zero ongoing dependency on Sau5 to keep testing.

Continuous re-runs

CI/CD gates fire on every commit. Regressions are caught before deploy, not after. Testing becomes part of how you ship.

For what comes next

The harness is yours. Add a quarterly or half-yearly subscription and Sau5 keeps catching regressions, refreshing the dataset, and testing each release before it ships.

06 · FAQ

The questions buyers ask most.

How long is a Sau5 engagement?

The RAG Accuracy & Grounding Assessment is a fixed four-week engagement, kickoff to handover. Week 1 is scope and golden dataset construction. Weeks 2 and 3 run the four upstream domains in parallel. Week 4 packages the harness, wires the CI/CD, and delivers the findings report and bilingual readout. Fixed scope, fixed fee.

Can Sau5 run alongside our UAT — or does it replace it?

Neither. UAT for AI systems does not work the way UAT for traditional software works. Two testers running the same query get different model answers; three testers reach three different conclusions about whether the answer was right. UAT signal collapses.

The right shape: Sau5 runs first, producing quantitative scores against a defined golden dataset. UAT then runs on top — humans test usability, tone, edge-case judgment. UAT becomes the human layer on top of a system already proven to be factually correct. Without the Sau5 layer first, you are asking humans to certify quality on a system whose quality varies between runs. That is not a test. That is a hope.

How is AI testing different from AI observability, eval platforms, or guardrails?

Four separate vendor categories, often confused. AI Testing (Sau5) is a methodology applied at a defined point. AI Observability (Arize, WhyLabs, Helicone, Datadog) is runtime monitoring of production AI traffic. AI Eval Platforms (Braintrust, Galileo, LangSmith) are SaaS dashboards for managing test runs over time. AI Guardrails (Lakera, NeMo, Patronus) are runtime filters that block bad outputs before they reach users.

The full breakdown is in the Buyer's Guide PDF above.

What does Sau5 not do — and who should we call instead?

Sau5 does AI testing only. We do not sell observability dashboards, run runtime guardrails, or operate as a SaaS eval platform. If you need runtime monitoring, talk to Arize, WhyLabs, Helicone, Langfuse, or Datadog LLM Observability. For eval platforms, Braintrust, Galileo, LangSmith, Vellum, or Humanloop — our harness integrates with any of them. For guardrails, Lakera, NeMo, Guardrails AI, Patronus, or Aporia.

Where is our data stored during a Sau5 engagement?

Sau5 minimises data handling by design. Test execution runs on client infrastructure by default. The harness is deployed into the client's environment, and client data never leaves the client perimeter for normal testing.

Where Sau5 does hold client artefacts (off-site golden dataset construction, findings report preparation): encrypted at rest, access limited to the engagement team only (typically one to three engineers), isolated from any other Sau5 work, access-logged.

The harness uses the client's existing LLM endpoints — OpenAI, Anthropic, Azure OpenAI, or self-hosted — under the client's existing vendor agreements. No new third-party LLM data flows are introduced.

What happens to our data after the engagement ends?

At handover the client takes ownership of the golden dataset, the runnable harness, the findings report, and any associated artefacts.

Sau5 retains only the methodology (not client-specific) and aggregated anonymised metrics. Sau5 deletes client knowledge base content, customer queries, test outputs containing client-identifiable information, attack-surface details from Domain 4, and credentials. Deletion is confirmed in writing within 14 days.

For Path B and Path C clients (managed service or hybrid retainer), Sau5 retains only the minimum needed to run the ongoing work between cadence runs.

What happens after the four weeks?

Three options. Path A: full handover, the client runs the methodology themselves, no ongoing relationship with Sau5. Path B: Sau5 managed service, where Sau5 runs the harness on the client's behalf on a quarterly or half-yearly cadence and surfaces regressions. Path C: hybrid, where the client owns and runs the harness and Sau5 stays on retainer for quarterly re-runs, methodology updates, and advisory. Most clients choose B or C because AI systems drift continuously and a one-off assessment ages out within three to six months.

How does Sau5 measure retrieval latency?

Latency is measured inside Domain 1, on the same calls that produce Recall, Precision, MRR and NDCG — at zero added client cost. Sau5 instruments the retriever to capture wall-clock time from query submission to ranked chunks returned, then reports p50, p95 and p99 across the full golden dataset plus per-query-type.

Default thresholds: p95 < 500ms / p99 < 1.2s for conversational deployments; p95 < 2s for batch use cases. Negotiated per engagement against the client's SLA. Drift across runs is gated — >20% raises a warn, >50% raises a fail. Sau5 measures retrieval-layer latency at steady state, not end-to-end response time or load behaviour.

Does Sau5 test for permission and access-control leaks in retrieval?

Yes — access-control verification sits inside Domain 4. The common failure: vector search bypasses row-level security on the source system, and an in-scope question from a junior user returns a chunk from a document only senior users were meant to see. The chunk's underlying document had ACLs; the embedding in the vector store did not.

Sau5 seeds the corpus during scoping with documents tagged for specific user roles or groups, then issues identical queries as different user personas and confirms retrieval respects the documented boundaries. Pass bar is zero leaks. Maps directly to OWASP LLM02 (Sensitive Information Disclosure).

Can Sau5 test a RAG system built with LlamaIndex or LlamaCloud?

Yes, with one caveat. Open-source LlamaIndex is fully testable end-to-end because chunking, embeddings and retriever internals are all inspectable. LlamaCloud (managed indexing, embedding and retrieval behind their API) is testable as a black box for outcome measurement, but root-cause diagnostics — explaining why a query failed — often require visibility into the chunking and embedding layer, which managed services abstract away.

LlamaIndex's built-in evaluation module (Faithfulness, Answer Relevance, Context Relevance) is a useful in-loop tool for developers during build. It is not a substitute for an independent third-party assessment with SME-labelled golden datasets, two-run judge agreement, and slice-level drift detection.

07 · INSIGHTS

AI Insights.

First engagements are being scoped now.

Slots are limited. Join the waitlist to hear from Sau5 first, before the next round opens.