AI ASSURANCE

The AI testing market has a quality problem.

Every enterprise is shipping AI. Almost none are testing it.
Sau5 closes that gap in four weeks, across five domains.

01 · THE PROBLEM

Most enterprises are deploying AI with no way to test it.

1in 3
Answers from leading retrieval-grounded legal AI tools were factually wrong on real-world queries (Stanford RegLab, 2024). Models that look fine in a demo break in production.
83%
of AI deployments lack any structured testing process today.
4×
faster growth than traditional QA across enterprise tooling spend.
02 · THE SOLUTION

A consultancy that tests AI —
only AI.

01 · OFFERING

RAG Accuracy & Grounding Assessment

A 5-domain, 4-week structured assessment of any RAG system. Findings report, remediation roadmap, and a repeatable eval harness the client keeps and runs themselves.

4-Week Engagement
02 · OFFERING

AI Testing Staffing

Trained QA engineers embedded directly into client teams. Onshore and offshore talent available — every engineer operates to the same Sau5 quality standard, with no ramp on your stack.

Embedded Delivery
03 · OFFERING

Training & Enablement

A 5-domain curriculum for QA engineers with no prior AI testing experience. Used internally to train our own people, and sold as a client deliverable.

Curriculum
03 · METHODOLOGY

The 5-domain framework,
run end-to-end in 4 weeks.

DOMAIN 01

Retrieval Quality

Is the correct content being surfaced? Foundation layer — everything downstream depends on it.

Recall@5≥ 0.85
Precision@5≥ 0.70
MRR≥ 0.75
Read more →
DOMAIN 02

Answer Grounding

Faithfulness scored via NLI + LLM-as-judge. Every claim must trace to retrieved source.

Faithfulness≥ 0.90
Answer Rel.≥ 0.85
Two-run ruleREQ.
Read more →
DOMAIN 03

Hallucination Detection

Three-stage pipeline: NLI → LLM-judge → atomic claim decomposition. Five hallucination types tracked.

Hall. rate≤ 5%
IntrinsicCRITICAL
Pipeline3-STAGE
Read more →
DOMAIN 04

Adversarial Robustness

Injection, jailbreak, boundary and PII canary tests. Run in an isolated environment with written client authorisation.

Injections8+ payloads
PII canaries36+ probes
Pass barZERO
Read more →
DOMAIN 05

Eval Operations

Python harness, regression detector, golden dataset versioning, and CI/CD configs the client keeps and runs.

CI gatePR-BLOCKING
Dataset100+ Q-A
HandoverCLIENT-OWNED
Read more →
04 · THE ENGAGEMENT

From kickoff to harness handover — no improvisation.

Week 1 — Scope & Dataset

Onboarding and golden dataset construction.

  • D 1–2System access, KB inventory, environment access
  • D 2–3Golden dataset design — 10 query types, coverage matrix
  • D 3–5100+ records constructed, labelled, SME-reviewed
  • D 5Domain 1 retrieval testing begins
Weeks 2–3 — Execution

Four domains, run in parallel.

  • D 01Retrieval — Recall, Precision, MRR, NDCG
  • D 02Grounding — Faithfulness via NLI + LLM-Judge
  • D 03Hallucination — NLI → Judge → Atomic decomposition
  • D 04Adversarial — injection, jailbreak, boundary, PII canary
Week 4 — Ops & Handover

Reporting, CI integration, client-owned harness.

  • D 1–2Eval harness packaged — Python runner + versioned dataset
  • D 2–3CI/CD configured — GitHub Actions / GitLab / Azure
  • D 3–4Findings report — scores, root causes, remediation roadmap
  • D 5Bilingual readout (EN / ES) and handover
05 · WHY NOT YOUR EXISTING QA TEAM

Traditional QA is trained for a different problem.

Dimension
Traditional QA
Sau5
AI Testing Expertise
Learning on the job
5-domain methodology — defined tooling, standard thresholds
Hallucination Detection
No structured approach
NLI + LLM-Judge + atomic decomposition — three-stage pipeline
Adversarial Testing
Not in scope
Full Domain 4 battery — injection, jailbreak, PII canary, boundary
Eval Ops & CI/CD
Manual and ad hoc
Automated harness, regression thresholds, CI gates as code
Delivery Model
Single-location, fixed model
Nearshore and local resourcing — engagement structured to suit your environment
Ramp Time
3–6 months to productive
2–4 weeks — engineers trained on the methodology before they touch your stack
06 · WHY SAU5

Five structural advantages,
not five marketing claims.

01

Proprietary Methodology

The 5-domain RAG Assessment framework is built in code and documented in full. Nothing about the engagement is improvised, and the same framework runs on every client.

02

Fixed Scope, Fixed Fee

Every engagement is scoped the same way — 5 domains, 4 weeks, a defined deliverable. No time-and-materials, no scope creep, no surprise invoices at the end.

03

Global by Design

Sau5 operates as a global brand from day one. Engagements run in English or Spanish, against any RAG stack, in any regulatory environment, without a regional setup phase.

04

Engineers Ready Before You Are

Every Sau5 engineer completes the 5-domain curriculum before working on a client engagement. Productive from week one — no 3–6 month ramp, no learning on your stack.

05

AI Testing Is All We Do

No general software QA. No project management. No side practices. Every engagement compounds our depth in one discipline — and that depth shows up in the findings.

07 · WHAT YOU WALK AWAY WITH

Three deliverables that
outlast the engagement.

Golden Dataset
100+

SME-reviewed Q-A-context records, versioned and refreshed every 60 days or whenever your KB changes. Yours to extend and re-run forever.

Eval Harness Ownership
100%

At handover the Python runner, regression detector and CI/CD configs are yours. Zero ongoing dependency on Sau5 to keep testing.

Continuous re-runs

CI/CD gates fire on every commit. Regressions are caught automatically. Testing becomes part of how you ship, not a one-time event.

For what comes next

The harness is yours. Add a quarterly or half-yearly subscription and Sau5 keeps catching regressions, retraining your team, and testing each release before it ships.

08 · ABOUT

Testing, rebuilt for AI.

There's a widening gap between how fast enterprises ship AI and how rigorously they test it. Untested systems put wrong answers in front of customers — and the cost shows up as refunds, regulatory exposure, lost trust, and brand damage that should have been caught before launch.

Sau5 exists to close that gap. The same engineering discipline that ships safe software every day, applied to the new failure modes AI brings with it.

Duncan Smith

Duncan Smith

Founder · Sau5

25+ years in software testing across large organisations in retail, transport and insurance. Founded Sau5 to bring the same engineering discipline to AI quality.

01

Measure, don't claim.

Every Sau5 finding ties to a defined metric, a defined threshold, and a defined test method. No subjective pass/fail. No vibes-based audits.

02

Hand the keys over.

The eval harness, the dataset, and the methodology runbook all leave the engagement with the client. Sau5 succeeds when the client can re-run the tests without us.

03

Only AI.

No general software QA. No side practices. The depth that produces good findings is the depth that comes from running the same kind of assessment, over and over again.

09 · INSIGHTS

Field notes from the eval bench.

First engagements are being scoped now.

Slots are limited. Join the waitlist to hear directly from Sau5 ahead of public availability.