Most enterprises are deploying AI with no way to test it.
A consultancy that tests AI —
only AI.
RAG Accuracy & Grounding Assessment
A 5-domain, 4-week structured assessment of any RAG system. Findings report, remediation roadmap, and a repeatable eval harness the client keeps and runs themselves.
4-Week EngagementTraining & Enablement
A practitioner curriculum to take QA engineers from zero AI testing experience to productive on RAG assessment engagements — covering all five domains of the Sau5 methodology. In development; not yet available.
Coming SoonThe 5-domain framework,
run end-to-end in 4 weeks.
Retrieval Quality
Is the correct content being surfaced — fast enough? The foundation. Every metric downstream depends on getting this right.
Answer Grounding
Faithfulness scored via NLI + LLM-as-judge. Every claim must trace to retrieved source.
Hallucination Detection
Three-stage pipeline: NLI → LLM-judge → atomic claim decomposition. Five hallucination types tracked.
Adversarial Robustness
Injection, jailbreak, boundary and PII canary tests. Run in an isolated environment with written client authorisation.
Eval Operations
Python harness, regression detector, golden dataset versioning, and CI/CD configs the client keeps and runs.
Traditional QA is trained for a different problem.
Three deliverables that
outlast the engagement.
SME-reviewed Q-A-context records, versioned and refreshed every 60 days or whenever your KB changes. Yours to extend and re-run forever.
At handover the Python runner, regression detector and CI/CD configs are yours. Zero ongoing dependency on Sau5 to keep testing.
CI/CD gates fire on every commit. Regressions are caught before deploy, not after. Testing becomes part of how you ship.
The harness is yours. Add a quarterly or half-yearly subscription and Sau5 keeps catching regressions, refreshing the dataset, and testing each release before it ships.
The questions buyers ask most.
How long is a Sau5 engagement?
The RAG Accuracy & Grounding Assessment is a fixed four-week engagement, kickoff to handover. Week 1 is scope and golden dataset construction. Weeks 2 and 3 run the four upstream domains in parallel. Week 4 packages the harness, wires the CI/CD, and delivers the findings report and bilingual readout. Fixed scope, fixed fee.
Can Sau5 run alongside our UAT — or does it replace it?
Neither. UAT for AI systems does not work the way UAT for traditional software works. Two testers running the same query get different model answers; three testers reach three different conclusions about whether the answer was right. UAT signal collapses.
The right shape: Sau5 runs first, producing quantitative scores against a defined golden dataset. UAT then runs on top — humans test usability, tone, edge-case judgment. UAT becomes the human layer on top of a system already proven to be factually correct. Without the Sau5 layer first, you are asking humans to certify quality on a system whose quality varies between runs. That is not a test. That is a hope.
How is AI testing different from AI observability, eval platforms, or guardrails?
Four separate vendor categories, often confused. AI Testing (Sau5) is a methodology applied at a defined point. AI Observability (Arize, WhyLabs, Helicone, Datadog) is runtime monitoring of production AI traffic. AI Eval Platforms (Braintrust, Galileo, LangSmith) are SaaS dashboards for managing test runs over time. AI Guardrails (Lakera, NeMo, Patronus) are runtime filters that block bad outputs before they reach users.
The full breakdown is in the Buyer's Guide PDF above.
What does Sau5 not do — and who should we call instead?
Sau5 does AI testing only. We do not sell observability dashboards, run runtime guardrails, or operate as a SaaS eval platform. If you need runtime monitoring, talk to Arize, WhyLabs, Helicone, Langfuse, or Datadog LLM Observability. For eval platforms, Braintrust, Galileo, LangSmith, Vellum, or Humanloop — our harness integrates with any of them. For guardrails, Lakera, NeMo, Guardrails AI, Patronus, or Aporia.
Where is our data stored during a Sau5 engagement?
Sau5 minimises data handling by design. Test execution runs on client infrastructure by default. The harness is deployed into the client's environment, and client data never leaves the client perimeter for normal testing.
Where Sau5 does hold client artefacts (off-site golden dataset construction, findings report preparation): encrypted at rest, access limited to the engagement team only (typically one to three engineers), isolated from any other Sau5 work, access-logged.
The harness uses the client's existing LLM endpoints — OpenAI, Anthropic, Azure OpenAI, or self-hosted — under the client's existing vendor agreements. No new third-party LLM data flows are introduced.
What happens to our data after the engagement ends?
At handover the client takes ownership of the golden dataset, the runnable harness, the findings report, and any associated artefacts.
Sau5 retains only the methodology (not client-specific) and aggregated anonymised metrics. Sau5 deletes client knowledge base content, customer queries, test outputs containing client-identifiable information, attack-surface details from Domain 4, and credentials. Deletion is confirmed in writing within 14 days.
For Path B and Path C clients (managed service or hybrid retainer), Sau5 retains only the minimum needed to run the ongoing work between cadence runs.
What happens after the four weeks?
Three options. Path A: full handover, the client runs the methodology themselves, no ongoing relationship with Sau5. Path B: Sau5 managed service, where Sau5 runs the harness on the client's behalf on a quarterly or half-yearly cadence and surfaces regressions. Path C: hybrid, where the client owns and runs the harness and Sau5 stays on retainer for quarterly re-runs, methodology updates, and advisory. Most clients choose B or C because AI systems drift continuously and a one-off assessment ages out within three to six months.
How does Sau5 measure retrieval latency?
Latency is measured inside Domain 1, on the same calls that produce Recall, Precision, MRR and NDCG — at zero added client cost. Sau5 instruments the retriever to capture wall-clock time from query submission to ranked chunks returned, then reports p50, p95 and p99 across the full golden dataset plus per-query-type.
Default thresholds: p95 < 500ms / p99 < 1.2s for conversational deployments; p95 < 2s for batch use cases. Negotiated per engagement against the client's SLA. Drift across runs is gated — >20% raises a warn, >50% raises a fail. Sau5 measures retrieval-layer latency at steady state, not end-to-end response time or load behaviour.
Does Sau5 test for permission and access-control leaks in retrieval?
Yes — access-control verification sits inside Domain 4. The common failure: vector search bypasses row-level security on the source system, and an in-scope question from a junior user returns a chunk from a document only senior users were meant to see. The chunk's underlying document had ACLs; the embedding in the vector store did not.
Sau5 seeds the corpus during scoping with documents tagged for specific user roles or groups, then issues identical queries as different user personas and confirms retrieval respects the documented boundaries. Pass bar is zero leaks. Maps directly to OWASP LLM02 (Sensitive Information Disclosure).
Can Sau5 test a RAG system built with LlamaIndex or LlamaCloud?
Yes, with one caveat. Open-source LlamaIndex is fully testable end-to-end because chunking, embeddings and retriever internals are all inspectable. LlamaCloud (managed indexing, embedding and retrieval behind their API) is testable as a black box for outcome measurement, but root-cause diagnostics — explaining why a query failed — often require visibility into the chunking and embedding layer, which managed services abstract away.
LlamaIndex's built-in evaluation module (Faithfulness, Answer Relevance, Context Relevance) is a useful in-loop tool for developers during build. It is not a substitute for an independent third-party assessment with SME-labelled golden datasets, two-run judge agreement, and slice-level drift detection.
AI Insights.
First engagements are being scoped now.
Slots are limited. Join the waitlist to hear from Sau5 first, before the next round opens.