Seven things that hold up on the eval bench.
1. Build the golden set before you build the model
A golden set is fifty to two hundred real questions with the right answers written down by a human who knows the domain. Boring. Effective. Most teams skip it because it isn't fun. Don't skip it. The team that writes the golden set first ships twice as fast as the team that writes it after the first user complaint.
2. Test retrieval separately from generation
If the model gives a bad answer, you have two suspects: the retriever pulled the wrong documents, or the generator ignored the right ones. Different failure modes, different fixes. Test them apart. Recall@5 and groundedness are your two friends here.
3. Run it twice
Same prompt. Same model. Two runs. If the answers disagree on the facts, you have a determinism problem worth knowing about before your customer finds it. Two API calls. One postmortem avoided.
4. Use a model to judge a model (carefully)
LLM-as-judge gets a lot of stick, and most of the stick is deserved when people use it lazily. Used carefully — with a rubric, calibrated against human-graded examples, with a confidence threshold — it scales evaluation in a way humans can't. Treat the judge as a noisy sensor, not an oracle.
5. Watch for the boring failures first
Everyone wants to talk about jailbreaks and adversarial attacks. Glamorous. Worth your attention eventually. Not the failure mode costing you money this quarter. The failure mode costing you money is the model politely making up a refund policy. Catch the boring stuff first.
6. Version your prompts like you version your code
Your prompt is code. Your prompt-template-with-six-edits-from-three-people-in-a-Slack-thread is not code. It is a liability. Put it in git. Tag it. Diff it. When behaviour changes you'll be glad you can answer "what changed?" in thirty seconds instead of two days.
7. Have someone you trust try to break it before you ship
I'll keep this short on detail for reasons that will become obvious if you think about why. Find a person who is good at finding edges. Let them poke. Fix what they find. Repeat. The cost of doing this internally is small. The cost of skipping it is large and public.