Every project gets the same three evaluation harnesses before we call it production-ready:
- Retrieval recall checks against a fixed golden set.
- Output contract tests for shape, citations, and confidence signals.
- Latency and cost regression checks on representative traffic.
This set is intentionally small. Bigger eval suites look impressive and then rot. Small suites run every time and catch what matters.