LN/000Lab note - Process

Three eval harnesses we ship with everything

The smallest set of automated checks that catches the most regressions.

ProcessEvalQuality

APAndrea Phillips·AI Engineer & Founder, Controlled Mayhem

03/14/2026Published

5 minRead time

v1.0Version

03/14/2026Last revised

Every project gets the same three evaluation harnesses before we call it production-ready:

Retrieval recall checks against a fixed golden set.
Output contract tests for shape, citations, and confidence signals.
Latency and cost regression checks on representative traffic.

This set is intentionally small. Bigger eval suites look impressive and then rot. Small suites run every time and catch what matters.

- Suggested citation

Andrea Phillips. (March 14, 2026). Three eval harnesses we ship with everything. Controlled Mayhem - Lab Notes.

- About the author

Andrea Phillips

Senior engineer with deep experience building AI agent infrastructure — persistent memory, multi-agent orchestration, and MCP tooling. Designs and ships production-grade systems that make AI agents reliable, persistent, and genuinely useful. Fifteen years of full-stack and real-time engineering underpinning a focused practice in applied AI.

About →GitHub →X →

§02 - Logbook subscription

New notes in your inbox.

Roughly weekly, written when something breaks or surprises us. No marketing, no roundups - just the working notes. Unsubscribe anytime.

→ 1,240 readers · monthly cadence · no list selling

RSS Atom JSON feed

Three eval harnesses we ship with everything

- Suggested citation

- About the author

Andrea Phillips

More from the logbook.

What is an AI product studio?

Architecture before automation: a case study

Year in review — 02 founders, 06 systems

New notes in your inbox.