Controlled Mayhem
Now shipping - Kodus Legal v0.4 RAG layer Lab note - Memory routing in TaskHive Open source - Hecate primitives v0.2 San Jose, CR - UTC-6 Now shipping - Kodus Legal v0.4 RAG layer Lab note - Memory routing in TaskHive Open source - Hecate primitives v0.2 San Jose, CR - UTC-6
LN/000Lab note - Process

Three eval harnesses we ship with everything

The smallest set of automated checks that catches the most regressions.

ProcessEvalQuality

Every project gets the same three evaluation harnesses before we call it production-ready:

  1. Retrieval recall checks against a fixed golden set.
  2. Output contract tests for shape, citations, and confidence signals.
  3. Latency and cost regression checks on representative traffic.

This set is intentionally small. Bigger eval suites look impressive and then rot. Small suites run every time and catch what matters.

- Suggested citation

Andrea Phillips. (March 14, 2026). Three eval harnesses we ship with everything. Controlled Mayhem - Lab Notes.

AP
- About the author

Andrea Phillips

Senior engineer with deep experience building AI agent infrastructure — persistent memory, multi-agent orchestration, and MCP tooling. Designs and ships production-grade systems that make AI agents reliable, persistent, and genuinely useful. Fifteen years of full-stack and real-time engineering underpinning a focused practice in applied AI.

§02 - Logbook subscription

New notes in your inbox.

Roughly weekly, written when something breaks or surprises us. No marketing, no roundups - just the working notes. Unsubscribe anytime.

→ 1,240 readers · monthly cadence · no list selling