Controlled Mayhem
Now shipping - Kodus Legal v0.4 RAG layer Lab note - Memory routing in TaskHive Open source - Hecate primitives v0.2 San Jose, CR - UTC-6 Now shipping - Kodus Legal v0.4 RAG layer Lab note - Memory routing in TaskHive Open source - Hecate primitives v0.2 San Jose, CR - UTC-6
LN/013Lab note - Research

Evermind: what happened when we tested associative memory at scale

We published a paper testing whether spreading activation over a memory graph improves retrieval for LLMs. At production scale, the honest answer is: barely, and only if you tune it carefully.

MemoryRetrievalResearchEvermind

TL;DR: We published a paper testing a cognitively-inspired memory architecture — Evermind — at production scale. The honest result: spreading activation over an embedding-derived memory graph has a small, possibly-positive ceiling that we have now characterized empirically. It is not the retrieval improvement we hoped for. We publish the null result anyway, because a clean answer at scale is worth more than a hopeful one.


Human memory does not wait for a query. Hearing "doctor" surfaces "nurse" and "hospital" without any conscious recall effort. Every memory system we ship for language models works the other way: it waits for an explicit trigger — a user query, an LLM decision, a task prediction — before it retrieves anything.

Evermind: Context-Triggered Spreading Activation Memory for Large Language Models asks whether closing that gap actually helps. Daniel Phillips, Controlled Mayhem, May 2026.

The question the paper asks

Evermind is a memory architecture that combines two ideas: context-triggered retrieval, which surfaces relevant memories before the model generates rather than after it asks, and spreading activation over a weighted memory graph, so that a strongly-matching memory makes its neighbours slightly more likely to surface too.

The question is empirical, not architectural. Spreading activation is a natural candidate for improving conceptual retrieval — but how much does it actually help, on a real corpus, at real scale? If the answer is "meaningfully," production retrieval systems should adopt it. If the answer is "negligibly," they should not.

This work was motivated by a concrete operational need. The author operates Kodus, a Spanish-language legal-intelligence platform indexing over four million chunks of case law across five Costa Rican and Guatemalan legal corpora. Conceptual semantic queries there work passably — users report missed relevant documents when their phrasing diverges from the source text. Spreading activation looked like a fix worth testing properly.

What we found

We tested the spreading hypothesis at scale: 1,000 retrieval scenarios over a 100,000-chunk subset of the Kodus corpus, with bootstrap 95% confidence intervals over per-scenario F1. To our knowledge this is the largest single-corpus evaluation of spreading activation on contemporary LLM-augmented retrieval.

The results are clean, and they are mostly negative:

  1. Aggressive spreading significantly degrades retrieval. At a decay setting of γ = 0.70, ΔF1 = −0.017 (95% CI [−0.027, −0.007]). Systems that switch spreading on naïvely should expect their retrieval to get worse.
  2. Minimal spreading is borderline beneficial. At γ = 0.95, ΔF1 = +0.006 (95% CI [−0.0004, +0.0132]). The effect is real in direction, but small enough that the practical case is weak.
  3. Entity-grounded queries are immune. For all 300 queries referencing specific laws, articles, or IDs, no value of γ changes the top-3 retrieval set at all. Spreading activation is categorically inert for "find-all" lookups.
  4. The ceiling is robust. A graph-topology ablation — mutual k-NN versus open k-NN, eliminating a 30% isolated-node rate — does not break the +0.006 ceiling. The bottleneck is similarity geometry, not graph density.

Per-scenario, 5.1% of semantic queries improve and 3.5% degrade. The mechanism works exactly as designed. It just does not move the needle far enough to justify building, maintaining, and serving a k-NN memory graph in production.

Why we published a null result

The honest scientific contribution of this paper is the negative result with strong bounds. Spreading activation on embedding-derived memory graphs has a small, possibly-positive ceiling that nobody had characterized at this scale before.

That is worth publishing. A vague "it might help" sends production teams down a months-long engineering path. A clean, statistically powered "+0.006, here are the confidence intervals" tells them not to. The paper is deliberately readable as either a cautiously positive result (minimal spreading is safe and marginally beneficial on semantic queries) or a negative one (the benefit is too small to deploy) — both readings are defensible from the data, and the paper presents them honestly rather than picking the flattering one.

What we recommend instead

For Kodus and similar production legal-intelligence systems, the paper makes four recommendations:

  • Do not adopt spreading activation as the next improvement. The change is below the noise floor of user-perceived quality, and exactly zero on the entity-grounded queries that dominate production traffic.
  • Invest in hybrid graph signals first. Entity co-occurrence and citation links are signals an embedding-derived k-NN graph cannot see. The corpus already supports them.
  • Invest in query understanding and reranking. The dominant failure mode is missed relevant documents, not mis-ordered ones — query rewriting and cross-encoder reranking address recall more directly.
  • If you do adopt spreading, set γ = 0.95 and gate it. Use it only when baseline confidence is low. Never run aggressive decay; the degradation at γ = 0.70 is real.

How to read it

The paper runs 15 pages. If you have five minutes, read the abstract and Section 7 (Conclusion). If you have twenty, Section 4 covers the architecture and Section 6 reports the at-scale benchmark, the per-scenario breakdown, and the recommendations. Section 6.7 lists the limitations plainly — one corpus, one language, one embedding model, top-3 metrics only.

Read the full paper →

The takeaway

Cognitively-inspired memory mechanisms are appealing because the cognitive science is appealing. But faithful replication of a brain mechanism is not the same as an engineering win. Spreading activation works as designed on a 100,000-chunk legal corpus — and still does not earn its keep against a strong baseline.

The useful direction is not more iteration on graph construction within the k-NN family. It is hybrid signals the embedding graph cannot derive on its own. We would rather publish that clearly than ship a memory feature that looks principled and changes nothing.

If you are building retrieval or memory infrastructure and weighing associative recall against simpler wins, we would be interested in comparing notes.

- Suggested citation

Phillips, D. (2026, May 21). Evermind: what happened when we tested associative memory at scale. Controlled Mayhem - Lab Notes, LN/013.

DP
- About the author

Daniel Phillips

Product architect and software design strategist with 10+ years leading AI-driven digital products from concept to delivery. Bridges user needs, technical architecture, and business goals to build scalable, high-impact systems.

§02 - Logbook subscription

New notes in your inbox.

Roughly weekly, written when something breaks or surprises us. No marketing, no roundups - just the working notes. Unsubscribe anytime.

→ 1,240 readers · monthly cadence · no list selling