TL;DR: We published a paper testing a cognitively-inspired memory architecture — Evermind — at production scale. The honest result: spreading activation over an embedding-derived memory graph has a small, possibly-positive ceiling that we have now characterized empirically. It is not the retrieval improvement we hoped for. We publish the null result anyway, because a clean answer at scale is worth more than a hopeful one.
Human memory does not wait for a query. Hearing "doctor" surfaces "nurse" and "hospital" without any conscious recall effort. Every memory system we ship for language models works the other way: it waits for an explicit trigger — a user query, an LLM decision, a task prediction — before it retrieves anything.
Evermind: Context-Triggered Spreading Activation Memory for Large Language Models asks whether closing that gap actually helps. Daniel Phillips, Controlled Mayhem, May 2026.
The question the paper asks
Evermind is a memory architecture that combines two ideas: context-triggered retrieval, which surfaces relevant memories before the model generates rather than after it asks, and spreading activation over a weighted memory graph, so that a strongly-matching memory makes its neighbours slightly more likely to surface too.
The question is empirical, not architectural. Spreading activation is a natural candidate for improving conceptual retrieval — but how much does it actually help, on a real corpus, at real scale? If the answer is "meaningfully," production retrieval systems should adopt it. If the answer is "negligibly," they should not.
This work was motivated by a concrete operational need. The author operates Kodus, a Spanish-language legal-intelligence platform indexing over four million chunks of case law across five Costa Rican and Guatemalan legal corpora. Conceptual semantic queries there work passably — users report missed relevant documents when their phrasing diverges from the source text. Spreading activation looked like a fix worth testing properly.
What we found
We tested the spreading hypothesis at scale: 1,000 retrieval scenarios over a 100,000-chunk subset of the Kodus corpus, with bootstrap 95% confidence intervals over per-scenario F1. To our knowledge this is the largest single-corpus evaluation of spreading activation on contemporary LLM-augmented retrieval.
The results are clean, and they are mostly negative:
- Aggressive spreading significantly degrades retrieval. At a decay setting of γ = 0.70, ΔF1 = −0.017 (95% CI [−0.027, −0.007]). Systems that switch spreading on naïvely should expect their retrieval to get worse.
- Minimal spreading is borderline beneficial. At γ = 0.95, ΔF1 = +0.006 (95% CI [−0.0004, +0.0132]). The effect is real in direction, but small enough that the practical case is weak.
- Entity-grounded queries are immune. For all 300 queries referencing specific laws, articles, or IDs, no value of γ changes the top-3 retrieval set at all. Spreading activation is categorically inert for "find-all" lookups.
- The ceiling is robust. A graph-topology ablation — mutual k-NN versus open k-NN, eliminating a 30% isolated-node rate — does not break the +0.006 ceiling. The bottleneck is similarity geometry, not graph density.
Per-scenario, 5.1% of semantic queries improve and 3.5% degrade. The mechanism works exactly as designed. It just does not move the needle far enough to justify building, maintaining, and serving a k-NN memory graph in production.
Why we published a null result
The honest scientific contribution of this paper is the negative result with strong bounds. Spreading activation on embedding-derived memory graphs has a small, possibly-positive ceiling that nobody had characterized at this scale before.
That is worth publishing. A vague "it might help" sends production teams down a months-long engineering path. A clean, statistically powered "+0.006, here are the confidence intervals" tells them not to. The paper is deliberately readable as either a cautiously positive result (minimal spreading is safe and marginally beneficial on semantic queries) or a negative one (the benefit is too small to deploy) — both readings are defensible from the data, and the paper presents them honestly rather than picking the flattering one.
What we recommend instead
For Kodus and similar production legal-intelligence systems, the paper makes four recommendations:
- Do not adopt spreading activation as the next improvement. The change is below the noise floor of user-perceived quality, and exactly zero on the entity-grounded queries that dominate production traffic.
- Invest in hybrid graph signals first. Entity co-occurrence and citation links are signals an embedding-derived k-NN graph cannot see. The corpus already supports them.
- Invest in query understanding and reranking. The dominant failure mode is missed relevant documents, not mis-ordered ones — query rewriting and cross-encoder reranking address recall more directly.
- If you do adopt spreading, set γ = 0.95 and gate it. Use it only when baseline confidence is low. Never run aggressive decay; the degradation at γ = 0.70 is real.
How to read it
The paper runs 15 pages. If you have five minutes, read the abstract and Section 7 (Conclusion). If you have twenty, Section 4 covers the architecture and Section 6 reports the at-scale benchmark, the per-scenario breakdown, and the recommendations. Section 6.7 lists the limitations plainly — one corpus, one language, one embedding model, top-3 metrics only.
The takeaway
Cognitively-inspired memory mechanisms are appealing because the cognitive science is appealing. But faithful replication of a brain mechanism is not the same as an engineering win. Spreading activation works as designed on a 100,000-chunk legal corpus — and still does not earn its keep against a strong baseline.
The useful direction is not more iteration on graph construction within the k-NN family. It is hybrid signals the embedding graph cannot derive on its own. We would rather publish that clearly than ship a memory feature that looks principled and changes nothing.
If you are building retrieval or memory infrastructure and weighing associative recall against simpler wins, we would be interested in comparing notes.