reverie
Coding agents forget. Or worse, they remember the wrong thing — load the stale rule, search the store that doesn’t hold the answer, save the same fact for the fourth time. Reverie is the system that decides what an agent should remember, where to put it, and when to let it go.
The bet
The dominant story about why AI coding agents fail is that the models aren’t smart enough yet. We think the dominant story is wrong. In session after session, watching real harnesses do real work, the failures cluster somewhere else: the wrong context is in the window. The directive that would have prevented the bug was filed in a store nobody searched. The reference the agent needed sat three layers down behind a query nobody knew to ask. The “memory” was technically present and operationally invisible.
Reverie is a research bet on a different framing. Memory for agents is not a database problem and not a model problem — it’s a context management and injection problem. What gets loaded before the prompt, what gets evicted between turns, what gets consolidated overnight, what gets quietly thrown away. Operating systems solved the analogous problem in the 1960s. Brains solved it on a longer timescale. Neither of them used a single flat store.
Why it exists
Every modern coding harness — Claude Code, Cursor, Windsurf — now persists knowledge across sessions. None of them ship with a theory of where a given piece of knowledge should live. The default behaviour is “dump everything into one store,” and that default produces five recurring failure modes:
- Missed retrievals. Behavioural directives saved to a search-only store never load unless something actively searches for them. The harness silently degrades — it still works, it just stops following the rule.
- Wasted searches. Curated heuristic collections meant to be loaded in bulk get stored as flat rows and pulled back one at a time, paying a round-trip per row.
- Polluted results. A single store doing four jobs — rule store, preference store, reference library, decision log — returns philosophy mixed with architecture mixed with trivia on every search.
- Duplicate staleness. The same fact saved in three places. Updates touch one copy. The other two drift, and the agent eventually quotes the wrong one with confidence.
- Write-then-delete churn. Without placement discipline, agents save aggressively and prune afterward. A pre-reverie audit of one real harness found 62% of writes were eventually tombstoned.
None of this is a model problem. It’s a placement problem — the same one operating systems solved with a cache hierarchy and the same one brains solved with hippocampal consolidation. Reverie applies both lessons to the surface area of an AI coding harness.
What reverie does
Placement-aware memory
Knowledge gets routed across a five-tier hierarchy by the kind of thing it is, not the verb the agent happened to use. Always-loaded directives sit at the top; preferences and project decisions sit in fast indexed stores; browsable reference belongs in a vault; the canonical source of truth stays in the repo. A daemon decides placement at write time, not after the fact, and an offline consolidation cycle — modeled on hippocampal sleep — periodically replays new observations, deduplicates them, synthesizes higher-order summaries, and promotes what proved useful to a more durable layer. The same cycle quietly downscales what didn’t.
A derivability gate before every write
Before anything is stored, the gate asks four questions: will I need this in a future session, is it already stored elsewhere, which layer matches its access pattern, and is it a fact or a directive. If the answer to the second question is yes, the write is rejected — there’s no point duplicating what’s already in the repo or the vault. This single pre-write filter is the largest behavioural difference between reverie and every other LLM memory system we surveyed; the rest operate post-hoc on whatever the agent decided to save.
Hybrid retrieval that handles vocabulary mismatch
Lexical search handles exact-phrase recall. Dense vectors handle semantic gaps. A sparse Hamming-code pathway catches typos and morphological drift. A cross-encoder reranks the top candidates, and a query-expansion pass kicks in when the obvious search returns thin. Four pathways, fused with reciprocal rank fusion, because no single retrieval method handles every failure mode and the failures are diverse.
Contradiction handling that doesn’t silently overwrite
On save, every observation is checked against existing entries. Conflicts get tagged on both sides and feed a supersession graph that drives the next consolidation pass. Old beliefs aren’t silently replaced — they’re kept until something newer earns precedence, which matters when the agent later asks why the system used to think otherwise.
The research grounding
Reverie isn’t “brain-inspired” in the marketing sense. It implements a small number of mechanisms from cognitive neuroscience and treats their predictions as load-bearing constraints on the architecture. A handful of them, in particular, determine why the system has the shape it does.
Complementary learning systems theory is why the fast write path and the slow consolidation path are different processes — biology pays a heavy price to keep them apart, and the price is paid for a reason. Sharp-wave-ripple replay is why the consolidation phase is priority-weighted (recency × access × importance × novelty) rather than FIFO; the brain doesn’t replay everything, and neither do we. Synaptic homeostasis is why downscaling is proportional to disuse rather than a hard threshold — a clean threshold throws away too much, a graceful decay keeps the long tail available. Reconsolidation is why retrieval is treated as a write opportunity, not a duplicate event; remembering changes what is remembered. And the spacing effect is why stability gain is largest when retrievability was lowest before the access — a result borrowed wholesale from FSRS and the Ebbinghaus literature.
Each of these ruled out a shape the system might otherwise have taken. They are not decorations.
What we’ve measured
Reverie’s retrieval is validated on the LoCoMo benchmark — 812 questions across five long-form conversations, scored against a Python BM25 reference implementation on the same questions:
| Metric | Reverie | BM25 baseline | Δ |
|---|---|---|---|
| R@1 | 52.8% | 46.1% | +6.7pp |
| R@5 | 77.0% | 66.5% | +10.5pp |
| R@10 | 84.0% | 71.9% | +12.1pp |
| MRR | 0.628 | 0.551 | +0.077 |
The largest gains show up where vocabulary mismatch dominates — multi-hop questions (+14.9pp R@5) and adversarial ones (+11.7pp) — exactly where dense retrieval should pull ahead of pure lexical search. Single-hop questions, where BM25 already does well, still show a consistent +7.1pp lift.
A note on the numbers: 77% R@5 is retrieval recall, not end-to-end answer F1. The LoCoMo paper reports its best non-human baseline at F1 ≈ 41%, which sits downstream of retrieval and is expected to be lower. End-to-end F1 numbers, with judge agent, ship with the v1.0 milestone.
What it isn’t
- Not a chat-memory product. Tools like MemGPT and Letta optimize for stretching the current session past the model’s context window. Reverie optimizes for what survives after the session ends. The two compose; they don’t compete.
- Not a CRUD pipeline. Reverie is write-once with offline consolidation; updates happen via reconsolidation, not delete verbs. Add/Update/Delete is the wrong abstraction for memory that is meant to age.
- Not a knowledge graph. No typed relations, no entity linking, no traversal API. Observations are flat text plus metadata; graph-shaped knowledge belongs in the vault layer where humans actually look at it.
- Not a one-store-fits-all replacement. Reverie’s whole thesis is that one store does not fit all. It cooperates with the layers it doesn’t replace — instruction files, vaults, the repo itself.
Status
Proprietary. Copyright © 2026 Christian M. Todie / cerebral.work — all rights reserved.
Currently v0.9.13. v1.0 targets 2026-09-15 and gates on four
blockers: multi-user v1 (tenant-scoped observations, per-user auth on all routes), Postgres
integration suite green, docs end-to-end review, and the M1 findings paper published. Quality
bars: core crate test coverage ≥ 70%, LoCoMo MRR ≥ 0.657, /health green
for 14 + consecutive days on the reference deployment, all CI gates green. For commercial
licensing, embedded use, or evaluation access,
[email protected].