Claude Session Coordination Protocol — v0 #
Status: draft, ship-today scope (local filesystem backend) Target: cooperating Claude Code sessions on one host, with a clear path to multi-host (Redis / NATS / Postgres) in v1+.
1. Problem #
Two or more Claude Code sessions running against the same workspace (either the
same user in multiple VSCode windows, or via claude --resume) collide in
subtle, silent ways. Examples observed 2026-04-07 during the MVP-B push:
- Git state contention — peer session created a
reverie-wt-tod-412worktree and modifiedmain.rs; the primary session didn’t discover the drift untilcargosurfaced compile errors against the composed tree. gh pr mergerace — post-push mergeability is async on GitHub’s side; back-to-back merges from different sessions hitUNKNOWN/UNSTABLEand error non-deterministically.~/.claude/file races — both sessions tried to write hooks and settings simultaneously; last writer silently won.- Branch-base drift — peer branched from an older main, primary branched from a newer one; neither session could see that a shared helper struct had gained a field, and the merge train hit type errors at compose time.
- Daemon ownership ambiguity — the cutover
engram serveprocess is shared state, but neither session owned it; SIGTERM during in-place upgrade could kill the other session’s in-flight requests. - Pre-commit hook stash dance — cargo-fmt’s stash-then-restore silently dropped merge commits when the primary session was actively editing, because the peer session had unstaged changes the hook didn’t know about.
Missing layer: peer-discoverable structured state that any cooperating Claude session can read/write without stepping on each other.
2. Design principles #
- Filesystem-backed for v0. Zero infrastructure. Survives session crashes
(the file is still there). Inspectable by
cat+jq. - Operation set isomorphic to Redis primitives. Every v0 filesystem operation maps cleanly to a single Redis command (or Lua script), so v1 is a backend swap, not a protocol rewrite.
- Opt-in, non-blocking. A session that doesn’t implement coordination isn’t broken — it just gets no peer awareness. Coord-aware sessions handle coord-unaware peers by treating the whole tree as “potentially contended”.
- Arbitrary opaque blobs. Every message carries a
blobfield that neither sender nor receiver parses — use for experimental / future-proof payloads without schema bumps. - Short schemas, loud version bumps. Schema version is prominent in every file; mismatch → other sessions ignore you gracefully.
- Liveness is heartbeat-based, not TCP-based. File mtime + heartbeat timestamp + process check. Cheap and distributed-safe.
3. Wire format #
3.1 Session record #
{
"schema": 1,
"session_id": "c87d1c5c-ee1c-480d-9384-ca2481aa143b",
"claude_pid": 5865,
"claude_version": "2.1.92",
"cwd": "/home/ctodie/projects/reverie",
"bin": "/home/ctodie/.vscode-server/.../native-binary/claude",
"started_at": "2026-04-07T09:47:00-04:00",
"last_heartbeat": "2026-04-07T15:38:00-04:00",
"owned_resources": {
"worktrees": ["/home/ctodie/projects/reverie-tod-406"],
"branches": ["chris/tod-406-dream-classify"],
"prs": [34, 35],
"processes": {
"engram_serve_pid": 46183
},
"files": []
},
"current_task": {
"ticket": "TOD-406",
"phase": "merging PR #34",
"status": "in_progress"
},
"blob": {
"schema_hint": "claude-coord-v0",
"notes": "free-form text",
"handoffs": [],
"announcements": [],
"custom": {}
}
}
3.2 Lock record #
{
"schema": 1,
"resource": "main-branch",
"owner_session_id": "c87d1c5c-...",
"owner_pid": 5865,
"acquired_at": "2026-04-07T15:40:00-04:00",
"expires_at": "2026-04-07T16:40:00-04:00",
"reason": "merging PR #34 to main"
}
3.3 Message record #
{
"schema": 1,
"from_session_id": "c87d1c5c-...",
"to_session_id": "3443ca93-...",
"sent_at": "2026-04-07T15:42:00-04:00",
"kind": "handoff",
"subject": "TOD-407 is unblocked",
"body": "finished TOD-406 merge, TOD-407 place phase is all yours — scan/classify types are stable in phases/mod.rs now",
"blob": {}
}
Message kinds (open set, recognized by convention):
handoff— task handoff between sessionswarn— non-fatal advisory (“don’t touch X, I’m mid-merge”)emergency— stop-the-line (“main is broken on HEAD~2, roll back”)status— informational broadcast (“dream benchmark running on :17437”)request— ask a peer to do somethingreply— answer to arequest
4. Filesystem layout (v0 backend) #
/tmp/claude-coord/
├── schema # file containing "1" — version marker
├── sessions/
│ ├── c87d1c5c-....json # per-session state
│ └── 3443ca93-....json
├── locks/
│ ├── main-branch/ # atomic dir = held lock
│ │ ├── owner # session id
│ │ └── record.json # lock record from §3.2
│ └── engram-serve/
└── messages/
├── inbox-c87d1c5c/ # one dir per destination session
│ └── 2026-04-07T15-42-00Z-001.json
└── inbox-3443ca93/
- Atomic lock:
mkdirof the resource dir is the atomic primitive. Ifmkdirsucceeds, you own the lock. Release viarm -rf. - Message ordering: filename prefix is an ISO timestamp + sequence number, so sorted directory listing is the delivery order.
- Schema file: single-line version. All sessions writing under
/tmp/claude-coord/MUST checkcat schemafirst and refuse if mismatched.
5. Operation set (the stable API) #
| Op | Shell | Redis equivalent (v1) | Notes |
|---|---|---|---|
| register | coord register [--task ...] | HSET coord:session:<id> ...; ZADD coord:sessions <now> <id> | Idempotent. |
| heartbeat | coord heartbeat | HSET coord:session:<id> last_heartbeat <now>; ZADD coord:sessions <now> <id> | Call every 30s. |
| peers | coord peers [--live] | ZRANGEBYSCORE coord:sessions <now-5m> +inf | --live filters stale. |
| lock | coord lock <resource> [--ttl 1h] [--reason ...] | SET coord:lock:<r> <id> NX EX <ttl-sec> | Blocks until acquired or timeout. |
| unlock | coord unlock <resource> | Lua: del only if value == <id> | Owner-only. |
| steal | coord steal <resource> | Lua: del only if ttl expired OR owner pid dead | Recovers stuck locks. |
| send | coord send <peer> <kind> <subject> [--body ...] | LPUSH coord:inbox:<peer> <msg> | Fire and forget. |
| recv | coord recv [--drain] | RPOP coord:inbox:<self> (or LRANGE) | Non-blocking. |
| update | coord update [--task ...] [--status ...] [--blob ...] [--merge-blob] | HSET coord:session:<id> ... | Patch own session record (task, status, blob) without full re-register. |
| broadcast | coord broadcast <kind> <subject> [--body ...] [--live-only] | for peer in SMEMBERS coord:sessions: LPUSH coord:inbox:<peer> <msg> | Send a message to every peer (optionally only live ones). |
| dereg | coord dereg | DEL coord:session:<id>; ZREM coord:sessions <id> | On session end. Alias: coord deregister. |
| log | coord log [tail|stats|locks|session] [...] | — | Query the audit log. Subcommands: tail (filter + tail entries), stats (op counts + lock-hold percentiles), locks (lock/unlock/steal history), session <id> (all events for a session). |
| metrics | coord metrics | — | Prometheus-format metrics: live sessions, held locks, message counters. |
| project-lock | coord project-lock <project> [--area X] | SET coord:lock:project:<p>[:<a>] <id> NX EX <ttl> | Convenience over lock; tags scope=project. |
| project-unlock | coord project-unlock <project> [--area X] | (same as unlock) | Owner-only release. |
| status | coord status | — | Human-readable dump of own state. |
All operations return exit code 0 on success, non-zero on error, with JSON on stdout and human-readable text on stderr. Machine-friendly, human-inspectable.
6. Liveness and stale cleanup #
A session is stale if:
now - last_heartbeat > STALE_HEARTBEAT_THRESHOLD(default 5 minutes), ORkill -0 $claude_pidfails (process gone), ORclaude_pidis reused by a different command (check/proc/$pid/commif Linux; fall back to process age if macOS).
A lock is revocable if:
- Its owner session is stale, AND
now > record.expires_at, AND- At least 10 minutes have elapsed since acquisition.
Any live peer can call coord steal <resource> to break a stuck lock. The
steal is logged in the broken lock’s record.json before unlink so there’s an
audit trail.
7. Coordination rules for Claude sessions #
At session start #
coord register --task <description>— write own session filecoord peers— enumerate live peers; if any, surface them in the first response to the user (e.g., “There are 2 other Claude sessions running on this repo: c87d… in /projects/reverie, 3443… resuming my session.”)- Start a heartbeat loop:
coord heartbeatevery 30s via a backgrounded shell process or a hook firing on every tool call.
Before any shared-state action #
| Action | Lock | Rationale |
|---|---|---|
git checkout main | main-branch | Prevent lost work |
git push main | main-push | Prevent race |
gh pr merge N | pr-merge-queue | Prevent UNKNOWN mergeability races |
cargo build on main worktree | cargo-build | Prevent CPU thrash + target/ corruption |
Modify ~/.claude/* | claude-config | Prevent lost writes |
kill or mv on ~/.local/bin/engram | engram-serve | Daemon cutover safety |
| Spawn a background agent | (register under owned_resources.agents) | Peer awareness |
| Edit/merge files in a shared project tree | coord project-lock <project> [--area X] | Serialize peer sessions on the same files without going through the coarse main-branch lock |
Project merge locks #
Two cooperating sessions touching the same repo collide most often inside a
single file region (e.g. both editing engram_compat.rs while rebasing
adjacent PRs). The main-branch lock is too coarse — it blocks unrelated work
on other crates. Project locks give a middle granularity:
coord project-lock <project>— whole-project lock; resource idproject:<project>(e.g.project:reverie). Use when rewriting many files or running a global format pass.coord project-lock <project> --area <area>— area-scoped lock; resource idproject:<project>:<area>(e.g.project:reverie:engram_compat,project:reverie:dream-runner). Use when you know exactly which file or module you’re going to touch.
area is a free-form slug — convention is the crate or module name, no path
separators. Two sessions on different --area values acquire independently;
two on the same --area serialize. The whole-project lock and any area lock
are independent — they do not currently nest. If you need exclusion across
all areas, take the whole-project lock.
Lock records carry scope: "project" so peers can filter project locks out of
coord status when they only care about cross-cutting locks like
pr-merge-queue.
On tool errors that look suspicious #
Before retrying a failed git/gh/cargo op, run coord peers to check if a peer
is doing the same thing. If so, wait for their lock to release instead of
fighting.
At session end #
coord dereg removes own session file and releases all held locks. Session
crashes (no dereg) are handled by the stale cleanup rule in §6.
8. Multi-host evolution path #
v0 (this doc) — local filesystem #
- Backend:
/tmp/claude-coord/ - Works across WSL/Linux/macOS on one host
- No dependencies, zero setup
- Single-user (filesystem permissions)
v1 — Redis #
Swap coord binary’s backend from filesystem to Redis via a backend enum
selected by COORD_BACKEND=redis + COORD_REDIS_URL=redis://....
Mapping (already canonicalized in §5):
coord register → HSET coord:session:<id>; ZADD coord:sessions <now> <id>
coord heartbeat → HSET coord:session:<id> last_heartbeat <now>; ZADD ...
coord peers → ZRANGEBYSCORE coord:sessions <now-5m> +inf
coord lock → SET coord:lock:<r> <id> NX EX <ttl>
coord unlock → EVAL "if redis.call('GET', K) == ARGV[1] then DEL K end"
coord steal → Lua: DEL only if ttl expired
coord send → LPUSH coord:inbox:<peer> <msg>
coord recv → LRANGE then DEL (or RPOP in loop)
coord dereg → DEL + ZREM
Benefits:
- Atomic ops are single commands — no
mkdirgymnastics BLPOPgives real push semantics (recv can block with timeout)PUBLISH/SUBSCRIBEenables push notifications for cross-session events- Cluster-capable (multiple hosts coordinate via shared Redis)
Cost:
- Requires Redis running (single-node for dev; replicated for prod)
- Network hop per op (negligible on LAN, painful on high-latency VPN)
v1 alt — NATS JetStream #
Similar mapping but uses NATS primitives:
- KV bucket for sessions and locks (atomic ops built-in)
- Stream per inbox (persistent, replayable, TTL-managed)
- PubSub for real-time peer events
Better for “truly multi-host, potentially across WAN” because NATS has first-class clustering + leaf nodes + JetStream file-backed persistence.
v1 alt — Postgres #
For installations that already run Postgres:
- Table
coord_sessionswith trigger on UPDATE for stale cleanup SELECT ... FOR UPDATE SKIP LOCKEDfor lock acquisitionLISTEN/NOTIFYfor push events- Full SQL observability — any peer can query history
v2 — HTTP gateway #
For sessions in browser tabs or locked-down environments where local FS + Redis both unavailable: stand up a small HTTP gateway (Cloudflare Worker + KV for storage) that exposes the coord API as JSON-RPC. Latency ~50ms vs ~1ms local, but unblocks hostile environments.
8a. Schema evolution and migrations #
The v0 schema is intentionally shaped like a protobuf message — field numbers
are reserved in comments on the JSON Schema, and a draft .proto file ships
alongside it at docs/coord/coord.proto. When v1 Redis/NATS lands, the
migration to real protobuf is mechanical (protoc --rust_out=...), not a
design exercise.
Full migration rules, version history, compatibility matrix, and the
coord migrate subcommand design: docs/coord/migrations.md.
TL;DR:
- JSON for v0, protobuf-shaped.
.protoco-located, not compiled. - Never rename a field. Never reuse a field number. Add new; deprecate old.
schema: <int>in every record. Bump on breaking changes.- Unknown fields preserved via
blob._unknownfor forward compat. coord migrateis a no-op today; real migrations land alongside schema 2.- Protobuf cut-over happens at v1 (first network-backed implementation).
9. Binary + schema artifacts #
Shipped alongside this doc:
~/.claude/bin/coord— shell implementation of the v0 backend (bash +jq). ~200 LOC. Single file, self-contained. Drop into$PATH.~/.claude/coord/schema-v0.json— JSON Schema (draft 2020-12) for the session + lock + message records in §3. Machine-validatable.~/.claude/CLAUDE.mdaddendum — global rule that all Claude Code sessions on this machine MUST register + heartbeat + acquire locks before shared-state actions.
See ~/.claude/bin/coord --help and ~/.claude/coord/schema-v0.json for the
concrete artifacts.
10. Forward compatibility (v0 → v1) #
The shell binary, schema, and global rule ship TODAY. v1 is a pure backend swap — no caller (Claude session) code changes required. The trigger to ship v1 is:
- Two Claude sessions on different hosts need to coordinate, OR
- Local filesystem becomes a performance bottleneck (hundreds of heartbeats per minute), OR
- A user wants audit history of coordination events (v0 forgets on reboot)
Until one of those is true, v0 is enough.
11. Open questions #
- Should
coordbe a Rust binary (compiled fromcrates/reverie-coord) instead of shell? Pros: cross-platform without bash, typed, testable. Cons: compilation step, more moving parts for a v0. Decision: start with shell, port to Rust when the shell hits its limits (probably around the v1 Redis backend when we need a Redis client). - How do we handle Claude Code itself crashing between
registeranddereg? The stale cleanup in §6 handles it eventually (5 min + 10 min lock timeout), but there’s a window where peers see a ghost. Mitigation: emit a dereg on the Claude Codeexithook if one exists, fall back to heartbeat expiry. - Locking granularity: is
main-branchtoo coarse? Should we havemain-readvsmain-write? Decision: start coarse, refine when contention actually warrants it. - What about worktrees? Each worktree has its own index and can hold its
own locks without colliding on the primary’s. We track worktrees under
owned_resources.worktreesbut don’t lock them separately for v0.
12. Trigger conditions for this doc #
File as a follow-up Linear ticket. The doc + binary ship when:
- User explicitly approves (this session), OR
- A third Claude session joins the primary + peer and causes a visible collision
Target: ship v0 tonight, observe for 1 week, then either promote to v1 Redis or declare v0 sufficient.