Claude Session Coordination Protocol — v0 #

Status: draft, ship-today scope (local filesystem backend) Target: cooperating Claude Code sessions on one host, with a clear path to multi-host (Redis / NATS / Postgres) in v1+.

1. Problem #

Two or more Claude Code sessions running against the same workspace (either the same user in multiple VSCode windows, or via claude --resume) collide in subtle, silent ways. Examples observed 2026-04-07 during the MVP-B push:

  1. Git state contention — peer session created a reverie-wt-tod-412 worktree and modified main.rs; the primary session didn’t discover the drift until cargo surfaced compile errors against the composed tree.
  2. gh pr merge race — post-push mergeability is async on GitHub’s side; back-to-back merges from different sessions hit UNKNOWN / UNSTABLE and error non-deterministically.
  3. ~/.claude/ file races — both sessions tried to write hooks and settings simultaneously; last writer silently won.
  4. Branch-base drift — peer branched from an older main, primary branched from a newer one; neither session could see that a shared helper struct had gained a field, and the merge train hit type errors at compose time.
  5. Daemon ownership ambiguity — the cutover engram serve process is shared state, but neither session owned it; SIGTERM during in-place upgrade could kill the other session’s in-flight requests.
  6. Pre-commit hook stash dance — cargo-fmt’s stash-then-restore silently dropped merge commits when the primary session was actively editing, because the peer session had unstaged changes the hook didn’t know about.

Missing layer: peer-discoverable structured state that any cooperating Claude session can read/write without stepping on each other.

2. Design principles #

  1. Filesystem-backed for v0. Zero infrastructure. Survives session crashes (the file is still there). Inspectable by cat + jq.
  2. Operation set isomorphic to Redis primitives. Every v0 filesystem operation maps cleanly to a single Redis command (or Lua script), so v1 is a backend swap, not a protocol rewrite.
  3. Opt-in, non-blocking. A session that doesn’t implement coordination isn’t broken — it just gets no peer awareness. Coord-aware sessions handle coord-unaware peers by treating the whole tree as “potentially contended”.
  4. Arbitrary opaque blobs. Every message carries a blob field that neither sender nor receiver parses — use for experimental / future-proof payloads without schema bumps.
  5. Short schemas, loud version bumps. Schema version is prominent in every file; mismatch → other sessions ignore you gracefully.
  6. Liveness is heartbeat-based, not TCP-based. File mtime + heartbeat timestamp + process check. Cheap and distributed-safe.

3. Wire format #

3.1 Session record #

{
  "schema": 1,
  "session_id": "c87d1c5c-ee1c-480d-9384-ca2481aa143b",
  "claude_pid": 5865,
  "claude_version": "2.1.92",
  "cwd": "/home/ctodie/projects/reverie",
  "bin": "/home/ctodie/.vscode-server/.../native-binary/claude",
  "started_at": "2026-04-07T09:47:00-04:00",
  "last_heartbeat": "2026-04-07T15:38:00-04:00",
  "owned_resources": {
    "worktrees": ["/home/ctodie/projects/reverie-tod-406"],
    "branches":  ["chris/tod-406-dream-classify"],
    "prs":       [34, 35],
    "processes": {
      "engram_serve_pid": 46183
    },
    "files": []
  },
  "current_task": {
    "ticket": "TOD-406",
    "phase":  "merging PR #34",
    "status": "in_progress"
  },
  "blob": {
    "schema_hint": "claude-coord-v0",
    "notes":       "free-form text",
    "handoffs":    [],
    "announcements": [],
    "custom":      {}
  }
}

3.2 Lock record #

{
  "schema": 1,
  "resource": "main-branch",
  "owner_session_id": "c87d1c5c-...",
  "owner_pid": 5865,
  "acquired_at": "2026-04-07T15:40:00-04:00",
  "expires_at":  "2026-04-07T16:40:00-04:00",
  "reason":  "merging PR #34 to main"
}

3.3 Message record #

{
  "schema": 1,
  "from_session_id": "c87d1c5c-...",
  "to_session_id":   "3443ca93-...",
  "sent_at": "2026-04-07T15:42:00-04:00",
  "kind":    "handoff",
  "subject": "TOD-407 is unblocked",
  "body":    "finished TOD-406 merge, TOD-407 place phase is all yours — scan/classify types are stable in phases/mod.rs now",
  "blob":    {}
}

Message kinds (open set, recognized by convention):

4. Filesystem layout (v0 backend) #

/tmp/claude-coord/
├── schema                              # file containing "1" — version marker
├── sessions/
│   ├── c87d1c5c-....json                # per-session state
│   └── 3443ca93-....json
├── locks/
│   ├── main-branch/                     # atomic dir = held lock
│   │   ├── owner                        # session id
│   │   └── record.json                  # lock record from §3.2
│   └── engram-serve/
└── messages/
    ├── inbox-c87d1c5c/                  # one dir per destination session
    │   └── 2026-04-07T15-42-00Z-001.json
    └── inbox-3443ca93/

5. Operation set (the stable API) #

OpShellRedis equivalent (v1)Notes
registercoord register [--task ...]HSET coord:session:<id> ...; ZADD coord:sessions <now> <id>Idempotent.
heartbeatcoord heartbeatHSET coord:session:<id> last_heartbeat <now>; ZADD coord:sessions <now> <id>Call every 30s.
peerscoord peers [--live]ZRANGEBYSCORE coord:sessions <now-5m> +inf--live filters stale.
lockcoord lock <resource> [--ttl 1h] [--reason ...]SET coord:lock:<r> <id> NX EX <ttl-sec>Blocks until acquired or timeout.
unlockcoord unlock <resource>Lua: del only if value == <id>Owner-only.
stealcoord steal <resource>Lua: del only if ttl expired OR owner pid deadRecovers stuck locks.
sendcoord send <peer> <kind> <subject> [--body ...]LPUSH coord:inbox:<peer> <msg>Fire and forget.
recvcoord recv [--drain]RPOP coord:inbox:<self> (or LRANGE)Non-blocking.
updatecoord update [--task ...] [--status ...] [--blob ...] [--merge-blob]HSET coord:session:<id> ...Patch own session record (task, status, blob) without full re-register.
broadcastcoord broadcast <kind> <subject> [--body ...] [--live-only]for peer in SMEMBERS coord:sessions: LPUSH coord:inbox:<peer> <msg>Send a message to every peer (optionally only live ones).
deregcoord deregDEL coord:session:<id>; ZREM coord:sessions <id>On session end. Alias: coord deregister.
logcoord log [tail|stats|locks|session] [...]Query the audit log. Subcommands: tail (filter + tail entries), stats (op counts + lock-hold percentiles), locks (lock/unlock/steal history), session <id> (all events for a session).
metricscoord metricsPrometheus-format metrics: live sessions, held locks, message counters.
project-lockcoord project-lock <project> [--area X]SET coord:lock:project:<p>[:<a>] <id> NX EX <ttl>Convenience over lock; tags scope=project.
project-unlockcoord project-unlock <project> [--area X](same as unlock)Owner-only release.
statuscoord statusHuman-readable dump of own state.

All operations return exit code 0 on success, non-zero on error, with JSON on stdout and human-readable text on stderr. Machine-friendly, human-inspectable.

6. Liveness and stale cleanup #

A session is stale if:

A lock is revocable if:

Any live peer can call coord steal <resource> to break a stuck lock. The steal is logged in the broken lock’s record.json before unlink so there’s an audit trail.

7. Coordination rules for Claude sessions #

At session start #

  1. coord register --task <description> — write own session file
  2. coord peers — enumerate live peers; if any, surface them in the first response to the user (e.g., “There are 2 other Claude sessions running on this repo: c87d… in /projects/reverie, 3443… resuming my session.”)
  3. Start a heartbeat loop: coord heartbeat every 30s via a backgrounded shell process or a hook firing on every tool call.

Before any shared-state action #

ActionLockRationale
git checkout mainmain-branchPrevent lost work
git push mainmain-pushPrevent race
gh pr merge Npr-merge-queuePrevent UNKNOWN mergeability races
cargo build on main worktreecargo-buildPrevent CPU thrash + target/ corruption
Modify ~/.claude/*claude-configPrevent lost writes
kill or mv on ~/.local/bin/engramengram-serveDaemon cutover safety
Spawn a background agent(register under owned_resources.agents)Peer awareness
Edit/merge files in a shared project treecoord project-lock <project> [--area X]Serialize peer sessions on the same files without going through the coarse main-branch lock

Project merge locks #

Two cooperating sessions touching the same repo collide most often inside a single file region (e.g. both editing engram_compat.rs while rebasing adjacent PRs). The main-branch lock is too coarse — it blocks unrelated work on other crates. Project locks give a middle granularity:

area is a free-form slug — convention is the crate or module name, no path separators. Two sessions on different --area values acquire independently; two on the same --area serialize. The whole-project lock and any area lock are independent — they do not currently nest. If you need exclusion across all areas, take the whole-project lock.

Lock records carry scope: "project" so peers can filter project locks out of coord status when they only care about cross-cutting locks like pr-merge-queue.

On tool errors that look suspicious #

Before retrying a failed git/gh/cargo op, run coord peers to check if a peer is doing the same thing. If so, wait for their lock to release instead of fighting.

At session end #

coord dereg removes own session file and releases all held locks. Session crashes (no dereg) are handled by the stale cleanup rule in §6.

8. Multi-host evolution path #

v0 (this doc) — local filesystem #

v1 — Redis #

Swap coord binary’s backend from filesystem to Redis via a backend enum selected by COORD_BACKEND=redis + COORD_REDIS_URL=redis://....

Mapping (already canonicalized in §5):

coord register  → HSET coord:session:<id>; ZADD coord:sessions <now> <id>
coord heartbeat → HSET coord:session:<id> last_heartbeat <now>; ZADD ...
coord peers     → ZRANGEBYSCORE coord:sessions <now-5m> +inf
coord lock      → SET coord:lock:<r> <id> NX EX <ttl>
coord unlock    → EVAL "if redis.call('GET', K) == ARGV[1] then DEL K end"
coord steal     → Lua: DEL only if ttl expired
coord send      → LPUSH coord:inbox:<peer> <msg>
coord recv      → LRANGE then DEL (or RPOP in loop)
coord dereg     → DEL + ZREM

Benefits:

Cost:

v1 alt — NATS JetStream #

Similar mapping but uses NATS primitives:

Better for “truly multi-host, potentially across WAN” because NATS has first-class clustering + leaf nodes + JetStream file-backed persistence.

v1 alt — Postgres #

For installations that already run Postgres:

v2 — HTTP gateway #

For sessions in browser tabs or locked-down environments where local FS + Redis both unavailable: stand up a small HTTP gateway (Cloudflare Worker + KV for storage) that exposes the coord API as JSON-RPC. Latency ~50ms vs ~1ms local, but unblocks hostile environments.

8a. Schema evolution and migrations #

The v0 schema is intentionally shaped like a protobuf message — field numbers are reserved in comments on the JSON Schema, and a draft .proto file ships alongside it at docs/coord/coord.proto. When v1 Redis/NATS lands, the migration to real protobuf is mechanical (protoc --rust_out=...), not a design exercise.

Full migration rules, version history, compatibility matrix, and the coord migrate subcommand design: docs/coord/migrations.md.

TL;DR:

9. Binary + schema artifacts #

Shipped alongside this doc:

See ~/.claude/bin/coord --help and ~/.claude/coord/schema-v0.json for the concrete artifacts.

10. Forward compatibility (v0 → v1) #

The shell binary, schema, and global rule ship TODAY. v1 is a pure backend swap — no caller (Claude session) code changes required. The trigger to ship v1 is:

  1. Two Claude sessions on different hosts need to coordinate, OR
  2. Local filesystem becomes a performance bottleneck (hundreds of heartbeats per minute), OR
  3. A user wants audit history of coordination events (v0 forgets on reboot)

Until one of those is true, v0 is enough.

11. Open questions #

12. Trigger conditions for this doc #

File as a follow-up Linear ticket. The doc + binary ship when:

Target: ship v0 tonight, observe for 1 week, then either promote to v1 Redis or declare v0 sufficient.