Kernel-level tracing for the reverie env-map #

Status: research (2026-04-08, control-room lane) Source: background research subagent + control-room synthesis Cross-refs: docs/research/app-tracing-enforcement.md, Part H env-map

TL;DR #

Part 1 — WSL2 6.6 tracing surface #

Kernel: Linux 6.6.87.2-microsoft-standard-WSL2 (June 2025). Full Linux tracing subsystem is present but locked down without CAP_SYS_ADMIN.

Available without root:

Locked behind CAP_SYS_ADMIN:

Available but broken on WSL2:

Part 2 — Zero-root per-PID sampling (primary source) #

One procfs read per target PID per tick gives a rich metric set:

FieldSourceTypeMeaning
state/proc/<pid>/stat field 3enumR/S/D/Z/T — running/sleeping/disk-wait/zombie/stopped
rss_bytes/proc/<pid>/status VmRSSgaugeResident memory
vm_peak/proc/<pid>/status VmPeakgaugePeak virtual memory
utime_ms/proc/<pid>/stat field 14counterUser CPU time
stime_ms/proc/<pid>/stat field 15counterKernel CPU time
num_threads/proc/<pid>/stat field 20gaugeThread count
voluntary_ctxt_switches/proc/<pid>/statuscounterVolunteer context switches
nonvoluntary_ctxt_switches/proc/<pid>/statuscounterPreempted context switches
io_read_bytes/proc/<pid>/io read_bytescounterBytes read through disk layer
io_write_bytes/proc/<pid>/io write_bytescounterBytes written through disk layer
io_syscr/proc/<pid>/io syscrcounterRead syscall count
io_syscw/proc/<pid>/io syscwcounterWrite syscall count
wchan/proc/<pid>/wchanstringKernel function this task is sleeping in
fd_countls /proc/<pid>/fdgaugeOpen file descriptor count
runtime_ns/proc/<pid>/schedstat field 1counterTime spent running on CPU
wait_sum_ns/proc/<pid>/schedstat field 2counterTime spent waiting to run

Sampling cost at 2s tick for ~10 PIDs: ~5 ms total (measured). Negligible.

Part 3 — eBPF / bpftrace (VIABLE via privileged Docker sidecar — 2026-04-08 update) #

Originally this section said eBPF wasn’t viable on WSL2 because the user distro lacks CAP + kernel headers. That’s still true for running bpftrace directly inside the user distro. But a privileged Docker sidecar container bypasses both limits, because Docker Desktop’s VM hosts the kernel, not the user distro, and the quay.io/iovisor/bpftrace:latest image ships its own matching headers + BTF.

Confirmed experiment (2026-04-08T08:10Z):

docker run --rm -d --name reverie-bpf \
    --privileged --pid=host \
    --cap-add=SYS_ADMIN --cap-add=PERFMON --cap-add=BPF \
    -v /sys:/sys:ro -v /lib/modules:/lib/modules:ro \
    --entrypoint sleep \
    quay.io/iovisor/bpftrace:latest 300

docker exec reverie-bpf bpftrace -e '
    tracepoint:syscalls:sys_enter_* { @[probe] = count(); }
    interval:s:3 { exit(); }
'
# → Attaching 347 probes...
# → @[tracepoint:syscalls:sys_enter_<name>]: <count>
# → full histogram over 3s across ALL host PIDs

Result: 347 syscall tracepoints attached cleanly, full histogram returned. eBPF programs loaded into the host (Docker Desktop VM) kernel, executed, and produced output. WSL2 + Docker Desktop is fine for eBPF if you’re willing to run a privileged sidecar.

What this unlocks #

The entire Part 6 integration plan gets a viable path now — instead of polling /proc/<pid>/* from inside reveried, we run a long-lived reverie-bpf sidecar container that:

  1. Starts with reveried-compose up (add it to the existing obs stack compose file)
  2. Runs a persistent bpftrace program that emits per-pid summaries every 10s
  3. Writes summaries to a shared volume /var/lib/reverie/bpf-summaries/<ts>.json
  4. Reveried’s env_ticker reads the latest summary and merges it into KernelHooks.per_pid[pid]
  5. meshctl TUI renders syscall rate, IO latency histograms, scheduler wakeups

Real probes worth running in the sidecar #

# Per-pid syscall rate histogram (the one we already tested)
tracepoint:syscalls:sys_enter_* { @[pid, probe] = count(); }
interval:s:10 { print(@); clear(@); }

# Block IO latency histogram per-pid (us)
tracepoint:block:block_rq_issue { @start[args->dev, args->sector] = nsecs; }
tracepoint:block:block_rq_complete /@start[args->dev, args->sector]/ {
    @lat_us[pid] = hist((nsecs - @start[args->dev, args->sector]) / 1000);
    delete(@start[args->dev, args->sector]);
}
interval:s:10 { print(@lat_us); clear(@lat_us); }

# TCP connect latency per-pid (us)
kprobe:tcp_v4_connect { @start[tid] = nsecs; }
kretprobe:tcp_v4_connect /@start[tid]/ {
    @connect_us[pid] = hist((nsecs - @start[tid]) / 1000);
    delete(@start[tid]);
}
interval:s:10 { print(@connect_us); clear(@connect_us); }

# Scheduler wakeups per-pid (context switches)
tracepoint:sched:sched_wakeup { @wakeups[pid] = count(); }
interval:s:10 { print(@wakeups); clear(@wakeups); }

# Memory allocation rate per pid (via mmap2)
tracepoint:syscalls:sys_enter_mmap { @mmap[pid] = count(); }
interval:s:10 { print(@mmap); clear(@mmap); }

Caveats discovered #

Revised recommendation #

The earlier Part 7 recommendation #4 (“long-running bpftrace child [hours, blocked]”) is no longer blocked. Promote it:

4.(REVISED) [1–2 hours] Long-running privileged bpftrace sidecar container

This changes the effort/value calculation significantly. Kernel-level observability is on the table.


If root + kernel headers were available, the following one-liners would complement /proc sampling:

# syscall rate per pid
bpftrace -e 'tracepoint:syscalls:sys_enter_* /pid == 65170/ { @[probe] = count(); }'

# block IO latency histogram
bpftrace -e 'tracepoint:block:block_rq_issue { @start[args->dev, args->sector] = nsecs; }
             tracepoint:block:block_rq_complete /@start[args->dev, args->sector]/ {
                 @lat = hist(nsecs - @start[args->dev, args->sector]);
                 delete(@start[args->dev, args->sector]);
             }'

# tcp connect latency per pid
bpftrace -e 'kprobe:tcp_v4_connect { @start[tid] = nsecs; }
             kretprobe:tcp_v4_connect /@start[tid]/ {
                 @[comm, pid] = hist(nsecs - @start[tid]);
                 delete(@start[tid]);
             }'

# scheduler wakeups
bpftrace -e 'tracepoint:sched:sched_wakeup { @[comm] = count(); }'

Integration shape if/when available: run bpftrace as a long-running child of reveried, emit summaries to /tmp/reverie-bpf-summary.json every 10s, reveried reads and merges into PidHooks.syscall_latency_p50/p99 on next tick. bpftrace startup cost is ~1s so it can’t be re-launched per tick.

Verdict for WSL2 now: skip. Revisit when headers + CAP become available (probably never on this box).

Part 4 — Benevolent introspection (gdb / ptrace / strace / perf) #

Your own process tree is fair game. These tools don’t run on the hot path — they’re attached manually from a meshctl keybind for forensic snapshots.

ToolUse caseRoot?Invocation
gcore <pid>Snapshot live process to core fileNo (same-user)gcore -o /tmp/reveried.core 65170
gdb -p <pid>Attach, inspect symbols, detachNo (same-user, YAMA-permitting)batch mode: gdb -batch -ex 'p event_manager.in_flight' -ex 'detach' -p 65170
strace -p <pid> -cSample syscall histogram for N secondsNostrace -p 65170 -c -S calls --summary-wall-clock --absolute-timestamps -- sleep 10
perf record -p <pid>CPU flamegraph, on-demandOften yes (perf_event_paranoid)perf record -F 99 -p 65170 -g --call-graph dwarf -- sleep 10
/proc/<pid>/mapsMemory map visualizationNoParse + render per-pid bar chart
/proc/<pid>/smapsPer-VMA RSS/PSS/USS breakdownNoExpensive but precise

Integration into env-map + TUI:

Safety invariant: all of these are opt-in and manual. Don’t invoke them from the 2s tick — they’re expensive and intrusive.

Part 5 — Rust ecosystem #

CrateVersionPurposeVerdict
procfs0.16Pure-Rust /proc parserPick this. Zero deps, thorough, well-maintained.
libbpf-rs0.24Rust wrapper for libbpfRequires kernel headers + BTF + CAP. Not viable on WSL2.
aya0.13Pure-Rust eBPF frameworkMore ergonomic than libbpf-rs. Same WSL2 blockers.
perf-event0.4perf_event_open() wrapperNeeds CAP_SYS_ADMIN or paranoid=-1. Marginal.
nix / rustixlatestptrace/waitpid wrappersFor manual gdb-like introspection. Pick rustix (lighter).

Part 6 — Proposed integration #

Extend EnvSnapshot (Part H) with:

#[derive(Clone, Serialize)]
pub struct KernelHooks {
    pub sampled_at: i64,             // unix seconds
    pub per_pid: HashMap<i32, PidHooks>,
    pub global: GlobalHooks,
}

#[derive(Clone, Serialize)]
pub struct PidHooks {
    pub pid: i32,
    pub cmd: String,
    pub state: char,
    pub rss_bytes: u64,
    pub utime_ms: u64,
    pub stime_ms: u64,
    pub num_threads: u32,
    pub vol_ctxt_sw: u64,
    pub nonvol_ctxt_sw: u64,
    pub io_read_bytes: u64,
    pub io_write_bytes: u64,
    pub io_syscr: u64,
    pub io_syscw: u64,
    pub wchan: Option<String>,
    pub fd_count: usize,
    pub runtime_ns: u64,
    pub wait_sum_ns: u64,
}

#[derive(Clone, Serialize)]
pub struct GlobalHooks {
    pub loadavg_1: f64,
    pub loadavg_5: f64,
    pub loadavg_15: f64,
    pub mem_total_kb: u64,
    pub mem_available_kb: u64,
    pub swap_used_kb: u64,
    pub cpu_count: usize,
    pub boot_time_unix: i64,
}

Sampling trigger: the env_ticker (Part H) on each 2s tick calls sample_kernel_hooks(&[65170, <each claude session pid>, <redis>, <memcached>, <ollama>]) and populates env.kernel_hooks.

TUI pane: new view ViewMode::Kernel bound to k in the meshctl status hotkey map. Renders a table:

┌ kernel · per-pid sample ────────────────────────────────────────┐
│ PID     CMD          STATE  RSS      CPU%  CTXSW   IO    WCHAN  │
│ 65170   reveried     S      124 MiB  3.1%  142/12  0/0   futex  │
│ 21209   reveried     R      88 MiB   0.0%  0/0     0/0   -      │
│ 58793   claude       S      512 MiB  8.4%  89/3    2k/0  poll   │
│ ...                                                              │
└──────────────────────────────────────────────────────────────────┘

Part 7 — Ranked recommendations #

  1. [15 min] Add procfs dep to reveried + sample_kernel_hooks() function. No root, no headers. Produces PidHooks for each passed-in PID. Plugs directly into the Part H env-map.
  2. [15 min] meshctl ViewMode::Kernel pane. New hotkey k, ratatui Table rendering the kernel_hooks map. Colors: RSS growth yellow > 50%, red > 2× baseline; ctx-switch delta outliers flagged.
  3. [5 min] Per-thread breakdown. Extend PidHooks with tasks: HashMap<u32, TaskHooks>. Call procfs::process::Process::tasks() per pid. Unlocks “which tokio worker is starved”.
  4. [45 min] Prometheus exporter. Add kernel_rss_bytes{pid, cmd} etc. to reveried /metrics. Grafana dashboard + long-term history + alerting.
  5. [15 min] Benevolent introspection keybinds. F flamegraph, S strace sample, G gcore snapshot. All opt-in, all invoked from the focused peer row in the peers table.
  6. [hours, blocked] bpftrace child process. Defer until root + kernel headers are available.

WSL2 gotchas #


Control-room lane · research only · informs Part H env-map design.