App tracing enforcement — audit + plan #
Status: research (2026-04-08, control-room lane)
Source: background research subagent + control-room synthesis
Cross-refs: docs/research/kernel-tracing.md, docs/research/hotswap-listener-design.md
TL;DR #
Reveried has a well-configured OTLP → Tempo exporter foundation but it’s critically under-used. About 193 public functions across the workspace have only ~20 tracing spans between them. HTTP middleware covers all 21 axum handlers at a coarse grain via the tower layer at crates/reverie-store/src/http/metrics.rs:244–280, but handler bodies, database ops, event publishes, and child tokio tasks are all invisible.
Quick win: 5.5 hours of work → 60% coverage improvement across the top-five gaps. No new deps, no architectural changes.
Part 1 — Current coverage (actual state) #
Tracing initialization — solid foundation #
crates/reveried/src/main.rs:126–219 sets up:
tracing_subscriber::fmt()+ JSON format (LOG_FORMAT=json) or pretty formattracing_opentelemetry::layer()bolted on top- OTLP/HTTP exporter pointed at
otel_endpoint(default Tempo:4318/v1/traces) - Resource attributes:
service.name=reveried,service.version,deployment.environment - Simple exporter (not batch) because
init_tracing()runs before the tokio runtime starts — batch exporter requires an async runtime and can’t be bolted on pre-main
Verdict: the foundation is right. Spans that get opened will reach Tempo. The problem is that not enough spans are being opened.
HTTP middleware — the one thing that works #
crates/reverie-store/src/http/metrics.rs:244–280 defines a tower middleware that opens an http_request span per request with fields method, route, status. Applied at the router root so every handler is covered at that level.
Problem: the span only wraps the outermost request boundary. Handler bodies (lock contention, store calls, context building, MCP tool dispatch) are inside that span as bare code with no child spans. Tempo shows http_request · POST /mem_save → nothing, 120 ms, done. No visibility into where the 120 ms went.
Per-crate coverage #
Approximate #[instrument] / tracing::span! count by crate:
| Crate | Pub fns | Spans | Coverage |
|---|---|---|---|
reveried | ~30 | ~12 (dream cycle + phase spans manually opened) | ~40% |
reverie-store | ~75 | ~5 (HTTP middleware + a few event/metrics sites) | ~7% |
reverie-dream | ~35 | ~5 (cycle runner top-level spans) | ~14% |
reverie-gate | ~15 | 0 | 0% |
reverie-sync | ~20 | 0 | 0% |
reverie-bench | ~10 | 0 | 0% |
meshctl | ~8 | 0 | 0% |
Total: ~193 pub fns, ~22 spans. ~11% coverage.
Dream cycle — structured but shallow #
crates/reverie-dream/src/runner.rs:103–150 manually opens cycle_span at the top and per-phase spans for each of the 6 phases (scan, replay, interleave, reconsolidate, downscale, promote). Good structured approach. Problem: the implementations of each phase have zero internal spans, so Tempo shows “scan phase, 2.3s” with no breakdown.
Part 2 — Six major gaps #
-
Missing
#[instrument]on 21 HTTP handlers. The tower middleware wraps the outside; handler bodies are dark. Lock contention, store queries, context builder phases — all invisible. Impact: debugging a slow/searchrequires reading source code and adding print statements. -
Missing span context on background tokio tasks.
tokio::spawndoes not inherit the current tracing span by default. Two sites that violate this:metrics::spawn_coord_sessions_exporterinreverie-store/src/http/mod.rs:259— the 10s coord session scanner spawns orphaned spans.- Any
tokio::spawninside dream phase implementations (none today, but plural planned). Fix: wrap with.in_current_span()or useSpan::current().in_scope(|| tokio::spawn(...)).
-
Missing trace ID in structured log lines.
tracing-subscriber+tracing-opentelemetryshould injecttrace_idandspan_idinto every JSON log line automatically via the OTEL layer. Unverified — agent couldn’t confirm in the current setup. Needs a smoke test: log from inside a span, grep the output fortrace_id=. -
Missing W3C Trace Context on outbound HTTP. Reveried makes outbound calls to Prom (
:9090), ollama (:11434), anthropic API, openrouter. None of them propagate the current trace via thetraceparentheader. Result: every outbound call starts a fresh trace ID, breaking the end-to-end distributed picture. Fix: wrap reqwest clients withreqwest-tracing::TracingMiddlewareor manually inject viatracing_opentelemetry::OpenTelemetrySpanExt::context(). -
Missing spans on
EventManager::publishpath. The in-proc event bus is the hot path for cross-subsystem work. ~200 events/min at steady state per design target. Zero spans = zero visibility into what events are firing and which subscribers are slow. Should be at DEBUG level (low volume in traces but queryable). -
Missing cross-session distributed tracing. Session A writes via
/mem_save. Session B reads via/context/smart. Both participate in the same logical “user asked a question” trace, but today they’re completely disconnected. Coord protocol messages should carry atraceparentfield. Receiver extracts and opens a span linked back to the sender’s trace ID viaSpanContext::new_with_remote.
Part 3 — Enforcement mechanisms #
How to make tracing coverage a ratchet instead of a slope:
Static lint (3 hours, medium risk) #
A tools/check-tracing.sh script that greps every pub fn / pub async fn in the workspace and flags those without one of:
#[instrument]attribute#[tracing::instrument]attribute- A
tracing::span!(...)call in the body - A comment
// tracing::noopto explicitly opt out
Runs in CI. Fails the build if coverage drops. Allowlist for files that don’t need instrumentation (tests, build scripts, examples).
Runtime counter (1 hour, low risk) #
Add a custom tracing_subscriber::filter::Filter that counts un-instrumented operations (functions that took > 10 ms but didn’t open a span). Emit as reveried_uninstrumented_operations_total prometheus counter. Grafana panel for trend. Not enforcement per se but a visible gauge.
Proc-macro wrapper (4 hours, medium risk) #
A #[reverie::traced] attribute macro that expands to #[instrument(skip(big_args), level = "debug")] with sensible defaults. Encourages adoption by reducing boilerplate.
CI gate (3 hours, medium risk) #
.github/workflows/tracing-coverage.yml that runs the static lint + fails the PR if a public function regresses to un-instrumented.
Part 4 — Span taxonomy #
Standardize on level-per-scope so Tempo queries stay sensible:
| Level | Scope | Examples |
|---|---|---|
ERROR | Fatal | daemon startup error, migration failure |
WARN | Degraded | Prom unreachable, Tempo unreachable, fallback to engram_compat backend |
INFO | Request boundary | HTTP request, dream cycle start/end, coord message send/recv, event publish |
DEBUG | Sub-request | store query boundary, sqlite tx begin/commit, context builder phase |
TRACE | Hot path | tight inner loops — don’t enforce, sample opt-in |
Default subscriber filter: INFO,reveried=DEBUG,reverie_store=DEBUG,reverie_dream=DEBUG.
Part 5 — Cross-session distributed trace via coord #
Propose: every coord message carries a traceparent field built from the current span’s W3C Trace Context representation.
Sender #
use tracing_opentelemetry::OpenTelemetrySpanExt;
let span = tracing::info_span!("coord_send", to = %peer, kind = %kind);
let _guard = span.enter();
let cx = span.context(); // opentelemetry::Context
let mut carrier = HashMap::new();
opentelemetry::global::get_text_map_propagator(|prop| {
prop.inject_context(&cx, &mut carrier);
});
// carrier now contains "traceparent" and "tracestate" keys
msg.traceparent = carrier.remove("traceparent");
msg.tracestate = carrier.remove("tracestate");
coord_send(msg);
Receiver #
let msg = coord_recv()?;
let mut carrier = HashMap::new();
if let Some(tp) = &msg.traceparent {
carrier.insert("traceparent".into(), tp.clone());
}
if let Some(ts) = &msg.tracestate {
carrier.insert("tracestate".into(), ts.clone());
}
let parent_cx = opentelemetry::global::get_text_map_propagator(|prop| {
prop.extract(&carrier)
});
let span = tracing::info_span!("coord_recv", kind = %msg.kind);
span.set_parent(parent_cx);
let _guard = span.enter();
// handle msg
Result: a single Tempo trace spans Session A’s /mem_save → Session A’s coord send → Session B’s coord recv → Session B’s dream consolidation → Session B’s engram write. The entire “user asked, memory formed” flow is one trace.
Part 6 — Env-map integration #
Expose active trace IDs in the env-map snapshot so meshctl TUI can show “what’s reveried doing right now?”. Sample the last 10 trace IDs + span names + duration + status.
Two sources:
- Tempo HTTP API at
:3200/api/search?tags=service.name=reveried&limit=10. Tempo supports tag-based search. Pro: already exists. Con: depends on Tempo being up, adds a network hop. - Local tracing bus via a custom
tracing_subscriber::Layerthat pushes completed spans into a bounded ring buffer reveried exposes as JSON at/traces/recent. Pro: no external dependency, real-time. Con: more code.
Recommendation: ship the Tempo HTTP query first (zero new code), fall back to the local layer if Tempo is unreachable.
Part 7 — Ranked recommendations #
| # | Action | Impact | Effort | Risk |
|---|---|---|---|---|
| 1 | Add #[instrument] to 21 HTTP handlers | HIGH | 2h | Low |
| 2 | Wrap tokio::spawn sites with .in_current_span() | HIGH | 30m | Low |
| 3 | Add spans to dream phase implementations | MEDIUM | 4h | Low |
| 4 | W3C Trace Context on outbound HTTP (reqwest middleware) | MEDIUM | 2h | Low |
| 5 | Coord message traceparent field + extract on recv | MEDIUM | 4h | Low |
| 6 | Static lint in CI (tools/check-tracing.sh) | MEDIUM | 3h | Medium |
| 7 | Emit per-event span on EventManager::publish | MEDIUM | 1h | Low |
| 8 | Env-map /traces/recent endpoint | LOW | 2h | Low |
Quick win (1 + 2 + 7): 3.5 hours → coverage jumps from ~11% to ~55% at the hot paths that matter most. Full pass (1–8): ~18 hours → comprehensive coverage, CI-enforced, cross-session distributed traces, visible in env-map.
Open questions #
- Is Tempo scrape actually working end-to-end today? The tower middleware opens spans, but the user flow is: span → tracing-opentelemetry bridge → OTLP/HTTP exporter → Tempo ingester → queryable via Grafana Explore. Any link broken = no traces land. Needs a smoke test:
curl :7437/health, then query Tempo for the span. - Batch vs simple exporter: current setup uses the simple exporter (no async runtime at init time). Batch exporter is better for throughput. Could be initialized post-startup once tokio is running, at the cost of losing the first few seconds of spans.
- Sampling strategy: currently head-based sampling via subscriber filter. Tempo supports tail-based sampling (keep traces with errors, drop boring ones) via a processor. Worth enabling once we have volume.
Control-room lane · research + audit · informs Part C.7 metrics + env-map integration.