diff --git a/docs/plans/2026-06-15-session-resilience-design.md b/docs/plans/2026-06-15-session-resilience-design.md new file mode 100644 index 0000000..013c3e5 --- /dev/null +++ b/docs/plans/2026-06-15-session-resilience-design.md @@ -0,0 +1,233 @@ +# Session Resilience Epic — Design + +**Date:** 2026-06-15 +**Branch:** `feat/session-resilience` +**Source:** `stillpending.md` §2 (intentional v1-deferred items), scoped into a real feature design. +**Status:** Design approved; implementation plan to follow. + +## Goal + +Lift four deliberately-deferred v1 limitations into supported features, built on +one shared foundation: + +1. **Multi-event-subscriber fan-out** (§2: plumbed but validator-blocked). +2. **Reconnectable sessions** (§2: 1:1 session↔connection today). +3. **Per-session dashboard ACL** (§2 / EventsHub `TODO(per-session-acl)`). +4. **Orphan-worker reattach on gateway restart** (§2 — **overturns a hard + CLAUDE.md rule**, see "Documented-rule changes"). + +These are not peers: fan-out is the keystone, reconnect and reattach reuse its +machinery, and three of the four need a new session-ownership concept. + +## Documented-rule changes (explicit, owner-approved) + +This epic deliberately reverses three documented v1 decisions. Each reversal is a +required deliverable in the same change as the code: + +- **CLAUDE.md:77** "Gateway restart does not reattach orphan workers… do not + design code paths that assume reattachment." → reattach becomes supported, + bounded, and opt-in. +- **`docs/DesignDecisions.md:63-73`** "no reconnectable sessions for v1." → + reconnect becomes supported within a bounded detach-grace window. +- **`docs/DesignDecisions.md:75-80`** single event subscriber per session. → + multi-subscriber fan-out becomes supported, capped. + +The owner explicitly accepted overturning the reattach rule during design. + +## Current-state seams (verified by recon, with citations) + +- `GatewaySession.AttachEventSubscriber(bool allowMultipleSubscribers)` + (`Sessions/GatewaySession.cs:386-408`) guards on a single int + `_activeEventSubscriberCount` (`:16`) under `_syncRoot`; a second subscriber + throws `EventSubscriberAlreadyActive` (`:398`). +- `GatewayOptionsValidator.cs:181-185` hard-rejects + `AllowMultipleEventSubscribers` ("not supported until event fan-out is + implemented"); option bound at `SessionOptions.cs:26-29`. +- `EventStreamService.StreamEventsAsync` (`Grpc/EventStreamService.cs:27-101`) + creates **a new bounded `Channel` per RPC call** (`:43-50`) and + `ProduceEventsAsync` drains `session.ReadEventsAsync()` directly — a + **destructive, single-consumer drain**. Two RPCs would fight over one queue. +- Backpressure: `ProduceEventsAsync` uses non-blocking `TryWrite`; on overflow + with `EventBackpressurePolicy.FailFast` (default, `EventOptions`) it calls + `session.MarkFaulted` (`EventStreamService.cs:143-162`) — faulting the **whole + session**, not just the slow consumer. +- `DashboardEventBroadcaster.Publish` (`Dashboard/Hubs/DashboardEventBroadcaster.cs:13-44`) + is called **inside** the per-RPC producer loop (`EventStreamService.cs:131-141`) + — so the dashboard only mirrors events while a gRPC client is streaming. Latent + bug: no gRPC subscriber ⇒ dashboard feed is dark. +- Pipe name `mxaccess-gateway-{Environment.ProcessId}-{sessionId}` + (`SessionManager.cs:433`); session id `session-{Guid:N}` (`:479`), in-memory + `SessionRegistry` only (`SessionRegistry.cs:12`), **not persisted**. +- `OrphanWorkerTerminator` (`Workers/OrphanWorkerTerminator.cs:49-112`) discovers + orphans by executable name/path (x64 gateway cannot introspect the x86 worker + module → image-name fallback) and **terminates** them; rationale comment at + `:9-16`. +- Pipe fault → `WorkerClient` read loop detects `EndOfStream`, session → + `Faulted` (`WorkerClient.cs:376-381`); no reattach. Worker launch passes the + per-session nonce via `MXGATEWAY_WORKER_NONCE` env var + (`WorkerProcessLauncher.cs:180-182`). +- Sessions store `ClientIdentity` (informational only, `GatewaySession.cs:114`); + **no `OwnerKeyId`, no per-session ACL.** gRPC `StreamEvents` enforces per-item + read constraints but **no session-level access gate** — any caller who knows a + session id can stream it. +- `EventsHub.SubscribeSession(string)` (`Dashboard/Hubs/EventsHub.cs:46-54`) joins + group `session:{id}`; only hub-level `[Authorize(HubClientsPolicy)]` gates it, + so **any** Admin/Viewer can subscribe to **any** session. `TODO(per-session-acl)` + at `:39-43`. `SnapshotHub`/`AlarmsHub` broadcast to all. Hub bearer + (`HubTokenService`, 30-min) carries name + roles only, **no session scope**. +- `StreamEventsRequest.AfterWorkerSequence` already exists on the wire (the + reconnect replay contract is half-built). + +## Shared foundation + +### A. `SessionEventDistributor` (one pump, N per-subscriber channels) + +Per `GatewaySession`, replace the per-RPC direct drain with a single owned +distributor: + +- One background **pump task** drains `ReadEventsAsync()` exactly once. +- Each event is (1) stamped with its worker sequence, (2) appended to a **bounded + replay ring buffer** (retain last `ReplayBufferCapacity` events or + `ReplayRetentionSeconds`, whichever first), and (3) `TryWrite`-fanned to every + registered subscriber's own bounded `Channel`. +- **Per-subscriber backpressure isolation:** overflow completes only that + subscriber's channel (policy `DisconnectSubscriber`); the session and peers are + untouched. `FailFast`→`MarkFaulted` is retained only for the legacy + single-subscriber config path, for backward compatibility. +- **Constraint filtering stays per-subscriber:** the pump fans *raw* events; each + subscriber's read loop applies its own API-key read subtree/glob filter exactly + as today. No change to constraint semantics. +- `AttachEventSubscriber` returns a lease carrying that subscriber's channel + reader + its start sequence (for replay). `EventStreamService` reads the lease + channel instead of creating its own channel and draining the session. + +### B. Session ownership + +Record an authoritative **`OwnerKeyId`** (the creating API key id) on the session +at `OpenSession`, alongside the existing informational `ClientIdentity`. This one +field underpins ACL, reconnect re-validation, and reattach adoption. + +## Feature designs + +### 1. Multi-subscriber fan-out + +- Remove the `GatewayOptionsValidator.cs:181-185` rejection; keep the option but + allow `true`. +- `_activeEventSubscriberCount` → a subscriber-lease collection on the + distributor. New cap `MaxEventSubscribersPerSession` (default 8) → reject the + N+1 attach with `EventSubscriberLimitReached`. +- Dashboard broadcaster registers as a distributor subscriber (removing the inline + tap), fixing the dashboard-dark-without-gRPC bug. +- **No proto change.** + +### 2. Reconnectable sessions + +- On stream drop, a session in **detach-grace** mode is retained (not closed) for + `DetachGraceSeconds` (separate from the session lease). New session + disconnect-policy value `DetachGrace`. +- On reconnect: client calls `StreamEvents` with the same session id + + `AfterWorkerSequence = lastSeen`. The distributor replays ring-buffer events + with `sequence > AfterWorkerSequence`, then resumes live. +- If the requested sequence is older than the ring's oldest retained event (gone + too long / ring overflowed), the server signals **`ReplayGap`** so the client + re-snapshots. **Contract addition** (a `ReplayGap` status / response marker) → + codegen ripple across all 5 clients. +- Reconnect re-validates caller `OwnerKeyId` == session owner → else + `PermissionDenied`. + +### 3. Per-session ACL + +- **gRPC (real security win, no proto change):** `Invoke` / `StreamEvents` / + `CloseSession` gated to the owning API key, OR a key holding a new + all-sessions admin scope → else `PermissionDenied`. Enforced in + `MxAccessGatewayService` against session `OwnerKeyId`. +- **Dashboard:** identity-domain mismatch (LDAP Admin/Viewer users vs API-key + sessions) means no natural owner link. + - **Decision required (flagged, not hard-coded):** default proposal — + **Admin** sees all sessions; **Viewer** scoped via config + `Dashboard:GroupToSessionTag` matched against an optional session `Tag`. + Enforced at `EventsHub.SubscribeSession` and in the `/hubs/token` mint + (token gains an allowed-session-tag claim). The owner may instead choose a + strict default (Viewers see nothing unless granted). + +### 4. Orphan-worker reattach + +- **Stable pipe naming:** drop `{gatewayPid}`; use a persisted stable + gateway-instance id. Replaces the pid's collision-avoidance role. +- **Adoption manifest:** persist a minimal record per live session + (`sessionId → workerPid, nonce, ownerKeyId, pipeName`) in the existing SQLite + store. This is the *only* persisted session state; COM/advise state stays in + the worker. +- **Worker phones home:** the worker runs a reconnect loop with bounded backoff; + the restarted gateway re-opens pipe servers for manifest entries and the + surviving worker re-attaches, presenting its **nonce**. Gateway validates the + nonce against the manifest and **rejects impostors / foreign workers**. +- **Resync, not replay:** the in-memory ring buffer is lost on restart, so a + reattached session's subscribers get `ReplayGap` and re-snapshot. Gateway + resyncs worker view via the now-implemented `GetSessionState` / `GetWorkerInfo` + commands. +- **Safety net retained:** workers self-terminate after `MaxOrphanLifetime` with + no re-adoption; `OrphanWorkerTerminator` stays as the fallback for un-adoptable + or foreign workers. Reattach is opt-in (`Workers:EnableOrphanReattach`, + default off) so the documented-safe behavior remains the default. +- **Pipe protocol:** add an **adopt/reconnect frame** to `mxaccess_worker.proto` + → worker codegen regen + commit `Generated/` (net48 regen rule applies). + +## Contract / codegen impact + +Unlike the prior epic, this is **not zero-proto**: + +- `mxaccess_gateway.proto` — `ReplayGap` signal for reconnect (Feature 2). +- `mxaccess_worker.proto` — adopt/reconnect frame (Feature 4). + +Per the repo rule: regenerate `Generated/`, commit it, rebuild gateway + worker + +every generated client touched, and update affected docs in the same change. + +## Error handling + +- Per-subscriber overflow → disconnect that subscriber only; session survives. +- Reconnect past the ring horizon → `ReplayGap`, client re-snapshots (no silent + loss). +- Reattach nonce mismatch → reject + fall back to termination. +- ACL denial → `PermissionDenied` (gRPC) / hub subscribe refused (dashboard). +- All worker COM/STA interactions keep MXAccess parity — no synthesized events, + no "fixing" surprising returns. + +## Testing & cross-platform verification + +| Area | Test | Host | +|---|---|---| +| Distributor fan-out, per-sub backpressure, replay ring | unit, `FakeWorkerHarness` | local (macOS) | +| Reconnect replay + `ReplayGap` | unit + fake-worker integration | local | +| Session ownership / gRPC ACL | unit + gateway integration | local | +| Dashboard per-session ACL | LDAP test users (`multi-role`/`gw-viewer`) | local + live LDAP | +| Worker adopt frame, reattach handshake | worker unit (net48/x86) | **windev** | +| Gateway-restart reattach round-trip | integration | **windev** + live worker | +| Client `ReplayGap` handling | per-client tests; Java on macOS JDK 21 | local | + +TDD throughout; per-task commits; `Generated/` regenerated+committed on proto +changes; docs (incl. the three documented-rule reversals) updated in the same +change as source. + +## Delivery order (dependency stack) + +Each phase is independently shippable: + +1. **Foundation** — `SessionEventDistributor` + replay ring + session `OwnerKeyId` + (refactor with no external behavior change; dashboard-dark bug fixed). +2. **Fan-out** — remove validator block, subscriber-lease list, cap, dashboard as + subscriber. +3. **Reconnect** — detach-grace, replay-on-reconnect, `ReplayGap` contract + + client handling. +4. **Per-session ACL** — gRPC owner gate + dashboard scoping. +5. **Reattach** — stable pipe naming, adoption manifest, worker phone-home + + adopt frame, resync, safety net; documented-rule reversals. + +## Out of scope + +- Cross-gateway / clustered session sharing (single gateway instance only). +- Event persistence beyond the in-memory ring (no durable event log). +- Reconnect across a gateway *restart* with zero event gap (restart always yields + `ReplayGap` by design — the ring is in-memory). +- Per-session ACL on `SnapshotHub` / `AlarmsHub` (they broadcast aggregate state; + only `EventsHub` is session-scoped).