# Session Resilience Epic — Design **Date:** 2026-06-15 **Branch:** `feat/session-resilience` **Source:** `stillpending.md` §2 (intentional v1-deferred items), scoped into a real feature design. **Status:** Design approved; implementation plan to follow. ## Goal Lift four deliberately-deferred v1 limitations into supported features, built on one shared foundation: 1. **Multi-event-subscriber fan-out** (§2: plumbed but validator-blocked). 2. **Reconnectable sessions** (§2: 1:1 session↔connection today). 3. **Per-session dashboard ACL** (§2 / EventsHub `TODO(per-session-acl)`). 4. **Orphan-worker reattach on gateway restart** (§2 — **overturns a hard CLAUDE.md rule**, see "Documented-rule changes"). These are not peers: fan-out is the keystone, reconnect and reattach reuse its machinery, and three of the four need a new session-ownership concept. ## Documented-rule changes (explicit, owner-approved) This epic deliberately reverses three documented v1 decisions. Each reversal is a required deliverable in the same change as the code: - **CLAUDE.md:77** "Gateway restart does not reattach orphan workers… do not design code paths that assume reattachment." → reattach becomes supported, bounded, and opt-in. - **`docs/DesignDecisions.md:63-73`** "no reconnectable sessions for v1." → reconnect becomes supported within a bounded detach-grace window. - **`docs/DesignDecisions.md:75-80`** single event subscriber per session. → multi-subscriber fan-out becomes supported, capped. The owner explicitly accepted overturning the reattach rule during design. ## Current-state seams (verified by recon, with citations) - `GatewaySession.AttachEventSubscriber(bool allowMultipleSubscribers)` (`Sessions/GatewaySession.cs:386-408`) guards on a single int `_activeEventSubscriberCount` (`:16`) under `_syncRoot`; a second subscriber throws `EventSubscriberAlreadyActive` (`:398`). - `GatewayOptionsValidator.cs:181-185` hard-rejects `AllowMultipleEventSubscribers` ("not supported until event fan-out is implemented"); option bound at `SessionOptions.cs:26-29`. - `EventStreamService.StreamEventsAsync` (`Grpc/EventStreamService.cs:27-101`) creates **a new bounded `Channel` per RPC call** (`:43-50`) and `ProduceEventsAsync` drains `session.ReadEventsAsync()` directly — a **destructive, single-consumer drain**. Two RPCs would fight over one queue. - Backpressure: `ProduceEventsAsync` uses non-blocking `TryWrite`; on overflow with `EventBackpressurePolicy.FailFast` (default, `EventOptions`) it calls `session.MarkFaulted` (`EventStreamService.cs:143-162`) — faulting the **whole session**, not just the slow consumer. - `DashboardEventBroadcaster.Publish` (`Dashboard/Hubs/DashboardEventBroadcaster.cs:13-44`) is called **inside** the per-RPC producer loop (`EventStreamService.cs:131-141`) — so the dashboard only mirrors events while a gRPC client is streaming. Latent bug: no gRPC subscriber ⇒ dashboard feed is dark. - Pipe name `mxaccess-gateway-{Environment.ProcessId}-{sessionId}` (`SessionManager.cs:433`); session id `session-{Guid:N}` (`:479`), in-memory `SessionRegistry` only (`SessionRegistry.cs:12`), **not persisted**. - `OrphanWorkerTerminator` (`Workers/OrphanWorkerTerminator.cs:49-112`) discovers orphans by executable name/path (x64 gateway cannot introspect the x86 worker module → image-name fallback) and **terminates** them; rationale comment at `:9-16`. - Pipe fault → `WorkerClient` read loop detects `EndOfStream`, session → `Faulted` (`WorkerClient.cs:376-381`); no reattach. Worker launch passes the per-session nonce via `MXGATEWAY_WORKER_NONCE` env var (`WorkerProcessLauncher.cs:180-182`). - Sessions store `ClientIdentity` (informational only, `GatewaySession.cs:114`); **no `OwnerKeyId`, no per-session ACL.** gRPC `StreamEvents` enforces per-item read constraints but **no session-level access gate** — any caller who knows a session id can stream it. - `EventsHub.SubscribeSession(string)` (`Dashboard/Hubs/EventsHub.cs:46-54`) joins group `session:{id}`; only hub-level `[Authorize(HubClientsPolicy)]` gates it, so **any** Admin/Viewer can subscribe to **any** session. `TODO(per-session-acl)` at `:39-43`. `SnapshotHub`/`AlarmsHub` broadcast to all. Hub bearer (`HubTokenService`, 30-min) carries name + roles only, **no session scope**. - `StreamEventsRequest.AfterWorkerSequence` already exists on the wire (the reconnect replay contract is half-built). ## Shared foundation ### A. `SessionEventDistributor` (one pump, N per-subscriber channels) Per `GatewaySession`, replace the per-RPC direct drain with a single owned distributor: - One background **pump task** drains `ReadEventsAsync()` exactly once. - Each event is (1) stamped with its worker sequence, (2) appended to a **bounded replay ring buffer** (retain last `ReplayBufferCapacity` events or `ReplayRetentionSeconds`, whichever first), and (3) `TryWrite`-fanned to every registered subscriber's own bounded `Channel`. - **Per-subscriber backpressure isolation:** overflow completes only that subscriber's channel (policy `DisconnectSubscriber`); the session and peers are untouched. `FailFast`→`MarkFaulted` is retained only for the legacy single-subscriber config path, for backward compatibility. - **Constraint filtering stays per-subscriber:** the pump fans *raw* events; each subscriber's read loop applies its own API-key read subtree/glob filter exactly as today. No change to constraint semantics. - `AttachEventSubscriber` returns a lease carrying that subscriber's channel reader + its start sequence (for replay). `EventStreamService` reads the lease channel instead of creating its own channel and draining the session. ### B. Session ownership Record an authoritative **`OwnerKeyId`** (the creating API key id) on the session at `OpenSession`, alongside the existing informational `ClientIdentity`. This one field underpins ACL, reconnect re-validation, and reattach adoption. ## Feature designs ### 1. Multi-subscriber fan-out - Remove the `GatewayOptionsValidator.cs:181-185` rejection; keep the option but allow `true`. - `_activeEventSubscriberCount` → a subscriber-lease collection on the distributor. New cap `MaxEventSubscribersPerSession` (default 8) → reject the N+1 attach with `EventSubscriberLimitReached`. - Dashboard broadcaster registers as a distributor subscriber (removing the inline tap), fixing the dashboard-dark-without-gRPC bug. - **No proto change.** ### 2. Reconnectable sessions - On stream drop, a session in **detach-grace** mode is retained (not closed) for `DetachGraceSeconds` (separate from the session lease). New session disconnect-policy value `DetachGrace`. - On reconnect: client calls `StreamEvents` with the same session id + `AfterWorkerSequence = lastSeen`. The distributor replays ring-buffer events with `sequence > AfterWorkerSequence`, then resumes live. - If the requested sequence is older than the ring's oldest retained event (gone too long / ring overflowed), the server signals **`ReplayGap`** so the client re-snapshots. **Contract addition** (a `ReplayGap` status / response marker) → codegen ripple across all 5 clients. - Reconnect re-validates caller `OwnerKeyId` == session owner → else `PermissionDenied`. ### 3. Per-session ACL - **gRPC (real security win, no proto change):** `Invoke` / `StreamEvents` / `CloseSession` gated to the owning API key, OR a key holding a new all-sessions admin scope → else `PermissionDenied`. Enforced in `MxAccessGatewayService` against session `OwnerKeyId`. - **Dashboard:** identity-domain mismatch (LDAP Admin/Viewer users vs API-key sessions) means no natural owner link. - **Decision required (flagged, not hard-coded):** default proposal — **Admin** sees all sessions; **Viewer** scoped via config `Dashboard:GroupToSessionTag` matched against an optional session `Tag`. Enforced at `EventsHub.SubscribeSession` and in the `/hubs/token` mint (token gains an allowed-session-tag claim). The owner may instead choose a strict default (Viewers see nothing unless granted). ### 4. Orphan-worker reattach - **Stable pipe naming:** drop `{gatewayPid}`; use a persisted stable gateway-instance id. Replaces the pid's collision-avoidance role. - **Adoption manifest:** persist a minimal record per live session (`sessionId → workerPid, nonce, ownerKeyId, pipeName`) in the existing SQLite store. This is the *only* persisted session state; COM/advise state stays in the worker. - **Worker phones home:** the worker runs a reconnect loop with bounded backoff; the restarted gateway re-opens pipe servers for manifest entries and the surviving worker re-attaches, presenting its **nonce**. Gateway validates the nonce against the manifest and **rejects impostors / foreign workers**. - **Resync, not replay:** the in-memory ring buffer is lost on restart, so a reattached session's subscribers get `ReplayGap` and re-snapshot. Gateway resyncs worker view via the now-implemented `GetSessionState` / `GetWorkerInfo` commands. - **Safety net retained:** workers self-terminate after `MaxOrphanLifetime` with no re-adoption; `OrphanWorkerTerminator` stays as the fallback for un-adoptable or foreign workers. Reattach is opt-in (`Workers:EnableOrphanReattach`, default off) so the documented-safe behavior remains the default. - **Pipe protocol:** add an **adopt/reconnect frame** to `mxaccess_worker.proto` → worker codegen regen + commit `Generated/` (net48 regen rule applies). ## Contract / codegen impact Unlike the prior epic, this is **not zero-proto**: - `mxaccess_gateway.proto` — `ReplayGap` signal for reconnect (Feature 2). - `mxaccess_worker.proto` — adopt/reconnect frame (Feature 4). Per the repo rule: regenerate `Generated/`, commit it, rebuild gateway + worker + every generated client touched, and update affected docs in the same change. ## Error handling - Per-subscriber overflow → disconnect that subscriber only; session survives. - Reconnect past the ring horizon → `ReplayGap`, client re-snapshots (no silent loss). - Reattach nonce mismatch → reject + fall back to termination. - ACL denial → `PermissionDenied` (gRPC) / hub subscribe refused (dashboard). - All worker COM/STA interactions keep MXAccess parity — no synthesized events, no "fixing" surprising returns. ## Testing & cross-platform verification | Area | Test | Host | |---|---|---| | Distributor fan-out, per-sub backpressure, replay ring | unit, `FakeWorkerHarness` | local (macOS) | | Reconnect replay + `ReplayGap` | unit + fake-worker integration | local | | Session ownership / gRPC ACL | unit + gateway integration | local | | Dashboard per-session ACL | LDAP test users (`multi-role`/`gw-viewer`) | local + live LDAP | | Worker adopt frame, reattach handshake | worker unit (net48/x86) | **windev** | | Gateway-restart reattach round-trip | integration | **windev** + live worker | | Client `ReplayGap` handling | per-client tests; Java on macOS JDK 21 | local | TDD throughout; per-task commits; `Generated/` regenerated+committed on proto changes; docs (incl. the three documented-rule reversals) updated in the same change as source. ## Delivery order (dependency stack) Each phase is independently shippable: 1. **Foundation** — `SessionEventDistributor` + replay ring + session `OwnerKeyId` (refactor with no external behavior change; dashboard-dark bug fixed). 2. **Fan-out** — remove validator block, subscriber-lease list, cap, dashboard as subscriber. 3. **Reconnect** — detach-grace, replay-on-reconnect, `ReplayGap` contract + client handling. 4. **Per-session ACL** — gRPC owner gate + dashboard scoping. 5. **Reattach** — stable pipe naming, adoption manifest, worker phone-home + adopt frame, resync, safety net; documented-rule reversals. ## Out of scope - Cross-gateway / clustered session sharing (single gateway instance only). - Event persistence beyond the in-memory ring (no durable event log). - Reconnect across a gateway *restart* with zero event gap (restart always yields `ReplayGap` by design — the ring is in-memory). - Per-session ACL on `SnapshotHub` / `AlarmsHub` (they broadcast aggregate state; only `EventsHub` is session-scoped).