Files
mxaccessgw/docs/plans/2026-06-15-session-resilience-design.md
T

12 KiB

Session Resilience Epic — Design

Date: 2026-06-15 Branch: feat/session-resilience Source: stillpending.md §2 (intentional v1-deferred items), scoped into a real feature design. Status: Design approved; implementation plan to follow.

Goal

Lift four deliberately-deferred v1 limitations into supported features, built on one shared foundation:

  1. Multi-event-subscriber fan-out (§2: plumbed but validator-blocked).
  2. Reconnectable sessions (§2: 1:1 session↔connection today).
  3. Per-session dashboard ACL (§2 / EventsHub TODO(per-session-acl)).
  4. Orphan-worker reattach on gateway restart (§2 — overturns a hard CLAUDE.md rule, see "Documented-rule changes").

These are not peers: fan-out is the keystone, reconnect and reattach reuse its machinery, and three of the four need a new session-ownership concept.

Documented-rule changes (explicit, owner-approved)

This epic deliberately reverses three documented v1 decisions. Each reversal is a required deliverable in the same change as the code:

  • CLAUDE.md:77 "Gateway restart does not reattach orphan workers… do not design code paths that assume reattachment." → reattach becomes supported, bounded, and opt-in.
  • docs/DesignDecisions.md:63-73 "no reconnectable sessions for v1." → reconnect becomes supported within a bounded detach-grace window.
  • docs/DesignDecisions.md:75-80 single event subscriber per session. → multi-subscriber fan-out becomes supported, capped.

The owner explicitly accepted overturning the reattach rule during design.

Current-state seams (verified by recon, with citations)

  • GatewaySession.AttachEventSubscriber(bool allowMultipleSubscribers) (Sessions/GatewaySession.cs:386-408) guards on a single int _activeEventSubscriberCount (:16) under _syncRoot; a second subscriber throws EventSubscriberAlreadyActive (:398).
  • GatewayOptionsValidator.cs:181-185 hard-rejects AllowMultipleEventSubscribers ("not supported until event fan-out is implemented"); option bound at SessionOptions.cs:26-29.
  • EventStreamService.StreamEventsAsync (Grpc/EventStreamService.cs:27-101) creates a new bounded Channel<MxEvent> per RPC call (:43-50) and ProduceEventsAsync drains session.ReadEventsAsync() directly — a destructive, single-consumer drain. Two RPCs would fight over one queue.
  • Backpressure: ProduceEventsAsync uses non-blocking TryWrite; on overflow with EventBackpressurePolicy.FailFast (default, EventOptions) it calls session.MarkFaulted (EventStreamService.cs:143-162) — faulting the whole session, not just the slow consumer.
  • DashboardEventBroadcaster.Publish (Dashboard/Hubs/DashboardEventBroadcaster.cs:13-44) is called inside the per-RPC producer loop (EventStreamService.cs:131-141) — so the dashboard only mirrors events while a gRPC client is streaming. Latent bug: no gRPC subscriber ⇒ dashboard feed is dark.
  • Pipe name mxaccess-gateway-{Environment.ProcessId}-{sessionId} (SessionManager.cs:433); session id session-{Guid:N} (:479), in-memory SessionRegistry only (SessionRegistry.cs:12), not persisted.
  • OrphanWorkerTerminator (Workers/OrphanWorkerTerminator.cs:49-112) discovers orphans by executable name/path (x64 gateway cannot introspect the x86 worker module → image-name fallback) and terminates them; rationale comment at :9-16.
  • Pipe fault → WorkerClient read loop detects EndOfStream, session → Faulted (WorkerClient.cs:376-381); no reattach. Worker launch passes the per-session nonce via MXGATEWAY_WORKER_NONCE env var (WorkerProcessLauncher.cs:180-182).
  • Sessions store ClientIdentity (informational only, GatewaySession.cs:114); no OwnerKeyId, no per-session ACL. gRPC StreamEvents enforces per-item read constraints but no session-level access gate — any caller who knows a session id can stream it.
  • EventsHub.SubscribeSession(string) (Dashboard/Hubs/EventsHub.cs:46-54) joins group session:{id}; only hub-level [Authorize(HubClientsPolicy)] gates it, so any Admin/Viewer can subscribe to any session. TODO(per-session-acl) at :39-43. SnapshotHub/AlarmsHub broadcast to all. Hub bearer (HubTokenService, 30-min) carries name + roles only, no session scope.
  • StreamEventsRequest.AfterWorkerSequence already exists on the wire (the reconnect replay contract is half-built).

Shared foundation

A. SessionEventDistributor (one pump, N per-subscriber channels)

Per GatewaySession, replace the per-RPC direct drain with a single owned distributor:

  • One background pump task drains ReadEventsAsync() exactly once.
  • Each event is (1) stamped with its worker sequence, (2) appended to a bounded replay ring buffer (retain last ReplayBufferCapacity events or ReplayRetentionSeconds, whichever first), and (3) TryWrite-fanned to every registered subscriber's own bounded Channel<MxEvent>.
  • Per-subscriber backpressure isolation: overflow completes only that subscriber's channel (policy DisconnectSubscriber); the session and peers are untouched. FailFastMarkFaulted is retained only for the legacy single-subscriber config path, for backward compatibility.
  • Constraint filtering stays per-subscriber: the pump fans raw events; each subscriber's read loop applies its own API-key read subtree/glob filter exactly as today. No change to constraint semantics.
  • AttachEventSubscriber returns a lease carrying that subscriber's channel reader + its start sequence (for replay). EventStreamService reads the lease channel instead of creating its own channel and draining the session.

B. Session ownership

Record an authoritative OwnerKeyId (the creating API key id) on the session at OpenSession, alongside the existing informational ClientIdentity. This one field underpins ACL, reconnect re-validation, and reattach adoption.

Feature designs

1. Multi-subscriber fan-out

  • Remove the GatewayOptionsValidator.cs:181-185 rejection; keep the option but allow true.
  • _activeEventSubscriberCount → a subscriber-lease collection on the distributor. New cap MaxEventSubscribersPerSession (default 8) → reject the N+1 attach with EventSubscriberLimitReached.
  • Dashboard broadcaster registers as a distributor subscriber (removing the inline tap), fixing the dashboard-dark-without-gRPC bug.
  • No proto change.

2. Reconnectable sessions

  • On stream drop, a session in detach-grace mode is retained (not closed) for DetachGraceSeconds (separate from the session lease). New session disconnect-policy value DetachGrace.
  • On reconnect: client calls StreamEvents with the same session id + AfterWorkerSequence = lastSeen. The distributor replays ring-buffer events with sequence > AfterWorkerSequence, then resumes live.
  • If the requested sequence is older than the ring's oldest retained event (gone too long / ring overflowed), the server signals ReplayGap so the client re-snapshots. Contract addition (a ReplayGap status / response marker) → codegen ripple across all 5 clients.
  • Reconnect re-validates caller OwnerKeyId == session owner → else PermissionDenied.

3. Per-session ACL

  • gRPC (real security win, no proto change): Invoke / StreamEvents / CloseSession gated to the owning API key, OR a key holding a new all-sessions admin scope → else PermissionDenied. Enforced in MxAccessGatewayService against session OwnerKeyId.
  • Dashboard: identity-domain mismatch (LDAP Admin/Viewer users vs API-key sessions) means no natural owner link.
    • Decision required (flagged, not hard-coded): default proposal — Admin sees all sessions; Viewer scoped via config Dashboard:GroupToSessionTag matched against an optional session Tag. Enforced at EventsHub.SubscribeSession and in the /hubs/token mint (token gains an allowed-session-tag claim). The owner may instead choose a strict default (Viewers see nothing unless granted).

4. Orphan-worker reattach

  • Stable pipe naming: drop {gatewayPid}; use a persisted stable gateway-instance id. Replaces the pid's collision-avoidance role.
  • Adoption manifest: persist a minimal record per live session (sessionId → workerPid, nonce, ownerKeyId, pipeName) in the existing SQLite store. This is the only persisted session state; COM/advise state stays in the worker.
  • Worker phones home: the worker runs a reconnect loop with bounded backoff; the restarted gateway re-opens pipe servers for manifest entries and the surviving worker re-attaches, presenting its nonce. Gateway validates the nonce against the manifest and rejects impostors / foreign workers.
  • Resync, not replay: the in-memory ring buffer is lost on restart, so a reattached session's subscribers get ReplayGap and re-snapshot. Gateway resyncs worker view via the now-implemented GetSessionState / GetWorkerInfo commands.
  • Safety net retained: workers self-terminate after MaxOrphanLifetime with no re-adoption; OrphanWorkerTerminator stays as the fallback for un-adoptable or foreign workers. Reattach is opt-in (Workers:EnableOrphanReattach, default off) so the documented-safe behavior remains the default.
  • Pipe protocol: add an adopt/reconnect frame to mxaccess_worker.proto → worker codegen regen + commit Generated/ (net48 regen rule applies).

Contract / codegen impact

Unlike the prior epic, this is not zero-proto:

  • mxaccess_gateway.protoReplayGap signal for reconnect (Feature 2).
  • mxaccess_worker.proto — adopt/reconnect frame (Feature 4).

Per the repo rule: regenerate Generated/, commit it, rebuild gateway + worker + every generated client touched, and update affected docs in the same change.

Error handling

  • Per-subscriber overflow → disconnect that subscriber only; session survives.
  • Reconnect past the ring horizon → ReplayGap, client re-snapshots (no silent loss).
  • Reattach nonce mismatch → reject + fall back to termination.
  • ACL denial → PermissionDenied (gRPC) / hub subscribe refused (dashboard).
  • All worker COM/STA interactions keep MXAccess parity — no synthesized events, no "fixing" surprising returns.

Testing & cross-platform verification

Area Test Host
Distributor fan-out, per-sub backpressure, replay ring unit, FakeWorkerHarness local (macOS)
Reconnect replay + ReplayGap unit + fake-worker integration local
Session ownership / gRPC ACL unit + gateway integration local
Dashboard per-session ACL LDAP test users (multi-role/gw-viewer) local + live LDAP
Worker adopt frame, reattach handshake worker unit (net48/x86) windev
Gateway-restart reattach round-trip integration windev + live worker
Client ReplayGap handling per-client tests; Java on macOS JDK 21 local

TDD throughout; per-task commits; Generated/ regenerated+committed on proto changes; docs (incl. the three documented-rule reversals) updated in the same change as source.

Delivery order (dependency stack)

Each phase is independently shippable:

  1. FoundationSessionEventDistributor + replay ring + session OwnerKeyId (refactor with no external behavior change; dashboard-dark bug fixed).
  2. Fan-out — remove validator block, subscriber-lease list, cap, dashboard as subscriber.
  3. Reconnect — detach-grace, replay-on-reconnect, ReplayGap contract + client handling.
  4. Per-session ACL — gRPC owner gate + dashboard scoping.
  5. Reattach — stable pipe naming, adoption manifest, worker phone-home + adopt frame, resync, safety net; documented-rule reversals.

Out of scope

  • Cross-gateway / clustered session sharing (single gateway instance only).
  • Event persistence beyond the in-memory ring (no durable event log).
  • Reconnect across a gateway restart with zero event gap (restart always yields ReplayGap by design — the ring is in-memory).
  • Per-session ACL on SnapshotHub / AlarmsHub (they broadcast aggregate state; only EventsHub is session-scoped).