12 KiB
Session Resilience Epic — Design
Date: 2026-06-15
Branch: feat/session-resilience
Source: stillpending.md §2 (intentional v1-deferred items), scoped into a real feature design.
Status: Design approved; implementation plan to follow.
Goal
Lift four deliberately-deferred v1 limitations into supported features, built on one shared foundation:
- Multi-event-subscriber fan-out (§2: plumbed but validator-blocked).
- Reconnectable sessions (§2: 1:1 session↔connection today).
- Per-session dashboard ACL (§2 / EventsHub
TODO(per-session-acl)). - Orphan-worker reattach on gateway restart (§2 — overturns a hard CLAUDE.md rule, see "Documented-rule changes").
These are not peers: fan-out is the keystone, reconnect and reattach reuse its machinery, and three of the four need a new session-ownership concept.
Documented-rule changes (explicit, owner-approved)
This epic deliberately reverses three documented v1 decisions. Each reversal is a required deliverable in the same change as the code:
- CLAUDE.md:77 "Gateway restart does not reattach orphan workers… do not design code paths that assume reattachment." → reattach becomes supported, bounded, and opt-in.
docs/DesignDecisions.md:63-73"no reconnectable sessions for v1." → reconnect becomes supported within a bounded detach-grace window.docs/DesignDecisions.md:75-80single event subscriber per session. → multi-subscriber fan-out becomes supported, capped.
The owner explicitly accepted overturning the reattach rule during design.
Current-state seams (verified by recon, with citations)
GatewaySession.AttachEventSubscriber(bool allowMultipleSubscribers)(Sessions/GatewaySession.cs:386-408) guards on a single int_activeEventSubscriberCount(:16) under_syncRoot; a second subscriber throwsEventSubscriberAlreadyActive(:398).GatewayOptionsValidator.cs:181-185hard-rejectsAllowMultipleEventSubscribers("not supported until event fan-out is implemented"); option bound atSessionOptions.cs:26-29.EventStreamService.StreamEventsAsync(Grpc/EventStreamService.cs:27-101) creates a new boundedChannel<MxEvent>per RPC call (:43-50) andProduceEventsAsyncdrainssession.ReadEventsAsync()directly — a destructive, single-consumer drain. Two RPCs would fight over one queue.- Backpressure:
ProduceEventsAsyncuses non-blockingTryWrite; on overflow withEventBackpressurePolicy.FailFast(default,EventOptions) it callssession.MarkFaulted(EventStreamService.cs:143-162) — faulting the whole session, not just the slow consumer. DashboardEventBroadcaster.Publish(Dashboard/Hubs/DashboardEventBroadcaster.cs:13-44) is called inside the per-RPC producer loop (EventStreamService.cs:131-141) — so the dashboard only mirrors events while a gRPC client is streaming. Latent bug: no gRPC subscriber ⇒ dashboard feed is dark.- Pipe name
mxaccess-gateway-{Environment.ProcessId}-{sessionId}(SessionManager.cs:433); session idsession-{Guid:N}(:479), in-memorySessionRegistryonly (SessionRegistry.cs:12), not persisted. OrphanWorkerTerminator(Workers/OrphanWorkerTerminator.cs:49-112) discovers orphans by executable name/path (x64 gateway cannot introspect the x86 worker module → image-name fallback) and terminates them; rationale comment at:9-16.- Pipe fault →
WorkerClientread loop detectsEndOfStream, session →Faulted(WorkerClient.cs:376-381); no reattach. Worker launch passes the per-session nonce viaMXGATEWAY_WORKER_NONCEenv var (WorkerProcessLauncher.cs:180-182). - Sessions store
ClientIdentity(informational only,GatewaySession.cs:114); noOwnerKeyId, no per-session ACL. gRPCStreamEventsenforces per-item read constraints but no session-level access gate — any caller who knows a session id can stream it. EventsHub.SubscribeSession(string)(Dashboard/Hubs/EventsHub.cs:46-54) joins groupsession:{id}; only hub-level[Authorize(HubClientsPolicy)]gates it, so any Admin/Viewer can subscribe to any session.TODO(per-session-acl)at:39-43.SnapshotHub/AlarmsHubbroadcast to all. Hub bearer (HubTokenService, 30-min) carries name + roles only, no session scope.StreamEventsRequest.AfterWorkerSequencealready exists on the wire (the reconnect replay contract is half-built).
Shared foundation
A. SessionEventDistributor (one pump, N per-subscriber channels)
Per GatewaySession, replace the per-RPC direct drain with a single owned
distributor:
- One background pump task drains
ReadEventsAsync()exactly once. - Each event is (1) stamped with its worker sequence, (2) appended to a bounded
replay ring buffer (retain last
ReplayBufferCapacityevents orReplayRetentionSeconds, whichever first), and (3)TryWrite-fanned to every registered subscriber's own boundedChannel<MxEvent>. - Per-subscriber backpressure isolation: overflow completes only that
subscriber's channel (policy
DisconnectSubscriber); the session and peers are untouched.FailFast→MarkFaultedis retained only for the legacy single-subscriber config path, for backward compatibility. - Constraint filtering stays per-subscriber: the pump fans raw events; each subscriber's read loop applies its own API-key read subtree/glob filter exactly as today. No change to constraint semantics.
AttachEventSubscriberreturns a lease carrying that subscriber's channel reader + its start sequence (for replay).EventStreamServicereads the lease channel instead of creating its own channel and draining the session.
B. Session ownership
Record an authoritative OwnerKeyId (the creating API key id) on the session
at OpenSession, alongside the existing informational ClientIdentity. This one
field underpins ACL, reconnect re-validation, and reattach adoption.
Feature designs
1. Multi-subscriber fan-out
- Remove the
GatewayOptionsValidator.cs:181-185rejection; keep the option but allowtrue. _activeEventSubscriberCount→ a subscriber-lease collection on the distributor. New capMaxEventSubscribersPerSession(default 8) → reject the N+1 attach withEventSubscriberLimitReached.- Dashboard broadcaster registers as a distributor subscriber (removing the inline tap), fixing the dashboard-dark-without-gRPC bug.
- No proto change.
2. Reconnectable sessions
- On stream drop, a session in detach-grace mode is retained (not closed) for
DetachGraceSeconds(separate from the session lease). New session disconnect-policy valueDetachGrace. - On reconnect: client calls
StreamEventswith the same session id +AfterWorkerSequence = lastSeen. The distributor replays ring-buffer events withsequence > AfterWorkerSequence, then resumes live. - If the requested sequence is older than the ring's oldest retained event (gone
too long / ring overflowed), the server signals
ReplayGapso the client re-snapshots. Contract addition (aReplayGapstatus / response marker) → codegen ripple across all 5 clients. - Reconnect re-validates caller
OwnerKeyId== session owner → elsePermissionDenied.
3. Per-session ACL
- gRPC (real security win, no proto change):
Invoke/StreamEvents/CloseSessiongated to the owning API key, OR a key holding a new all-sessions admin scope → elsePermissionDenied. Enforced inMxAccessGatewayServiceagainst sessionOwnerKeyId. - Dashboard: identity-domain mismatch (LDAP Admin/Viewer users vs API-key
sessions) means no natural owner link.
- Decision required (flagged, not hard-coded): default proposal —
Admin sees all sessions; Viewer scoped via config
Dashboard:GroupToSessionTagmatched against an optional sessionTag. Enforced atEventsHub.SubscribeSessionand in the/hubs/tokenmint (token gains an allowed-session-tag claim). The owner may instead choose a strict default (Viewers see nothing unless granted).
- Decision required (flagged, not hard-coded): default proposal —
Admin sees all sessions; Viewer scoped via config
4. Orphan-worker reattach
- Stable pipe naming: drop
{gatewayPid}; use a persisted stable gateway-instance id. Replaces the pid's collision-avoidance role. - Adoption manifest: persist a minimal record per live session
(
sessionId → workerPid, nonce, ownerKeyId, pipeName) in the existing SQLite store. This is the only persisted session state; COM/advise state stays in the worker. - Worker phones home: the worker runs a reconnect loop with bounded backoff; the restarted gateway re-opens pipe servers for manifest entries and the surviving worker re-attaches, presenting its nonce. Gateway validates the nonce against the manifest and rejects impostors / foreign workers.
- Resync, not replay: the in-memory ring buffer is lost on restart, so a
reattached session's subscribers get
ReplayGapand re-snapshot. Gateway resyncs worker view via the now-implementedGetSessionState/GetWorkerInfocommands. - Safety net retained: workers self-terminate after
MaxOrphanLifetimewith no re-adoption;OrphanWorkerTerminatorstays as the fallback for un-adoptable or foreign workers. Reattach is opt-in (Workers:EnableOrphanReattach, default off) so the documented-safe behavior remains the default. - Pipe protocol: add an adopt/reconnect frame to
mxaccess_worker.proto→ worker codegen regen + commitGenerated/(net48 regen rule applies).
Contract / codegen impact
Unlike the prior epic, this is not zero-proto:
mxaccess_gateway.proto—ReplayGapsignal for reconnect (Feature 2).mxaccess_worker.proto— adopt/reconnect frame (Feature 4).
Per the repo rule: regenerate Generated/, commit it, rebuild gateway + worker +
every generated client touched, and update affected docs in the same change.
Error handling
- Per-subscriber overflow → disconnect that subscriber only; session survives.
- Reconnect past the ring horizon →
ReplayGap, client re-snapshots (no silent loss). - Reattach nonce mismatch → reject + fall back to termination.
- ACL denial →
PermissionDenied(gRPC) / hub subscribe refused (dashboard). - All worker COM/STA interactions keep MXAccess parity — no synthesized events, no "fixing" surprising returns.
Testing & cross-platform verification
| Area | Test | Host |
|---|---|---|
| Distributor fan-out, per-sub backpressure, replay ring | unit, FakeWorkerHarness |
local (macOS) |
Reconnect replay + ReplayGap |
unit + fake-worker integration | local |
| Session ownership / gRPC ACL | unit + gateway integration | local |
| Dashboard per-session ACL | LDAP test users (multi-role/gw-viewer) |
local + live LDAP |
| Worker adopt frame, reattach handshake | worker unit (net48/x86) | windev |
| Gateway-restart reattach round-trip | integration | windev + live worker |
Client ReplayGap handling |
per-client tests; Java on macOS JDK 21 | local |
TDD throughout; per-task commits; Generated/ regenerated+committed on proto
changes; docs (incl. the three documented-rule reversals) updated in the same
change as source.
Delivery order (dependency stack)
Each phase is independently shippable:
- Foundation —
SessionEventDistributor+ replay ring + sessionOwnerKeyId(refactor with no external behavior change; dashboard-dark bug fixed). - Fan-out — remove validator block, subscriber-lease list, cap, dashboard as subscriber.
- Reconnect — detach-grace, replay-on-reconnect,
ReplayGapcontract + client handling. - Per-session ACL — gRPC owner gate + dashboard scoping.
- Reattach — stable pipe naming, adoption manifest, worker phone-home + adopt frame, resync, safety net; documented-rule reversals.
Out of scope
- Cross-gateway / clustered session sharing (single gateway instance only).
- Event persistence beyond the in-memory ring (no durable event log).
- Reconnect across a gateway restart with zero event gap (restart always yields
ReplayGapby design — the ring is in-memory). - Per-session ACL on
SnapshotHub/AlarmsHub(they broadcast aggregate state; onlyEventsHubis session-scoped).