docs: session-resilience epic design (fan-out, reconnect, ACL, reattach)

This commit is contained in:
Joseph Doherty
2026-06-15 12:11:46 -04:00
parent 0e4843612b
commit 3fc6ccad30
@@ -0,0 +1,233 @@
# Session Resilience Epic — Design
**Date:** 2026-06-15
**Branch:** `feat/session-resilience`
**Source:** `stillpending.md` §2 (intentional v1-deferred items), scoped into a real feature design.
**Status:** Design approved; implementation plan to follow.
## Goal
Lift four deliberately-deferred v1 limitations into supported features, built on
one shared foundation:
1. **Multi-event-subscriber fan-out** (§2: plumbed but validator-blocked).
2. **Reconnectable sessions** (§2: 1:1 session↔connection today).
3. **Per-session dashboard ACL** (§2 / EventsHub `TODO(per-session-acl)`).
4. **Orphan-worker reattach on gateway restart** (§2 — **overturns a hard
CLAUDE.md rule**, see "Documented-rule changes").
These are not peers: fan-out is the keystone, reconnect and reattach reuse its
machinery, and three of the four need a new session-ownership concept.
## Documented-rule changes (explicit, owner-approved)
This epic deliberately reverses three documented v1 decisions. Each reversal is a
required deliverable in the same change as the code:
- **CLAUDE.md:77** "Gateway restart does not reattach orphan workers… do not
design code paths that assume reattachment." → reattach becomes supported,
bounded, and opt-in.
- **`docs/DesignDecisions.md:63-73`** "no reconnectable sessions for v1." →
reconnect becomes supported within a bounded detach-grace window.
- **`docs/DesignDecisions.md:75-80`** single event subscriber per session. →
multi-subscriber fan-out becomes supported, capped.
The owner explicitly accepted overturning the reattach rule during design.
## Current-state seams (verified by recon, with citations)
- `GatewaySession.AttachEventSubscriber(bool allowMultipleSubscribers)`
(`Sessions/GatewaySession.cs:386-408`) guards on a single int
`_activeEventSubscriberCount` (`:16`) under `_syncRoot`; a second subscriber
throws `EventSubscriberAlreadyActive` (`:398`).
- `GatewayOptionsValidator.cs:181-185` hard-rejects
`AllowMultipleEventSubscribers` ("not supported until event fan-out is
implemented"); option bound at `SessionOptions.cs:26-29`.
- `EventStreamService.StreamEventsAsync` (`Grpc/EventStreamService.cs:27-101`)
creates **a new bounded `Channel<MxEvent>` per RPC call** (`:43-50`) and
`ProduceEventsAsync` drains `session.ReadEventsAsync()` directly — a
**destructive, single-consumer drain**. Two RPCs would fight over one queue.
- Backpressure: `ProduceEventsAsync` uses non-blocking `TryWrite`; on overflow
with `EventBackpressurePolicy.FailFast` (default, `EventOptions`) it calls
`session.MarkFaulted` (`EventStreamService.cs:143-162`) — faulting the **whole
session**, not just the slow consumer.
- `DashboardEventBroadcaster.Publish` (`Dashboard/Hubs/DashboardEventBroadcaster.cs:13-44`)
is called **inside** the per-RPC producer loop (`EventStreamService.cs:131-141`)
— so the dashboard only mirrors events while a gRPC client is streaming. Latent
bug: no gRPC subscriber ⇒ dashboard feed is dark.
- Pipe name `mxaccess-gateway-{Environment.ProcessId}-{sessionId}`
(`SessionManager.cs:433`); session id `session-{Guid:N}` (`:479`), in-memory
`SessionRegistry` only (`SessionRegistry.cs:12`), **not persisted**.
- `OrphanWorkerTerminator` (`Workers/OrphanWorkerTerminator.cs:49-112`) discovers
orphans by executable name/path (x64 gateway cannot introspect the x86 worker
module → image-name fallback) and **terminates** them; rationale comment at
`:9-16`.
- Pipe fault → `WorkerClient` read loop detects `EndOfStream`, session →
`Faulted` (`WorkerClient.cs:376-381`); no reattach. Worker launch passes the
per-session nonce via `MXGATEWAY_WORKER_NONCE` env var
(`WorkerProcessLauncher.cs:180-182`).
- Sessions store `ClientIdentity` (informational only, `GatewaySession.cs:114`);
**no `OwnerKeyId`, no per-session ACL.** gRPC `StreamEvents` enforces per-item
read constraints but **no session-level access gate** — any caller who knows a
session id can stream it.
- `EventsHub.SubscribeSession(string)` (`Dashboard/Hubs/EventsHub.cs:46-54`) joins
group `session:{id}`; only hub-level `[Authorize(HubClientsPolicy)]` gates it,
so **any** Admin/Viewer can subscribe to **any** session. `TODO(per-session-acl)`
at `:39-43`. `SnapshotHub`/`AlarmsHub` broadcast to all. Hub bearer
(`HubTokenService`, 30-min) carries name + roles only, **no session scope**.
- `StreamEventsRequest.AfterWorkerSequence` already exists on the wire (the
reconnect replay contract is half-built).
## Shared foundation
### A. `SessionEventDistributor` (one pump, N per-subscriber channels)
Per `GatewaySession`, replace the per-RPC direct drain with a single owned
distributor:
- One background **pump task** drains `ReadEventsAsync()` exactly once.
- Each event is (1) stamped with its worker sequence, (2) appended to a **bounded
replay ring buffer** (retain last `ReplayBufferCapacity` events or
`ReplayRetentionSeconds`, whichever first), and (3) `TryWrite`-fanned to every
registered subscriber's own bounded `Channel<MxEvent>`.
- **Per-subscriber backpressure isolation:** overflow completes only that
subscriber's channel (policy `DisconnectSubscriber`); the session and peers are
untouched. `FailFast``MarkFaulted` is retained only for the legacy
single-subscriber config path, for backward compatibility.
- **Constraint filtering stays per-subscriber:** the pump fans *raw* events; each
subscriber's read loop applies its own API-key read subtree/glob filter exactly
as today. No change to constraint semantics.
- `AttachEventSubscriber` returns a lease carrying that subscriber's channel
reader + its start sequence (for replay). `EventStreamService` reads the lease
channel instead of creating its own channel and draining the session.
### B. Session ownership
Record an authoritative **`OwnerKeyId`** (the creating API key id) on the session
at `OpenSession`, alongside the existing informational `ClientIdentity`. This one
field underpins ACL, reconnect re-validation, and reattach adoption.
## Feature designs
### 1. Multi-subscriber fan-out
- Remove the `GatewayOptionsValidator.cs:181-185` rejection; keep the option but
allow `true`.
- `_activeEventSubscriberCount` → a subscriber-lease collection on the
distributor. New cap `MaxEventSubscribersPerSession` (default 8) → reject the
N+1 attach with `EventSubscriberLimitReached`.
- Dashboard broadcaster registers as a distributor subscriber (removing the inline
tap), fixing the dashboard-dark-without-gRPC bug.
- **No proto change.**
### 2. Reconnectable sessions
- On stream drop, a session in **detach-grace** mode is retained (not closed) for
`DetachGraceSeconds` (separate from the session lease). New session
disconnect-policy value `DetachGrace`.
- On reconnect: client calls `StreamEvents` with the same session id +
`AfterWorkerSequence = lastSeen`. The distributor replays ring-buffer events
with `sequence > AfterWorkerSequence`, then resumes live.
- If the requested sequence is older than the ring's oldest retained event (gone
too long / ring overflowed), the server signals **`ReplayGap`** so the client
re-snapshots. **Contract addition** (a `ReplayGap` status / response marker) →
codegen ripple across all 5 clients.
- Reconnect re-validates caller `OwnerKeyId` == session owner → else
`PermissionDenied`.
### 3. Per-session ACL
- **gRPC (real security win, no proto change):** `Invoke` / `StreamEvents` /
`CloseSession` gated to the owning API key, OR a key holding a new
all-sessions admin scope → else `PermissionDenied`. Enforced in
`MxAccessGatewayService` against session `OwnerKeyId`.
- **Dashboard:** identity-domain mismatch (LDAP Admin/Viewer users vs API-key
sessions) means no natural owner link.
- **Decision required (flagged, not hard-coded):** default proposal —
**Admin** sees all sessions; **Viewer** scoped via config
`Dashboard:GroupToSessionTag` matched against an optional session `Tag`.
Enforced at `EventsHub.SubscribeSession` and in the `/hubs/token` mint
(token gains an allowed-session-tag claim). The owner may instead choose a
strict default (Viewers see nothing unless granted).
### 4. Orphan-worker reattach
- **Stable pipe naming:** drop `{gatewayPid}`; use a persisted stable
gateway-instance id. Replaces the pid's collision-avoidance role.
- **Adoption manifest:** persist a minimal record per live session
(`sessionId → workerPid, nonce, ownerKeyId, pipeName`) in the existing SQLite
store. This is the *only* persisted session state; COM/advise state stays in
the worker.
- **Worker phones home:** the worker runs a reconnect loop with bounded backoff;
the restarted gateway re-opens pipe servers for manifest entries and the
surviving worker re-attaches, presenting its **nonce**. Gateway validates the
nonce against the manifest and **rejects impostors / foreign workers**.
- **Resync, not replay:** the in-memory ring buffer is lost on restart, so a
reattached session's subscribers get `ReplayGap` and re-snapshot. Gateway
resyncs worker view via the now-implemented `GetSessionState` / `GetWorkerInfo`
commands.
- **Safety net retained:** workers self-terminate after `MaxOrphanLifetime` with
no re-adoption; `OrphanWorkerTerminator` stays as the fallback for un-adoptable
or foreign workers. Reattach is opt-in (`Workers:EnableOrphanReattach`,
default off) so the documented-safe behavior remains the default.
- **Pipe protocol:** add an **adopt/reconnect frame** to `mxaccess_worker.proto`
→ worker codegen regen + commit `Generated/` (net48 regen rule applies).
## Contract / codegen impact
Unlike the prior epic, this is **not zero-proto**:
- `mxaccess_gateway.proto``ReplayGap` signal for reconnect (Feature 2).
- `mxaccess_worker.proto` — adopt/reconnect frame (Feature 4).
Per the repo rule: regenerate `Generated/`, commit it, rebuild gateway + worker +
every generated client touched, and update affected docs in the same change.
## Error handling
- Per-subscriber overflow → disconnect that subscriber only; session survives.
- Reconnect past the ring horizon → `ReplayGap`, client re-snapshots (no silent
loss).
- Reattach nonce mismatch → reject + fall back to termination.
- ACL denial → `PermissionDenied` (gRPC) / hub subscribe refused (dashboard).
- All worker COM/STA interactions keep MXAccess parity — no synthesized events,
no "fixing" surprising returns.
## Testing & cross-platform verification
| Area | Test | Host |
|---|---|---|
| Distributor fan-out, per-sub backpressure, replay ring | unit, `FakeWorkerHarness` | local (macOS) |
| Reconnect replay + `ReplayGap` | unit + fake-worker integration | local |
| Session ownership / gRPC ACL | unit + gateway integration | local |
| Dashboard per-session ACL | LDAP test users (`multi-role`/`gw-viewer`) | local + live LDAP |
| Worker adopt frame, reattach handshake | worker unit (net48/x86) | **windev** |
| Gateway-restart reattach round-trip | integration | **windev** + live worker |
| Client `ReplayGap` handling | per-client tests; Java on macOS JDK 21 | local |
TDD throughout; per-task commits; `Generated/` regenerated+committed on proto
changes; docs (incl. the three documented-rule reversals) updated in the same
change as source.
## Delivery order (dependency stack)
Each phase is independently shippable:
1. **Foundation**`SessionEventDistributor` + replay ring + session `OwnerKeyId`
(refactor with no external behavior change; dashboard-dark bug fixed).
2. **Fan-out** — remove validator block, subscriber-lease list, cap, dashboard as
subscriber.
3. **Reconnect** — detach-grace, replay-on-reconnect, `ReplayGap` contract +
client handling.
4. **Per-session ACL** — gRPC owner gate + dashboard scoping.
5. **Reattach** — stable pipe naming, adoption manifest, worker phone-home +
adopt frame, resync, safety net; documented-rule reversals.
## Out of scope
- Cross-gateway / clustered session sharing (single gateway instance only).
- Event persistence beyond the in-memory ring (no durable event log).
- Reconnect across a gateway *restart* with zero event gap (restart always yields
`ReplayGap` by design — the ring is in-memory).
- Per-session ACL on `SnapshotHub` / `AlarmsHub` (they broadcast aggregate state;
only `EventsHub` is session-scoped).