docs: session-resilience epic design (fan-out, reconnect, ACL, reattach)
This commit is contained in:
@@ -0,0 +1,233 @@
|
||||
# Session Resilience Epic — Design
|
||||
|
||||
**Date:** 2026-06-15
|
||||
**Branch:** `feat/session-resilience`
|
||||
**Source:** `stillpending.md` §2 (intentional v1-deferred items), scoped into a real feature design.
|
||||
**Status:** Design approved; implementation plan to follow.
|
||||
|
||||
## Goal
|
||||
|
||||
Lift four deliberately-deferred v1 limitations into supported features, built on
|
||||
one shared foundation:
|
||||
|
||||
1. **Multi-event-subscriber fan-out** (§2: plumbed but validator-blocked).
|
||||
2. **Reconnectable sessions** (§2: 1:1 session↔connection today).
|
||||
3. **Per-session dashboard ACL** (§2 / EventsHub `TODO(per-session-acl)`).
|
||||
4. **Orphan-worker reattach on gateway restart** (§2 — **overturns a hard
|
||||
CLAUDE.md rule**, see "Documented-rule changes").
|
||||
|
||||
These are not peers: fan-out is the keystone, reconnect and reattach reuse its
|
||||
machinery, and three of the four need a new session-ownership concept.
|
||||
|
||||
## Documented-rule changes (explicit, owner-approved)
|
||||
|
||||
This epic deliberately reverses three documented v1 decisions. Each reversal is a
|
||||
required deliverable in the same change as the code:
|
||||
|
||||
- **CLAUDE.md:77** "Gateway restart does not reattach orphan workers… do not
|
||||
design code paths that assume reattachment." → reattach becomes supported,
|
||||
bounded, and opt-in.
|
||||
- **`docs/DesignDecisions.md:63-73`** "no reconnectable sessions for v1." →
|
||||
reconnect becomes supported within a bounded detach-grace window.
|
||||
- **`docs/DesignDecisions.md:75-80`** single event subscriber per session. →
|
||||
multi-subscriber fan-out becomes supported, capped.
|
||||
|
||||
The owner explicitly accepted overturning the reattach rule during design.
|
||||
|
||||
## Current-state seams (verified by recon, with citations)
|
||||
|
||||
- `GatewaySession.AttachEventSubscriber(bool allowMultipleSubscribers)`
|
||||
(`Sessions/GatewaySession.cs:386-408`) guards on a single int
|
||||
`_activeEventSubscriberCount` (`:16`) under `_syncRoot`; a second subscriber
|
||||
throws `EventSubscriberAlreadyActive` (`:398`).
|
||||
- `GatewayOptionsValidator.cs:181-185` hard-rejects
|
||||
`AllowMultipleEventSubscribers` ("not supported until event fan-out is
|
||||
implemented"); option bound at `SessionOptions.cs:26-29`.
|
||||
- `EventStreamService.StreamEventsAsync` (`Grpc/EventStreamService.cs:27-101`)
|
||||
creates **a new bounded `Channel<MxEvent>` per RPC call** (`:43-50`) and
|
||||
`ProduceEventsAsync` drains `session.ReadEventsAsync()` directly — a
|
||||
**destructive, single-consumer drain**. Two RPCs would fight over one queue.
|
||||
- Backpressure: `ProduceEventsAsync` uses non-blocking `TryWrite`; on overflow
|
||||
with `EventBackpressurePolicy.FailFast` (default, `EventOptions`) it calls
|
||||
`session.MarkFaulted` (`EventStreamService.cs:143-162`) — faulting the **whole
|
||||
session**, not just the slow consumer.
|
||||
- `DashboardEventBroadcaster.Publish` (`Dashboard/Hubs/DashboardEventBroadcaster.cs:13-44`)
|
||||
is called **inside** the per-RPC producer loop (`EventStreamService.cs:131-141`)
|
||||
— so the dashboard only mirrors events while a gRPC client is streaming. Latent
|
||||
bug: no gRPC subscriber ⇒ dashboard feed is dark.
|
||||
- Pipe name `mxaccess-gateway-{Environment.ProcessId}-{sessionId}`
|
||||
(`SessionManager.cs:433`); session id `session-{Guid:N}` (`:479`), in-memory
|
||||
`SessionRegistry` only (`SessionRegistry.cs:12`), **not persisted**.
|
||||
- `OrphanWorkerTerminator` (`Workers/OrphanWorkerTerminator.cs:49-112`) discovers
|
||||
orphans by executable name/path (x64 gateway cannot introspect the x86 worker
|
||||
module → image-name fallback) and **terminates** them; rationale comment at
|
||||
`:9-16`.
|
||||
- Pipe fault → `WorkerClient` read loop detects `EndOfStream`, session →
|
||||
`Faulted` (`WorkerClient.cs:376-381`); no reattach. Worker launch passes the
|
||||
per-session nonce via `MXGATEWAY_WORKER_NONCE` env var
|
||||
(`WorkerProcessLauncher.cs:180-182`).
|
||||
- Sessions store `ClientIdentity` (informational only, `GatewaySession.cs:114`);
|
||||
**no `OwnerKeyId`, no per-session ACL.** gRPC `StreamEvents` enforces per-item
|
||||
read constraints but **no session-level access gate** — any caller who knows a
|
||||
session id can stream it.
|
||||
- `EventsHub.SubscribeSession(string)` (`Dashboard/Hubs/EventsHub.cs:46-54`) joins
|
||||
group `session:{id}`; only hub-level `[Authorize(HubClientsPolicy)]` gates it,
|
||||
so **any** Admin/Viewer can subscribe to **any** session. `TODO(per-session-acl)`
|
||||
at `:39-43`. `SnapshotHub`/`AlarmsHub` broadcast to all. Hub bearer
|
||||
(`HubTokenService`, 30-min) carries name + roles only, **no session scope**.
|
||||
- `StreamEventsRequest.AfterWorkerSequence` already exists on the wire (the
|
||||
reconnect replay contract is half-built).
|
||||
|
||||
## Shared foundation
|
||||
|
||||
### A. `SessionEventDistributor` (one pump, N per-subscriber channels)
|
||||
|
||||
Per `GatewaySession`, replace the per-RPC direct drain with a single owned
|
||||
distributor:
|
||||
|
||||
- One background **pump task** drains `ReadEventsAsync()` exactly once.
|
||||
- Each event is (1) stamped with its worker sequence, (2) appended to a **bounded
|
||||
replay ring buffer** (retain last `ReplayBufferCapacity` events or
|
||||
`ReplayRetentionSeconds`, whichever first), and (3) `TryWrite`-fanned to every
|
||||
registered subscriber's own bounded `Channel<MxEvent>`.
|
||||
- **Per-subscriber backpressure isolation:** overflow completes only that
|
||||
subscriber's channel (policy `DisconnectSubscriber`); the session and peers are
|
||||
untouched. `FailFast`→`MarkFaulted` is retained only for the legacy
|
||||
single-subscriber config path, for backward compatibility.
|
||||
- **Constraint filtering stays per-subscriber:** the pump fans *raw* events; each
|
||||
subscriber's read loop applies its own API-key read subtree/glob filter exactly
|
||||
as today. No change to constraint semantics.
|
||||
- `AttachEventSubscriber` returns a lease carrying that subscriber's channel
|
||||
reader + its start sequence (for replay). `EventStreamService` reads the lease
|
||||
channel instead of creating its own channel and draining the session.
|
||||
|
||||
### B. Session ownership
|
||||
|
||||
Record an authoritative **`OwnerKeyId`** (the creating API key id) on the session
|
||||
at `OpenSession`, alongside the existing informational `ClientIdentity`. This one
|
||||
field underpins ACL, reconnect re-validation, and reattach adoption.
|
||||
|
||||
## Feature designs
|
||||
|
||||
### 1. Multi-subscriber fan-out
|
||||
|
||||
- Remove the `GatewayOptionsValidator.cs:181-185` rejection; keep the option but
|
||||
allow `true`.
|
||||
- `_activeEventSubscriberCount` → a subscriber-lease collection on the
|
||||
distributor. New cap `MaxEventSubscribersPerSession` (default 8) → reject the
|
||||
N+1 attach with `EventSubscriberLimitReached`.
|
||||
- Dashboard broadcaster registers as a distributor subscriber (removing the inline
|
||||
tap), fixing the dashboard-dark-without-gRPC bug.
|
||||
- **No proto change.**
|
||||
|
||||
### 2. Reconnectable sessions
|
||||
|
||||
- On stream drop, a session in **detach-grace** mode is retained (not closed) for
|
||||
`DetachGraceSeconds` (separate from the session lease). New session
|
||||
disconnect-policy value `DetachGrace`.
|
||||
- On reconnect: client calls `StreamEvents` with the same session id +
|
||||
`AfterWorkerSequence = lastSeen`. The distributor replays ring-buffer events
|
||||
with `sequence > AfterWorkerSequence`, then resumes live.
|
||||
- If the requested sequence is older than the ring's oldest retained event (gone
|
||||
too long / ring overflowed), the server signals **`ReplayGap`** so the client
|
||||
re-snapshots. **Contract addition** (a `ReplayGap` status / response marker) →
|
||||
codegen ripple across all 5 clients.
|
||||
- Reconnect re-validates caller `OwnerKeyId` == session owner → else
|
||||
`PermissionDenied`.
|
||||
|
||||
### 3. Per-session ACL
|
||||
|
||||
- **gRPC (real security win, no proto change):** `Invoke` / `StreamEvents` /
|
||||
`CloseSession` gated to the owning API key, OR a key holding a new
|
||||
all-sessions admin scope → else `PermissionDenied`. Enforced in
|
||||
`MxAccessGatewayService` against session `OwnerKeyId`.
|
||||
- **Dashboard:** identity-domain mismatch (LDAP Admin/Viewer users vs API-key
|
||||
sessions) means no natural owner link.
|
||||
- **Decision required (flagged, not hard-coded):** default proposal —
|
||||
**Admin** sees all sessions; **Viewer** scoped via config
|
||||
`Dashboard:GroupToSessionTag` matched against an optional session `Tag`.
|
||||
Enforced at `EventsHub.SubscribeSession` and in the `/hubs/token` mint
|
||||
(token gains an allowed-session-tag claim). The owner may instead choose a
|
||||
strict default (Viewers see nothing unless granted).
|
||||
|
||||
### 4. Orphan-worker reattach
|
||||
|
||||
- **Stable pipe naming:** drop `{gatewayPid}`; use a persisted stable
|
||||
gateway-instance id. Replaces the pid's collision-avoidance role.
|
||||
- **Adoption manifest:** persist a minimal record per live session
|
||||
(`sessionId → workerPid, nonce, ownerKeyId, pipeName`) in the existing SQLite
|
||||
store. This is the *only* persisted session state; COM/advise state stays in
|
||||
the worker.
|
||||
- **Worker phones home:** the worker runs a reconnect loop with bounded backoff;
|
||||
the restarted gateway re-opens pipe servers for manifest entries and the
|
||||
surviving worker re-attaches, presenting its **nonce**. Gateway validates the
|
||||
nonce against the manifest and **rejects impostors / foreign workers**.
|
||||
- **Resync, not replay:** the in-memory ring buffer is lost on restart, so a
|
||||
reattached session's subscribers get `ReplayGap` and re-snapshot. Gateway
|
||||
resyncs worker view via the now-implemented `GetSessionState` / `GetWorkerInfo`
|
||||
commands.
|
||||
- **Safety net retained:** workers self-terminate after `MaxOrphanLifetime` with
|
||||
no re-adoption; `OrphanWorkerTerminator` stays as the fallback for un-adoptable
|
||||
or foreign workers. Reattach is opt-in (`Workers:EnableOrphanReattach`,
|
||||
default off) so the documented-safe behavior remains the default.
|
||||
- **Pipe protocol:** add an **adopt/reconnect frame** to `mxaccess_worker.proto`
|
||||
→ worker codegen regen + commit `Generated/` (net48 regen rule applies).
|
||||
|
||||
## Contract / codegen impact
|
||||
|
||||
Unlike the prior epic, this is **not zero-proto**:
|
||||
|
||||
- `mxaccess_gateway.proto` — `ReplayGap` signal for reconnect (Feature 2).
|
||||
- `mxaccess_worker.proto` — adopt/reconnect frame (Feature 4).
|
||||
|
||||
Per the repo rule: regenerate `Generated/`, commit it, rebuild gateway + worker +
|
||||
every generated client touched, and update affected docs in the same change.
|
||||
|
||||
## Error handling
|
||||
|
||||
- Per-subscriber overflow → disconnect that subscriber only; session survives.
|
||||
- Reconnect past the ring horizon → `ReplayGap`, client re-snapshots (no silent
|
||||
loss).
|
||||
- Reattach nonce mismatch → reject + fall back to termination.
|
||||
- ACL denial → `PermissionDenied` (gRPC) / hub subscribe refused (dashboard).
|
||||
- All worker COM/STA interactions keep MXAccess parity — no synthesized events,
|
||||
no "fixing" surprising returns.
|
||||
|
||||
## Testing & cross-platform verification
|
||||
|
||||
| Area | Test | Host |
|
||||
|---|---|---|
|
||||
| Distributor fan-out, per-sub backpressure, replay ring | unit, `FakeWorkerHarness` | local (macOS) |
|
||||
| Reconnect replay + `ReplayGap` | unit + fake-worker integration | local |
|
||||
| Session ownership / gRPC ACL | unit + gateway integration | local |
|
||||
| Dashboard per-session ACL | LDAP test users (`multi-role`/`gw-viewer`) | local + live LDAP |
|
||||
| Worker adopt frame, reattach handshake | worker unit (net48/x86) | **windev** |
|
||||
| Gateway-restart reattach round-trip | integration | **windev** + live worker |
|
||||
| Client `ReplayGap` handling | per-client tests; Java on macOS JDK 21 | local |
|
||||
|
||||
TDD throughout; per-task commits; `Generated/` regenerated+committed on proto
|
||||
changes; docs (incl. the three documented-rule reversals) updated in the same
|
||||
change as source.
|
||||
|
||||
## Delivery order (dependency stack)
|
||||
|
||||
Each phase is independently shippable:
|
||||
|
||||
1. **Foundation** — `SessionEventDistributor` + replay ring + session `OwnerKeyId`
|
||||
(refactor with no external behavior change; dashboard-dark bug fixed).
|
||||
2. **Fan-out** — remove validator block, subscriber-lease list, cap, dashboard as
|
||||
subscriber.
|
||||
3. **Reconnect** — detach-grace, replay-on-reconnect, `ReplayGap` contract +
|
||||
client handling.
|
||||
4. **Per-session ACL** — gRPC owner gate + dashboard scoping.
|
||||
5. **Reattach** — stable pipe naming, adoption manifest, worker phone-home +
|
||||
adopt frame, resync, safety net; documented-rule reversals.
|
||||
|
||||
## Out of scope
|
||||
|
||||
- Cross-gateway / clustered session sharing (single gateway instance only).
|
||||
- Event persistence beyond the in-memory ring (no durable event log).
|
||||
- Reconnect across a gateway *restart* with zero event gap (restart always yields
|
||||
`ReplayGap` by design — the ring is in-memory).
|
||||
- Per-session ACL on `SnapshotHub` / `AlarmsHub` (they broadcast aggregate state;
|
||||
only `EventsHub` is session-scoped).
|
||||
Reference in New Issue
Block a user