Files
mxaccessgw/docs/plans/2026-06-15-session-resilience.md
T

23 KiB
Raw Blame History

Session Resilience Epic — Implementation Plan

For Claude: REQUIRED SUB-SKILL: Use superpowers-extended-cc:subagent-driven-development (same session) or executing-plans (parallel session) to implement this plan task-by-task.

Goal: Lift four deferred v1 limitations — multi-subscriber fan-out, reconnectable sessions, per-session ACL, orphan-worker reattach — onto one shared event-distribution foundation.

Architecture: A per-session SessionEventDistributor (one pump → N per-subscriber bounded channels + a bounded replay ring) replaces today's per-RPC destructive drain. Session ownership (OwnerKeyId) underpins ACL, reconnect re-validation, and reattach adoption. See docs/plans/2026-06-15-session-resilience-design.md.

Tech Stack: .NET 10 gateway (x64), .NET Framework 4.8 worker (x86, windev), SQLite auth/manifest store, gRPC + protobuf contracts (net10.0;net48), 5 language clients, Blazor/SignalR dashboard, LDAP dashboard auth.

Cross-platform: Gateway, dotnet/Go/Rust/Python clients, and the Java client build/test locally on macOS (JDK 21 at ~/.local/jdks/jdk-21.0.11+10/Contents/Home). The net48/x86 worker and worker tests build/test on windev (ssh alias, PowerShell). Proto changes: regenerate Generated/, commit it, rebuild every touched component.

Standing rules (from CLAUDE.md): never log secrets/credentials/values; MXAccess parity (no synthesized events, no "fixing" surprising returns); no init-only props/positional records in net48 worker; update docs in the same change as source; branch already created (feat/session-resilience); per-task commits; build+test affected components before marking done.


Phase 1 — Foundation (refactor; no external behavior change except the dashboard-dark fix)

Task 1: Add OwnerKeyId to the session

Classification: small Estimated implement time: ~4 min Parallelizable with: none (other phase-1 tasks build on the session type)

Files:

  • Modify: src/ZB.MOM.WW.MxGateway.Server/Sessions/GatewaySession.cs (add OwnerKeyId readonly prop near ClientIdentity:114)
  • Modify: src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionManager.cs (set OwnerKeyId from the request identity at OpenSession, near CreateSessionId:479)
  • Test: src/ZB.MOM.WW.MxGateway.Tests/Gateway/Sessions/SessionManagerTests.cs

Steps: TDD — failing test asserting an opened session records the creating API key id → add the property + assignment from IGatewayRequestIdentityAccessor.Current → green → dotnet build src/ZB.MOM.WW.MxGateway.Server + run session tests → commit.

Task 2: SessionEventDistributor skeleton (single pump, subscriber registry)

Classification: high-risk (concurrency / actor model) Estimated implement time: ~5 min Parallelizable with: none

Files:

  • Create: src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionEventDistributor.cs
  • Test: src/ZB.MOM.WW.MxGateway.Tests/Gateway/Sessions/SessionEventDistributorTests.cs

Design: One background pump Task draining session.ReadEventsAsync() exactly once; a thread-safe subscriber collection where each subscriber owns a bounded Channel<MxEvent> (SingleReader=true, FullMode=Wait for the per-sub channel, but writes use non-blocking TryWrite). Register(startSequence) returns a lease (channel reader + dispose). Pump fans each drained event to all subscriber channels via TryWrite.

Steps: Failing test: two registered subscribers both receive the same fanned event; disposing one stops its delivery without affecting the other. Implement pump + registry. Green. Build + test. Commit.

Task 3: Bounded replay ring buffer

Classification: standard Estimated implement time: ~4 min Parallelizable with: none (extends Task 2)

Files:

  • Modify: src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionEventDistributor.cs
  • Modify: src/ZB.MOM.WW.MxGateway.Server/Configuration/EventOptions.cs (add ReplayBufferCapacity, ReplayRetentionSeconds)
  • Test: SessionEventDistributorTests.cs

Design: Append each fanned event to a ring keyed by worker sequence, evicting by count (ReplayBufferCapacity) or age (ReplayRetentionSeconds), whichever first. Expose TryGetReplayFrom(afterSequence, out events, out gap).

Steps: Failing test: events evicted past capacity; TryGetReplayFrom returns gap=true when requested sequence is older than the oldest retained. Implement. Green. Build+test. Commit.

Task 4: Rewire AttachEventSubscriber + EventStreamService onto the distributor

Classification: high-risk (changes the live event path) Estimated implement time: ~5 min Parallelizable with: none

Files:

  • Modify: src/ZB.MOM.WW.MxGateway.Server/Sessions/GatewaySession.cs:386-408 (own a SessionEventDistributor; AttachEventSubscriber returns a lease wrapping distributor.Register(...))
  • Modify: src/ZB.MOM.WW.MxGateway.Server/Grpc/EventStreamService.cs:27-101 (read the lease's channel instead of creating a per-RPC channel and draining the session directly; remove the per-RPC Channel.CreateBounded at :43-50)
  • Test: src/ZB.MOM.WW.MxGateway.Tests/Gateway/Grpc/EventStreamServiceTests.cs

Steps: Failing test: a single subscriber still streams events end-to-end through the distributor (regression parity with today). Rewire. Keep per-item constraint filtering in the subscriber read loop. Green. Build + run gateway event-stream tests. Commit.

Task 5: Per-subscriber backpressure isolation

Classification: standard Estimated implement time: ~4 min Parallelizable with: none (extends Tasks 2/4)

Files:

  • Modify: src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionEventDistributor.cs
  • Modify: src/ZB.MOM.WW.MxGateway.Server/Grpc/EventStreamService.cs (overflow path :143-162)
  • Test: SessionEventDistributorTests.cs

Design: On a subscriber channel TryWrite failure, complete only that subscriber's channel with EventQueueOverflow (policy DisconnectSubscriber). Retain FailFastMarkFaulted only when the session is in legacy single-subscriber mode (back-compat).

Steps: Failing test: a slow subscriber overflows and is disconnected while a second subscriber keeps receiving and the session stays Ready. Implement. Green. Build+test. Commit.

Task 6: Dashboard broadcaster becomes a distributor subscriber

Classification: standard Estimated implement time: ~4 min Parallelizable with: none

Files:

  • Modify: src/ZB.MOM.WW.MxGateway.Server/Grpc/EventStreamService.cs:131-141 (remove the inline dashboardEventBroadcaster.Publish tap)
  • Modify: src/ZB.MOM.WW.MxGateway.Server/Sessions/GatewaySession.cs (register the dashboard broadcaster as a distributor subscriber on session start)
  • Modify: src/ZB.MOM.WW.MxGateway.Server/Dashboard/Hubs/DashboardEventBroadcaster.cs (consume from a distributor lease)
  • Test: src/ZB.MOM.WW.MxGateway.Tests/Gateway/Dashboard/DashboardEventBroadcasterTests.cs

Steps: Failing test: dashboard receives session events even with no active gRPC subscriber (fixes the latent dark-feed bug). Implement. Green. Build + dashboard tests. Commit.


Phase 2 — Multi-subscriber fan-out

Task 7: Remove the validator block + add the subscriber cap option

Classification: small Estimated implement time: ~3 min Parallelizable with: Task 8 is sequential (same files); none

Files:

  • Modify: src/ZB.MOM.WW.MxGateway.Server/Configuration/GatewayOptionsValidator.cs:181-185 (delete the rejection)
  • Modify: src/ZB.MOM.WW.MxGateway.Server/Configuration/SessionOptions.cs (add MaxEventSubscribersPerSession, default 8)
  • Test: src/ZB.MOM.WW.MxGateway.Tests/Gateway/Configuration/GatewayOptionsValidatorTests.cs

Steps: Failing test: AllowMultipleEventSubscribers=true now validates clean. Remove rule, add option. Green. Build+test. Commit.

Task 8: Subscriber-lease collection + cap enforcement

Classification: standard Estimated implement time: ~4 min Parallelizable with: none

Files:

  • Modify: src/ZB.MOM.WW.MxGateway.Server/Sessions/GatewaySession.cs (replace _activeEventSubscriberCount:16 with a lease collection; honor allowMultipleSubscribers; reject N+1 with new SessionManagerErrorCode.EventSubscriberLimitReached)
  • Modify: src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionManagerErrorCode.cs
  • Test: src/ZB.MOM.WW.MxGateway.Tests/Gateway/Sessions/GatewaySessionTests.cs

Steps: Failing tests: N subscribers attach concurrently up to the cap; N+1 throws EventSubscriberLimitReached; single-subscriber mode still rejects the 2nd. Implement. Green. Build+test. Commit.

Task 9: Multi-subscriber end-to-end test via FakeWorkerHarness

Classification: standard Estimated implement time: ~4 min Parallelizable with: none

Files:

  • Test: src/ZB.MOM.WW.MxGateway.Tests/Gateway/GatewayEndToEndFakeWorkerSmokeTests.cs

Steps: Two concurrent StreamEvents RPCs on one session both receive every worker event; one cancels, the other continues. Build + full fake-worker suite. Commit.


Phase 3 — Reconnectable sessions

Task 10: Proto — ReplayGap signal (contract change)

Classification: high-risk (contracts → all clients) Estimated implement time: ~5 min Parallelizable with: none

Files:

  • Modify: src/ZB.MOM.WW.MxGateway.Contracts/Protos/mxaccess_gateway.proto (add a ReplayGap marker — a replay_gap bool + oldest_available_sequence on the stream response, or a dedicated leading status frame)
  • Regenerate: dotnet build src/ZB.MOM.WW.MxGateway.Contracts/ZB.MOM.WW.MxGateway.Contracts.csproj; commit src/ZB.MOM.WW.MxGateway.Contracts/Generated/* (net48 regen rule — see project_proto_codegen_regen)
  • Test: contracts build both TFMs (net10.0;net48)

Steps: Add field(s), regen, del Generated/*.cs if needed to force regen, commit generated. Build contracts both TFMs. Commit. This unblocks Task 11 and Task 14.

Task 11: Detach-grace session retention

Classification: high-risk (session lifecycle) Estimated implement time: ~5 min Parallelizable with: none

Files:

  • Modify: src/ZB.MOM.WW.MxGateway.Server/Sessions/GatewaySession.cs (add DetachGrace retention: on last-subscriber-drop, keep session alive for DetachGraceSeconds instead of closing)
  • Modify: src/ZB.MOM.WW.MxGateway.Server/Configuration/SessionOptions.cs (DetachGraceSeconds; new disconnect-policy value DetachGrace)
  • Modify: src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionLeaseMonitorHostedService.cs (sweep expired detach-grace windows)
  • Test: GatewaySessionTests.cs

Steps: Failing test: subscriber drop under DetachGrace keeps the session Ready until the window expires, then closes. Implement. Green. Build + session/lease tests. Commit.

Task 12: Replay-on-reconnect + emit ReplayGap

Classification: high-risk Estimated implement time: ~5 min Parallelizable with: none

Files:

  • Modify: src/ZB.MOM.WW.MxGateway.Server/Grpc/EventStreamService.cs (on attach with AfterWorkerSequence, call distributor.TryGetReplayFrom; replay buffered events then resume live; if gap, send the ReplayGap marker first)
  • Test: EventStreamServiceTests.cs

Steps: Failing tests: reconnect with a known sequence replays only newer events; reconnect past the ring horizon yields ReplayGap. Implement. Green. Build + test. Commit.

Task 13: Owner re-validation on reconnect

Classification: small Estimated implement time: ~3 min Parallelizable with: Task 12 (different assertion in same service — sequence after 12)

Files:

  • Modify: src/ZB.MOM.WW.MxGateway.Server/Grpc/EventStreamService.cs (reconnect requires caller OwnerKeyId == session owner → PermissionDenied)
  • Test: EventStreamServiceTests.cs

Steps: Failing test: a different API key cannot resume someone else's session. Implement. Green. Build+test. Commit.

Task 14: Client ReplayGap handling — all 5 clients

Classification: standard Estimated implement time: ~5 min each (dispatch as 5 parallel sub-tasks; disjoint files) Parallelizable with: each other (14a14e)

Files (one client each):

  • 14a dotnet: clients/dotnet/.../ stream consumer + test
  • 14b Go: clients/go/mxgateway/ + go test
  • 14c Python: clients/python/src/.../ + pytest
  • 14d Rust: clients/rust/crates/.../ + cargo test/clippy
  • 14e Java: clients/java/.../ + gradle test (macOS JDK 21; revert generated MxaccessGateway.java churn per project_java_generated_churn)

Steps (each): Regenerate client stubs from the updated proto; surface ReplayGap to the caller (callback/return marker) so apps know to re-snapshot; test the gap path. Build+test that client. Commit per client.

Task 15: Reconnect integration test (fake worker)

Classification: standard Estimated implement time: ~4 min Parallelizable with: none

Files:

  • Test: src/ZB.MOM.WW.MxGateway.Tests/Gateway/GatewayEndToEndFakeWorkerSmokeTests.cs

Steps: Stream, drop, reconnect within grace with last sequence → no gap; reconnect after ring overflow → ReplayGap. Build + suite. Commit.


Phase 4 — Per-session ACL

Task 16: gRPC session-owner gate + all-sessions admin scope

Classification: high-risk (security) Estimated implement time: ~5 min Parallelizable with: none

Files:

  • Modify: src/ZB.MOM.WW.MxGateway.Server/Grpc/MxAccessGatewayService.cs (Invoke/StreamEvents/CloseSession require caller key == session OwnerKeyId, or a key bearing a new all-sessions scope)
  • Modify: src/ZB.MOM.WW.MxGateway.Server/Security/Authorization/ (define the all-sessions scope; map it)
  • Test: src/ZB.MOM.WW.MxGateway.Tests/Gateway/Grpc/MxAccessGatewayServiceTests.cs

Steps: Failing tests: foreign key gets PermissionDenied on another key's session; owner and all-sessions-scoped key succeed. Implement. Green. Build + gateway tests. Commit.

Task 17: Session Tag + dashboard group→tag config

Classification: small Estimated implement time: ~3 min Parallelizable with: Task 16 (disjoint files)

Files:

  • Modify: src/ZB.MOM.WW.MxGateway.Server/Sessions/GatewaySession.cs (+SessionManager to set an optional Tag from the open request)
  • Modify: src/ZB.MOM.WW.MxGateway.Server/Configuration/ Dashboard options (GroupToSessionTag map)
  • Test: config-binding test

Steps: Failing test: a session carries its tag; config map binds. Implement. Green. Build+test. Commit.

Task 18: EventsHub per-session ACL + hub-token session-tag claim

Classification: standard Estimated implement time: ~5 min Parallelizable with: none (depends on 17)

Files:

  • Modify: src/ZB.MOM.WW.MxGateway.Server/Dashboard/Hubs/EventsHub.cs:39-54 (replace TODO(per-session-acl): Admin sees all; Viewer allowed only if the session's tag is in the user's GroupToSessionTag-derived allowed set)
  • Modify: src/ZB.MOM.WW.MxGateway.Server/Dashboard/HubTokenService.cs (mint an allowed-session-tag claim)
  • Modify: src/ZB.MOM.WW.MxGateway.Server/Dashboard/HubTokenAuthenticationHandler.cs (carry the claim back)
  • Test: src/ZB.MOM.WW.MxGateway.Tests/Gateway/Dashboard/EventsHubTests.cs

Decision flagged in the design doc: default is "Admin all / Viewer by tag map." If the owner chose the strict variant (Viewers see nothing unless granted), invert the default here — the executor must confirm which before implementing.

Steps: Failing tests: Viewer without the tag is refused SubscribeSession; Admin allowed; Viewer with the mapped tag allowed. Implement. Green. Build + dashboard tests. Commit.

Task 19: ACL tests incl. live LDAP users

Classification: standard Estimated implement time: ~4 min Parallelizable with: none

Files:

  • Test: src/ZB.MOM.WW.MxGateway.IntegrationTests/DashboardLdapLiveTests.cs (extend; gated MXGATEWAY_RUN_LIVE_LDAP_TESTS=1)

Steps: With multi-role (Admin) vs gw-viewer (Viewer), assert subscribe authorization differs by session tag. Document if skipped (no live LDAP). Commit.


Phase 5 — Orphan-worker reattach (overturns the CLAUDE.md rule)

Task 20: Stable gateway-instance id + stable pipe naming

Classification: standard Estimated implement time: ~4 min Parallelizable with: none

Files:

  • Modify: src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionManager.cs:433 (pipe name uses a persisted stable gateway-instance id instead of Environment.ProcessId)
  • Create: gateway-instance-id persistence (small file/SQLite row under C:\ProgramData\MxGateway\)
  • Test: SessionManagerTests.cs / a new instance-id test

Steps: Failing test: pipe name is stable across simulated restarts (same instance id). Implement. Green. Build + tests. Commit.

Task 21: Adoption manifest store (SQLite)

Classification: high-risk (persistence, security material) Estimated implement time: ~5 min Parallelizable with: none

Files:

  • Create: src/ZB.MOM.WW.MxGateway.Server/Workers/WorkerAdoptionManifest.cs (persist sessionId → workerPid, nonce, ownerKeyId, pipeName; upsert on launch, delete on clean close)
  • Modify: gateway-auth SQLite schema/migration
  • Test: src/ZB.MOM.WW.MxGateway.Tests/Gateway/Workers/WorkerAdoptionManifestTests.cs

Nonce is security material — store it like other secrets (no plaintext logging; standing rule).

Steps: Failing test: manifest round-trips an entry; clean close removes it. Implement. Green. Build + tests. Commit.

Task 22: Proto — worker adopt/reconnect frame (contract change)

Classification: high-risk (contracts → worker) Estimated implement time: ~5 min Parallelizable with: none

Files:

  • Modify: src/ZB.MOM.WW.MxGateway.Contracts/Protos/mxaccess_worker.proto (add an adopt/reconnect WorkerEnvelope frame: worker presents sessionId + nonce; gateway ACK/NACK)
  • Regenerate + commit Generated/* (net48 rule)
  • Test: contracts build both TFMs

Steps: Add frame, regen, commit generated, build both TFMs. Commit. Unblocks Tasks 2425.

Task 23: Worker phone-home reconnect loop + self-terminate

Classification: high-risk (worker, net48/x86 — windev) Estimated implement time: ~5 min Parallelizable with: none

Files:

  • Modify: src/ZB.MOM.WW.MxGateway.Worker/Ipc/WorkerPipeClient.cs (on pipe drop: reconnect loop with bounded backoff to the stable pipe name; present the adopt frame)
  • Modify: src/ZB.MOM.WW.MxGateway.Worker/ runtime (self-terminate after MaxOrphanLifetime with no adoption)
  • Test: src/ZB.MOM.WW.MxGateway.Worker.Tests/ (net48/x86 on windev)

Steps: Failing test (fake pipe server): worker retries and adopts; gives up + self-terminates past the lifetime. Build x86 + worker tests on windev. Commit. (net48: no init-only/positional records.)

Task 24: Gateway adoption — re-open pipes, nonce-validate, reject impostors

Classification: high-risk (security, lifecycle — windev for live) Estimated implement time: ~5 min Parallelizable with: none

Files:

  • Create: src/ZB.MOM.WW.MxGateway.Server/Workers/OrphanWorkerAdopter.cs (startup: read manifest, re-open pipe servers, accept adopt frames, validate nonce → adopt or reject)
  • Modify: gateway startup hosted-service order (adopter runs before OrphanWorkerTerminator; terminator handles only un-adoptable/foreign workers)
  • Test: src/ZB.MOM.WW.MxGateway.Tests/Gateway/Workers/OrphanWorkerAdopterTests.cs

Steps: Failing tests: matching nonce adopts and rebuilds the session; mismatched nonce is rejected and the worker terminated. Implement. Green. Build + tests. Commit.

Task 25: Resync adopted worker + ReplayGap to subscribers

Classification: standard Estimated implement time: ~4 min Parallelizable with: none

Files:

  • Modify: src/ZB.MOM.WW.MxGateway.Server/Workers/OrphanWorkerAdopter.cs (after adoption, GetSessionState/GetWorkerInfo to resync; reattached subscribers get ReplayGap since the ring is gone)
  • Test: OrphanWorkerAdopterTests.cs

Steps: Failing test: adopted session reports resynced state; a resuming subscriber receives ReplayGap. Implement. Green. Build + tests. Commit.

Task 26: EnableOrphanReattach flag (default off) + terminator fallback

Classification: small Estimated implement time: ~3 min Parallelizable with: none

Files:

  • Modify: src/ZB.MOM.WW.MxGateway.Server/Configuration/WorkerOptions.cs (EnableOrphanReattach, default false)
  • Modify: src/ZB.MOM.WW.MxGateway.Server/Workers/OrphanWorkerTerminator.cs (unchanged default behavior when reattach disabled)
  • Test: OrphanWorkerTerminatorTests.cs / adopter test

Steps: Failing test: with the flag off, startup terminates (today's behavior); on, it adopts. Implement. Green. Build + tests. Commit.

Task 27: Gateway-restart reattach round-trip (integration, windev + live worker)

Classification: high-risk Estimated implement time: ~5 min Parallelizable with: none

Files:

  • Test: src/ZB.MOM.WW.MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs (gated MXGATEWAY_RUN_LIVE_MXACCESS_TESTS=1)

Steps: Open session → simulate gateway restart → adopter re-adopts the surviving worker → session usable → subscriber gets ReplayGap then live events. Run on windev with live MXAccess. Document if skipped.

Task 28: Documented-rule reversals + stillpending refresh

Classification: trivial (doc-only) Estimated implement time: ~3 min Parallelizable with: none (final)

Files:

  • Modify: CLAUDE.md (line ~77 — reattach now supported, opt-in/bounded)
  • Modify: docs/DesignDecisions.md (:63-73 reconnect, :75-80 multi-subscriber, reattach rationale → mark superseded with this design's date/commit)
  • Modify: gateway.md (post-v1 revisit items — reflect what shipped)
  • Modify: stillpending.md (§2 items: mark fan-out/reconnect/ACL/reattach Resolved with commit refs)
  • Modify: docs/GatewayConfiguration.md (new options: MaxEventSubscribersPerSession, ReplayBufferCapacity, ReplayRetentionSeconds, DetachGraceSeconds, GroupToSessionTag, EnableOrphanReattach, MaxOrphanLifetime)

Steps: Edit docs to match shipped behavior. Commit.


Verification matrix

Phase Build/test Host
14 (gateway, clients) dotnet build + gateway/fake-worker tests; per-client go/pytest/cargo/gradle/dotnet test local (macOS)
3/5 proto changes regen + commit Generated/; build contracts both TFMs; rebuild touched clients local
5 worker (net48/x86) dotnet build -p:Platform=x86 + Worker.Tests windev
5 live reattach + Phase-4 LDAP opt-in gated integration tests windev / live LDAP

Final integration review

After all tasks: dispatch a final integration reviewer over git diff main..HEAD focusing on the live event path, concurrency in SessionEventDistributor, security gates (ACL + nonce adoption), and the three documented-rule reversals. Then use superpowers-extended-cc:finishing-a-development-branch.