23 KiB
Session Resilience Epic — Implementation Plan
For Claude: REQUIRED SUB-SKILL: Use superpowers-extended-cc:subagent-driven-development (same session) or executing-plans (parallel session) to implement this plan task-by-task.
Goal: Lift four deferred v1 limitations — multi-subscriber fan-out, reconnectable sessions, per-session ACL, orphan-worker reattach — onto one shared event-distribution foundation.
Architecture: A per-session SessionEventDistributor (one pump → N per-subscriber bounded channels + a bounded replay ring) replaces today's per-RPC destructive drain. Session ownership (OwnerKeyId) underpins ACL, reconnect re-validation, and reattach adoption. See docs/plans/2026-06-15-session-resilience-design.md.
Tech Stack: .NET 10 gateway (x64), .NET Framework 4.8 worker (x86, windev), SQLite auth/manifest store, gRPC + protobuf contracts (net10.0;net48), 5 language clients, Blazor/SignalR dashboard, LDAP dashboard auth.
Cross-platform: Gateway, dotnet/Go/Rust/Python clients, and the Java client build/test locally on macOS (JDK 21 at ~/.local/jdks/jdk-21.0.11+10/Contents/Home). The net48/x86 worker and worker tests build/test on windev (ssh alias, PowerShell). Proto changes: regenerate Generated/, commit it, rebuild every touched component.
Standing rules (from CLAUDE.md): never log secrets/credentials/values; MXAccess parity (no synthesized events, no "fixing" surprising returns); no init-only props/positional records in net48 worker; update docs in the same change as source; branch already created (feat/session-resilience); per-task commits; build+test affected components before marking done.
Phase 1 — Foundation (refactor; no external behavior change except the dashboard-dark fix)
Task 1: Add OwnerKeyId to the session
Classification: small Estimated implement time: ~4 min Parallelizable with: none (other phase-1 tasks build on the session type)
Files:
- Modify:
src/ZB.MOM.WW.MxGateway.Server/Sessions/GatewaySession.cs(addOwnerKeyIdreadonly prop nearClientIdentity:114) - Modify:
src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionManager.cs(setOwnerKeyIdfrom the request identity atOpenSession, nearCreateSessionId:479) - Test:
src/ZB.MOM.WW.MxGateway.Tests/Gateway/Sessions/SessionManagerTests.cs
Steps: TDD — failing test asserting an opened session records the creating API key id → add the property + assignment from IGatewayRequestIdentityAccessor.Current → green → dotnet build src/ZB.MOM.WW.MxGateway.Server + run session tests → commit.
Task 2: SessionEventDistributor skeleton (single pump, subscriber registry)
Classification: high-risk (concurrency / actor model) Estimated implement time: ~5 min Parallelizable with: none
Files:
- Create:
src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionEventDistributor.cs - Test:
src/ZB.MOM.WW.MxGateway.Tests/Gateway/Sessions/SessionEventDistributorTests.cs
Design: One background pump Task draining session.ReadEventsAsync() exactly once; a thread-safe subscriber collection where each subscriber owns a bounded Channel<MxEvent> (SingleReader=true, FullMode=Wait for the per-sub channel, but writes use non-blocking TryWrite). Register(startSequence) returns a lease (channel reader + dispose). Pump fans each drained event to all subscriber channels via TryWrite.
Steps: Failing test: two registered subscribers both receive the same fanned event; disposing one stops its delivery without affecting the other. Implement pump + registry. Green. Build + test. Commit.
Task 3: Bounded replay ring buffer
Classification: standard Estimated implement time: ~4 min Parallelizable with: none (extends Task 2)
Files:
- Modify:
src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionEventDistributor.cs - Modify:
src/ZB.MOM.WW.MxGateway.Server/Configuration/EventOptions.cs(addReplayBufferCapacity,ReplayRetentionSeconds) - Test:
SessionEventDistributorTests.cs
Design: Append each fanned event to a ring keyed by worker sequence, evicting by count (ReplayBufferCapacity) or age (ReplayRetentionSeconds), whichever first. Expose TryGetReplayFrom(afterSequence, out events, out gap).
Steps: Failing test: events evicted past capacity; TryGetReplayFrom returns gap=true when requested sequence is older than the oldest retained. Implement. Green. Build+test. Commit.
Task 4: Rewire AttachEventSubscriber + EventStreamService onto the distributor
Classification: high-risk (changes the live event path) Estimated implement time: ~5 min Parallelizable with: none
Files:
- Modify:
src/ZB.MOM.WW.MxGateway.Server/Sessions/GatewaySession.cs:386-408(own aSessionEventDistributor;AttachEventSubscriberreturns a lease wrappingdistributor.Register(...)) - Modify:
src/ZB.MOM.WW.MxGateway.Server/Grpc/EventStreamService.cs:27-101(read the lease's channel instead of creating a per-RPC channel and draining the session directly; remove the per-RPCChannel.CreateBoundedat:43-50) - Test:
src/ZB.MOM.WW.MxGateway.Tests/Gateway/Grpc/EventStreamServiceTests.cs
Steps: Failing test: a single subscriber still streams events end-to-end through the distributor (regression parity with today). Rewire. Keep per-item constraint filtering in the subscriber read loop. Green. Build + run gateway event-stream tests. Commit.
Task 5: Per-subscriber backpressure isolation
Classification: standard Estimated implement time: ~4 min Parallelizable with: none (extends Tasks 2/4)
Files:
- Modify:
src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionEventDistributor.cs - Modify:
src/ZB.MOM.WW.MxGateway.Server/Grpc/EventStreamService.cs(overflow path:143-162) - Test:
SessionEventDistributorTests.cs
Design: On a subscriber channel TryWrite failure, complete only that subscriber's channel with EventQueueOverflow (policy DisconnectSubscriber). Retain FailFast→MarkFaulted only when the session is in legacy single-subscriber mode (back-compat).
Steps: Failing test: a slow subscriber overflows and is disconnected while a second subscriber keeps receiving and the session stays Ready. Implement. Green. Build+test. Commit.
Task 6: Dashboard broadcaster becomes a distributor subscriber
Classification: standard Estimated implement time: ~4 min Parallelizable with: none
Files:
- Modify:
src/ZB.MOM.WW.MxGateway.Server/Grpc/EventStreamService.cs:131-141(remove the inlinedashboardEventBroadcaster.Publishtap) - Modify:
src/ZB.MOM.WW.MxGateway.Server/Sessions/GatewaySession.cs(register the dashboard broadcaster as a distributor subscriber on session start) - Modify:
src/ZB.MOM.WW.MxGateway.Server/Dashboard/Hubs/DashboardEventBroadcaster.cs(consume from a distributor lease) - Test:
src/ZB.MOM.WW.MxGateway.Tests/Gateway/Dashboard/DashboardEventBroadcasterTests.cs
Steps: Failing test: dashboard receives session events even with no active gRPC subscriber (fixes the latent dark-feed bug). Implement. Green. Build + dashboard tests. Commit.
Phase 2 — Multi-subscriber fan-out
Task 7: Remove the validator block + add the subscriber cap option
Classification: small Estimated implement time: ~3 min Parallelizable with: Task 8 is sequential (same files); none
Files:
- Modify:
src/ZB.MOM.WW.MxGateway.Server/Configuration/GatewayOptionsValidator.cs:181-185(delete the rejection) - Modify:
src/ZB.MOM.WW.MxGateway.Server/Configuration/SessionOptions.cs(addMaxEventSubscribersPerSession, default 8) - Test:
src/ZB.MOM.WW.MxGateway.Tests/Gateway/Configuration/GatewayOptionsValidatorTests.cs
Steps: Failing test: AllowMultipleEventSubscribers=true now validates clean. Remove rule, add option. Green. Build+test. Commit.
Task 8: Subscriber-lease collection + cap enforcement
Classification: standard Estimated implement time: ~4 min Parallelizable with: none
Files:
- Modify:
src/ZB.MOM.WW.MxGateway.Server/Sessions/GatewaySession.cs(replace_activeEventSubscriberCount:16with a lease collection; honorallowMultipleSubscribers; reject N+1 with newSessionManagerErrorCode.EventSubscriberLimitReached) - Modify:
src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionManagerErrorCode.cs - Test:
src/ZB.MOM.WW.MxGateway.Tests/Gateway/Sessions/GatewaySessionTests.cs
Steps: Failing tests: N subscribers attach concurrently up to the cap; N+1 throws EventSubscriberLimitReached; single-subscriber mode still rejects the 2nd. Implement. Green. Build+test. Commit.
Task 9: Multi-subscriber end-to-end test via FakeWorkerHarness
Classification: standard Estimated implement time: ~4 min Parallelizable with: none
Files:
- Test:
src/ZB.MOM.WW.MxGateway.Tests/Gateway/GatewayEndToEndFakeWorkerSmokeTests.cs
Steps: Two concurrent StreamEvents RPCs on one session both receive every worker event; one cancels, the other continues. Build + full fake-worker suite. Commit.
Phase 3 — Reconnectable sessions
Task 10: Proto — ReplayGap signal (contract change)
Classification: high-risk (contracts → all clients) Estimated implement time: ~5 min Parallelizable with: none
Files:
- Modify:
src/ZB.MOM.WW.MxGateway.Contracts/Protos/mxaccess_gateway.proto(add aReplayGapmarker — areplay_gapbool +oldest_available_sequenceon the stream response, or a dedicated leading status frame) - Regenerate:
dotnet build src/ZB.MOM.WW.MxGateway.Contracts/ZB.MOM.WW.MxGateway.Contracts.csproj; commitsrc/ZB.MOM.WW.MxGateway.Contracts/Generated/*(net48 regen rule — seeproject_proto_codegen_regen) - Test: contracts build both TFMs (net10.0;net48)
Steps: Add field(s), regen, del Generated/*.cs if needed to force regen, commit generated. Build contracts both TFMs. Commit. This unblocks Task 11 and Task 14.
Task 11: Detach-grace session retention
Classification: high-risk (session lifecycle) Estimated implement time: ~5 min Parallelizable with: none
Files:
- Modify:
src/ZB.MOM.WW.MxGateway.Server/Sessions/GatewaySession.cs(addDetachGraceretention: on last-subscriber-drop, keep session alive forDetachGraceSecondsinstead of closing) - Modify:
src/ZB.MOM.WW.MxGateway.Server/Configuration/SessionOptions.cs(DetachGraceSeconds; new disconnect-policy valueDetachGrace) - Modify:
src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionLeaseMonitorHostedService.cs(sweep expired detach-grace windows) - Test:
GatewaySessionTests.cs
Steps: Failing test: subscriber drop under DetachGrace keeps the session Ready until the window expires, then closes. Implement. Green. Build + session/lease tests. Commit.
Task 12: Replay-on-reconnect + emit ReplayGap
Classification: high-risk Estimated implement time: ~5 min Parallelizable with: none
Files:
- Modify:
src/ZB.MOM.WW.MxGateway.Server/Grpc/EventStreamService.cs(on attach withAfterWorkerSequence, calldistributor.TryGetReplayFrom; replay buffered events then resume live; ifgap, send theReplayGapmarker first) - Test:
EventStreamServiceTests.cs
Steps: Failing tests: reconnect with a known sequence replays only newer events; reconnect past the ring horizon yields ReplayGap. Implement. Green. Build + test. Commit.
Task 13: Owner re-validation on reconnect
Classification: small Estimated implement time: ~3 min Parallelizable with: Task 12 (different assertion in same service — sequence after 12)
Files:
- Modify:
src/ZB.MOM.WW.MxGateway.Server/Grpc/EventStreamService.cs(reconnect requires callerOwnerKeyId== session owner →PermissionDenied) - Test:
EventStreamServiceTests.cs
Steps: Failing test: a different API key cannot resume someone else's session. Implement. Green. Build+test. Commit.
Task 14: Client ReplayGap handling — all 5 clients
Classification: standard Estimated implement time: ~5 min each (dispatch as 5 parallel sub-tasks; disjoint files) Parallelizable with: each other (14a–14e)
Files (one client each):
- 14a dotnet:
clients/dotnet/.../stream consumer + test - 14b Go:
clients/go/mxgateway/+go test - 14c Python:
clients/python/src/.../+pytest - 14d Rust:
clients/rust/crates/.../+cargo test/clippy - 14e Java:
clients/java/.../+gradle test(macOS JDK 21; revert generatedMxaccessGateway.javachurn perproject_java_generated_churn)
Steps (each): Regenerate client stubs from the updated proto; surface ReplayGap to the caller (callback/return marker) so apps know to re-snapshot; test the gap path. Build+test that client. Commit per client.
Task 15: Reconnect integration test (fake worker)
Classification: standard Estimated implement time: ~4 min Parallelizable with: none
Files:
- Test:
src/ZB.MOM.WW.MxGateway.Tests/Gateway/GatewayEndToEndFakeWorkerSmokeTests.cs
Steps: Stream, drop, reconnect within grace with last sequence → no gap; reconnect after ring overflow → ReplayGap. Build + suite. Commit.
Phase 4 — Per-session ACL
Task 16: gRPC session-owner gate + all-sessions admin scope
Classification: high-risk (security) Estimated implement time: ~5 min Parallelizable with: none
Files:
- Modify:
src/ZB.MOM.WW.MxGateway.Server/Grpc/MxAccessGatewayService.cs(Invoke/StreamEvents/CloseSessionrequire caller key == sessionOwnerKeyId, or a key bearing a new all-sessions scope) - Modify:
src/ZB.MOM.WW.MxGateway.Server/Security/Authorization/(define the all-sessions scope; map it) - Test:
src/ZB.MOM.WW.MxGateway.Tests/Gateway/Grpc/MxAccessGatewayServiceTests.cs
Steps: Failing tests: foreign key gets PermissionDenied on another key's session; owner and all-sessions-scoped key succeed. Implement. Green. Build + gateway tests. Commit.
Task 17: Session Tag + dashboard group→tag config
Classification: small Estimated implement time: ~3 min Parallelizable with: Task 16 (disjoint files)
Files:
- Modify:
src/ZB.MOM.WW.MxGateway.Server/Sessions/GatewaySession.cs(+SessionManagerto set an optionalTagfrom the open request) - Modify:
src/ZB.MOM.WW.MxGateway.Server/Configuration/Dashboard options (GroupToSessionTagmap) - Test: config-binding test
Steps: Failing test: a session carries its tag; config map binds. Implement. Green. Build+test. Commit.
Task 18: EventsHub per-session ACL + hub-token session-tag claim
Classification: standard Estimated implement time: ~5 min Parallelizable with: none (depends on 17)
Files:
- Modify:
src/ZB.MOM.WW.MxGateway.Server/Dashboard/Hubs/EventsHub.cs:39-54(replaceTODO(per-session-acl): Admin sees all; Viewer allowed only if the session's tag is in the user'sGroupToSessionTag-derived allowed set) - Modify:
src/ZB.MOM.WW.MxGateway.Server/Dashboard/HubTokenService.cs(mint an allowed-session-tag claim) - Modify:
src/ZB.MOM.WW.MxGateway.Server/Dashboard/HubTokenAuthenticationHandler.cs(carry the claim back) - Test:
src/ZB.MOM.WW.MxGateway.Tests/Gateway/Dashboard/EventsHubTests.cs
Decision flagged in the design doc: default is "Admin all / Viewer by tag map." If the owner chose the strict variant (Viewers see nothing unless granted), invert the default here — the executor must confirm which before implementing.
Steps: Failing tests: Viewer without the tag is refused SubscribeSession; Admin allowed; Viewer with the mapped tag allowed. Implement. Green. Build + dashboard tests. Commit.
Task 19: ACL tests incl. live LDAP users
Classification: standard Estimated implement time: ~4 min Parallelizable with: none
Files:
- Test:
src/ZB.MOM.WW.MxGateway.IntegrationTests/DashboardLdapLiveTests.cs(extend; gatedMXGATEWAY_RUN_LIVE_LDAP_TESTS=1)
Steps: With multi-role (Admin) vs gw-viewer (Viewer), assert subscribe authorization differs by session tag. Document if skipped (no live LDAP). Commit.
Phase 5 — Orphan-worker reattach (overturns the CLAUDE.md rule)
Task 20: Stable gateway-instance id + stable pipe naming
Classification: standard Estimated implement time: ~4 min Parallelizable with: none
Files:
- Modify:
src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionManager.cs:433(pipe name uses a persisted stable gateway-instance id instead ofEnvironment.ProcessId) - Create: gateway-instance-id persistence (small file/SQLite row under
C:\ProgramData\MxGateway\) - Test:
SessionManagerTests.cs/ a new instance-id test
Steps: Failing test: pipe name is stable across simulated restarts (same instance id). Implement. Green. Build + tests. Commit.
Task 21: Adoption manifest store (SQLite)
Classification: high-risk (persistence, security material) Estimated implement time: ~5 min Parallelizable with: none
Files:
- Create:
src/ZB.MOM.WW.MxGateway.Server/Workers/WorkerAdoptionManifest.cs(persistsessionId → workerPid, nonce, ownerKeyId, pipeName; upsert on launch, delete on clean close) - Modify: gateway-auth SQLite schema/migration
- Test:
src/ZB.MOM.WW.MxGateway.Tests/Gateway/Workers/WorkerAdoptionManifestTests.cs
Nonce is security material — store it like other secrets (no plaintext logging; standing rule).
Steps: Failing test: manifest round-trips an entry; clean close removes it. Implement. Green. Build + tests. Commit.
Task 22: Proto — worker adopt/reconnect frame (contract change)
Classification: high-risk (contracts → worker) Estimated implement time: ~5 min Parallelizable with: none
Files:
- Modify:
src/ZB.MOM.WW.MxGateway.Contracts/Protos/mxaccess_worker.proto(add an adopt/reconnectWorkerEnvelopeframe: worker presentssessionId+nonce; gateway ACK/NACK) - Regenerate + commit
Generated/*(net48 rule) - Test: contracts build both TFMs
Steps: Add frame, regen, commit generated, build both TFMs. Commit. Unblocks Tasks 24–25.
Task 23: Worker phone-home reconnect loop + self-terminate
Classification: high-risk (worker, net48/x86 — windev) Estimated implement time: ~5 min Parallelizable with: none
Files:
- Modify:
src/ZB.MOM.WW.MxGateway.Worker/Ipc/WorkerPipeClient.cs(on pipe drop: reconnect loop with bounded backoff to the stable pipe name; present the adopt frame) - Modify:
src/ZB.MOM.WW.MxGateway.Worker/runtime (self-terminate afterMaxOrphanLifetimewith no adoption) - Test:
src/ZB.MOM.WW.MxGateway.Worker.Tests/(net48/x86 on windev)
Steps: Failing test (fake pipe server): worker retries and adopts; gives up + self-terminates past the lifetime. Build x86 + worker tests on windev. Commit. (net48: no init-only/positional records.)
Task 24: Gateway adoption — re-open pipes, nonce-validate, reject impostors
Classification: high-risk (security, lifecycle — windev for live) Estimated implement time: ~5 min Parallelizable with: none
Files:
- Create:
src/ZB.MOM.WW.MxGateway.Server/Workers/OrphanWorkerAdopter.cs(startup: read manifest, re-open pipe servers, accept adopt frames, validate nonce → adopt or reject) - Modify: gateway startup hosted-service order (adopter runs before
OrphanWorkerTerminator; terminator handles only un-adoptable/foreign workers) - Test:
src/ZB.MOM.WW.MxGateway.Tests/Gateway/Workers/OrphanWorkerAdopterTests.cs
Steps: Failing tests: matching nonce adopts and rebuilds the session; mismatched nonce is rejected and the worker terminated. Implement. Green. Build + tests. Commit.
Task 25: Resync adopted worker + ReplayGap to subscribers
Classification: standard Estimated implement time: ~4 min Parallelizable with: none
Files:
- Modify:
src/ZB.MOM.WW.MxGateway.Server/Workers/OrphanWorkerAdopter.cs(after adoption,GetSessionState/GetWorkerInfoto resync; reattached subscribers getReplayGapsince the ring is gone) - Test:
OrphanWorkerAdopterTests.cs
Steps: Failing test: adopted session reports resynced state; a resuming subscriber receives ReplayGap. Implement. Green. Build + tests. Commit.
Task 26: EnableOrphanReattach flag (default off) + terminator fallback
Classification: small Estimated implement time: ~3 min Parallelizable with: none
Files:
- Modify:
src/ZB.MOM.WW.MxGateway.Server/Configuration/WorkerOptions.cs(EnableOrphanReattach, defaultfalse) - Modify:
src/ZB.MOM.WW.MxGateway.Server/Workers/OrphanWorkerTerminator.cs(unchanged default behavior when reattach disabled) - Test:
OrphanWorkerTerminatorTests.cs/ adopter test
Steps: Failing test: with the flag off, startup terminates (today's behavior); on, it adopts. Implement. Green. Build + tests. Commit.
Task 27: Gateway-restart reattach round-trip (integration, windev + live worker)
Classification: high-risk Estimated implement time: ~5 min Parallelizable with: none
Files:
- Test:
src/ZB.MOM.WW.MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs(gatedMXGATEWAY_RUN_LIVE_MXACCESS_TESTS=1)
Steps: Open session → simulate gateway restart → adopter re-adopts the surviving worker → session usable → subscriber gets ReplayGap then live events. Run on windev with live MXAccess. Document if skipped.
Task 28: Documented-rule reversals + stillpending refresh
Classification: trivial (doc-only) Estimated implement time: ~3 min Parallelizable with: none (final)
Files:
- Modify:
CLAUDE.md(line ~77 — reattach now supported, opt-in/bounded) - Modify:
docs/DesignDecisions.md(:63-73reconnect,:75-80multi-subscriber, reattach rationale → mark superseded with this design's date/commit) - Modify:
gateway.md(post-v1 revisit items — reflect what shipped) - Modify:
stillpending.md(§2 items: mark fan-out/reconnect/ACL/reattach Resolved with commit refs) - Modify:
docs/GatewayConfiguration.md(new options:MaxEventSubscribersPerSession,ReplayBufferCapacity,ReplayRetentionSeconds,DetachGraceSeconds,GroupToSessionTag,EnableOrphanReattach,MaxOrphanLifetime)
Steps: Edit docs to match shipped behavior. Commit.
Verification matrix
| Phase | Build/test | Host |
|---|---|---|
| 1–4 (gateway, clients) | dotnet build + gateway/fake-worker tests; per-client go/pytest/cargo/gradle/dotnet test |
local (macOS) |
| 3/5 proto changes | regen + commit Generated/; build contracts both TFMs; rebuild touched clients |
local |
| 5 worker (net48/x86) | dotnet build -p:Platform=x86 + Worker.Tests |
windev |
| 5 live reattach + Phase-4 LDAP | opt-in gated integration tests | windev / live LDAP |
Final integration review
After all tasks: dispatch a final integration reviewer over git diff main..HEAD focusing on the live event path, concurrency in SessionEventDistributor, security gates (ACL + nonce adoption), and the three documented-rule reversals. Then use superpowers-extended-cc:finishing-a-development-branch.