diff --git a/docs/plans/2026-06-15-session-resilience.md b/docs/plans/2026-06-15-session-resilience.md new file mode 100644 index 0000000..098cda1 --- /dev/null +++ b/docs/plans/2026-06-15-session-resilience.md @@ -0,0 +1,417 @@ +# Session Resilience Epic — Implementation Plan + +> **For Claude:** REQUIRED SUB-SKILL: Use superpowers-extended-cc:subagent-driven-development (same session) or executing-plans (parallel session) to implement this plan task-by-task. + +**Goal:** Lift four deferred v1 limitations — multi-subscriber fan-out, reconnectable sessions, per-session ACL, orphan-worker reattach — onto one shared event-distribution foundation. + +**Architecture:** A per-session `SessionEventDistributor` (one pump → N per-subscriber bounded channels + a bounded replay ring) replaces today's per-RPC destructive drain. Session ownership (`OwnerKeyId`) underpins ACL, reconnect re-validation, and reattach adoption. See `docs/plans/2026-06-15-session-resilience-design.md`. + +**Tech Stack:** .NET 10 gateway (x64), .NET Framework 4.8 worker (x86, windev), SQLite auth/manifest store, gRPC + protobuf contracts (net10.0;net48), 5 language clients, Blazor/SignalR dashboard, LDAP dashboard auth. + +**Cross-platform:** Gateway, dotnet/Go/Rust/Python clients, and the Java client build/test locally on macOS (JDK 21 at `~/.local/jdks/jdk-21.0.11+10/Contents/Home`). The net48/x86 worker and worker tests build/test on **windev** (ssh alias, PowerShell). Proto changes: regenerate `Generated/`, commit it, rebuild every touched component. + +**Standing rules (from CLAUDE.md):** never log secrets/credentials/values; MXAccess parity (no synthesized events, no "fixing" surprising returns); no init-only props/positional records in net48 worker; update docs in the same change as source; branch already created (`feat/session-resilience`); per-task commits; build+test affected components before marking done. + +--- + +## Phase 1 — Foundation (refactor; no external behavior change except the dashboard-dark fix) + +### Task 1: Add `OwnerKeyId` to the session + +**Classification:** small +**Estimated implement time:** ~4 min +**Parallelizable with:** none (other phase-1 tasks build on the session type) + +**Files:** +- Modify: `src/ZB.MOM.WW.MxGateway.Server/Sessions/GatewaySession.cs` (add `OwnerKeyId` readonly prop near `ClientIdentity:114`) +- Modify: `src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionManager.cs` (set `OwnerKeyId` from the request identity at `OpenSession`, near `CreateSessionId:479`) +- Test: `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Sessions/SessionManagerTests.cs` + +**Steps:** TDD — failing test asserting an opened session records the creating API key id → add the property + assignment from `IGatewayRequestIdentityAccessor.Current` → green → `dotnet build src/ZB.MOM.WW.MxGateway.Server` + run session tests → commit. + +### Task 2: `SessionEventDistributor` skeleton (single pump, subscriber registry) + +**Classification:** high-risk (concurrency / actor model) +**Estimated implement time:** ~5 min +**Parallelizable with:** none + +**Files:** +- Create: `src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionEventDistributor.cs` +- Test: `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Sessions/SessionEventDistributorTests.cs` + +**Design:** One background pump `Task` draining `session.ReadEventsAsync()` exactly once; a thread-safe subscriber collection where each subscriber owns a bounded `Channel` (`SingleReader=true`, `FullMode=Wait` for the per-sub channel, but writes use non-blocking `TryWrite`). `Register(startSequence)` returns a lease (channel reader + dispose). Pump fans each drained event to all subscriber channels via `TryWrite`. + +**Steps:** Failing test: two registered subscribers both receive the same fanned event; disposing one stops its delivery without affecting the other. Implement pump + registry. Green. Build + test. Commit. + +### Task 3: Bounded replay ring buffer + +**Classification:** standard +**Estimated implement time:** ~4 min +**Parallelizable with:** none (extends Task 2) + +**Files:** +- Modify: `src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionEventDistributor.cs` +- Modify: `src/ZB.MOM.WW.MxGateway.Server/Configuration/EventOptions.cs` (add `ReplayBufferCapacity`, `ReplayRetentionSeconds`) +- Test: `SessionEventDistributorTests.cs` + +**Design:** Append each fanned event to a ring keyed by worker sequence, evicting by count (`ReplayBufferCapacity`) or age (`ReplayRetentionSeconds`), whichever first. Expose `TryGetReplayFrom(afterSequence, out events, out gap)`. + +**Steps:** Failing test: events evicted past capacity; `TryGetReplayFrom` returns `gap=true` when requested sequence is older than the oldest retained. Implement. Green. Build+test. Commit. + +### Task 4: Rewire `AttachEventSubscriber` + `EventStreamService` onto the distributor + +**Classification:** high-risk (changes the live event path) +**Estimated implement time:** ~5 min +**Parallelizable with:** none + +**Files:** +- Modify: `src/ZB.MOM.WW.MxGateway.Server/Sessions/GatewaySession.cs:386-408` (own a `SessionEventDistributor`; `AttachEventSubscriber` returns a lease wrapping `distributor.Register(...)`) +- Modify: `src/ZB.MOM.WW.MxGateway.Server/Grpc/EventStreamService.cs:27-101` (read the lease's channel instead of creating a per-RPC channel and draining the session directly; remove the per-RPC `Channel.CreateBounded` at `:43-50`) +- Test: `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Grpc/EventStreamServiceTests.cs` + +**Steps:** Failing test: a single subscriber still streams events end-to-end through the distributor (regression parity with today). Rewire. Keep per-item constraint filtering in the subscriber read loop. Green. Build + run gateway event-stream tests. Commit. + +### Task 5: Per-subscriber backpressure isolation + +**Classification:** standard +**Estimated implement time:** ~4 min +**Parallelizable with:** none (extends Tasks 2/4) + +**Files:** +- Modify: `src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionEventDistributor.cs` +- Modify: `src/ZB.MOM.WW.MxGateway.Server/Grpc/EventStreamService.cs` (overflow path `:143-162`) +- Test: `SessionEventDistributorTests.cs` + +**Design:** On a subscriber channel `TryWrite` failure, complete only that subscriber's channel with `EventQueueOverflow` (policy `DisconnectSubscriber`). Retain `FailFast`→`MarkFaulted` only when the session is in legacy single-subscriber mode (back-compat). + +**Steps:** Failing test: a slow subscriber overflows and is disconnected while a second subscriber keeps receiving and the session stays `Ready`. Implement. Green. Build+test. Commit. + +### Task 6: Dashboard broadcaster becomes a distributor subscriber + +**Classification:** standard +**Estimated implement time:** ~4 min +**Parallelizable with:** none + +**Files:** +- Modify: `src/ZB.MOM.WW.MxGateway.Server/Grpc/EventStreamService.cs:131-141` (remove the inline `dashboardEventBroadcaster.Publish` tap) +- Modify: `src/ZB.MOM.WW.MxGateway.Server/Sessions/GatewaySession.cs` (register the dashboard broadcaster as a distributor subscriber on session start) +- Modify: `src/ZB.MOM.WW.MxGateway.Server/Dashboard/Hubs/DashboardEventBroadcaster.cs` (consume from a distributor lease) +- Test: `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Dashboard/DashboardEventBroadcasterTests.cs` + +**Steps:** Failing test: dashboard receives session events even with **no** active gRPC subscriber (fixes the latent dark-feed bug). Implement. Green. Build + dashboard tests. Commit. + +--- + +## Phase 2 — Multi-subscriber fan-out + +### Task 7: Remove the validator block + add the subscriber cap option + +**Classification:** small +**Estimated implement time:** ~3 min +**Parallelizable with:** Task 8 is sequential (same files); none + +**Files:** +- Modify: `src/ZB.MOM.WW.MxGateway.Server/Configuration/GatewayOptionsValidator.cs:181-185` (delete the rejection) +- Modify: `src/ZB.MOM.WW.MxGateway.Server/Configuration/SessionOptions.cs` (add `MaxEventSubscribersPerSession`, default 8) +- Test: `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Configuration/GatewayOptionsValidatorTests.cs` + +**Steps:** Failing test: `AllowMultipleEventSubscribers=true` now validates clean. Remove rule, add option. Green. Build+test. Commit. + +### Task 8: Subscriber-lease collection + cap enforcement + +**Classification:** standard +**Estimated implement time:** ~4 min +**Parallelizable with:** none + +**Files:** +- Modify: `src/ZB.MOM.WW.MxGateway.Server/Sessions/GatewaySession.cs` (replace `_activeEventSubscriberCount:16` with a lease collection; honor `allowMultipleSubscribers`; reject N+1 with new `SessionManagerErrorCode.EventSubscriberLimitReached`) +- Modify: `src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionManagerErrorCode.cs` +- Test: `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Sessions/GatewaySessionTests.cs` + +**Steps:** Failing tests: N subscribers attach concurrently up to the cap; N+1 throws `EventSubscriberLimitReached`; single-subscriber mode still rejects the 2nd. Implement. Green. Build+test. Commit. + +### Task 9: Multi-subscriber end-to-end test via FakeWorkerHarness + +**Classification:** standard +**Estimated implement time:** ~4 min +**Parallelizable with:** none + +**Files:** +- Test: `src/ZB.MOM.WW.MxGateway.Tests/Gateway/GatewayEndToEndFakeWorkerSmokeTests.cs` + +**Steps:** Two concurrent `StreamEvents` RPCs on one session both receive every worker event; one cancels, the other continues. Build + full fake-worker suite. Commit. + +--- + +## Phase 3 — Reconnectable sessions + +### Task 10: Proto — `ReplayGap` signal (contract change) + +**Classification:** high-risk (contracts → all clients) +**Estimated implement time:** ~5 min +**Parallelizable with:** none + +**Files:** +- Modify: `src/ZB.MOM.WW.MxGateway.Contracts/Protos/mxaccess_gateway.proto` (add a `ReplayGap` marker — a `replay_gap` bool + `oldest_available_sequence` on the stream response, or a dedicated leading status frame) +- Regenerate: `dotnet build src/ZB.MOM.WW.MxGateway.Contracts/ZB.MOM.WW.MxGateway.Contracts.csproj`; **commit** `src/ZB.MOM.WW.MxGateway.Contracts/Generated/*` (net48 regen rule — see `project_proto_codegen_regen`) +- Test: contracts build both TFMs (net10.0;net48) + +**Steps:** Add field(s), regen, `del Generated/*.cs` if needed to force regen, commit generated. Build contracts both TFMs. Commit. **This unblocks Task 11 and Task 14.** + +### Task 11: Detach-grace session retention + +**Classification:** high-risk (session lifecycle) +**Estimated implement time:** ~5 min +**Parallelizable with:** none + +**Files:** +- Modify: `src/ZB.MOM.WW.MxGateway.Server/Sessions/GatewaySession.cs` (add `DetachGrace` retention: on last-subscriber-drop, keep session alive for `DetachGraceSeconds` instead of closing) +- Modify: `src/ZB.MOM.WW.MxGateway.Server/Configuration/SessionOptions.cs` (`DetachGraceSeconds`; new disconnect-policy value `DetachGrace`) +- Modify: `src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionLeaseMonitorHostedService.cs` (sweep expired detach-grace windows) +- Test: `GatewaySessionTests.cs` + +**Steps:** Failing test: subscriber drop under `DetachGrace` keeps the session `Ready` until the window expires, then closes. Implement. Green. Build + session/lease tests. Commit. + +### Task 12: Replay-on-reconnect + emit `ReplayGap` + +**Classification:** high-risk +**Estimated implement time:** ~5 min +**Parallelizable with:** none + +**Files:** +- Modify: `src/ZB.MOM.WW.MxGateway.Server/Grpc/EventStreamService.cs` (on attach with `AfterWorkerSequence`, call `distributor.TryGetReplayFrom`; replay buffered events then resume live; if `gap`, send the `ReplayGap` marker first) +- Test: `EventStreamServiceTests.cs` + +**Steps:** Failing tests: reconnect with a known sequence replays only newer events; reconnect past the ring horizon yields `ReplayGap`. Implement. Green. Build + test. Commit. + +### Task 13: Owner re-validation on reconnect + +**Classification:** small +**Estimated implement time:** ~3 min +**Parallelizable with:** Task 12 (different assertion in same service — sequence after 12) + +**Files:** +- Modify: `src/ZB.MOM.WW.MxGateway.Server/Grpc/EventStreamService.cs` (reconnect requires caller `OwnerKeyId` == session owner → `PermissionDenied`) +- Test: `EventStreamServiceTests.cs` + +**Steps:** Failing test: a different API key cannot resume someone else's session. Implement. Green. Build+test. Commit. + +### Task 14: Client `ReplayGap` handling — all 5 clients + +**Classification:** standard +**Estimated implement time:** ~5 min each (dispatch as 5 parallel sub-tasks; disjoint files) +**Parallelizable with:** each other (14a–14e) + +**Files (one client each):** +- 14a dotnet: `clients/dotnet/.../` stream consumer + test +- 14b Go: `clients/go/mxgateway/` + `go test` +- 14c Python: `clients/python/src/.../` + `pytest` +- 14d Rust: `clients/rust/crates/.../` + `cargo test`/clippy +- 14e Java: `clients/java/.../` + `gradle test` (macOS JDK 21; **revert generated `MxaccessGateway.java` churn** per `project_java_generated_churn`) + +**Steps (each):** Regenerate client stubs from the updated proto; surface `ReplayGap` to the caller (callback/return marker) so apps know to re-snapshot; test the gap path. Build+test that client. Commit per client. + +### Task 15: Reconnect integration test (fake worker) + +**Classification:** standard +**Estimated implement time:** ~4 min +**Parallelizable with:** none + +**Files:** +- Test: `src/ZB.MOM.WW.MxGateway.Tests/Gateway/GatewayEndToEndFakeWorkerSmokeTests.cs` + +**Steps:** Stream, drop, reconnect within grace with last sequence → no gap; reconnect after ring overflow → `ReplayGap`. Build + suite. Commit. + +--- + +## Phase 4 — Per-session ACL + +### Task 16: gRPC session-owner gate + all-sessions admin scope + +**Classification:** high-risk (security) +**Estimated implement time:** ~5 min +**Parallelizable with:** none + +**Files:** +- Modify: `src/ZB.MOM.WW.MxGateway.Server/Grpc/MxAccessGatewayService.cs` (`Invoke`/`StreamEvents`/`CloseSession` require caller key == session `OwnerKeyId`, or a key bearing a new all-sessions scope) +- Modify: `src/ZB.MOM.WW.MxGateway.Server/Security/Authorization/` (define the all-sessions scope; map it) +- Test: `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Grpc/MxAccessGatewayServiceTests.cs` + +**Steps:** Failing tests: foreign key gets `PermissionDenied` on another key's session; owner and all-sessions-scoped key succeed. Implement. Green. Build + gateway tests. Commit. + +### Task 17: Session `Tag` + dashboard group→tag config + +**Classification:** small +**Estimated implement time:** ~3 min +**Parallelizable with:** Task 16 (disjoint files) + +**Files:** +- Modify: `src/ZB.MOM.WW.MxGateway.Server/Sessions/GatewaySession.cs` (+`SessionManager` to set an optional `Tag` from the open request) +- Modify: `src/ZB.MOM.WW.MxGateway.Server/Configuration/` Dashboard options (`GroupToSessionTag` map) +- Test: config-binding test + +**Steps:** Failing test: a session carries its tag; config map binds. Implement. Green. Build+test. Commit. + +### Task 18: EventsHub per-session ACL + hub-token session-tag claim + +**Classification:** standard +**Estimated implement time:** ~5 min +**Parallelizable with:** none (depends on 17) + +**Files:** +- Modify: `src/ZB.MOM.WW.MxGateway.Server/Dashboard/Hubs/EventsHub.cs:39-54` (replace `TODO(per-session-acl)`: Admin sees all; Viewer allowed only if the session's tag is in the user's `GroupToSessionTag`-derived allowed set) +- Modify: `src/ZB.MOM.WW.MxGateway.Server/Dashboard/HubTokenService.cs` (mint an allowed-session-tag claim) +- Modify: `src/ZB.MOM.WW.MxGateway.Server/Dashboard/HubTokenAuthenticationHandler.cs` (carry the claim back) +- Test: `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Dashboard/EventsHubTests.cs` + +> **Decision flagged in the design doc:** default is "Admin all / Viewer by tag map." If the owner chose the strict variant (Viewers see nothing unless granted), invert the default here — the executor must confirm which before implementing. + +**Steps:** Failing tests: Viewer without the tag is refused `SubscribeSession`; Admin allowed; Viewer with the mapped tag allowed. Implement. Green. Build + dashboard tests. Commit. + +### Task 19: ACL tests incl. live LDAP users + +**Classification:** standard +**Estimated implement time:** ~4 min +**Parallelizable with:** none + +**Files:** +- Test: `src/ZB.MOM.WW.MxGateway.IntegrationTests/DashboardLdapLiveTests.cs` (extend; gated `MXGATEWAY_RUN_LIVE_LDAP_TESTS=1`) + +**Steps:** With `multi-role` (Admin) vs `gw-viewer` (Viewer), assert subscribe authorization differs by session tag. Document if skipped (no live LDAP). Commit. + +--- + +## Phase 5 — Orphan-worker reattach (overturns the CLAUDE.md rule) + +### Task 20: Stable gateway-instance id + stable pipe naming + +**Classification:** standard +**Estimated implement time:** ~4 min +**Parallelizable with:** none + +**Files:** +- Modify: `src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionManager.cs:433` (pipe name uses a persisted stable gateway-instance id instead of `Environment.ProcessId`) +- Create: gateway-instance-id persistence (small file/SQLite row under `C:\ProgramData\MxGateway\`) +- Test: `SessionManagerTests.cs` / a new instance-id test + +**Steps:** Failing test: pipe name is stable across simulated restarts (same instance id). Implement. Green. Build + tests. Commit. + +### Task 21: Adoption manifest store (SQLite) + +**Classification:** high-risk (persistence, security material) +**Estimated implement time:** ~5 min +**Parallelizable with:** none + +**Files:** +- Create: `src/ZB.MOM.WW.MxGateway.Server/Workers/WorkerAdoptionManifest.cs` (persist `sessionId → workerPid, nonce, ownerKeyId, pipeName`; upsert on launch, delete on clean close) +- Modify: gateway-auth SQLite schema/migration +- Test: `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Workers/WorkerAdoptionManifestTests.cs` + +> Nonce is security material — store it like other secrets (no plaintext logging; standing rule). + +**Steps:** Failing test: manifest round-trips an entry; clean close removes it. Implement. Green. Build + tests. Commit. + +### Task 22: Proto — worker adopt/reconnect frame (contract change) + +**Classification:** high-risk (contracts → worker) +**Estimated implement time:** ~5 min +**Parallelizable with:** none + +**Files:** +- Modify: `src/ZB.MOM.WW.MxGateway.Contracts/Protos/mxaccess_worker.proto` (add an adopt/reconnect `WorkerEnvelope` frame: worker presents `sessionId` + `nonce`; gateway ACK/NACK) +- Regenerate + **commit** `Generated/*` (net48 rule) +- Test: contracts build both TFMs + +**Steps:** Add frame, regen, commit generated, build both TFMs. Commit. **Unblocks Tasks 24–25.** + +### Task 23: Worker phone-home reconnect loop + self-terminate + +**Classification:** high-risk (worker, net48/x86 — **windev**) +**Estimated implement time:** ~5 min +**Parallelizable with:** none + +**Files:** +- Modify: `src/ZB.MOM.WW.MxGateway.Worker/Ipc/WorkerPipeClient.cs` (on pipe drop: reconnect loop with bounded backoff to the stable pipe name; present the adopt frame) +- Modify: `src/ZB.MOM.WW.MxGateway.Worker/` runtime (self-terminate after `MaxOrphanLifetime` with no adoption) +- Test: `src/ZB.MOM.WW.MxGateway.Worker.Tests/` (net48/x86 on windev) + +**Steps:** Failing test (fake pipe server): worker retries and adopts; gives up + self-terminates past the lifetime. Build x86 + worker tests on **windev**. Commit. *(net48: no init-only/positional records.)* + +### Task 24: Gateway adoption — re-open pipes, nonce-validate, reject impostors + +**Classification:** high-risk (security, lifecycle — **windev** for live) +**Estimated implement time:** ~5 min +**Parallelizable with:** none + +**Files:** +- Create: `src/ZB.MOM.WW.MxGateway.Server/Workers/OrphanWorkerAdopter.cs` (startup: read manifest, re-open pipe servers, accept adopt frames, validate nonce → adopt or reject) +- Modify: gateway startup hosted-service order (adopter runs **before** `OrphanWorkerTerminator`; terminator handles only un-adoptable/foreign workers) +- Test: `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Workers/OrphanWorkerAdopterTests.cs` + +**Steps:** Failing tests: matching nonce adopts and rebuilds the session; mismatched nonce is rejected and the worker terminated. Implement. Green. Build + tests. Commit. + +### Task 25: Resync adopted worker + `ReplayGap` to subscribers + +**Classification:** standard +**Estimated implement time:** ~4 min +**Parallelizable with:** none + +**Files:** +- Modify: `src/ZB.MOM.WW.MxGateway.Server/Workers/OrphanWorkerAdopter.cs` (after adoption, `GetSessionState`/`GetWorkerInfo` to resync; reattached subscribers get `ReplayGap` since the ring is gone) +- Test: `OrphanWorkerAdopterTests.cs` + +**Steps:** Failing test: adopted session reports resynced state; a resuming subscriber receives `ReplayGap`. Implement. Green. Build + tests. Commit. + +### Task 26: `EnableOrphanReattach` flag (default off) + terminator fallback + +**Classification:** small +**Estimated implement time:** ~3 min +**Parallelizable with:** none + +**Files:** +- Modify: `src/ZB.MOM.WW.MxGateway.Server/Configuration/WorkerOptions.cs` (`EnableOrphanReattach`, default `false`) +- Modify: `src/ZB.MOM.WW.MxGateway.Server/Workers/OrphanWorkerTerminator.cs` (unchanged default behavior when reattach disabled) +- Test: `OrphanWorkerTerminatorTests.cs` / adopter test + +**Steps:** Failing test: with the flag off, startup terminates (today's behavior); on, it adopts. Implement. Green. Build + tests. Commit. + +### Task 27: Gateway-restart reattach round-trip (integration, **windev** + live worker) + +**Classification:** high-risk +**Estimated implement time:** ~5 min +**Parallelizable with:** none + +**Files:** +- Test: `src/ZB.MOM.WW.MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs` (gated `MXGATEWAY_RUN_LIVE_MXACCESS_TESTS=1`) + +**Steps:** Open session → simulate gateway restart → adopter re-adopts the surviving worker → session usable → subscriber gets `ReplayGap` then live events. Run on **windev** with live MXAccess. Document if skipped. + +### Task 28: Documented-rule reversals + stillpending refresh + +**Classification:** trivial (doc-only) +**Estimated implement time:** ~3 min +**Parallelizable with:** none (final) + +**Files:** +- Modify: `CLAUDE.md` (line ~77 — reattach now supported, opt-in/bounded) +- Modify: `docs/DesignDecisions.md` (`:63-73` reconnect, `:75-80` multi-subscriber, reattach rationale → mark superseded with this design's date/commit) +- Modify: `gateway.md` (post-v1 revisit items — reflect what shipped) +- Modify: `stillpending.md` (§2 items: mark fan-out/reconnect/ACL/reattach Resolved with commit refs) +- Modify: `docs/GatewayConfiguration.md` (new options: `MaxEventSubscribersPerSession`, `ReplayBufferCapacity`, `ReplayRetentionSeconds`, `DetachGraceSeconds`, `GroupToSessionTag`, `EnableOrphanReattach`, `MaxOrphanLifetime`) + +**Steps:** Edit docs to match shipped behavior. Commit. + +--- + +## Verification matrix + +| Phase | Build/test | Host | +|---|---|---| +| 1–4 (gateway, clients) | `dotnet build` + gateway/fake-worker tests; per-client `go/pytest/cargo/gradle/dotnet test` | local (macOS) | +| 3/5 proto changes | regen + commit `Generated/`; build contracts both TFMs; rebuild touched clients | local | +| 5 worker (net48/x86) | `dotnet build -p:Platform=x86` + `Worker.Tests` | **windev** | +| 5 live reattach + Phase-4 LDAP | opt-in gated integration tests | **windev** / live LDAP | + +## Final integration review + +After all tasks: dispatch a final integration reviewer over `git diff main..HEAD` focusing on the live event path, concurrency in `SessionEventDistributor`, security gates (ACL + nonce adoption), and the three documented-rule reversals. Then use superpowers-extended-cc:finishing-a-development-branch. diff --git a/docs/plans/2026-06-15-session-resilience.md.tasks.json b/docs/plans/2026-06-15-session-resilience.md.tasks.json new file mode 100644 index 0000000..fe90793 --- /dev/null +++ b/docs/plans/2026-06-15-session-resilience.md.tasks.json @@ -0,0 +1,34 @@ +{ + "planPath": "docs/plans/2026-06-15-session-resilience.md", + "tasks": [ + {"id": 108, "subject": "Task 1: Add OwnerKeyId to the session", "status": "pending"}, + {"id": 109, "subject": "Task 2: SessionEventDistributor skeleton", "status": "pending", "blockedBy": [108]}, + {"id": 110, "subject": "Task 3: Bounded replay ring buffer", "status": "pending", "blockedBy": [109]}, + {"id": 111, "subject": "Task 4: Rewire AttachEventSubscriber + EventStreamService onto distributor", "status": "pending", "blockedBy": [110]}, + {"id": 112, "subject": "Task 5: Per-subscriber backpressure isolation", "status": "pending", "blockedBy": [111]}, + {"id": 113, "subject": "Task 6: Dashboard broadcaster becomes a distributor subscriber", "status": "pending", "blockedBy": [111]}, + {"id": 114, "subject": "Task 7: Remove validator block + add subscriber cap option", "status": "pending", "blockedBy": [112]}, + {"id": 115, "subject": "Task 8: Subscriber-lease collection + cap enforcement", "status": "pending", "blockedBy": [114]}, + {"id": 116, "subject": "Task 9: Multi-subscriber end-to-end test (FakeWorkerHarness)", "status": "pending", "blockedBy": [115]}, + {"id": 117, "subject": "Task 10: Proto - ReplayGap signal", "status": "pending", "blockedBy": [116]}, + {"id": 118, "subject": "Task 11: Detach-grace session retention", "status": "pending", "blockedBy": [117]}, + {"id": 119, "subject": "Task 12: Replay-on-reconnect + emit ReplayGap", "status": "pending", "blockedBy": [118, 110]}, + {"id": 120, "subject": "Task 13: Owner re-validation on reconnect", "status": "pending", "blockedBy": [119, 108]}, + {"id": 121, "subject": "Task 14: Client ReplayGap handling - all 5 clients", "status": "pending", "blockedBy": [117]}, + {"id": 122, "subject": "Task 15: Reconnect integration test (fake worker)", "status": "pending", "blockedBy": [119]}, + {"id": 123, "subject": "Task 16: gRPC session-owner gate + all-sessions admin scope", "status": "pending", "blockedBy": [116, 108]}, + {"id": 124, "subject": "Task 17: Session Tag + dashboard group-to-tag config", "status": "pending", "blockedBy": [116]}, + {"id": 125, "subject": "Task 18: EventsHub per-session ACL + hub-token tag claim", "status": "pending", "blockedBy": [124]}, + {"id": 126, "subject": "Task 19: ACL tests incl. live LDAP users", "status": "pending", "blockedBy": [125]}, + {"id": 127, "subject": "Task 20: Stable gateway-instance id + stable pipe naming", "status": "pending", "blockedBy": [126]}, + {"id": 128, "subject": "Task 21: Adoption manifest store (SQLite)", "status": "pending", "blockedBy": [127]}, + {"id": 129, "subject": "Task 22: Proto - worker adopt/reconnect frame", "status": "pending", "blockedBy": [128]}, + {"id": 130, "subject": "Task 23: Worker phone-home reconnect loop + self-terminate", "status": "pending", "blockedBy": [129]}, + {"id": 131, "subject": "Task 24: Gateway adoption - re-open pipes, nonce-validate, reject impostors", "status": "pending", "blockedBy": [130]}, + {"id": 132, "subject": "Task 25: Resync adopted worker + ReplayGap to subscribers", "status": "pending", "blockedBy": [131, 119]}, + {"id": 133, "subject": "Task 26: EnableOrphanReattach flag (default off) + terminator fallback", "status": "pending", "blockedBy": [131]}, + {"id": 134, "subject": "Task 27: Gateway-restart reattach round-trip (WINDEV + live worker)", "status": "pending", "blockedBy": [132, 133]}, + {"id": 135, "subject": "Task 28: Documented-rule reversals + stillpending refresh", "status": "pending", "blockedBy": [134]} + ], + "lastUpdated": "2026-06-15" +}