docs: session-resilience implementation plan (28 tasks, 5 phases)

This commit is contained in:
Joseph Doherty
2026-06-15 12:15:34 -04:00
parent 3fc6ccad30
commit 00c849e63b
2 changed files with 451 additions and 0 deletions
+417
View File
@@ -0,0 +1,417 @@
# Session Resilience Epic — Implementation Plan
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers-extended-cc:subagent-driven-development (same session) or executing-plans (parallel session) to implement this plan task-by-task.
**Goal:** Lift four deferred v1 limitations — multi-subscriber fan-out, reconnectable sessions, per-session ACL, orphan-worker reattach — onto one shared event-distribution foundation.
**Architecture:** A per-session `SessionEventDistributor` (one pump → N per-subscriber bounded channels + a bounded replay ring) replaces today's per-RPC destructive drain. Session ownership (`OwnerKeyId`) underpins ACL, reconnect re-validation, and reattach adoption. See `docs/plans/2026-06-15-session-resilience-design.md`.
**Tech Stack:** .NET 10 gateway (x64), .NET Framework 4.8 worker (x86, windev), SQLite auth/manifest store, gRPC + protobuf contracts (net10.0;net48), 5 language clients, Blazor/SignalR dashboard, LDAP dashboard auth.
**Cross-platform:** Gateway, dotnet/Go/Rust/Python clients, and the Java client build/test locally on macOS (JDK 21 at `~/.local/jdks/jdk-21.0.11+10/Contents/Home`). The net48/x86 worker and worker tests build/test on **windev** (ssh alias, PowerShell). Proto changes: regenerate `Generated/`, commit it, rebuild every touched component.
**Standing rules (from CLAUDE.md):** never log secrets/credentials/values; MXAccess parity (no synthesized events, no "fixing" surprising returns); no init-only props/positional records in net48 worker; update docs in the same change as source; branch already created (`feat/session-resilience`); per-task commits; build+test affected components before marking done.
---
## Phase 1 — Foundation (refactor; no external behavior change except the dashboard-dark fix)
### Task 1: Add `OwnerKeyId` to the session
**Classification:** small
**Estimated implement time:** ~4 min
**Parallelizable with:** none (other phase-1 tasks build on the session type)
**Files:**
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Sessions/GatewaySession.cs` (add `OwnerKeyId` readonly prop near `ClientIdentity:114`)
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionManager.cs` (set `OwnerKeyId` from the request identity at `OpenSession`, near `CreateSessionId:479`)
- Test: `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Sessions/SessionManagerTests.cs`
**Steps:** TDD — failing test asserting an opened session records the creating API key id → add the property + assignment from `IGatewayRequestIdentityAccessor.Current` → green → `dotnet build src/ZB.MOM.WW.MxGateway.Server` + run session tests → commit.
### Task 2: `SessionEventDistributor` skeleton (single pump, subscriber registry)
**Classification:** high-risk (concurrency / actor model)
**Estimated implement time:** ~5 min
**Parallelizable with:** none
**Files:**
- Create: `src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionEventDistributor.cs`
- Test: `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Sessions/SessionEventDistributorTests.cs`
**Design:** One background pump `Task` draining `session.ReadEventsAsync()` exactly once; a thread-safe subscriber collection where each subscriber owns a bounded `Channel<MxEvent>` (`SingleReader=true`, `FullMode=Wait` for the per-sub channel, but writes use non-blocking `TryWrite`). `Register(startSequence)` returns a lease (channel reader + dispose). Pump fans each drained event to all subscriber channels via `TryWrite`.
**Steps:** Failing test: two registered subscribers both receive the same fanned event; disposing one stops its delivery without affecting the other. Implement pump + registry. Green. Build + test. Commit.
### Task 3: Bounded replay ring buffer
**Classification:** standard
**Estimated implement time:** ~4 min
**Parallelizable with:** none (extends Task 2)
**Files:**
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionEventDistributor.cs`
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Configuration/EventOptions.cs` (add `ReplayBufferCapacity`, `ReplayRetentionSeconds`)
- Test: `SessionEventDistributorTests.cs`
**Design:** Append each fanned event to a ring keyed by worker sequence, evicting by count (`ReplayBufferCapacity`) or age (`ReplayRetentionSeconds`), whichever first. Expose `TryGetReplayFrom(afterSequence, out events, out gap)`.
**Steps:** Failing test: events evicted past capacity; `TryGetReplayFrom` returns `gap=true` when requested sequence is older than the oldest retained. Implement. Green. Build+test. Commit.
### Task 4: Rewire `AttachEventSubscriber` + `EventStreamService` onto the distributor
**Classification:** high-risk (changes the live event path)
**Estimated implement time:** ~5 min
**Parallelizable with:** none
**Files:**
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Sessions/GatewaySession.cs:386-408` (own a `SessionEventDistributor`; `AttachEventSubscriber` returns a lease wrapping `distributor.Register(...)`)
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Grpc/EventStreamService.cs:27-101` (read the lease's channel instead of creating a per-RPC channel and draining the session directly; remove the per-RPC `Channel.CreateBounded` at `:43-50`)
- Test: `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Grpc/EventStreamServiceTests.cs`
**Steps:** Failing test: a single subscriber still streams events end-to-end through the distributor (regression parity with today). Rewire. Keep per-item constraint filtering in the subscriber read loop. Green. Build + run gateway event-stream tests. Commit.
### Task 5: Per-subscriber backpressure isolation
**Classification:** standard
**Estimated implement time:** ~4 min
**Parallelizable with:** none (extends Tasks 2/4)
**Files:**
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionEventDistributor.cs`
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Grpc/EventStreamService.cs` (overflow path `:143-162`)
- Test: `SessionEventDistributorTests.cs`
**Design:** On a subscriber channel `TryWrite` failure, complete only that subscriber's channel with `EventQueueOverflow` (policy `DisconnectSubscriber`). Retain `FailFast``MarkFaulted` only when the session is in legacy single-subscriber mode (back-compat).
**Steps:** Failing test: a slow subscriber overflows and is disconnected while a second subscriber keeps receiving and the session stays `Ready`. Implement. Green. Build+test. Commit.
### Task 6: Dashboard broadcaster becomes a distributor subscriber
**Classification:** standard
**Estimated implement time:** ~4 min
**Parallelizable with:** none
**Files:**
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Grpc/EventStreamService.cs:131-141` (remove the inline `dashboardEventBroadcaster.Publish` tap)
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Sessions/GatewaySession.cs` (register the dashboard broadcaster as a distributor subscriber on session start)
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Dashboard/Hubs/DashboardEventBroadcaster.cs` (consume from a distributor lease)
- Test: `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Dashboard/DashboardEventBroadcasterTests.cs`
**Steps:** Failing test: dashboard receives session events even with **no** active gRPC subscriber (fixes the latent dark-feed bug). Implement. Green. Build + dashboard tests. Commit.
---
## Phase 2 — Multi-subscriber fan-out
### Task 7: Remove the validator block + add the subscriber cap option
**Classification:** small
**Estimated implement time:** ~3 min
**Parallelizable with:** Task 8 is sequential (same files); none
**Files:**
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Configuration/GatewayOptionsValidator.cs:181-185` (delete the rejection)
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Configuration/SessionOptions.cs` (add `MaxEventSubscribersPerSession`, default 8)
- Test: `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Configuration/GatewayOptionsValidatorTests.cs`
**Steps:** Failing test: `AllowMultipleEventSubscribers=true` now validates clean. Remove rule, add option. Green. Build+test. Commit.
### Task 8: Subscriber-lease collection + cap enforcement
**Classification:** standard
**Estimated implement time:** ~4 min
**Parallelizable with:** none
**Files:**
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Sessions/GatewaySession.cs` (replace `_activeEventSubscriberCount:16` with a lease collection; honor `allowMultipleSubscribers`; reject N+1 with new `SessionManagerErrorCode.EventSubscriberLimitReached`)
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionManagerErrorCode.cs`
- Test: `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Sessions/GatewaySessionTests.cs`
**Steps:** Failing tests: N subscribers attach concurrently up to the cap; N+1 throws `EventSubscriberLimitReached`; single-subscriber mode still rejects the 2nd. Implement. Green. Build+test. Commit.
### Task 9: Multi-subscriber end-to-end test via FakeWorkerHarness
**Classification:** standard
**Estimated implement time:** ~4 min
**Parallelizable with:** none
**Files:**
- Test: `src/ZB.MOM.WW.MxGateway.Tests/Gateway/GatewayEndToEndFakeWorkerSmokeTests.cs`
**Steps:** Two concurrent `StreamEvents` RPCs on one session both receive every worker event; one cancels, the other continues. Build + full fake-worker suite. Commit.
---
## Phase 3 — Reconnectable sessions
### Task 10: Proto — `ReplayGap` signal (contract change)
**Classification:** high-risk (contracts → all clients)
**Estimated implement time:** ~5 min
**Parallelizable with:** none
**Files:**
- Modify: `src/ZB.MOM.WW.MxGateway.Contracts/Protos/mxaccess_gateway.proto` (add a `ReplayGap` marker — a `replay_gap` bool + `oldest_available_sequence` on the stream response, or a dedicated leading status frame)
- Regenerate: `dotnet build src/ZB.MOM.WW.MxGateway.Contracts/ZB.MOM.WW.MxGateway.Contracts.csproj`; **commit** `src/ZB.MOM.WW.MxGateway.Contracts/Generated/*` (net48 regen rule — see `project_proto_codegen_regen`)
- Test: contracts build both TFMs (net10.0;net48)
**Steps:** Add field(s), regen, `del Generated/*.cs` if needed to force regen, commit generated. Build contracts both TFMs. Commit. **This unblocks Task 11 and Task 14.**
### Task 11: Detach-grace session retention
**Classification:** high-risk (session lifecycle)
**Estimated implement time:** ~5 min
**Parallelizable with:** none
**Files:**
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Sessions/GatewaySession.cs` (add `DetachGrace` retention: on last-subscriber-drop, keep session alive for `DetachGraceSeconds` instead of closing)
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Configuration/SessionOptions.cs` (`DetachGraceSeconds`; new disconnect-policy value `DetachGrace`)
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionLeaseMonitorHostedService.cs` (sweep expired detach-grace windows)
- Test: `GatewaySessionTests.cs`
**Steps:** Failing test: subscriber drop under `DetachGrace` keeps the session `Ready` until the window expires, then closes. Implement. Green. Build + session/lease tests. Commit.
### Task 12: Replay-on-reconnect + emit `ReplayGap`
**Classification:** high-risk
**Estimated implement time:** ~5 min
**Parallelizable with:** none
**Files:**
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Grpc/EventStreamService.cs` (on attach with `AfterWorkerSequence`, call `distributor.TryGetReplayFrom`; replay buffered events then resume live; if `gap`, send the `ReplayGap` marker first)
- Test: `EventStreamServiceTests.cs`
**Steps:** Failing tests: reconnect with a known sequence replays only newer events; reconnect past the ring horizon yields `ReplayGap`. Implement. Green. Build + test. Commit.
### Task 13: Owner re-validation on reconnect
**Classification:** small
**Estimated implement time:** ~3 min
**Parallelizable with:** Task 12 (different assertion in same service — sequence after 12)
**Files:**
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Grpc/EventStreamService.cs` (reconnect requires caller `OwnerKeyId` == session owner → `PermissionDenied`)
- Test: `EventStreamServiceTests.cs`
**Steps:** Failing test: a different API key cannot resume someone else's session. Implement. Green. Build+test. Commit.
### Task 14: Client `ReplayGap` handling — all 5 clients
**Classification:** standard
**Estimated implement time:** ~5 min each (dispatch as 5 parallel sub-tasks; disjoint files)
**Parallelizable with:** each other (14a14e)
**Files (one client each):**
- 14a dotnet: `clients/dotnet/.../` stream consumer + test
- 14b Go: `clients/go/mxgateway/` + `go test`
- 14c Python: `clients/python/src/.../` + `pytest`
- 14d Rust: `clients/rust/crates/.../` + `cargo test`/clippy
- 14e Java: `clients/java/.../` + `gradle test` (macOS JDK 21; **revert generated `MxaccessGateway.java` churn** per `project_java_generated_churn`)
**Steps (each):** Regenerate client stubs from the updated proto; surface `ReplayGap` to the caller (callback/return marker) so apps know to re-snapshot; test the gap path. Build+test that client. Commit per client.
### Task 15: Reconnect integration test (fake worker)
**Classification:** standard
**Estimated implement time:** ~4 min
**Parallelizable with:** none
**Files:**
- Test: `src/ZB.MOM.WW.MxGateway.Tests/Gateway/GatewayEndToEndFakeWorkerSmokeTests.cs`
**Steps:** Stream, drop, reconnect within grace with last sequence → no gap; reconnect after ring overflow → `ReplayGap`. Build + suite. Commit.
---
## Phase 4 — Per-session ACL
### Task 16: gRPC session-owner gate + all-sessions admin scope
**Classification:** high-risk (security)
**Estimated implement time:** ~5 min
**Parallelizable with:** none
**Files:**
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Grpc/MxAccessGatewayService.cs` (`Invoke`/`StreamEvents`/`CloseSession` require caller key == session `OwnerKeyId`, or a key bearing a new all-sessions scope)
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Security/Authorization/` (define the all-sessions scope; map it)
- Test: `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Grpc/MxAccessGatewayServiceTests.cs`
**Steps:** Failing tests: foreign key gets `PermissionDenied` on another key's session; owner and all-sessions-scoped key succeed. Implement. Green. Build + gateway tests. Commit.
### Task 17: Session `Tag` + dashboard group→tag config
**Classification:** small
**Estimated implement time:** ~3 min
**Parallelizable with:** Task 16 (disjoint files)
**Files:**
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Sessions/GatewaySession.cs` (+`SessionManager` to set an optional `Tag` from the open request)
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Configuration/` Dashboard options (`GroupToSessionTag` map)
- Test: config-binding test
**Steps:** Failing test: a session carries its tag; config map binds. Implement. Green. Build+test. Commit.
### Task 18: EventsHub per-session ACL + hub-token session-tag claim
**Classification:** standard
**Estimated implement time:** ~5 min
**Parallelizable with:** none (depends on 17)
**Files:**
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Dashboard/Hubs/EventsHub.cs:39-54` (replace `TODO(per-session-acl)`: Admin sees all; Viewer allowed only if the session's tag is in the user's `GroupToSessionTag`-derived allowed set)
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Dashboard/HubTokenService.cs` (mint an allowed-session-tag claim)
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Dashboard/HubTokenAuthenticationHandler.cs` (carry the claim back)
- Test: `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Dashboard/EventsHubTests.cs`
> **Decision flagged in the design doc:** default is "Admin all / Viewer by tag map." If the owner chose the strict variant (Viewers see nothing unless granted), invert the default here — the executor must confirm which before implementing.
**Steps:** Failing tests: Viewer without the tag is refused `SubscribeSession`; Admin allowed; Viewer with the mapped tag allowed. Implement. Green. Build + dashboard tests. Commit.
### Task 19: ACL tests incl. live LDAP users
**Classification:** standard
**Estimated implement time:** ~4 min
**Parallelizable with:** none
**Files:**
- Test: `src/ZB.MOM.WW.MxGateway.IntegrationTests/DashboardLdapLiveTests.cs` (extend; gated `MXGATEWAY_RUN_LIVE_LDAP_TESTS=1`)
**Steps:** With `multi-role` (Admin) vs `gw-viewer` (Viewer), assert subscribe authorization differs by session tag. Document if skipped (no live LDAP). Commit.
---
## Phase 5 — Orphan-worker reattach (overturns the CLAUDE.md rule)
### Task 20: Stable gateway-instance id + stable pipe naming
**Classification:** standard
**Estimated implement time:** ~4 min
**Parallelizable with:** none
**Files:**
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionManager.cs:433` (pipe name uses a persisted stable gateway-instance id instead of `Environment.ProcessId`)
- Create: gateway-instance-id persistence (small file/SQLite row under `C:\ProgramData\MxGateway\`)
- Test: `SessionManagerTests.cs` / a new instance-id test
**Steps:** Failing test: pipe name is stable across simulated restarts (same instance id). Implement. Green. Build + tests. Commit.
### Task 21: Adoption manifest store (SQLite)
**Classification:** high-risk (persistence, security material)
**Estimated implement time:** ~5 min
**Parallelizable with:** none
**Files:**
- Create: `src/ZB.MOM.WW.MxGateway.Server/Workers/WorkerAdoptionManifest.cs` (persist `sessionId → workerPid, nonce, ownerKeyId, pipeName`; upsert on launch, delete on clean close)
- Modify: gateway-auth SQLite schema/migration
- Test: `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Workers/WorkerAdoptionManifestTests.cs`
> Nonce is security material — store it like other secrets (no plaintext logging; standing rule).
**Steps:** Failing test: manifest round-trips an entry; clean close removes it. Implement. Green. Build + tests. Commit.
### Task 22: Proto — worker adopt/reconnect frame (contract change)
**Classification:** high-risk (contracts → worker)
**Estimated implement time:** ~5 min
**Parallelizable with:** none
**Files:**
- Modify: `src/ZB.MOM.WW.MxGateway.Contracts/Protos/mxaccess_worker.proto` (add an adopt/reconnect `WorkerEnvelope` frame: worker presents `sessionId` + `nonce`; gateway ACK/NACK)
- Regenerate + **commit** `Generated/*` (net48 rule)
- Test: contracts build both TFMs
**Steps:** Add frame, regen, commit generated, build both TFMs. Commit. **Unblocks Tasks 2425.**
### Task 23: Worker phone-home reconnect loop + self-terminate
**Classification:** high-risk (worker, net48/x86 — **windev**)
**Estimated implement time:** ~5 min
**Parallelizable with:** none
**Files:**
- Modify: `src/ZB.MOM.WW.MxGateway.Worker/Ipc/WorkerPipeClient.cs` (on pipe drop: reconnect loop with bounded backoff to the stable pipe name; present the adopt frame)
- Modify: `src/ZB.MOM.WW.MxGateway.Worker/` runtime (self-terminate after `MaxOrphanLifetime` with no adoption)
- Test: `src/ZB.MOM.WW.MxGateway.Worker.Tests/` (net48/x86 on windev)
**Steps:** Failing test (fake pipe server): worker retries and adopts; gives up + self-terminates past the lifetime. Build x86 + worker tests on **windev**. Commit. *(net48: no init-only/positional records.)*
### Task 24: Gateway adoption — re-open pipes, nonce-validate, reject impostors
**Classification:** high-risk (security, lifecycle — **windev** for live)
**Estimated implement time:** ~5 min
**Parallelizable with:** none
**Files:**
- Create: `src/ZB.MOM.WW.MxGateway.Server/Workers/OrphanWorkerAdopter.cs` (startup: read manifest, re-open pipe servers, accept adopt frames, validate nonce → adopt or reject)
- Modify: gateway startup hosted-service order (adopter runs **before** `OrphanWorkerTerminator`; terminator handles only un-adoptable/foreign workers)
- Test: `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Workers/OrphanWorkerAdopterTests.cs`
**Steps:** Failing tests: matching nonce adopts and rebuilds the session; mismatched nonce is rejected and the worker terminated. Implement. Green. Build + tests. Commit.
### Task 25: Resync adopted worker + `ReplayGap` to subscribers
**Classification:** standard
**Estimated implement time:** ~4 min
**Parallelizable with:** none
**Files:**
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Workers/OrphanWorkerAdopter.cs` (after adoption, `GetSessionState`/`GetWorkerInfo` to resync; reattached subscribers get `ReplayGap` since the ring is gone)
- Test: `OrphanWorkerAdopterTests.cs`
**Steps:** Failing test: adopted session reports resynced state; a resuming subscriber receives `ReplayGap`. Implement. Green. Build + tests. Commit.
### Task 26: `EnableOrphanReattach` flag (default off) + terminator fallback
**Classification:** small
**Estimated implement time:** ~3 min
**Parallelizable with:** none
**Files:**
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Configuration/WorkerOptions.cs` (`EnableOrphanReattach`, default `false`)
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Workers/OrphanWorkerTerminator.cs` (unchanged default behavior when reattach disabled)
- Test: `OrphanWorkerTerminatorTests.cs` / adopter test
**Steps:** Failing test: with the flag off, startup terminates (today's behavior); on, it adopts. Implement. Green. Build + tests. Commit.
### Task 27: Gateway-restart reattach round-trip (integration, **windev** + live worker)
**Classification:** high-risk
**Estimated implement time:** ~5 min
**Parallelizable with:** none
**Files:**
- Test: `src/ZB.MOM.WW.MxGateway.IntegrationTests/WorkerLiveMxAccessSmokeTests.cs` (gated `MXGATEWAY_RUN_LIVE_MXACCESS_TESTS=1`)
**Steps:** Open session → simulate gateway restart → adopter re-adopts the surviving worker → session usable → subscriber gets `ReplayGap` then live events. Run on **windev** with live MXAccess. Document if skipped.
### Task 28: Documented-rule reversals + stillpending refresh
**Classification:** trivial (doc-only)
**Estimated implement time:** ~3 min
**Parallelizable with:** none (final)
**Files:**
- Modify: `CLAUDE.md` (line ~77 — reattach now supported, opt-in/bounded)
- Modify: `docs/DesignDecisions.md` (`:63-73` reconnect, `:75-80` multi-subscriber, reattach rationale → mark superseded with this design's date/commit)
- Modify: `gateway.md` (post-v1 revisit items — reflect what shipped)
- Modify: `stillpending.md` (§2 items: mark fan-out/reconnect/ACL/reattach Resolved with commit refs)
- Modify: `docs/GatewayConfiguration.md` (new options: `MaxEventSubscribersPerSession`, `ReplayBufferCapacity`, `ReplayRetentionSeconds`, `DetachGraceSeconds`, `GroupToSessionTag`, `EnableOrphanReattach`, `MaxOrphanLifetime`)
**Steps:** Edit docs to match shipped behavior. Commit.
---
## Verification matrix
| Phase | Build/test | Host |
|---|---|---|
| 14 (gateway, clients) | `dotnet build` + gateway/fake-worker tests; per-client `go/pytest/cargo/gradle/dotnet test` | local (macOS) |
| 3/5 proto changes | regen + commit `Generated/`; build contracts both TFMs; rebuild touched clients | local |
| 5 worker (net48/x86) | `dotnet build -p:Platform=x86` + `Worker.Tests` | **windev** |
| 5 live reattach + Phase-4 LDAP | opt-in gated integration tests | **windev** / live LDAP |
## Final integration review
After all tasks: dispatch a final integration reviewer over `git diff main..HEAD` focusing on the live event path, concurrency in `SessionEventDistributor`, security gates (ACL + nonce adoption), and the three documented-rule reversals. Then use superpowers-extended-cc:finishing-a-development-branch.
@@ -0,0 +1,34 @@
{
"planPath": "docs/plans/2026-06-15-session-resilience.md",
"tasks": [
{"id": 108, "subject": "Task 1: Add OwnerKeyId to the session", "status": "pending"},
{"id": 109, "subject": "Task 2: SessionEventDistributor skeleton", "status": "pending", "blockedBy": [108]},
{"id": 110, "subject": "Task 3: Bounded replay ring buffer", "status": "pending", "blockedBy": [109]},
{"id": 111, "subject": "Task 4: Rewire AttachEventSubscriber + EventStreamService onto distributor", "status": "pending", "blockedBy": [110]},
{"id": 112, "subject": "Task 5: Per-subscriber backpressure isolation", "status": "pending", "blockedBy": [111]},
{"id": 113, "subject": "Task 6: Dashboard broadcaster becomes a distributor subscriber", "status": "pending", "blockedBy": [111]},
{"id": 114, "subject": "Task 7: Remove validator block + add subscriber cap option", "status": "pending", "blockedBy": [112]},
{"id": 115, "subject": "Task 8: Subscriber-lease collection + cap enforcement", "status": "pending", "blockedBy": [114]},
{"id": 116, "subject": "Task 9: Multi-subscriber end-to-end test (FakeWorkerHarness)", "status": "pending", "blockedBy": [115]},
{"id": 117, "subject": "Task 10: Proto - ReplayGap signal", "status": "pending", "blockedBy": [116]},
{"id": 118, "subject": "Task 11: Detach-grace session retention", "status": "pending", "blockedBy": [117]},
{"id": 119, "subject": "Task 12: Replay-on-reconnect + emit ReplayGap", "status": "pending", "blockedBy": [118, 110]},
{"id": 120, "subject": "Task 13: Owner re-validation on reconnect", "status": "pending", "blockedBy": [119, 108]},
{"id": 121, "subject": "Task 14: Client ReplayGap handling - all 5 clients", "status": "pending", "blockedBy": [117]},
{"id": 122, "subject": "Task 15: Reconnect integration test (fake worker)", "status": "pending", "blockedBy": [119]},
{"id": 123, "subject": "Task 16: gRPC session-owner gate + all-sessions admin scope", "status": "pending", "blockedBy": [116, 108]},
{"id": 124, "subject": "Task 17: Session Tag + dashboard group-to-tag config", "status": "pending", "blockedBy": [116]},
{"id": 125, "subject": "Task 18: EventsHub per-session ACL + hub-token tag claim", "status": "pending", "blockedBy": [124]},
{"id": 126, "subject": "Task 19: ACL tests incl. live LDAP users", "status": "pending", "blockedBy": [125]},
{"id": 127, "subject": "Task 20: Stable gateway-instance id + stable pipe naming", "status": "pending", "blockedBy": [126]},
{"id": 128, "subject": "Task 21: Adoption manifest store (SQLite)", "status": "pending", "blockedBy": [127]},
{"id": 129, "subject": "Task 22: Proto - worker adopt/reconnect frame", "status": "pending", "blockedBy": [128]},
{"id": 130, "subject": "Task 23: Worker phone-home reconnect loop + self-terminate", "status": "pending", "blockedBy": [129]},
{"id": 131, "subject": "Task 24: Gateway adoption - re-open pipes, nonce-validate, reject impostors", "status": "pending", "blockedBy": [130]},
{"id": 132, "subject": "Task 25: Resync adopted worker + ReplayGap to subscribers", "status": "pending", "blockedBy": [131, 119]},
{"id": 133, "subject": "Task 26: EnableOrphanReattach flag (default off) + terminator fallback", "status": "pending", "blockedBy": [131]},
{"id": 134, "subject": "Task 27: Gateway-restart reattach round-trip (WINDEV + live worker)", "status": "pending", "blockedBy": [132, 133]},
{"id": 135, "subject": "Task 28: Documented-rule reversals + stillpending refresh", "status": "pending", "blockedBy": [134]}
],
"lastUpdated": "2026-06-15"
}