docs(deploy): record T18/T19 plan refinement + live-smoke fixes + task state

2026-06-26 17:35:07 -04:00
parent fd22f5ce0a
commit f48a748f37
3 changed files with 45 additions and 8 deletions
@@ -168,6 +168,17 @@ Token-gated internal endpoint; constant-time compare; short TTL; scoped to exact
 - New EF migration for `PendingDeployment` (auto-apply in dev; SQL script for prod). No site SQLite schema change.
 - Add `CentralFetchBaseUrl` to central appsettings in `docker/`, `docker-env2/`, `deploy/wonder-app-vd03/`; confirm site→central HTTP reachability (fine on the co-located test host/docker; a firewall port to open in a hub-spoke prod). Update `deploy/wonder-app-vd03/RUNBOOK.md`.

+## 11a. Live validation + post-implementation fixes (2026-06-26)
+
+Smoke-tested on the docker cluster (rebuilt from this branch). Validated end-to-end:
+- Migration applied; all 8 nodes healthy/ready.
+- **Deploy notify-and-fetch**: central notify → active node fetch+apply → id-only replicate → standby fetch → `Success` in **~0.11 s** (the exact path that previously hung 120 s).
+- **Startup reconciliation**: every node self-heals on startup — single-node gap heals; the **concurrent both-nodes-empty** race heals both.
+
+The smoke surfaced two real bugs in the reconciliation path (missed by unit/integration tests because those didn't have a second concurrent node or a lingering expired row), both fixed:
+1. **Concurrent-gap omit** — when two nodes were concurrently missing the same instance, the second node's `StagePendingIfAbsentAsync` returned false and the handler *omitted* the item, leaving that node unhealed. Fix: on false, return the **existing** pending row's deploymentId + token (multi-use within TTL) so all concurrently-missing nodes heal in the same round.
+2. **Expired pending row blocks self-heal** — `StagePendingIfAbsentAsync` checked existence by `InstanceId` ignoring expiry, so an expired-but-unpurged row (the periodic purge is still a deferred TODO) blocked a fresh stage *and* would collide with the snapshot's reused `DeploymentId` on the unique index. Fix: **expiry-aware staging** — delete expired rows for the instance first, then check only live rows; `GetPendingDeploymentByInstanceIdAsync` filters by expiry. This also opportunistically cleans expired rows, reducing reliance on the deferred periodic purge.
+
 ## 12. Affected files (for the plan)

 - `src/ZB.MOM.WW.ScadaBridge.Commons/Entities/Deployment/` — new `PendingDeployment` entity.
@@ -760,15 +760,38 @@ private void HandleApplyConfigDeploy(ApplyConfigDeploy msg)

 **Rationale:** Replication is best-effort with no retry and no startup reconciliation; a standby that is *down* during a deploy permanently misses that instance until its next deploy (pre-existing gap, independent of frame size). This makes replication self-healing.

+**Auth decision (resolved):** NO static shared key. Reconciliation reuses the same trust model as deploy — the capability (fetch token) is delivered over the **trusted Akka ClusterClient** channel, the bulk config over HTTP. Each node, on startup, sends central its **local inventory** over Akka; central diffs it against the deployed snapshots and replies with **fresh fetch tokens only for the gap** (missing/stale instances). The node fetches the gap configs over the existing token-gated HTTP endpoint, which serves the re-staged `DeployedConfigSnapshot`. Central re-stages ONLY the gap (usually nothing), `stage-if-absent` so an in-flight deploy's pending row is never clobbered. Runs on **every** node (per-node, not the singleton) so a down standby self-heals. Fetch missing/stale; **log** orphans (never delete).
+
 **Files:**
- Create: `src/ZB.MOM.WW.ScadaBridge.ManagementService/DeploymentConfigEndpoints.cs` (add `GET /api/internal/sites/{siteId}/deployments` → `[{instanceUniqueName, revisionHash}]`, same anonymous+internal style — token or central-auth gated)
- Modify: `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs` (on startup / peer `MemberUp`, reconcile local `deployed_configurations` against central's expected set: fetch missing/stale by id, drop orphans)
- Modify: `IDeploymentManagerRepository` / impl — already has `GetExpectedDeploymentsForSiteAsync` (Task 2).
- Test: reconciliation unit test (missing instance fetched; stale revision refreshed; orphan removed).
+- Create: Commons reconcile messages — `ReconcileSiteRequest(SiteIdentifier, NodeId, IReadOnlyDictionary<string,string> LocalNameToRevisionHash)` + `ReconcileSiteResponse(IReadOnlyList<ReconcileGapItem> Gap, IReadOnlyList<string> OrphanNames, string CentralFetchBaseUrl)` where `ReconcileGapItem(InstanceUniqueName, DeploymentId, RevisionHash, FetchToken)`.
+- Modify: `IDeploymentManagerRepository` / impl — add `GetExpectedDeploymentsForSiteAsync(siteId)` returning `(InstanceUniqueName, RevisionHash, DeploymentId)` from `DeployedConfigSnapshot`⋈`Instance` (by site); + a `StagePendingIfAbsentAsync` (insert-if-no-pending-row-for-instance, from the snapshot's config, fresh token + TTL) — does NOT supersede.
+- Modify: `CentralCommunicationActor` (or a small central reconcile handler) — handle `ReconcileSiteRequest`: compute expected set, diff vs reported inventory, `StagePendingIfAbsent` the gap with fresh tokens, reply `ReconcileSiteResponse`. Reuses the existing token + config-fetch endpoint.
+- Create: `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/SiteReconciliationActor.cs` (per-node) — on startup (after local configs load), build the local inventory, Ask central via the site's central client, fetch the gap via `IDeploymentConfigFetcher`, guarded-write (`StoreDeployedConfigIfNewerAsync`), log orphans. Wire in `AkkaHostedService` per node + give it the central client + fetcher + storage.
+- Modify: site appsettings — add `ScadaBridge:Communication:CentralFetchBaseUrl` to the SITE files (the site now initiates fetches without a notify carrying it). NOTE: the response also carries `CentralFetchBaseUrl` from central, so site config is a fallback.
+- Test: central handler (gap diff + stage-if-absent + tokens), site reconcile (missing fetched, stale refreshed, orphan logged-not-deleted, no-gap = no fetch).

-**Steps:** TDD as above; gate the per-instance fetch through the same `IDeploymentConfigFetcher`. Because the site needs central's base URL at startup (no notify in hand), add a site option `Communication:CentralFetchBaseUrl` (or reuse a site-config value) for the reconciliation path, and a reconciliation auth token scheme (a static internal token, or extend the endpoint to accept the cluster's shared secret). **Decide the reconciliation auth during this task** (the per-deployment token model doesn't apply to a cold-start pull) — surface options before implementing.
+**Steps:** TDD; gate the gap fetch through the existing `IDeploymentConfigFetcher` + endpoint; one Akka round-trip; re-stage only the gap. Conservative: never delete orphans.

-**Commit** — `feat(site): startup/rejoin reconciliation of deployed configs against central`
+**Commit** — `feat(site): startup reconciliation of deployed configs (Akka inventory + gap fetch)`
+
+---
+
+### Task 19: Topology page — fast load (staleness off the live loop)
+
+**Classification:** standard
+**Estimated implement time:** ~5 min
+**Parallelizable with:** none (independent of T18; can run after)
+
+**Rationale:** `Topology.razor` (`/deployment/topology`) already reads deployed state from central DB (it does NOT query sites). It's slow because the **staleness** badge loops over every deployed instance calling `DeploymentService.GetDeploymentComparisonAsync` → `FlattenAndValidateAsync` (a full re-flatten per instance) — on initial load AND again every 15 s via the live-updates timer.
+
+**Files:**
+- Modify: `IDeploymentManagerRepository` / impl — add a bulk `GetDeployedSnapshotsBySiteAsync`/`GetAllDeployedSnapshotsAsync` (one query, avoids N snapshot lookups).
+- Modify: `src/ZB.MOM.WW.ScadaBridge.CentralUI/Components/Pages/Deployment/Topology.razor` — (a) load deployed state + the bulk snapshots fast; (b) take the staleness re-flatten OFF the 15 s live-update loop (live update refreshes only the cheap deployed state); (c) compute staleness once on initial load, **parallelized** across instances (not sequential awaits), and on an explicit "Re-check staleness" button.
+- Test: `tests/ZB.MOM.WW.ScadaBridge.CentralUI.Tests/TopologyPageTests.cs` — assert the live-update path does NOT call `GetDeploymentComparisonAsync`; deployed state still renders; staleness computed on load + manual refresh.
+
+**Steps:** TDD; keep the Stale/Current badge accurate (it only changes on edit/redeploy, so it doesn't belong on a 15 s poll); the deployed state (State + deployed-at) is the only thing the live timer refreshes.
+
+**Commit** — `perf(ui): topology page — staleness off the live loop + bulk snapshot query`

 ---

@@ -17,8 +17,11 @@
    {"id": 37, "subject": "Task 14: Retire fat DeployInstanceCommand wire path", "status": "completed", "blockedBy": [29, 30, 33, 34]},
    {"id": 38, "subject": "Task 15: appsettings CentralFetchBaseUrl + RUNBOOK", "status": "completed"},
    {"id": 39, "subject": "Task 16: Integration test — large config, supersession, token", "status": "completed", "blockedBy": [29, 33, 35, 37]},
-    {"id": 40, "subject": "Task 17: Live smoke on docker cluster", "status": "pending", "blockedBy": [39]},
-    {"id": 41, "subject": "Task 18 (FOLLOW-UP): standby/startup reconciliation", "status": "pending", "blockedBy": [39]}
+    {"id": 40, "subject": "Task 17: Live smoke on docker cluster (found+fixed 2 reconcile bugs)", "status": "completed", "blockedBy": [42, 43, 44]},
+    {"id": 41, "subject": "Task 18a: reconcile messages + repo (expected-set + stage-if-absent)", "status": "completed", "blockedBy": [39]},
+    {"id": 42, "subject": "Task 18b: central reconcile handler", "status": "completed", "blockedBy": [41]},
+    {"id": 43, "subject": "Task 18c: site reconciliation actor (per-node) + wiring", "status": "completed", "blockedBy": [41]},
+    {"id": 44, "subject": "Task 19: Topology page fast load (staleness off live loop)", "status": "completed"}
  ],
  "lastUpdated": "2026-06-26"
 }