diff --git a/docs/plans/2026-06-26-deploy-config-notify-and-fetch-design.md b/docs/plans/2026-06-26-deploy-config-notify-and-fetch-design.md index a6b58710..6658aaa9 100644 --- a/docs/plans/2026-06-26-deploy-config-notify-and-fetch-design.md +++ b/docs/plans/2026-06-26-deploy-config-notify-and-fetch-design.md @@ -168,6 +168,17 @@ Token-gated internal endpoint; constant-time compare; short TTL; scoped to exact - New EF migration for `PendingDeployment` (auto-apply in dev; SQL script for prod). No site SQLite schema change. - Add `CentralFetchBaseUrl` to central appsettings in `docker/`, `docker-env2/`, `deploy/wonder-app-vd03/`; confirm site→central HTTP reachability (fine on the co-located test host/docker; a firewall port to open in a hub-spoke prod). Update `deploy/wonder-app-vd03/RUNBOOK.md`. +## 11a. Live validation + post-implementation fixes (2026-06-26) + +Smoke-tested on the docker cluster (rebuilt from this branch). Validated end-to-end: +- Migration applied; all 8 nodes healthy/ready. +- **Deploy notify-and-fetch**: central notify → active node fetch+apply → id-only replicate → standby fetch → `Success` in **~0.11 s** (the exact path that previously hung 120 s). +- **Startup reconciliation**: every node self-heals on startup — single-node gap heals; the **concurrent both-nodes-empty** race heals both. + +The smoke surfaced two real bugs in the reconciliation path (missed by unit/integration tests because those didn't have a second concurrent node or a lingering expired row), both fixed: +1. **Concurrent-gap omit** — when two nodes were concurrently missing the same instance, the second node's `StagePendingIfAbsentAsync` returned false and the handler *omitted* the item, leaving that node unhealed. Fix: on false, return the **existing** pending row's deploymentId + token (multi-use within TTL) so all concurrently-missing nodes heal in the same round. +2. **Expired pending row blocks self-heal** — `StagePendingIfAbsentAsync` checked existence by `InstanceId` ignoring expiry, so an expired-but-unpurged row (the periodic purge is still a deferred TODO) blocked a fresh stage *and* would collide with the snapshot's reused `DeploymentId` on the unique index. Fix: **expiry-aware staging** — delete expired rows for the instance first, then check only live rows; `GetPendingDeploymentByInstanceIdAsync` filters by expiry. This also opportunistically cleans expired rows, reducing reliance on the deferred periodic purge. + ## 12. Affected files (for the plan) - `src/ZB.MOM.WW.ScadaBridge.Commons/Entities/Deployment/` — new `PendingDeployment` entity. diff --git a/docs/plans/2026-06-26-deploy-config-notify-and-fetch.md b/docs/plans/2026-06-26-deploy-config-notify-and-fetch.md index 32c8188b..0f36656d 100644 --- a/docs/plans/2026-06-26-deploy-config-notify-and-fetch.md +++ b/docs/plans/2026-06-26-deploy-config-notify-and-fetch.md @@ -760,15 +760,38 @@ private void HandleApplyConfigDeploy(ApplyConfigDeploy msg) **Rationale:** Replication is best-effort with no retry and no startup reconciliation; a standby that is *down* during a deploy permanently misses that instance until its next deploy (pre-existing gap, independent of frame size). This makes replication self-healing. +**Auth decision (resolved):** NO static shared key. Reconciliation reuses the same trust model as deploy — the capability (fetch token) is delivered over the **trusted Akka ClusterClient** channel, the bulk config over HTTP. Each node, on startup, sends central its **local inventory** over Akka; central diffs it against the deployed snapshots and replies with **fresh fetch tokens only for the gap** (missing/stale instances). The node fetches the gap configs over the existing token-gated HTTP endpoint, which serves the re-staged `DeployedConfigSnapshot`. Central re-stages ONLY the gap (usually nothing), `stage-if-absent` so an in-flight deploy's pending row is never clobbered. Runs on **every** node (per-node, not the singleton) so a down standby self-heals. Fetch missing/stale; **log** orphans (never delete). + **Files:** -- Create: `src/ZB.MOM.WW.ScadaBridge.ManagementService/DeploymentConfigEndpoints.cs` (add `GET /api/internal/sites/{siteId}/deployments` → `[{instanceUniqueName, revisionHash}]`, same anonymous+internal style — token or central-auth gated) -- Modify: `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs` (on startup / peer `MemberUp`, reconcile local `deployed_configurations` against central's expected set: fetch missing/stale by id, drop orphans) -- Modify: `IDeploymentManagerRepository` / impl — already has `GetExpectedDeploymentsForSiteAsync` (Task 2). -- Test: reconciliation unit test (missing instance fetched; stale revision refreshed; orphan removed). +- Create: Commons reconcile messages — `ReconcileSiteRequest(SiteIdentifier, NodeId, IReadOnlyDictionary LocalNameToRevisionHash)` + `ReconcileSiteResponse(IReadOnlyList Gap, IReadOnlyList OrphanNames, string CentralFetchBaseUrl)` where `ReconcileGapItem(InstanceUniqueName, DeploymentId, RevisionHash, FetchToken)`. +- Modify: `IDeploymentManagerRepository` / impl — add `GetExpectedDeploymentsForSiteAsync(siteId)` returning `(InstanceUniqueName, RevisionHash, DeploymentId)` from `DeployedConfigSnapshot`⋈`Instance` (by site); + a `StagePendingIfAbsentAsync` (insert-if-no-pending-row-for-instance, from the snapshot's config, fresh token + TTL) — does NOT supersede. +- Modify: `CentralCommunicationActor` (or a small central reconcile handler) — handle `ReconcileSiteRequest`: compute expected set, diff vs reported inventory, `StagePendingIfAbsent` the gap with fresh tokens, reply `ReconcileSiteResponse`. Reuses the existing token + config-fetch endpoint. +- Create: `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/SiteReconciliationActor.cs` (per-node) — on startup (after local configs load), build the local inventory, Ask central via the site's central client, fetch the gap via `IDeploymentConfigFetcher`, guarded-write (`StoreDeployedConfigIfNewerAsync`), log orphans. Wire in `AkkaHostedService` per node + give it the central client + fetcher + storage. +- Modify: site appsettings — add `ScadaBridge:Communication:CentralFetchBaseUrl` to the SITE files (the site now initiates fetches without a notify carrying it). NOTE: the response also carries `CentralFetchBaseUrl` from central, so site config is a fallback. +- Test: central handler (gap diff + stage-if-absent + tokens), site reconcile (missing fetched, stale refreshed, orphan logged-not-deleted, no-gap = no fetch). -**Steps:** TDD as above; gate the per-instance fetch through the same `IDeploymentConfigFetcher`. Because the site needs central's base URL at startup (no notify in hand), add a site option `Communication:CentralFetchBaseUrl` (or reuse a site-config value) for the reconciliation path, and a reconciliation auth token scheme (a static internal token, or extend the endpoint to accept the cluster's shared secret). **Decide the reconciliation auth during this task** (the per-deployment token model doesn't apply to a cold-start pull) — surface options before implementing. +**Steps:** TDD; gate the gap fetch through the existing `IDeploymentConfigFetcher` + endpoint; one Akka round-trip; re-stage only the gap. Conservative: never delete orphans. -**Commit** — `feat(site): startup/rejoin reconciliation of deployed configs against central` +**Commit** — `feat(site): startup reconciliation of deployed configs (Akka inventory + gap fetch)` + +--- + +### Task 19: Topology page — fast load (staleness off the live loop) + +**Classification:** standard +**Estimated implement time:** ~5 min +**Parallelizable with:** none (independent of T18; can run after) + +**Rationale:** `Topology.razor` (`/deployment/topology`) already reads deployed state from central DB (it does NOT query sites). It's slow because the **staleness** badge loops over every deployed instance calling `DeploymentService.GetDeploymentComparisonAsync` → `FlattenAndValidateAsync` (a full re-flatten per instance) — on initial load AND again every 15 s via the live-updates timer. + +**Files:** +- Modify: `IDeploymentManagerRepository` / impl — add a bulk `GetDeployedSnapshotsBySiteAsync`/`GetAllDeployedSnapshotsAsync` (one query, avoids N snapshot lookups). +- Modify: `src/ZB.MOM.WW.ScadaBridge.CentralUI/Components/Pages/Deployment/Topology.razor` — (a) load deployed state + the bulk snapshots fast; (b) take the staleness re-flatten OFF the 15 s live-update loop (live update refreshes only the cheap deployed state); (c) compute staleness once on initial load, **parallelized** across instances (not sequential awaits), and on an explicit "Re-check staleness" button. +- Test: `tests/ZB.MOM.WW.ScadaBridge.CentralUI.Tests/TopologyPageTests.cs` — assert the live-update path does NOT call `GetDeploymentComparisonAsync`; deployed state still renders; staleness computed on load + manual refresh. + +**Steps:** TDD; keep the Stale/Current badge accurate (it only changes on edit/redeploy, so it doesn't belong on a 15 s poll); the deployed state (State + deployed-at) is the only thing the live timer refreshes. + +**Commit** — `perf(ui): topology page — staleness off the live loop + bulk snapshot query` --- diff --git a/docs/plans/2026-06-26-deploy-config-notify-and-fetch.md.tasks.json b/docs/plans/2026-06-26-deploy-config-notify-and-fetch.md.tasks.json index d52691fb..f6ed99f0 100644 --- a/docs/plans/2026-06-26-deploy-config-notify-and-fetch.md.tasks.json +++ b/docs/plans/2026-06-26-deploy-config-notify-and-fetch.md.tasks.json @@ -17,8 +17,11 @@ {"id": 37, "subject": "Task 14: Retire fat DeployInstanceCommand wire path", "status": "completed", "blockedBy": [29, 30, 33, 34]}, {"id": 38, "subject": "Task 15: appsettings CentralFetchBaseUrl + RUNBOOK", "status": "completed"}, {"id": 39, "subject": "Task 16: Integration test — large config, supersession, token", "status": "completed", "blockedBy": [29, 33, 35, 37]}, - {"id": 40, "subject": "Task 17: Live smoke on docker cluster", "status": "pending", "blockedBy": [39]}, - {"id": 41, "subject": "Task 18 (FOLLOW-UP): standby/startup reconciliation", "status": "pending", "blockedBy": [39]} + {"id": 40, "subject": "Task 17: Live smoke on docker cluster (found+fixed 2 reconcile bugs)", "status": "completed", "blockedBy": [42, 43, 44]}, + {"id": 41, "subject": "Task 18a: reconcile messages + repo (expected-set + stage-if-absent)", "status": "completed", "blockedBy": [39]}, + {"id": 42, "subject": "Task 18b: central reconcile handler", "status": "completed", "blockedBy": [41]}, + {"id": 43, "subject": "Task 18c: site reconciliation actor (per-node) + wiring", "status": "completed", "blockedBy": [41]}, + {"id": 44, "subject": "Task 19: Topology page fast load (staleness off live loop)", "status": "completed"} ], "lastUpdated": "2026-06-26" }