docs(deploy): record T18/T19 plan refinement + live-smoke fixes + task state
This commit is contained in:
@@ -760,15 +760,38 @@ private void HandleApplyConfigDeploy(ApplyConfigDeploy msg)
|
||||
|
||||
**Rationale:** Replication is best-effort with no retry and no startup reconciliation; a standby that is *down* during a deploy permanently misses that instance until its next deploy (pre-existing gap, independent of frame size). This makes replication self-healing.
|
||||
|
||||
**Auth decision (resolved):** NO static shared key. Reconciliation reuses the same trust model as deploy — the capability (fetch token) is delivered over the **trusted Akka ClusterClient** channel, the bulk config over HTTP. Each node, on startup, sends central its **local inventory** over Akka; central diffs it against the deployed snapshots and replies with **fresh fetch tokens only for the gap** (missing/stale instances). The node fetches the gap configs over the existing token-gated HTTP endpoint, which serves the re-staged `DeployedConfigSnapshot`. Central re-stages ONLY the gap (usually nothing), `stage-if-absent` so an in-flight deploy's pending row is never clobbered. Runs on **every** node (per-node, not the singleton) so a down standby self-heals. Fetch missing/stale; **log** orphans (never delete).
|
||||
|
||||
**Files:**
|
||||
- Create: `src/ZB.MOM.WW.ScadaBridge.ManagementService/DeploymentConfigEndpoints.cs` (add `GET /api/internal/sites/{siteId}/deployments` → `[{instanceUniqueName, revisionHash}]`, same anonymous+internal style — token or central-auth gated)
|
||||
- Modify: `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs` (on startup / peer `MemberUp`, reconcile local `deployed_configurations` against central's expected set: fetch missing/stale by id, drop orphans)
|
||||
- Modify: `IDeploymentManagerRepository` / impl — already has `GetExpectedDeploymentsForSiteAsync` (Task 2).
|
||||
- Test: reconciliation unit test (missing instance fetched; stale revision refreshed; orphan removed).
|
||||
- Create: Commons reconcile messages — `ReconcileSiteRequest(SiteIdentifier, NodeId, IReadOnlyDictionary<string,string> LocalNameToRevisionHash)` + `ReconcileSiteResponse(IReadOnlyList<ReconcileGapItem> Gap, IReadOnlyList<string> OrphanNames, string CentralFetchBaseUrl)` where `ReconcileGapItem(InstanceUniqueName, DeploymentId, RevisionHash, FetchToken)`.
|
||||
- Modify: `IDeploymentManagerRepository` / impl — add `GetExpectedDeploymentsForSiteAsync(siteId)` returning `(InstanceUniqueName, RevisionHash, DeploymentId)` from `DeployedConfigSnapshot`⋈`Instance` (by site); + a `StagePendingIfAbsentAsync` (insert-if-no-pending-row-for-instance, from the snapshot's config, fresh token + TTL) — does NOT supersede.
|
||||
- Modify: `CentralCommunicationActor` (or a small central reconcile handler) — handle `ReconcileSiteRequest`: compute expected set, diff vs reported inventory, `StagePendingIfAbsent` the gap with fresh tokens, reply `ReconcileSiteResponse`. Reuses the existing token + config-fetch endpoint.
|
||||
- Create: `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/SiteReconciliationActor.cs` (per-node) — on startup (after local configs load), build the local inventory, Ask central via the site's central client, fetch the gap via `IDeploymentConfigFetcher`, guarded-write (`StoreDeployedConfigIfNewerAsync`), log orphans. Wire in `AkkaHostedService` per node + give it the central client + fetcher + storage.
|
||||
- Modify: site appsettings — add `ScadaBridge:Communication:CentralFetchBaseUrl` to the SITE files (the site now initiates fetches without a notify carrying it). NOTE: the response also carries `CentralFetchBaseUrl` from central, so site config is a fallback.
|
||||
- Test: central handler (gap diff + stage-if-absent + tokens), site reconcile (missing fetched, stale refreshed, orphan logged-not-deleted, no-gap = no fetch).
|
||||
|
||||
**Steps:** TDD as above; gate the per-instance fetch through the same `IDeploymentConfigFetcher`. Because the site needs central's base URL at startup (no notify in hand), add a site option `Communication:CentralFetchBaseUrl` (or reuse a site-config value) for the reconciliation path, and a reconciliation auth token scheme (a static internal token, or extend the endpoint to accept the cluster's shared secret). **Decide the reconciliation auth during this task** (the per-deployment token model doesn't apply to a cold-start pull) — surface options before implementing.
|
||||
**Steps:** TDD; gate the gap fetch through the existing `IDeploymentConfigFetcher` + endpoint; one Akka round-trip; re-stage only the gap. Conservative: never delete orphans.
|
||||
|
||||
**Commit** — `feat(site): startup/rejoin reconciliation of deployed configs against central`
|
||||
**Commit** — `feat(site): startup reconciliation of deployed configs (Akka inventory + gap fetch)`
|
||||
|
||||
---
|
||||
|
||||
### Task 19: Topology page — fast load (staleness off the live loop)
|
||||
|
||||
**Classification:** standard
|
||||
**Estimated implement time:** ~5 min
|
||||
**Parallelizable with:** none (independent of T18; can run after)
|
||||
|
||||
**Rationale:** `Topology.razor` (`/deployment/topology`) already reads deployed state from central DB (it does NOT query sites). It's slow because the **staleness** badge loops over every deployed instance calling `DeploymentService.GetDeploymentComparisonAsync` → `FlattenAndValidateAsync` (a full re-flatten per instance) — on initial load AND again every 15 s via the live-updates timer.
|
||||
|
||||
**Files:**
|
||||
- Modify: `IDeploymentManagerRepository` / impl — add a bulk `GetDeployedSnapshotsBySiteAsync`/`GetAllDeployedSnapshotsAsync` (one query, avoids N snapshot lookups).
|
||||
- Modify: `src/ZB.MOM.WW.ScadaBridge.CentralUI/Components/Pages/Deployment/Topology.razor` — (a) load deployed state + the bulk snapshots fast; (b) take the staleness re-flatten OFF the 15 s live-update loop (live update refreshes only the cheap deployed state); (c) compute staleness once on initial load, **parallelized** across instances (not sequential awaits), and on an explicit "Re-check staleness" button.
|
||||
- Test: `tests/ZB.MOM.WW.ScadaBridge.CentralUI.Tests/TopologyPageTests.cs` — assert the live-update path does NOT call `GetDeploymentComparisonAsync`; deployed state still renders; staleness computed on load + manual refresh.
|
||||
|
||||
**Steps:** TDD; keep the Stale/Current badge accurate (it only changes on edit/redeploy, so it doesn't belong on a 15 s poll); the deployed state (State + deployed-at) is the only thing the live timer refreshes.
|
||||
|
||||
**Commit** — `perf(ui): topology page — staleness off the live loop + bulk snapshot query`
|
||||
|
||||
---
|
||||
|
||||
|
||||
Reference in New Issue
Block a user