docs(deploy): record T18/T19 plan refinement + live-smoke fixes + task state

2026-06-26 17:35:07 -04:00
parent fd22f5ce0a
commit f48a748f37
3 changed files with 45 additions and 8 deletions
@@ -168,6 +168,17 @@ Token-gated internal endpoint; constant-time compare; short TTL; scoped to exact
 - New EF migration for `PendingDeployment` (auto-apply in dev; SQL script for prod). No site SQLite schema change.
 - Add `CentralFetchBaseUrl` to central appsettings in `docker/`, `docker-env2/`, `deploy/wonder-app-vd03/`; confirm site→central HTTP reachability (fine on the co-located test host/docker; a firewall port to open in a hub-spoke prod). Update `deploy/wonder-app-vd03/RUNBOOK.md`.

+## 11a. Live validation + post-implementation fixes (2026-06-26)
+
+Smoke-tested on the docker cluster (rebuilt from this branch). Validated end-to-end:
+- Migration applied; all 8 nodes healthy/ready.
+- **Deploy notify-and-fetch**: central notify → active node fetch+apply → id-only replicate → standby fetch → `Success` in **~0.11 s** (the exact path that previously hung 120 s).
+- **Startup reconciliation**: every node self-heals on startup — single-node gap heals; the **concurrent both-nodes-empty** race heals both.
+
+The smoke surfaced two real bugs in the reconciliation path (missed by unit/integration tests because those didn't have a second concurrent node or a lingering expired row), both fixed:
+1. **Concurrent-gap omit** — when two nodes were concurrently missing the same instance, the second node's `StagePendingIfAbsentAsync` returned false and the handler *omitted* the item, leaving that node unhealed. Fix: on false, return the **existing** pending row's deploymentId + token (multi-use within TTL) so all concurrently-missing nodes heal in the same round.
+2. **Expired pending row blocks self-heal** — `StagePendingIfAbsentAsync` checked existence by `InstanceId` ignoring expiry, so an expired-but-unpurged row (the periodic purge is still a deferred TODO) blocked a fresh stage *and* would collide with the snapshot's reused `DeploymentId` on the unique index. Fix: **expiry-aware staging** — delete expired rows for the instance first, then check only live rows; `GetPendingDeploymentByInstanceIdAsync` filters by expiry. This also opportunistically cleans expired rows, reducing reliance on the deferred periodic purge.
+
 ## 12. Affected files (for the plan)

 - `src/ZB.MOM.WW.ScadaBridge.Commons/Entities/Deployment/` — new `PendingDeployment` entity.