docs(deploy): record T18/T19 plan refinement + live-smoke fixes + task state
This commit is contained in:
@@ -168,6 +168,17 @@ Token-gated internal endpoint; constant-time compare; short TTL; scoped to exact
|
||||
- New EF migration for `PendingDeployment` (auto-apply in dev; SQL script for prod). No site SQLite schema change.
|
||||
- Add `CentralFetchBaseUrl` to central appsettings in `docker/`, `docker-env2/`, `deploy/wonder-app-vd03/`; confirm site→central HTTP reachability (fine on the co-located test host/docker; a firewall port to open in a hub-spoke prod). Update `deploy/wonder-app-vd03/RUNBOOK.md`.
|
||||
|
||||
## 11a. Live validation + post-implementation fixes (2026-06-26)
|
||||
|
||||
Smoke-tested on the docker cluster (rebuilt from this branch). Validated end-to-end:
|
||||
- Migration applied; all 8 nodes healthy/ready.
|
||||
- **Deploy notify-and-fetch**: central notify → active node fetch+apply → id-only replicate → standby fetch → `Success` in **~0.11 s** (the exact path that previously hung 120 s).
|
||||
- **Startup reconciliation**: every node self-heals on startup — single-node gap heals; the **concurrent both-nodes-empty** race heals both.
|
||||
|
||||
The smoke surfaced two real bugs in the reconciliation path (missed by unit/integration tests because those didn't have a second concurrent node or a lingering expired row), both fixed:
|
||||
1. **Concurrent-gap omit** — when two nodes were concurrently missing the same instance, the second node's `StagePendingIfAbsentAsync` returned false and the handler *omitted* the item, leaving that node unhealed. Fix: on false, return the **existing** pending row's deploymentId + token (multi-use within TTL) so all concurrently-missing nodes heal in the same round.
|
||||
2. **Expired pending row blocks self-heal** — `StagePendingIfAbsentAsync` checked existence by `InstanceId` ignoring expiry, so an expired-but-unpurged row (the periodic purge is still a deferred TODO) blocked a fresh stage *and* would collide with the snapshot's reused `DeploymentId` on the unique index. Fix: **expiry-aware staging** — delete expired rows for the instance first, then check only live rows; `GetPendingDeploymentByInstanceIdAsync` filters by expiry. This also opportunistically cleans expired rows, reducing reliance on the deferred periodic purge.
|
||||
|
||||
## 12. Affected files (for the plan)
|
||||
|
||||
- `src/ZB.MOM.WW.ScadaBridge.Commons/Entities/Deployment/` — new `PendingDeployment` entity.
|
||||
|
||||
Reference in New Issue
Block a user