docs(deploy): notify-and-fetch design + implementation plan + task state

2026-06-26 15:09:12 -04:00
parent 7c7989d176
commit e5503857df
3 changed files with 985 additions and 0 deletions
@@ -0,0 +1,184 @@
+# Deploy + Replication Config Transport — Notify-and-Fetch Design
+
+**Date:** 2026-06-26 · **Status:** APPROVED (design) · **Area:** Deployment Manager / Site Runtime / Cluster Communication
+**Fixes:** [`docs/known-issues/2026-06-26-deploy-config-exceeds-akka-frame-size.md`](../known-issues/2026-06-26-deploy-config-exceeds-akka-frame-size.md)
+
+## 1. Problem
+
+The flattened instance config travels over Akka on **two** hops today, and both are bounded by Akka.Remote's default `maximum-frame-size` (**128 KB / `128000b`**). A large config (e.g. a 3rd composition of the same base template) silently breaks both:
+
+1. **Central → site (deploy).** `DeploymentService` serializes the whole flattened config into `DeployInstanceCommand.FlattenedConfigurationJson` and Asks the site over ClusterClient. An oversized frame is **silently dropped**; the central `Ask` times out after 120 s (`Communication:DeploymentTimeout`); the instance becomes un-deployable until the config shrinks.
+
+2. **Site active → standby (replication).** After a successful apply, `DeploymentManagerActor` does `_replicationActor.Tell(new ReplicateConfigDeploy(instanceName, command.FlattenedConfigurationJson, …))`; `SiteReplicationActor` relays it node-to-node via `ActorSelection(RootActorPath(peer)/"user"/"site-replication").Tell(…)`. This is **fire-and-forget** — an oversized `ApplyConfigDeploy` is silently dropped with *no timeout and no error*. The deploy "succeeds," the active node runs fine, and the standby silently lacks the instance — surfacing only at **failover**, when the now-active node recovers from its SQLite and the instance simply isn't there.
+
+The JSON is double-escaped inside the Akka envelope (default serializer), so ~60–70 KB of raw config is enough to blow the 128 KB frame. `log-frame-size-exceeding` defaults off, so there isn't even a warning.
+
+## 2. Goal & guiding principle
+
+Eliminate the frame-size failure on **both** hops, permanently and at any config size, **without** raising `maximum-frame-size`.
+
+**Principle:** *small control messages stay on Akka; the bulk config never touches Akka.* The config always moves over **HTTP from central** (the single source of truth) to whichever node needs it — the active node (which applies + runs it) and the standby node (which persists it for future recovery).
+
+This mirrors how the site already works at startup: a site recovers its instances **purely from local SQLite** (`deployed_configurations`), never contacting central. A deploy thus becomes an *on-demand, single-instance version of startup recovery*: "your definition changed — fetch it and refresh." Only the config **source** changes; the apply path is unchanged.
+
+### Why notify-and-fetch over alternatives
+
+- **Push over gRPC (central→site):** would need host-aware routing (the `DeploymentManager` singleton runs only on the active node; the gRPC server runs on both), and still leaves the *intra-site* replication hop broken.
+- **Raise Akka `maximum-frame-size`:** only raises the ceiling, doesn't decouple size from transport, and changes remoting config globally.
+- **Sites read central MS SQL directly:** rejected — it tears down the enforced hub-spoke boundary (sites = local SQLite, central = MS SQL, enforced in DI *and* `StartupValidator`), gives every site node a login that can read central's *entire* config DB (huge blast radius), and couples sites to central's EF schema.
+
+A small `RefreshDeploymentCommand` routes to the active-node singleton **for free** via the existing `ClusterSingletonProxy` (no host detection, no frame risk), and the same small-message pattern fixes replication. The only genuinely new piece is the **fetch channel**, implemented as an authenticated `GET` against central's **existing** HTTP management API.
+
+## 3. Architecture overview
+
+```
+                         central MS SQL
+                       ┌──────────────────┐
+                       │ PendingDeployment │  (configJson, token, TTL, by deploymentId; ≤1 per instance)
+                       └──────────────────┘
+                          ▲            │ read (token-gated)
+            persist before notify      │
+                          │            ▼
+   DeploymentService ──Akka small notify──▶ site (active node, via singleton proxy)
+   RefreshDeploymentCommand(id, name, hash,        │
+     deployedBy, ts, centralFetchBaseUrl, token)   │ 1. HTTP GET config (token)  ◀── central mgmt API
+                          ▲                         │ 2. apply (existing path): Instance Actor + SQLite
+                          │ DeploymentStatusResponse│ 3. reply success/failure
+                          └─────────────────────────┘
+                                                    │ 4. tell standby (small, id-only)
+                                                    ▼
+                                       SiteReplicationActor (standby)
+                                         HTTP GET config (token) ──▶ central mgmt API
+                                         write deployed_configurations SQLite (no Instance Actor)
+```
+
+Every Akka hop now carries only small control messages. The bulk config moves only over HTTP, point-to-point, to the node that needs it.
+
+## 4. Component changes
+
+### 4.1 Central — pending-config store
+
+New table **`PendingDeployment`** in central MS SQL (EF entity + migration):
+
+| column | type | notes |
+|---|---|---|
+| `DeploymentId` | `varchar` PK | matches the deploy's GUID (`N` format) |
+| `InstanceId` | `int`, indexed | for supersession lookup |
+| `RevisionHash` | `varchar` | staleness/version marker |
+| `ConfigJson` | `nvarchar(max)` | the flattened config |
+| `Token` | `varchar` | random per-deployment fetch secret |
+| `CreatedAtUtc` | `datetimeoffset` | |
+| `ExpiresAtUtc` | `datetimeoffset`, indexed | TTL (default 5 min, configurable) |
+
+Kept **separate** from `DeployedConfigSnapshot` so the existing "deployed = success" semantics stay clean. On success the pending row's config is promoted into `DeployedConfigSnapshot` (existing logic); on failure/timeout/TTL it's purged. A bounded periodic purge removes expired rows.
+
+**Supersession (requested):** when persisting a new pending row for an instance, **first delete any existing pending rows for that same `InstanceId`** (delete-then-insert, one transaction) → at most one pending row per instance, always the latest. Race-free because the existing **per-instance operation lock** already serializes same-instance deploys.
+
+### 4.2 Central — serve endpoint
+
+`GET /api/internal/deployments/{deploymentId}/config` on the existing management API:
+
+- `[AllowAnonymous]` + **explicit token validation** (constant-time compare against the pending row's `Token`), TTL-checked. Machine-to-machine; never user-session-gated.
+- Returns the row's `ConfigJson` (200) — or **404** when the `deploymentId` is unknown, expired, or **superseded** (no longer the current pending row for its instance). The 404-on-superseded is what makes stale fetches fail closed (see §5).
+- Both central nodes read the same MS SQL, so the request works through Traefik's active-node LB.
+- HTTPS strongly recommended in transit (config carries already-encrypted `SecretsBlock`s; the token is single-deployment + short-TTL).
+- Companion endpoint for the follow-up (§6): `GET /api/internal/sites/{siteId}/deployments` → `[{instanceUniqueName, revisionHash}]`.
+
+### 4.3 Central — send small notify
+
+New message:
+
+```csharp
+public record RefreshDeploymentCommand(
+    string DeploymentId,
+    string InstanceUniqueName,
+    string RevisionHash,
+    string DeployedBy,
+    DateTimeOffset Timestamp,
+    string CentralFetchBaseUrl,   // central's LB/Traefik URL — carried so the site needs no new standing config
+    string FetchToken);
+```
+
+`DeploymentService` flow: flatten + hash (unchanged) → persist `PendingDeployment` (config + token + TTL, superseding prior) → send `RefreshDeploymentCommand` over the existing ClusterClient path (replacing the fat `DeployInstanceCommand`) → await `DeploymentStatusResponse` (reply path unchanged) → on success promote pending→snapshot. The `Ask`/`DeploymentTimeout` stay, but now carry a tiny request/reply — no frame risk. `CentralFetchBaseUrl` is a new central option (e.g. `Communication:CentralFetchBaseUrl`).
+
+### 4.4 Site — deploy (active node)
+
+- `SiteCommunicationActor` routes `RefreshDeploymentCommand` to the `deployment-manager-proxy` (replacing the `DeployInstanceCommand` forward). The proxy delivers to the singleton on the **active node automatically** — small message, no host detection.
+- `DeploymentManagerActor` handles `RefreshDeploymentCommand`: fetch the config via an injected `IDeploymentConfigFetcher` (HttpClient wrapper, registered for the site role), then run the **existing `ApplyDeployment`** verbatim (create/replace Instance Actor, write `deployed_configurations`, reply `DeploymentStatusResponse`). Only the config source changes.
+- Fetch failures reply `Failed` with a `Communication failure:`-prefixed message (transient/comm) so central's DeploymentManager-006 query-site reconciliation fires; apply failures stay generic deployment errors.
+
+### 4.5 Site — replication via fetch (active → standby), unified
+
+- `ReplicateConfigDeploy` / `ApplyConfigDeploy` **drop `ConfigJson`** and gain `(DeploymentId, RevisionHash, IsEnabled, CentralFetchBaseUrl, FetchToken)`. After applying, the active node sends the id-only message to the standby.
+- The standby's `SiteReplicationActor` **fetches** the config via the same `IDeploymentConfigFetcher`, then writes `deployed_configurations` via `SiteStorageService.StoreDeployedConfigAsync` — **no Instance Actor** (it's standby). Best-effort, but the fetch gets a small bounded retry (cheap/async).
+- **Defensive write guard:** the standby only overwrites a `deployed_configurations` row when the incoming `RevisionHash`/`DeployedAt` is newer, so a delayed older replicate can't clobber a newer config.
+- A stale replicate (superseded deploymentId) fetches → 404 → standby skips it; the newer deploy's replicate-notify delivers the correct config (§5).
+- These records are intra-site (same Host binary on both nodes) → no cross-version concern.
+
+### 4.6 Secondary bug — timeout classification
+
+`DeploymentService.cs` (~line 312): add `AskTimeoutException` to the timeout branch (it does **not** derive from `System.TimeoutException`), alongside `OperationCanceledException`, so a no-reply deploy is classified as a timeout and triggers the query-site reconciliation.
+
+### 4.7 Retire the fat message
+
+`DeployInstanceCommand.FlattenedConfigurationJson` is no longer sent cross-cluster. Keep the record only if convenient as an internal apply DTO; otherwise remove. After this change, **no Akka message carries config.**
+
+## 5. Supersession & ordering semantics
+
+- **At central:** delete-then-insert under the per-instance lock ⇒ ≤1 pending row per instance, always the newest deploy.
+- **Fetch fails closed:** the serve endpoint returns 404 for any `deploymentId` that isn't the current pending row (unknown / expired / superseded / already promoted). So an older in-flight deploy notify or an older replicate-notify can never fetch a lower-version config.
+- **Standby last-write-wins + guard:** `deployed_configurations` upserts by `instance_unique_name`; the standby additionally refuses to overwrite with an older `RevisionHash`/`DeployedAt`. Net: the standby converges on the latest config, never a stale one.
+
+## 6. Follow-up (in the plan, separate & lower-priority): standby/startup reconciliation
+
+Replication is best-effort with no retry and **no startup reconciliation** today — a standby that is *down* during a deploy permanently misses that instance until its next deploy (a pre-existing gap, independent of frame size). To make replication self-healing:
+
+- On node start / peer rejoin, the node reconciles its `deployed_configurations` against central's **expected set for this site** (via `GET /api/internal/sites/{siteId}/deployments` → `{instanceUniqueName, revisionHash}` pairs), fetching missing/stale configs (reusing the §4.2 config endpoint) and dropping orphans central no longer has.
+
+Marked as a distinct task, sequenced **after** the core frame-size fix.
+
+## 7. Configuration & options
+
+- **Central:** `Communication:CentralFetchBaseUrl` (LB/Traefik URL); `Deployment:PendingDeploymentTtl` (default 5 min); token length; purge interval.
+- **Site:** `HttpClient` registration for the site role; fetch timeout; replication fetch retry count.
+- **Akka `maximum-frame-size`:** intentionally **unchanged** — the fix doesn't rely on it.
+- gRPC message size: not involved (HTTP carries the config).
+
+## 8. Failover & edge cases
+
+- Site failover: the notify routes to the new active node automatically via the singleton proxy.
+- Active node dies mid-fetch: central's `Ask` times out → reconciliation.
+- Central node failover during a fetch: Traefik routes to the active central node; the pending row is in shared MS SQL → still served.
+- Standby down during deploy: misses the replicate-notify → covered by the §6 follow-up.
+- Token TTL must comfortably cover both nodes' fetches within the deploy window (default 5 min is ample).
+
+## 9. Security
+
+Token-gated internal endpoint; constant-time compare; short TTL; scoped to exactly one deployment's config; not tied to a user session; HTTPS recommended. Config `SecretsBlock`s are already encrypted at rest.
+
+## 10. Testing
+
+- **Unit:** `PendingDeployment` store incl. supersession (delete-then-insert) + TTL/purge; token validation (valid / expired / wrong / superseded, constant-time); `IDeploymentConfigFetcher` (200 / 401 / 404 / timeout); `DeploymentService` notify→promote flow; AskTimeout classification; standby fetch-on-replicate + older-write guard.
+- **Integration:** the exact failing case — a **>128 KB** config deploys end-to-end; standby obtains it via fetch; kill the active node → failover recovers it from the standby's SQLite; redeploy a *newer* version → pending superseded, stale fetch 404s.
+- **Live smoke (docker):** redeploy the 3rd-composition config that originally hung; confirm both nodes hold it.
+
+## 11. Migration / deploy
+
+- Single Host binary serves both roles → one rebuild + redeploy covers central and site. Message-contract changes (`RefreshDeploymentCommand` new; `ReplicateConfigDeploy` drops `ConfigJson`) are safe since all nodes upgrade together.
+- New EF migration for `PendingDeployment` (auto-apply in dev; SQL script for prod). No site SQLite schema change.
+- Add `CentralFetchBaseUrl` to central appsettings in `docker/`, `docker-env2/`, `deploy/wonder-app-vd03/`; confirm site→central HTTP reachability (fine on the co-located test host/docker; a firewall port to open in a hub-spoke prod). Update `deploy/wonder-app-vd03/RUNBOOK.md`.
+
+## 12. Affected files (for the plan)
+
+- `src/ZB.MOM.WW.ScadaBridge.Commons/Entities/Deployment/` — new `PendingDeployment` entity.
+- `src/ZB.MOM.WW.ScadaBridge.Commons/Interfaces/Repositories/IDeploymentManagerRepository.cs` — pending CRUD + supersession + expected-set query.
+- `src/ZB.MOM.WW.ScadaBridge.Commons/Messages/Deployment/` — new `RefreshDeploymentCommand`; retire `DeployInstanceCommand` wire use.
+- `src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase/` — EF mapping, repository impl, migration.
+- `src/ZB.MOM.WW.ScadaBridge.DeploymentManager/DeploymentService.cs` — persist pending + supersede, send notify, promote on success, AskTimeout classification.
+- `src/ZB.MOM.WW.ScadaBridge.Communication/CommunicationService.cs` + `Actors/{Central,Site}CommunicationActor.cs` — route `RefreshDeploymentCommand`.
+- `src/ZB.MOM.WW.ScadaBridge.CentralUI` (or management API project) — `/api/internal/deployments/{id}/config` + `/api/internal/sites/{id}/deployments` endpoints.
+- `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs` — handle `RefreshDeploymentCommand`, fetch, replicate id-only.
+- `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/SiteReplicationActor.cs` + `Messages/ReplicationMessages.cs` — id-only replication, fetch on standby, older-write guard.
+- `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/` — `IDeploymentConfigFetcher` + HttpClient registration; site reconciliation (follow-up).
+- appsettings (`docker/`, `docker-env2/`, `deploy/wonder-app-vd03/`) + `RUNBOOK.md`.
+- Tests under `tests/`.