docs(deploy): notify-and-fetch design + implementation plan + task state

This commit is contained in:
Joseph Doherty
2026-06-26 15:09:12 -04:00
parent 7c7989d176
commit e5503857df
3 changed files with 985 additions and 0 deletions
@@ -0,0 +1,184 @@
# Deploy + Replication Config Transport — Notify-and-Fetch Design
**Date:** 2026-06-26 · **Status:** APPROVED (design) · **Area:** Deployment Manager / Site Runtime / Cluster Communication
**Fixes:** [`docs/known-issues/2026-06-26-deploy-config-exceeds-akka-frame-size.md`](../known-issues/2026-06-26-deploy-config-exceeds-akka-frame-size.md)
## 1. Problem
The flattened instance config travels over Akka on **two** hops today, and both are bounded by Akka.Remote's default `maximum-frame-size` (**128 KB / `128000b`**). A large config (e.g. a 3rd composition of the same base template) silently breaks both:
1. **Central → site (deploy).** `DeploymentService` serializes the whole flattened config into `DeployInstanceCommand.FlattenedConfigurationJson` and Asks the site over ClusterClient. An oversized frame is **silently dropped**; the central `Ask` times out after 120 s (`Communication:DeploymentTimeout`); the instance becomes un-deployable until the config shrinks.
2. **Site active → standby (replication).** After a successful apply, `DeploymentManagerActor` does `_replicationActor.Tell(new ReplicateConfigDeploy(instanceName, command.FlattenedConfigurationJson, …))`; `SiteReplicationActor` relays it node-to-node via `ActorSelection(RootActorPath(peer)/"user"/"site-replication").Tell(…)`. This is **fire-and-forget** — an oversized `ApplyConfigDeploy` is silently dropped with *no timeout and no error*. The deploy "succeeds," the active node runs fine, and the standby silently lacks the instance — surfacing only at **failover**, when the now-active node recovers from its SQLite and the instance simply isn't there.
The JSON is double-escaped inside the Akka envelope (default serializer), so ~6070 KB of raw config is enough to blow the 128 KB frame. `log-frame-size-exceeding` defaults off, so there isn't even a warning.
## 2. Goal & guiding principle
Eliminate the frame-size failure on **both** hops, permanently and at any config size, **without** raising `maximum-frame-size`.
**Principle:** *small control messages stay on Akka; the bulk config never touches Akka.* The config always moves over **HTTP from central** (the single source of truth) to whichever node needs it — the active node (which applies + runs it) and the standby node (which persists it for future recovery).
This mirrors how the site already works at startup: a site recovers its instances **purely from local SQLite** (`deployed_configurations`), never contacting central. A deploy thus becomes an *on-demand, single-instance version of startup recovery*: "your definition changed — fetch it and refresh." Only the config **source** changes; the apply path is unchanged.
### Why notify-and-fetch over alternatives
- **Push over gRPC (central→site):** would need host-aware routing (the `DeploymentManager` singleton runs only on the active node; the gRPC server runs on both), and still leaves the *intra-site* replication hop broken.
- **Raise Akka `maximum-frame-size`:** only raises the ceiling, doesn't decouple size from transport, and changes remoting config globally.
- **Sites read central MS SQL directly:** rejected — it tears down the enforced hub-spoke boundary (sites = local SQLite, central = MS SQL, enforced in DI *and* `StartupValidator`), gives every site node a login that can read central's *entire* config DB (huge blast radius), and couples sites to central's EF schema.
A small `RefreshDeploymentCommand` routes to the active-node singleton **for free** via the existing `ClusterSingletonProxy` (no host detection, no frame risk), and the same small-message pattern fixes replication. The only genuinely new piece is the **fetch channel**, implemented as an authenticated `GET` against central's **existing** HTTP management API.
## 3. Architecture overview
```
central MS SQL
┌──────────────────┐
│ PendingDeployment │ (configJson, token, TTL, by deploymentId; ≤1 per instance)
└──────────────────┘
▲ │ read (token-gated)
persist before notify │
│ ▼
DeploymentService ──Akka small notify──▶ site (active node, via singleton proxy)
RefreshDeploymentCommand(id, name, hash, │
deployedBy, ts, centralFetchBaseUrl, token) │ 1. HTTP GET config (token) ◀── central mgmt API
▲ │ 2. apply (existing path): Instance Actor + SQLite
│ DeploymentStatusResponse│ 3. reply success/failure
└─────────────────────────┘
│ 4. tell standby (small, id-only)
SiteReplicationActor (standby)
HTTP GET config (token) ──▶ central mgmt API
write deployed_configurations SQLite (no Instance Actor)
```
Every Akka hop now carries only small control messages. The bulk config moves only over HTTP, point-to-point, to the node that needs it.
## 4. Component changes
### 4.1 Central — pending-config store
New table **`PendingDeployment`** in central MS SQL (EF entity + migration):
| column | type | notes |
|---|---|---|
| `DeploymentId` | `varchar` PK | matches the deploy's GUID (`N` format) |
| `InstanceId` | `int`, indexed | for supersession lookup |
| `RevisionHash` | `varchar` | staleness/version marker |
| `ConfigJson` | `nvarchar(max)` | the flattened config |
| `Token` | `varchar` | random per-deployment fetch secret |
| `CreatedAtUtc` | `datetimeoffset` | |
| `ExpiresAtUtc` | `datetimeoffset`, indexed | TTL (default 5 min, configurable) |
Kept **separate** from `DeployedConfigSnapshot` so the existing "deployed = success" semantics stay clean. On success the pending row's config is promoted into `DeployedConfigSnapshot` (existing logic); on failure/timeout/TTL it's purged. A bounded periodic purge removes expired rows.
**Supersession (requested):** when persisting a new pending row for an instance, **first delete any existing pending rows for that same `InstanceId`** (delete-then-insert, one transaction) → at most one pending row per instance, always the latest. Race-free because the existing **per-instance operation lock** already serializes same-instance deploys.
### 4.2 Central — serve endpoint
`GET /api/internal/deployments/{deploymentId}/config` on the existing management API:
- `[AllowAnonymous]` + **explicit token validation** (constant-time compare against the pending row's `Token`), TTL-checked. Machine-to-machine; never user-session-gated.
- Returns the row's `ConfigJson` (200) — or **404** when the `deploymentId` is unknown, expired, or **superseded** (no longer the current pending row for its instance). The 404-on-superseded is what makes stale fetches fail closed (see §5).
- Both central nodes read the same MS SQL, so the request works through Traefik's active-node LB.
- HTTPS strongly recommended in transit (config carries already-encrypted `SecretsBlock`s; the token is single-deployment + short-TTL).
- Companion endpoint for the follow-up (§6): `GET /api/internal/sites/{siteId}/deployments``[{instanceUniqueName, revisionHash}]`.
### 4.3 Central — send small notify
New message:
```csharp
public record RefreshDeploymentCommand(
string DeploymentId,
string InstanceUniqueName,
string RevisionHash,
string DeployedBy,
DateTimeOffset Timestamp,
string CentralFetchBaseUrl, // central's LB/Traefik URL — carried so the site needs no new standing config
string FetchToken);
```
`DeploymentService` flow: flatten + hash (unchanged) → persist `PendingDeployment` (config + token + TTL, superseding prior) → send `RefreshDeploymentCommand` over the existing ClusterClient path (replacing the fat `DeployInstanceCommand`) → await `DeploymentStatusResponse` (reply path unchanged) → on success promote pending→snapshot. The `Ask`/`DeploymentTimeout` stay, but now carry a tiny request/reply — no frame risk. `CentralFetchBaseUrl` is a new central option (e.g. `Communication:CentralFetchBaseUrl`).
### 4.4 Site — deploy (active node)
- `SiteCommunicationActor` routes `RefreshDeploymentCommand` to the `deployment-manager-proxy` (replacing the `DeployInstanceCommand` forward). The proxy delivers to the singleton on the **active node automatically** — small message, no host detection.
- `DeploymentManagerActor` handles `RefreshDeploymentCommand`: fetch the config via an injected `IDeploymentConfigFetcher` (HttpClient wrapper, registered for the site role), then run the **existing `ApplyDeployment`** verbatim (create/replace Instance Actor, write `deployed_configurations`, reply `DeploymentStatusResponse`). Only the config source changes.
- Fetch failures reply `Failed` with a `Communication failure:`-prefixed message (transient/comm) so central's DeploymentManager-006 query-site reconciliation fires; apply failures stay generic deployment errors.
### 4.5 Site — replication via fetch (active → standby), unified
- `ReplicateConfigDeploy` / `ApplyConfigDeploy` **drop `ConfigJson`** and gain `(DeploymentId, RevisionHash, IsEnabled, CentralFetchBaseUrl, FetchToken)`. After applying, the active node sends the id-only message to the standby.
- The standby's `SiteReplicationActor` **fetches** the config via the same `IDeploymentConfigFetcher`, then writes `deployed_configurations` via `SiteStorageService.StoreDeployedConfigAsync`**no Instance Actor** (it's standby). Best-effort, but the fetch gets a small bounded retry (cheap/async).
- **Defensive write guard:** the standby only overwrites a `deployed_configurations` row when the incoming `RevisionHash`/`DeployedAt` is newer, so a delayed older replicate can't clobber a newer config.
- A stale replicate (superseded deploymentId) fetches → 404 → standby skips it; the newer deploy's replicate-notify delivers the correct config (§5).
- These records are intra-site (same Host binary on both nodes) → no cross-version concern.
### 4.6 Secondary bug — timeout classification
`DeploymentService.cs` (~line 312): add `AskTimeoutException` to the timeout branch (it does **not** derive from `System.TimeoutException`), alongside `OperationCanceledException`, so a no-reply deploy is classified as a timeout and triggers the query-site reconciliation.
### 4.7 Retire the fat message
`DeployInstanceCommand.FlattenedConfigurationJson` is no longer sent cross-cluster. Keep the record only if convenient as an internal apply DTO; otherwise remove. After this change, **no Akka message carries config.**
## 5. Supersession & ordering semantics
- **At central:** delete-then-insert under the per-instance lock ⇒ ≤1 pending row per instance, always the newest deploy.
- **Fetch fails closed:** the serve endpoint returns 404 for any `deploymentId` that isn't the current pending row (unknown / expired / superseded / already promoted). So an older in-flight deploy notify or an older replicate-notify can never fetch a lower-version config.
- **Standby last-write-wins + guard:** `deployed_configurations` upserts by `instance_unique_name`; the standby additionally refuses to overwrite with an older `RevisionHash`/`DeployedAt`. Net: the standby converges on the latest config, never a stale one.
## 6. Follow-up (in the plan, separate & lower-priority): standby/startup reconciliation
Replication is best-effort with no retry and **no startup reconciliation** today — a standby that is *down* during a deploy permanently misses that instance until its next deploy (a pre-existing gap, independent of frame size). To make replication self-healing:
- On node start / peer rejoin, the node reconciles its `deployed_configurations` against central's **expected set for this site** (via `GET /api/internal/sites/{siteId}/deployments``{instanceUniqueName, revisionHash}` pairs), fetching missing/stale configs (reusing the §4.2 config endpoint) and dropping orphans central no longer has.
Marked as a distinct task, sequenced **after** the core frame-size fix.
## 7. Configuration & options
- **Central:** `Communication:CentralFetchBaseUrl` (LB/Traefik URL); `Deployment:PendingDeploymentTtl` (default 5 min); token length; purge interval.
- **Site:** `HttpClient` registration for the site role; fetch timeout; replication fetch retry count.
- **Akka `maximum-frame-size`:** intentionally **unchanged** — the fix doesn't rely on it.
- gRPC message size: not involved (HTTP carries the config).
## 8. Failover & edge cases
- Site failover: the notify routes to the new active node automatically via the singleton proxy.
- Active node dies mid-fetch: central's `Ask` times out → reconciliation.
- Central node failover during a fetch: Traefik routes to the active central node; the pending row is in shared MS SQL → still served.
- Standby down during deploy: misses the replicate-notify → covered by the §6 follow-up.
- Token TTL must comfortably cover both nodes' fetches within the deploy window (default 5 min is ample).
## 9. Security
Token-gated internal endpoint; constant-time compare; short TTL; scoped to exactly one deployment's config; not tied to a user session; HTTPS recommended. Config `SecretsBlock`s are already encrypted at rest.
## 10. Testing
- **Unit:** `PendingDeployment` store incl. supersession (delete-then-insert) + TTL/purge; token validation (valid / expired / wrong / superseded, constant-time); `IDeploymentConfigFetcher` (200 / 401 / 404 / timeout); `DeploymentService` notify→promote flow; AskTimeout classification; standby fetch-on-replicate + older-write guard.
- **Integration:** the exact failing case — a **>128 KB** config deploys end-to-end; standby obtains it via fetch; kill the active node → failover recovers it from the standby's SQLite; redeploy a *newer* version → pending superseded, stale fetch 404s.
- **Live smoke (docker):** redeploy the 3rd-composition config that originally hung; confirm both nodes hold it.
## 11. Migration / deploy
- Single Host binary serves both roles → one rebuild + redeploy covers central and site. Message-contract changes (`RefreshDeploymentCommand` new; `ReplicateConfigDeploy` drops `ConfigJson`) are safe since all nodes upgrade together.
- New EF migration for `PendingDeployment` (auto-apply in dev; SQL script for prod). No site SQLite schema change.
- Add `CentralFetchBaseUrl` to central appsettings in `docker/`, `docker-env2/`, `deploy/wonder-app-vd03/`; confirm site→central HTTP reachability (fine on the co-located test host/docker; a firewall port to open in a hub-spoke prod). Update `deploy/wonder-app-vd03/RUNBOOK.md`.
## 12. Affected files (for the plan)
- `src/ZB.MOM.WW.ScadaBridge.Commons/Entities/Deployment/` — new `PendingDeployment` entity.
- `src/ZB.MOM.WW.ScadaBridge.Commons/Interfaces/Repositories/IDeploymentManagerRepository.cs` — pending CRUD + supersession + expected-set query.
- `src/ZB.MOM.WW.ScadaBridge.Commons/Messages/Deployment/` — new `RefreshDeploymentCommand`; retire `DeployInstanceCommand` wire use.
- `src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase/` — EF mapping, repository impl, migration.
- `src/ZB.MOM.WW.ScadaBridge.DeploymentManager/DeploymentService.cs` — persist pending + supersede, send notify, promote on success, AskTimeout classification.
- `src/ZB.MOM.WW.ScadaBridge.Communication/CommunicationService.cs` + `Actors/{Central,Site}CommunicationActor.cs` — route `RefreshDeploymentCommand`.
- `src/ZB.MOM.WW.ScadaBridge.CentralUI` (or management API project) — `/api/internal/deployments/{id}/config` + `/api/internal/sites/{id}/deployments` endpoints.
- `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs` — handle `RefreshDeploymentCommand`, fetch, replicate id-only.
- `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/SiteReplicationActor.cs` + `Messages/ReplicationMessages.cs` — id-only replication, fetch on standby, older-write guard.
- `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/``IDeploymentConfigFetcher` + HttpClient registration; site reconciliation (follow-up).
- appsettings (`docker/`, `docker-env2/`, `deploy/wonder-app-vd03/`) + `RUNBOOK.md`.
- Tests under `tests/`.