# Deploy + Replication Config Transport — Notify-and-Fetch Design **Date:** 2026-06-26 · **Status:** APPROVED (design) · **Area:** Deployment Manager / Site Runtime / Cluster Communication **Fixes:** [`docs/known-issues/2026-06-26-deploy-config-exceeds-akka-frame-size.md`](../known-issues/2026-06-26-deploy-config-exceeds-akka-frame-size.md) ## 1. Problem The flattened instance config travels over Akka on **two** hops today, and both are bounded by Akka.Remote's default `maximum-frame-size` (**128 KB / `128000b`**). A large config (e.g. a 3rd composition of the same base template) silently breaks both: 1. **Central → site (deploy).** `DeploymentService` serializes the whole flattened config into `DeployInstanceCommand.FlattenedConfigurationJson` and Asks the site over ClusterClient. An oversized frame is **silently dropped**; the central `Ask` times out after 120 s (`Communication:DeploymentTimeout`); the instance becomes un-deployable until the config shrinks. 2. **Site active → standby (replication).** After a successful apply, `DeploymentManagerActor` does `_replicationActor.Tell(new ReplicateConfigDeploy(instanceName, command.FlattenedConfigurationJson, …))`; `SiteReplicationActor` relays it node-to-node via `ActorSelection(RootActorPath(peer)/"user"/"site-replication").Tell(…)`. This is **fire-and-forget** — an oversized `ApplyConfigDeploy` is silently dropped with *no timeout and no error*. The deploy "succeeds," the active node runs fine, and the standby silently lacks the instance — surfacing only at **failover**, when the now-active node recovers from its SQLite and the instance simply isn't there. The JSON is double-escaped inside the Akka envelope (default serializer), so ~60–70 KB of raw config is enough to blow the 128 KB frame. `log-frame-size-exceeding` defaults off, so there isn't even a warning. ## 2. Goal & guiding principle Eliminate the frame-size failure on **both** hops, permanently and at any config size, **without** raising `maximum-frame-size`. **Principle:** *small control messages stay on Akka; the bulk config never touches Akka.* The config always moves over **HTTP from central** (the single source of truth) to whichever node needs it — the active node (which applies + runs it) and the standby node (which persists it for future recovery). This mirrors how the site already works at startup: a site recovers its instances **purely from local SQLite** (`deployed_configurations`), never contacting central. A deploy thus becomes an *on-demand, single-instance version of startup recovery*: "your definition changed — fetch it and refresh." Only the config **source** changes; the apply path is unchanged. ### Why notify-and-fetch over alternatives - **Push over gRPC (central→site):** would need host-aware routing (the `DeploymentManager` singleton runs only on the active node; the gRPC server runs on both), and still leaves the *intra-site* replication hop broken. - **Raise Akka `maximum-frame-size`:** only raises the ceiling, doesn't decouple size from transport, and changes remoting config globally. - **Sites read central MS SQL directly:** rejected — it tears down the enforced hub-spoke boundary (sites = local SQLite, central = MS SQL, enforced in DI *and* `StartupValidator`), gives every site node a login that can read central's *entire* config DB (huge blast radius), and couples sites to central's EF schema. A small `RefreshDeploymentCommand` routes to the active-node singleton **for free** via the existing `ClusterSingletonProxy` (no host detection, no frame risk), and the same small-message pattern fixes replication. The only genuinely new piece is the **fetch channel**, implemented as an authenticated `GET` against central's **existing** HTTP management API. ## 3. Architecture overview ``` central MS SQL ┌──────────────────┐ │ PendingDeployment │ (configJson, token, TTL, by deploymentId; ≤1 per instance) └──────────────────┘ ▲ │ read (token-gated) persist before notify │ │ ▼ DeploymentService ──Akka small notify──▶ site (active node, via singleton proxy) RefreshDeploymentCommand(id, name, hash, │ deployedBy, ts, centralFetchBaseUrl, token) │ 1. HTTP GET config (token) ◀── central mgmt API ▲ │ 2. apply (existing path): Instance Actor + SQLite │ DeploymentStatusResponse│ 3. reply success/failure └─────────────────────────┘ │ 4. tell standby (small, id-only) ▼ SiteReplicationActor (standby) HTTP GET config (token) ──▶ central mgmt API write deployed_configurations SQLite (no Instance Actor) ``` Every Akka hop now carries only small control messages. The bulk config moves only over HTTP, point-to-point, to the node that needs it. ## 4. Component changes ### 4.1 Central — pending-config store New table **`PendingDeployment`** in central MS SQL (EF entity + migration): | column | type | notes | |---|---|---| | `DeploymentId` | `varchar` PK | matches the deploy's GUID (`N` format) | | `InstanceId` | `int`, indexed | for supersession lookup | | `RevisionHash` | `varchar` | staleness/version marker | | `ConfigJson` | `nvarchar(max)` | the flattened config | | `Token` | `varchar` | random per-deployment fetch secret | | `CreatedAtUtc` | `datetimeoffset` | | | `ExpiresAtUtc` | `datetimeoffset`, indexed | TTL (default 5 min, configurable) | Kept **separate** from `DeployedConfigSnapshot` so the existing "deployed = success" semantics stay clean. On success the pending row's config is promoted into `DeployedConfigSnapshot` (existing logic); on failure/timeout/TTL it's purged. A bounded periodic purge removes expired rows. **Supersession (requested):** when persisting a new pending row for an instance, **first delete any existing pending rows for that same `InstanceId`** (delete-then-insert, one transaction) → at most one pending row per instance, always the latest. Race-free because the existing **per-instance operation lock** already serializes same-instance deploys. ### 4.2 Central — serve endpoint `GET /api/internal/deployments/{deploymentId}/config` on the existing management API: - `[AllowAnonymous]` + **explicit token validation** (constant-time compare against the pending row's `Token`), TTL-checked. Machine-to-machine; never user-session-gated. - Returns the row's `ConfigJson` (200) — or **404** when the `deploymentId` is unknown, expired, or **superseded** (no longer the current pending row for its instance). The 404-on-superseded is what makes stale fetches fail closed (see §5). - Both central nodes read the same MS SQL, so the request works through Traefik's active-node LB. - HTTPS strongly recommended in transit (config carries already-encrypted `SecretsBlock`s; the token is single-deployment + short-TTL). - Companion endpoint for the follow-up (§6): `GET /api/internal/sites/{siteId}/deployments` → `[{instanceUniqueName, revisionHash}]`. ### 4.3 Central — send small notify New message: ```csharp public record RefreshDeploymentCommand( string DeploymentId, string InstanceUniqueName, string RevisionHash, string DeployedBy, DateTimeOffset Timestamp, string CentralFetchBaseUrl, // central's LB/Traefik URL — carried so the site needs no new standing config string FetchToken); ``` `DeploymentService` flow: flatten + hash (unchanged) → persist `PendingDeployment` (config + token + TTL, superseding prior) → send `RefreshDeploymentCommand` over the existing ClusterClient path (replacing the fat `DeployInstanceCommand`) → await `DeploymentStatusResponse` (reply path unchanged) → on success promote pending→snapshot. The `Ask`/`DeploymentTimeout` stay, but now carry a tiny request/reply — no frame risk. `CentralFetchBaseUrl` is a new central option (e.g. `Communication:CentralFetchBaseUrl`). ### 4.4 Site — deploy (active node) - `SiteCommunicationActor` routes `RefreshDeploymentCommand` to the `deployment-manager-proxy` (replacing the `DeployInstanceCommand` forward). The proxy delivers to the singleton on the **active node automatically** — small message, no host detection. - `DeploymentManagerActor` handles `RefreshDeploymentCommand`: fetch the config via an injected `IDeploymentConfigFetcher` (HttpClient wrapper, registered for the site role), then run the **existing `ApplyDeployment`** verbatim (create/replace Instance Actor, write `deployed_configurations`, reply `DeploymentStatusResponse`). Only the config source changes. - Fetch failures reply `Failed` with a `Communication failure:`-prefixed message (transient/comm) so central's DeploymentManager-006 query-site reconciliation fires; apply failures stay generic deployment errors. ### 4.5 Site — replication via fetch (active → standby), unified - `ReplicateConfigDeploy` / `ApplyConfigDeploy` **drop `ConfigJson`** and gain `(DeploymentId, RevisionHash, IsEnabled, CentralFetchBaseUrl, FetchToken)`. After applying, the active node sends the id-only message to the standby. - The standby's `SiteReplicationActor` **fetches** the config via the same `IDeploymentConfigFetcher`, then writes `deployed_configurations` via `SiteStorageService.StoreDeployedConfigAsync` — **no Instance Actor** (it's standby). Best-effort, but the fetch gets a small bounded retry (cheap/async). - **Defensive write guard:** the standby only overwrites a `deployed_configurations` row when the incoming `RevisionHash`/`DeployedAt` is newer, so a delayed older replicate can't clobber a newer config. - A stale replicate (superseded deploymentId) fetches → 404 → standby skips it; the newer deploy's replicate-notify delivers the correct config (§5). - These records are intra-site (same Host binary on both nodes) → no cross-version concern. ### 4.6 Secondary bug — timeout classification `DeploymentService.cs` (~line 312): add `AskTimeoutException` to the timeout branch (it does **not** derive from `System.TimeoutException`), alongside `OperationCanceledException`, so a no-reply deploy is classified as a timeout and triggers the query-site reconciliation. ### 4.7 Retire the fat message `DeployInstanceCommand.FlattenedConfigurationJson` is no longer sent cross-cluster. Keep the record only if convenient as an internal apply DTO; otherwise remove. After this change, **no Akka message carries config.** ## 5. Supersession & ordering semantics - **At central:** delete-then-insert under the per-instance lock ⇒ ≤1 pending row per instance, always the newest deploy. - **Fetch fails closed:** the serve endpoint returns 404 for any `deploymentId` that isn't the current pending row (unknown / expired / superseded / already promoted). So an older in-flight deploy notify or an older replicate-notify can never fetch a lower-version config. - **Standby last-write-wins + guard:** `deployed_configurations` upserts by `instance_unique_name`; the standby additionally refuses to overwrite with an older `RevisionHash`/`DeployedAt`. Net: the standby converges on the latest config, never a stale one. ## 6. Follow-up (in the plan, separate & lower-priority): standby/startup reconciliation Replication is best-effort with no retry and **no startup reconciliation** today — a standby that is *down* during a deploy permanently misses that instance until its next deploy (a pre-existing gap, independent of frame size). To make replication self-healing: - On node start / peer rejoin, the node reconciles its `deployed_configurations` against central's **expected set for this site** (via `GET /api/internal/sites/{siteId}/deployments` → `{instanceUniqueName, revisionHash}` pairs), fetching missing/stale configs (reusing the §4.2 config endpoint) and dropping orphans central no longer has. Marked as a distinct task, sequenced **after** the core frame-size fix. ## 7. Configuration & options - **Central:** `Communication:CentralFetchBaseUrl` (LB/Traefik URL); `Deployment:PendingDeploymentTtl` (default 5 min); token length; purge interval. - **Site:** `HttpClient` registration for the site role; fetch timeout; replication fetch retry count. - **Akka `maximum-frame-size`:** intentionally **unchanged** — the fix doesn't rely on it. - gRPC message size: not involved (HTTP carries the config). ## 8. Failover & edge cases - Site failover: the notify routes to the new active node automatically via the singleton proxy. - Active node dies mid-fetch: central's `Ask` times out → reconciliation. - Central node failover during a fetch: Traefik routes to the active central node; the pending row is in shared MS SQL → still served. - Standby down during deploy: misses the replicate-notify → covered by the §6 follow-up. - Token TTL must comfortably cover both nodes' fetches within the deploy window (default 5 min is ample). ## 9. Security Token-gated internal endpoint; constant-time compare; short TTL; scoped to exactly one deployment's config; not tied to a user session; HTTPS recommended. Config `SecretsBlock`s are already encrypted at rest. ## 10. Testing - **Unit:** `PendingDeployment` store incl. supersession (delete-then-insert) + TTL/purge; token validation (valid / expired / wrong / superseded, constant-time); `IDeploymentConfigFetcher` (200 / 401 / 404 / timeout); `DeploymentService` notify→promote flow; AskTimeout classification; standby fetch-on-replicate + older-write guard. - **Integration:** the exact failing case — a **>128 KB** config deploys end-to-end; standby obtains it via fetch; kill the active node → failover recovers it from the standby's SQLite; redeploy a *newer* version → pending superseded, stale fetch 404s. - **Live smoke (docker):** redeploy the 3rd-composition config that originally hung; confirm both nodes hold it. ## 11. Migration / deploy - Single Host binary serves both roles → one rebuild + redeploy covers central and site. Message-contract changes (`RefreshDeploymentCommand` new; `ReplicateConfigDeploy` drops `ConfigJson`) are safe since all nodes upgrade together. - New EF migration for `PendingDeployment` (auto-apply in dev; SQL script for prod). No site SQLite schema change. - Add `CentralFetchBaseUrl` to central appsettings in `docker/`, `docker-env2/`, `deploy/wonder-app-vd03/`; confirm site→central HTTP reachability (fine on the co-located test host/docker; a firewall port to open in a hub-spoke prod). Update `deploy/wonder-app-vd03/RUNBOOK.md`. ## 12. Affected files (for the plan) - `src/ZB.MOM.WW.ScadaBridge.Commons/Entities/Deployment/` — new `PendingDeployment` entity. - `src/ZB.MOM.WW.ScadaBridge.Commons/Interfaces/Repositories/IDeploymentManagerRepository.cs` — pending CRUD + supersession + expected-set query. - `src/ZB.MOM.WW.ScadaBridge.Commons/Messages/Deployment/` — new `RefreshDeploymentCommand`; retire `DeployInstanceCommand` wire use. - `src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase/` — EF mapping, repository impl, migration. - `src/ZB.MOM.WW.ScadaBridge.DeploymentManager/DeploymentService.cs` — persist pending + supersede, send notify, promote on success, AskTimeout classification. - `src/ZB.MOM.WW.ScadaBridge.Communication/CommunicationService.cs` + `Actors/{Central,Site}CommunicationActor.cs` — route `RefreshDeploymentCommand`. - `src/ZB.MOM.WW.ScadaBridge.CentralUI` (or management API project) — `/api/internal/deployments/{id}/config` + `/api/internal/sites/{id}/deployments` endpoints. - `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs` — handle `RefreshDeploymentCommand`, fetch, replicate id-only. - `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/SiteReplicationActor.cs` + `Messages/ReplicationMessages.cs` — id-only replication, fetch on standby, older-write guard. - `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/` — `IDeploymentConfigFetcher` + HttpClient registration; site reconciliation (follow-up). - appsettings (`docker/`, `docker-env2/`, `deploy/wonder-app-vd03/`) + `RUNBOOK.md`. - Tests under `tests/`.