18 KiB
Deploy + Replication Config Transport — Notify-and-Fetch Design
Date: 2026-06-26 · Status: APPROVED (design) · Area: Deployment Manager / Site Runtime / Cluster Communication
Fixes: docs/known-issues/2026-06-26-deploy-config-exceeds-akka-frame-size.md
1. Problem
The flattened instance config travels over Akka on two hops today, and both are bounded by Akka.Remote's default maximum-frame-size (128 KB / 128000b). A large config (e.g. a 3rd composition of the same base template) silently breaks both:
-
Central → site (deploy).
DeploymentServiceserializes the whole flattened config intoDeployInstanceCommand.FlattenedConfigurationJsonand Asks the site over ClusterClient. An oversized frame is silently dropped; the centralAsktimes out after 120 s (Communication:DeploymentTimeout); the instance becomes un-deployable until the config shrinks. -
Site active → standby (replication). After a successful apply,
DeploymentManagerActordoes_replicationActor.Tell(new ReplicateConfigDeploy(instanceName, command.FlattenedConfigurationJson, …));SiteReplicationActorrelays it node-to-node viaActorSelection(RootActorPath(peer)/"user"/"site-replication").Tell(…). This is fire-and-forget — an oversizedApplyConfigDeployis silently dropped with no timeout and no error. The deploy "succeeds," the active node runs fine, and the standby silently lacks the instance — surfacing only at failover, when the now-active node recovers from its SQLite and the instance simply isn't there.
The JSON is double-escaped inside the Akka envelope (default serializer), so ~60–70 KB of raw config is enough to blow the 128 KB frame. log-frame-size-exceeding defaults off, so there isn't even a warning.
2. Goal & guiding principle
Eliminate the frame-size failure on both hops, permanently and at any config size, without raising maximum-frame-size.
Principle: small control messages stay on Akka; the bulk config never touches Akka. The config always moves over HTTP from central (the single source of truth) to whichever node needs it — the active node (which applies + runs it) and the standby node (which persists it for future recovery).
This mirrors how the site already works at startup: a site recovers its instances purely from local SQLite (deployed_configurations), never contacting central. A deploy thus becomes an on-demand, single-instance version of startup recovery: "your definition changed — fetch it and refresh." Only the config source changes; the apply path is unchanged.
Why notify-and-fetch over alternatives
- Push over gRPC (central→site): would need host-aware routing (the
DeploymentManagersingleton runs only on the active node; the gRPC server runs on both), and still leaves the intra-site replication hop broken. - Raise Akka
maximum-frame-size: only raises the ceiling, doesn't decouple size from transport, and changes remoting config globally. - Sites read central MS SQL directly: rejected — it tears down the enforced hub-spoke boundary (sites = local SQLite, central = MS SQL, enforced in DI and
StartupValidator), gives every site node a login that can read central's entire config DB (huge blast radius), and couples sites to central's EF schema.
A small RefreshDeploymentCommand routes to the active-node singleton for free via the existing ClusterSingletonProxy (no host detection, no frame risk), and the same small-message pattern fixes replication. The only genuinely new piece is the fetch channel, implemented as an authenticated GET against central's existing HTTP management API.
3. Architecture overview
central MS SQL
┌──────────────────┐
│ PendingDeployment │ (configJson, token, TTL, by deploymentId; ≤1 per instance)
└──────────────────┘
▲ │ read (token-gated)
persist before notify │
│ ▼
DeploymentService ──Akka small notify──▶ site (active node, via singleton proxy)
RefreshDeploymentCommand(id, name, hash, │
deployedBy, ts, centralFetchBaseUrl, token) │ 1. HTTP GET config (token) ◀── central mgmt API
▲ │ 2. apply (existing path): Instance Actor + SQLite
│ DeploymentStatusResponse│ 3. reply success/failure
└─────────────────────────┘
│ 4. tell standby (small, id-only)
▼
SiteReplicationActor (standby)
HTTP GET config (token) ──▶ central mgmt API
write deployed_configurations SQLite (no Instance Actor)
Every Akka hop now carries only small control messages. The bulk config moves only over HTTP, point-to-point, to the node that needs it.
4. Component changes
4.1 Central — pending-config store
New table PendingDeployment in central MS SQL (EF entity + migration):
| column | type | notes |
|---|---|---|
DeploymentId |
varchar PK |
matches the deploy's GUID (N format) |
InstanceId |
int, indexed |
for supersession lookup |
RevisionHash |
varchar |
staleness/version marker |
ConfigJson |
nvarchar(max) |
the flattened config |
Token |
varchar |
random per-deployment fetch secret |
CreatedAtUtc |
datetimeoffset |
|
ExpiresAtUtc |
datetimeoffset, indexed |
TTL (default 5 min, configurable) |
Kept separate from DeployedConfigSnapshot so the existing "deployed = success" semantics stay clean. On success the pending row's config is promoted into DeployedConfigSnapshot (existing logic); on failure/timeout/TTL it's purged. A bounded periodic purge removes expired rows.
Supersession (requested): when persisting a new pending row for an instance, first delete any existing pending rows for that same InstanceId (delete-then-insert, one transaction) → at most one pending row per instance, always the latest. Race-free because the existing per-instance operation lock already serializes same-instance deploys.
4.2 Central — serve endpoint
GET /api/internal/deployments/{deploymentId}/config on the existing management API:
[AllowAnonymous]+ explicit token validation (constant-time compare against the pending row'sToken), TTL-checked. Machine-to-machine; never user-session-gated.- Returns the row's
ConfigJson(200) — or 404 when thedeploymentIdis unknown, expired, or superseded (no longer the current pending row for its instance). The 404-on-superseded is what makes stale fetches fail closed (see §5). - Both central nodes read the same MS SQL, so the request works through Traefik's active-node LB.
- HTTPS strongly recommended in transit (config carries already-encrypted
SecretsBlocks; the token is single-deployment + short-TTL). - Companion endpoint for the follow-up (§6):
GET /api/internal/sites/{siteId}/deployments→[{instanceUniqueName, revisionHash}].
4.3 Central — send small notify
New message:
public record RefreshDeploymentCommand(
string DeploymentId,
string InstanceUniqueName,
string RevisionHash,
string DeployedBy,
DateTimeOffset Timestamp,
string CentralFetchBaseUrl, // central's LB/Traefik URL — carried so the site needs no new standing config
string FetchToken);
DeploymentService flow: flatten + hash (unchanged) → persist PendingDeployment (config + token + TTL, superseding prior) → send RefreshDeploymentCommand over the existing ClusterClient path (replacing the fat DeployInstanceCommand) → await DeploymentStatusResponse (reply path unchanged) → on success promote pending→snapshot. The Ask/DeploymentTimeout stay, but now carry a tiny request/reply — no frame risk. CentralFetchBaseUrl is a new central option (e.g. Communication:CentralFetchBaseUrl).
4.4 Site — deploy (active node)
SiteCommunicationActorroutesRefreshDeploymentCommandto thedeployment-manager-proxy(replacing theDeployInstanceCommandforward). The proxy delivers to the singleton on the active node automatically — small message, no host detection.DeploymentManagerActorhandlesRefreshDeploymentCommand: fetch the config via an injectedIDeploymentConfigFetcher(HttpClient wrapper, registered for the site role), then run the existingApplyDeploymentverbatim (create/replace Instance Actor, writedeployed_configurations, replyDeploymentStatusResponse). Only the config source changes.- Fetch failures reply
Failedwith aCommunication failure:-prefixed message (transient/comm) so central's DeploymentManager-006 query-site reconciliation fires; apply failures stay generic deployment errors.
4.5 Site — replication via fetch (active → standby), unified
ReplicateConfigDeploy/ApplyConfigDeploydropConfigJsonand gain(DeploymentId, RevisionHash, IsEnabled, CentralFetchBaseUrl, FetchToken). After applying, the active node sends the id-only message to the standby.- The standby's
SiteReplicationActorfetches the config via the sameIDeploymentConfigFetcher, then writesdeployed_configurationsviaSiteStorageService.StoreDeployedConfigAsync— no Instance Actor (it's standby). Best-effort, but the fetch gets a small bounded retry (cheap/async). - Defensive write guard: the standby only overwrites a
deployed_configurationsrow when the incomingRevisionHash/DeployedAtis newer, so a delayed older replicate can't clobber a newer config. - A stale replicate (superseded deploymentId) fetches → 404 → standby skips it; the newer deploy's replicate-notify delivers the correct config (§5).
- These records are intra-site (same Host binary on both nodes) → no cross-version concern.
4.6 Secondary bug — timeout classification
DeploymentService.cs (~line 312): add AskTimeoutException to the timeout branch (it does not derive from System.TimeoutException), alongside OperationCanceledException, so a no-reply deploy is classified as a timeout and triggers the query-site reconciliation.
4.7 Retire the fat message
DeployInstanceCommand.FlattenedConfigurationJson is no longer sent cross-cluster. Keep the record only if convenient as an internal apply DTO; otherwise remove. After this change, no Akka message carries config.
5. Supersession & ordering semantics
- At central: delete-then-insert under the per-instance lock ⇒ ≤1 pending row per instance, always the newest deploy.
- Fetch fails closed: the serve endpoint returns 404 for any
deploymentIdthat isn't the current pending row (unknown / expired / superseded / already promoted). So an older in-flight deploy notify or an older replicate-notify can never fetch a lower-version config. - Standby last-write-wins + guard:
deployed_configurationsupserts byinstance_unique_name; the standby additionally refuses to overwrite with an olderRevisionHash/DeployedAt. Net: the standby converges on the latest config, never a stale one.
6. Follow-up (in the plan, separate & lower-priority): standby/startup reconciliation
Replication is best-effort with no retry and no startup reconciliation today — a standby that is down during a deploy permanently misses that instance until its next deploy (a pre-existing gap, independent of frame size). To make replication self-healing:
- On node start / peer rejoin, the node reconciles its
deployed_configurationsagainst central's expected set for this site (viaGET /api/internal/sites/{siteId}/deployments→{instanceUniqueName, revisionHash}pairs), fetching missing/stale configs (reusing the §4.2 config endpoint) and dropping orphans central no longer has.
Marked as a distinct task, sequenced after the core frame-size fix.
7. Configuration & options
- Central:
Communication:CentralFetchBaseUrl(LB/Traefik URL);Deployment:PendingDeploymentTtl(default 5 min); token length; purge interval. - Site:
HttpClientregistration for the site role; fetch timeout; replication fetch retry count. - Akka
maximum-frame-size: intentionally unchanged — the fix doesn't rely on it. - gRPC message size: not involved (HTTP carries the config).
8. Failover & edge cases
- Site failover: the notify routes to the new active node automatically via the singleton proxy.
- Active node dies mid-fetch: central's
Asktimes out → reconciliation. - Central node failover during a fetch: Traefik routes to the active central node; the pending row is in shared MS SQL → still served.
- Standby down during deploy: misses the replicate-notify → covered by the §6 follow-up.
- Token TTL must comfortably cover both nodes' fetches within the deploy window (default 5 min is ample).
9. Security
Token-gated internal endpoint; constant-time compare; short TTL; scoped to exactly one deployment's config; not tied to a user session; HTTPS recommended. Config SecretsBlocks are already encrypted at rest.
10. Testing
- Unit:
PendingDeploymentstore incl. supersession (delete-then-insert) + TTL/purge; token validation (valid / expired / wrong / superseded, constant-time);IDeploymentConfigFetcher(200 / 401 / 404 / timeout);DeploymentServicenotify→promote flow; AskTimeout classification; standby fetch-on-replicate + older-write guard. - Integration: the exact failing case — a >128 KB config deploys end-to-end; standby obtains it via fetch; kill the active node → failover recovers it from the standby's SQLite; redeploy a newer version → pending superseded, stale fetch 404s.
- Live smoke (docker): redeploy the 3rd-composition config that originally hung; confirm both nodes hold it.
11. Migration / deploy
- Single Host binary serves both roles → one rebuild + redeploy covers central and site. Message-contract changes (
RefreshDeploymentCommandnew;ReplicateConfigDeploydropsConfigJson) are safe since all nodes upgrade together. - New EF migration for
PendingDeployment(auto-apply in dev; SQL script for prod). No site SQLite schema change. - Add
CentralFetchBaseUrlto central appsettings indocker/,docker-env2/,deploy/wonder-app-vd03/; confirm site→central HTTP reachability (fine on the co-located test host/docker; a firewall port to open in a hub-spoke prod). Updatedeploy/wonder-app-vd03/RUNBOOK.md.
11a. Live validation + post-implementation fixes (2026-06-26)
Smoke-tested on the docker cluster (rebuilt from this branch). Validated end-to-end:
- Migration applied; all 8 nodes healthy/ready.
- Deploy notify-and-fetch: central notify → active node fetch+apply → id-only replicate → standby fetch →
Successin ~0.11 s (the exact path that previously hung 120 s). - Startup reconciliation: every node self-heals on startup — single-node gap heals; the concurrent both-nodes-empty race heals both.
The smoke surfaced two real bugs in the reconciliation path (missed by unit/integration tests because those didn't have a second concurrent node or a lingering expired row), both fixed:
- Concurrent-gap omit — when two nodes were concurrently missing the same instance, the second node's
StagePendingIfAbsentAsyncreturned false and the handler omitted the item, leaving that node unhealed. Fix: on false, return the existing pending row's deploymentId + token (multi-use within TTL) so all concurrently-missing nodes heal in the same round. - Expired pending row blocks self-heal —
StagePendingIfAbsentAsyncchecked existence byInstanceIdignoring expiry, so an expired-but-unpurged row (the periodic purge is still a deferred TODO) blocked a fresh stage and would collide with the snapshot's reusedDeploymentIdon the unique index. Fix: expiry-aware staging — delete expired rows for the instance first, then check only live rows;GetPendingDeploymentByInstanceIdAsyncfilters by expiry. This also opportunistically cleans expired rows, reducing reliance on the deferred periodic purge.
12. Affected files (for the plan)
src/ZB.MOM.WW.ScadaBridge.Commons/Entities/Deployment/— newPendingDeploymententity.src/ZB.MOM.WW.ScadaBridge.Commons/Interfaces/Repositories/IDeploymentManagerRepository.cs— pending CRUD + supersession + expected-set query.src/ZB.MOM.WW.ScadaBridge.Commons/Messages/Deployment/— newRefreshDeploymentCommand; retireDeployInstanceCommandwire use.src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase/— EF mapping, repository impl, migration.src/ZB.MOM.WW.ScadaBridge.DeploymentManager/DeploymentService.cs— persist pending + supersede, send notify, promote on success, AskTimeout classification.src/ZB.MOM.WW.ScadaBridge.Communication/CommunicationService.cs+Actors/{Central,Site}CommunicationActor.cs— routeRefreshDeploymentCommand.src/ZB.MOM.WW.ScadaBridge.CentralUI(or management API project) —/api/internal/deployments/{id}/config+/api/internal/sites/{id}/deploymentsendpoints.src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs— handleRefreshDeploymentCommand, fetch, replicate id-only.src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/SiteReplicationActor.cs+Messages/ReplicationMessages.cs— id-only replication, fetch on standby, older-write guard.src/ZB.MOM.WW.ScadaBridge.SiteRuntime/—IDeploymentConfigFetcher+ HttpClient registration; site reconciliation (follow-up).- appsettings (
docker/,docker-env2/,deploy/wonder-app-vd03/) +RUNBOOK.md. - Tests under
tests/.