Files

T

Joseph Doherty f48a748f37 docs(deploy): record T18/T19 plan refinement + live-smoke fixes + task state

2026-06-26 17:35:07 -04:00

18 KiB

Raw Blame History

Deploy + Replication Config Transport — Notify-and-Fetch Design

Date: 2026-06-26 · Status: APPROVED (design) · Area: Deployment Manager / Site Runtime / Cluster Communication Fixes: docs/known-issues/2026-06-26-deploy-config-exceeds-akka-frame-size.md

1. Problem

The flattened instance config travels over Akka on two hops today, and both are bounded by Akka.Remote's default maximum-frame-size (128 KB / 128000b). A large config (e.g. a 3rd composition of the same base template) silently breaks both:

Central → site (deploy). DeploymentService serializes the whole flattened config into DeployInstanceCommand.FlattenedConfigurationJson and Asks the site over ClusterClient. An oversized frame is silently dropped; the central Ask times out after 120 s (Communication:DeploymentTimeout); the instance becomes un-deployable until the config shrinks.
Site active → standby (replication). After a successful apply, DeploymentManagerActor does _replicationActor.Tell(new ReplicateConfigDeploy(instanceName, command.FlattenedConfigurationJson, …)); SiteReplicationActor relays it node-to-node via ActorSelection(RootActorPath(peer)/"user"/"site-replication").Tell(…). This is fire-and-forget — an oversized ApplyConfigDeploy is silently dropped with no timeout and no error. The deploy "succeeds," the active node runs fine, and the standby silently lacks the instance — surfacing only at failover, when the now-active node recovers from its SQLite and the instance simply isn't there.

The JSON is double-escaped inside the Akka envelope (default serializer), so ~60–70 KB of raw config is enough to blow the 128 KB frame. log-frame-size-exceeding defaults off, so there isn't even a warning.

2. Goal & guiding principle

Eliminate the frame-size failure on both hops, permanently and at any config size, without raising maximum-frame-size.

Principle: small control messages stay on Akka; the bulk config never touches Akka. The config always moves over HTTP from central (the single source of truth) to whichever node needs it — the active node (which applies + runs it) and the standby node (which persists it for future recovery).

This mirrors how the site already works at startup: a site recovers its instances purely from local SQLite (deployed_configurations), never contacting central. A deploy thus becomes an on-demand, single-instance version of startup recovery: "your definition changed — fetch it and refresh." Only the config source changes; the apply path is unchanged.

Why notify-and-fetch over alternatives

Push over gRPC (central→site): would need host-aware routing (the DeploymentManager singleton runs only on the active node; the gRPC server runs on both), and still leaves the intra-site replication hop broken.
Raise Akka maximum-frame-size: only raises the ceiling, doesn't decouple size from transport, and changes remoting config globally.
Sites read central MS SQL directly: rejected — it tears down the enforced hub-spoke boundary (sites = local SQLite, central = MS SQL, enforced in DI and StartupValidator), gives every site node a login that can read central's entire config DB (huge blast radius), and couples sites to central's EF schema.

A small RefreshDeploymentCommand routes to the active-node singleton for free via the existing ClusterSingletonProxy (no host detection, no frame risk), and the same small-message pattern fixes replication. The only genuinely new piece is the fetch channel, implemented as an authenticated GET against central's existing HTTP management API.

3. Architecture overview

                         central MS SQL
                       ┌──────────────────┐
                       │ PendingDeployment │  (configJson, token, TTL, by deploymentId; ≤1 per instance)
                       └──────────────────┘
                          ▲            │ read (token-gated)
            persist before notify      │
                          │            ▼
   DeploymentService ──Akka small notify──▶ site (active node, via singleton proxy)
   RefreshDeploymentCommand(id, name, hash,        │
     deployedBy, ts, centralFetchBaseUrl, token)   │ 1. HTTP GET config (token)  ◀── central mgmt API
                          ▲                         │ 2. apply (existing path): Instance Actor + SQLite
                          │ DeploymentStatusResponse│ 3. reply success/failure
                          └─────────────────────────┘
                                                    │ 4. tell standby (small, id-only)
                                                    ▼
                                       SiteReplicationActor (standby)
                                         HTTP GET config (token) ──▶ central mgmt API
                                         write deployed_configurations SQLite (no Instance Actor)

Every Akka hop now carries only small control messages. The bulk config moves only over HTTP, point-to-point, to the node that needs it.

4. Component changes

4.1 Central — pending-config store

New table PendingDeployment in central MS SQL (EF entity + migration):

column	type	notes
`DeploymentId`	`varchar` PK	matches the deploy's GUID (`N` format)
`InstanceId`	`int`, indexed	for supersession lookup
`RevisionHash`	`varchar`	staleness/version marker
`ConfigJson`	`nvarchar(max)`	the flattened config
`Token`	`varchar`	random per-deployment fetch secret
`CreatedAtUtc`	`datetimeoffset`
`ExpiresAtUtc`	`datetimeoffset`, indexed	TTL (default 5 min, configurable)

Kept separate from DeployedConfigSnapshot so the existing "deployed = success" semantics stay clean. On success the pending row's config is promoted into DeployedConfigSnapshot (existing logic); on failure/timeout/TTL it's purged. A bounded periodic purge removes expired rows.

Supersession (requested): when persisting a new pending row for an instance, first delete any existing pending rows for that same InstanceId (delete-then-insert, one transaction) → at most one pending row per instance, always the latest. Race-free because the existing per-instance operation lock already serializes same-instance deploys.

4.2 Central — serve endpoint

GET /api/internal/deployments/{deploymentId}/config on the existing management API:

[AllowAnonymous] + explicit token validation (constant-time compare against the pending row's Token), TTL-checked. Machine-to-machine; never user-session-gated.
Returns the row's ConfigJson (200) — or 404 when the deploymentId is unknown, expired, or superseded (no longer the current pending row for its instance). The 404-on-superseded is what makes stale fetches fail closed (see §5).
Both central nodes read the same MS SQL, so the request works through Traefik's active-node LB.
HTTPS strongly recommended in transit (config carries already-encrypted SecretsBlocks; the token is single-deployment + short-TTL).
Companion endpoint for the follow-up (§6): GET /api/internal/sites/{siteId}/deployments → [{instanceUniqueName, revisionHash}].

4.3 Central — send small notify

New message:

public record RefreshDeploymentCommand(
    string DeploymentId,
    string InstanceUniqueName,
    string RevisionHash,
    string DeployedBy,
    DateTimeOffset Timestamp,
    string CentralFetchBaseUrl,   // central's LB/Traefik URL — carried so the site needs no new standing config
    string FetchToken);

DeploymentService flow: flatten + hash (unchanged) → persist PendingDeployment (config + token + TTL, superseding prior) → send RefreshDeploymentCommand over the existing ClusterClient path (replacing the fat DeployInstanceCommand) → await DeploymentStatusResponse (reply path unchanged) → on success promote pending→snapshot. The Ask/DeploymentTimeout stay, but now carry a tiny request/reply — no frame risk. CentralFetchBaseUrl is a new central option (e.g. Communication:CentralFetchBaseUrl).

4.4 Site — deploy (active node)

SiteCommunicationActor routes RefreshDeploymentCommand to the deployment-manager-proxy (replacing the DeployInstanceCommand forward). The proxy delivers to the singleton on the active node automatically — small message, no host detection.
DeploymentManagerActor handles RefreshDeploymentCommand: fetch the config via an injected IDeploymentConfigFetcher (HttpClient wrapper, registered for the site role), then run the existing ApplyDeployment verbatim (create/replace Instance Actor, write deployed_configurations, reply DeploymentStatusResponse). Only the config source changes.
Fetch failures reply Failed with a Communication failure:-prefixed message (transient/comm) so central's DeploymentManager-006 query-site reconciliation fires; apply failures stay generic deployment errors.

4.5 Site — replication via fetch (active → standby), unified

ReplicateConfigDeploy / ApplyConfigDeploy drop ConfigJson and gain (DeploymentId, RevisionHash, IsEnabled, CentralFetchBaseUrl, FetchToken). After applying, the active node sends the id-only message to the standby.
The standby's SiteReplicationActor fetches the config via the same IDeploymentConfigFetcher, then writes deployed_configurations via SiteStorageService.StoreDeployedConfigAsync — no Instance Actor (it's standby). Best-effort, but the fetch gets a small bounded retry (cheap/async).
Defensive write guard: the standby only overwrites a deployed_configurations row when the incoming RevisionHash/DeployedAt is newer, so a delayed older replicate can't clobber a newer config.
A stale replicate (superseded deploymentId) fetches → 404 → standby skips it; the newer deploy's replicate-notify delivers the correct config (§5).
These records are intra-site (same Host binary on both nodes) → no cross-version concern.

4.6 Secondary bug — timeout classification

DeploymentService.cs (~line 312): add AskTimeoutException to the timeout branch (it does not derive from System.TimeoutException), alongside OperationCanceledException, so a no-reply deploy is classified as a timeout and triggers the query-site reconciliation.

4.7 Retire the fat message

DeployInstanceCommand.FlattenedConfigurationJson is no longer sent cross-cluster. Keep the record only if convenient as an internal apply DTO; otherwise remove. After this change, no Akka message carries config.

5. Supersession & ordering semantics

At central: delete-then-insert under the per-instance lock ⇒ ≤1 pending row per instance, always the newest deploy.
Fetch fails closed: the serve endpoint returns 404 for any deploymentId that isn't the current pending row (unknown / expired / superseded / already promoted). So an older in-flight deploy notify or an older replicate-notify can never fetch a lower-version config.
Standby last-write-wins + guard: deployed_configurations upserts by instance_unique_name; the standby additionally refuses to overwrite with an older RevisionHash/DeployedAt. Net: the standby converges on the latest config, never a stale one.

6. Follow-up (in the plan, separate & lower-priority): standby/startup reconciliation

Replication is best-effort with no retry and no startup reconciliation today — a standby that is down during a deploy permanently misses that instance until its next deploy (a pre-existing gap, independent of frame size). To make replication self-healing:

On node start / peer rejoin, the node reconciles its deployed_configurations against central's expected set for this site (via GET /api/internal/sites/{siteId}/deployments → {instanceUniqueName, revisionHash} pairs), fetching missing/stale configs (reusing the §4.2 config endpoint) and dropping orphans central no longer has.

Marked as a distinct task, sequenced after the core frame-size fix.

7. Configuration & options

Central: Communication:CentralFetchBaseUrl (LB/Traefik URL); Deployment:PendingDeploymentTtl (default 5 min); token length; purge interval.
Site: HttpClient registration for the site role; fetch timeout; replication fetch retry count.
Akka maximum-frame-size: intentionally unchanged — the fix doesn't rely on it.
gRPC message size: not involved (HTTP carries the config).

8. Failover & edge cases

Site failover: the notify routes to the new active node automatically via the singleton proxy.
Active node dies mid-fetch: central's Ask times out → reconciliation.
Central node failover during a fetch: Traefik routes to the active central node; the pending row is in shared MS SQL → still served.
Standby down during deploy: misses the replicate-notify → covered by the §6 follow-up.
Token TTL must comfortably cover both nodes' fetches within the deploy window (default 5 min is ample).

9. Security

Token-gated internal endpoint; constant-time compare; short TTL; scoped to exactly one deployment's config; not tied to a user session; HTTPS recommended. Config SecretsBlocks are already encrypted at rest.

10. Testing

Unit: PendingDeployment store incl. supersession (delete-then-insert) + TTL/purge; token validation (valid / expired / wrong / superseded, constant-time); IDeploymentConfigFetcher (200 / 401 / 404 / timeout); DeploymentService notify→promote flow; AskTimeout classification; standby fetch-on-replicate + older-write guard.
Integration: the exact failing case — a >128 KB config deploys end-to-end; standby obtains it via fetch; kill the active node → failover recovers it from the standby's SQLite; redeploy a newer version → pending superseded, stale fetch 404s.
Live smoke (docker): redeploy the 3rd-composition config that originally hung; confirm both nodes hold it.

11. Migration / deploy

Single Host binary serves both roles → one rebuild + redeploy covers central and site. Message-contract changes (RefreshDeploymentCommand new; ReplicateConfigDeploy drops ConfigJson) are safe since all nodes upgrade together.
New EF migration for PendingDeployment (auto-apply in dev; SQL script for prod). No site SQLite schema change.
Add CentralFetchBaseUrl to central appsettings in docker/, docker-env2/, deploy/wonder-app-vd03/; confirm site→central HTTP reachability (fine on the co-located test host/docker; a firewall port to open in a hub-spoke prod). Update deploy/wonder-app-vd03/RUNBOOK.md.

11a. Live validation + post-implementation fixes (2026-06-26)

Smoke-tested on the docker cluster (rebuilt from this branch). Validated end-to-end:

Migration applied; all 8 nodes healthy/ready.
Deploy notify-and-fetch: central notify → active node fetch+apply → id-only replicate → standby fetch → Success in ~0.11 s (the exact path that previously hung 120 s).
Startup reconciliation: every node self-heals on startup — single-node gap heals; the concurrent both-nodes-empty race heals both.

The smoke surfaced two real bugs in the reconciliation path (missed by unit/integration tests because those didn't have a second concurrent node or a lingering expired row), both fixed:

Concurrent-gap omit — when two nodes were concurrently missing the same instance, the second node's StagePendingIfAbsentAsync returned false and the handler omitted the item, leaving that node unhealed. Fix: on false, return the existing pending row's deploymentId + token (multi-use within TTL) so all concurrently-missing nodes heal in the same round.
Expired pending row blocks self-heal — StagePendingIfAbsentAsync checked existence by InstanceId ignoring expiry, so an expired-but-unpurged row (the periodic purge is still a deferred TODO) blocked a fresh stage and would collide with the snapshot's reused DeploymentId on the unique index. Fix: expiry-aware staging — delete expired rows for the instance first, then check only live rows; GetPendingDeploymentByInstanceIdAsync filters by expiry. This also opportunistically cleans expired rows, reducing reliance on the deferred periodic purge.

12. Affected files (for the plan)

src/ZB.MOM.WW.ScadaBridge.Commons/Entities/Deployment/ — new PendingDeployment entity.
src/ZB.MOM.WW.ScadaBridge.Commons/Interfaces/Repositories/IDeploymentManagerRepository.cs — pending CRUD + supersession + expected-set query.
src/ZB.MOM.WW.ScadaBridge.Commons/Messages/Deployment/ — new RefreshDeploymentCommand; retire DeployInstanceCommand wire use.
src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase/ — EF mapping, repository impl, migration.
src/ZB.MOM.WW.ScadaBridge.DeploymentManager/DeploymentService.cs — persist pending + supersede, send notify, promote on success, AskTimeout classification.
src/ZB.MOM.WW.ScadaBridge.Communication/CommunicationService.cs + Actors/{Central,Site}CommunicationActor.cs — route RefreshDeploymentCommand.
src/ZB.MOM.WW.ScadaBridge.CentralUI (or management API project) — /api/internal/deployments/{id}/config + /api/internal/sites/{id}/deployments endpoints.
src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs — handle RefreshDeploymentCommand, fetch, replicate id-only.
src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/SiteReplicationActor.cs + Messages/ReplicationMessages.cs — id-only replication, fetch on standby, older-write guard.
src/ZB.MOM.WW.ScadaBridge.SiteRuntime/ — IDeploymentConfigFetcher + HttpClient registration; site reconciliation (follow-up).
appsettings (docker/, docker-env2/, deploy/wonder-app-vd03/) + RUNBOOK.md.
Tests under tests/.

18 KiB Raw Blame History Unescape Escape