Files
ScadaBridge/docs/known-issues/2026-06-26-deploy-config-exceeds-akka-frame-size.md
T
Joseph Doherty d9f5fbb664 docs(known-issues): mark deploy-config frame-size bug RESOLVED via notify-and-fetch
The 128 KB Akka frame-size deploy trap is fixed and merged. Record the
resolution at the top of the writeup (notify-and-fetch on both the
central->site deploy hop and the site active->standby replication hop;
AskTimeout classification fix; startup reconciliation + topology perf fix),
link the design + plan docs, and note the live-smoke validation. The
diagnosis is retained as historical record.
2026-06-26 23:09:24 -04:00

11 KiB
Raw Blame History

Deploy hangs (120 s) when the flattened instance config exceeds Akka's default frame size

Date: 2026-06-26 · Status: RESOLVED (2026-06-26) · Severity: High (silently un-deployable instances) · Area: Deployment Manager / Cluster Communication

Resolution (2026-06-26)

Fixed via the notify-and-fetch rework (the primary recommendation below), not the frame-size stopgap. The full flattened config no longer travels inside any Akka message: central stages it in a PendingDeployment row and sends a small RefreshDeploymentCommand; the site fetches the config over HTTP using a per-deployment token. Both frame-trapped Akka hops are fixed — the central→site deploy hop and the site active→standby SiteReplicationActor hop (which carried the same config across an intra-site Akka hop with the same silent-drop trap). The secondary AskTimeoutException mis-classification (below) is fixed, and a startup SiteReconciliationActor self-heal plus a topology-page performance fix shipped alongside.

The diagnosis below is retained as the historical record of how the bug was found and reasoned about.

Summary

Deploying an instance whose flattened configuration JSON is large enough that the DeployInstanceCommand Akka message exceeds Akka.Remote's default maximum-frame-size (128 KB / 128000b) causes the message to be silently dropped in transit between the central and site nodes. The central's deploy Ask then times out after exactly 120 s (Communication:DeploymentTimeout), the deployment is recorded as failed, and the instance cannot be (re)deployed at all until the config shrinks back under the limit.

It was discovered while adding a third composition of the same base template (a generic DelmiaReceiver added to CvdReactor, which already composes LeftDelmiaReceiver + RightDelmiaReceiver). Each composition fully expands the composed template's members into the flattened config, so the 3rd copy tipped the message over 128 KB.

How it manifests

  • CLI/UI deploy returns {"error":"Deployment error: Timeout after 120.00 seconds","code":"COMMAND_FAILED"} (or a 504 via the management API) after ~120 s.
  • deploy status shows the record going InProgress (1) → Failed (3) with errorMessage = "Deployment error: Timeout after 120.00 seconds".
  • The running instance is unaffected — deploys are atomic, so it keeps serving its last successful deployment.
  • Smaller configs deploy normally and fast (the working baseline here, the 2-receiver side-based config, completed in 0.04 s).

Root cause

The deploy ships the entire flattened config inline as a JSON string inside the Akka message, and Akka.Remote's frame size is left at the un-configured default:

  1. Full config carried in the messageDeploymentService.cs serializes the flattened config (JsonSerializer.Serialize(flattenedConfig)) and puts it in DeployInstanceCommand.FlattenedConfigurationJson (src/ZB.MOM.WW.ScadaBridge.Commons/Messages/Deployment/DeployInstanceCommand.cs).
  2. Sent central → site over Akka — via ClusterClient (CentralCommunicationActorClusterClient.Send("/user/site-communication", …)).
  3. No frame-size overrideAkkaHostedService.BuildHocon (src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs, ~lines 240-244) sets only dot-netty.tcp { hostname; port }. There is no maximum-frame-size and no appsettings knob for it, so the Akka.Remote 1.5.62 default of 128000b applies (and log-frame-size-exceeding defaults to off, so there isn't even a warning).
  4. Amplifier — no custom Akka serializer / serialization-bindings is configured, so the already-serialized configJson is JSON-escaped again inside the envelope by the default serializer (every "\"), roughly doubling that portion of the on-wire payload. So the raw flattened JSON only needs to be ~60-70 KB to blow the 128 KB frame.

When the encoded frame exceeds the limit, the transport drops that one message without tearing down the association (heartbeats/telemetry keep flowing, so the site still reports healthy), the site never receives the deploy, and the central's Ask (timeout = DeploymentTimeout = 2 min) expires.

Why the 3rd same-base composition is the trigger

The flattener (FlatteningService) fully expands each composition's attributes/alarms/scripts, path-qualified — so N compositions of the same base ≈ N× those members in the config. The config grows roughly linearly with composition count; here 2 DelmiaReceiver copies sat under the limit, the 3rd pushed it over. Removing the composed scripts did not help, because the attribute set alone (×3) already exceeded the frame.

Evidence (live logs, wonder-app-vd03, 2026-06-26)

Central log:

07:44:28.033 [INF] Sending deployment 82a39ef5… for instance Z28061Sim to site site-a
   … (no further deploy activity; only DB/health/telemetry) …
07:46:28.038 [ERR] Deployment 82a39ef5… for instance Z28061Sim failed
07:46:28.039 [ERR] Management command MgmtDeployInstanceCommand failed
07:46:28.039 [WRN] Dead letter: ManagementError from akka://…/temp/3d to deadLetters

Site log (same window): nothing — no deploy receipt, no error. The last deploy the site ever received was the smaller side-based config:

05:51:46.429 [INF] Deploying instance Z28061Sim, deploymentId=e531d216…   ← worked

No OversizedPayloadException, serialization error, or deserialization error appears on either node — consistent with a silent transport-level frame drop. The differentiator is purely config size (2-receiver config delivered & applied; 3-receiver config never arrived).

Secondary bug (found during the same investigation)

AskTimeoutException does not derive from System.TimeoutException, so the catch in DeploymentService.cs (~line 312) mis-classifies a genuine site Ask-timeout as a generic "Deployment error: …" instead of the "Communication failure: …" / timeout branch. Consequence: the DeploymentManager-006 "query-site-before-redeploy" reconciliation (keyed off the Communication failure: prefix) does not trigger after these failures. Fix the classification so AskTimeoutException/OperationCanceledException are treated as timeouts.

Shipped: the Primary option below (notify-and-fetch) was implemented and merged on 2026-06-26 — see the Resolution section at the top. The stopgap frame-size bump and the config-only workaround were NOT taken.

Stop shipping the full flattened config inside the Akka message. Instead:

  1. Central persists the pending flattened config (it already persists DeployedConfigSnapshots on success — extend to a pending/by-deploymentId store).
  2. Central sends the site a small "apply deployment <deploymentId> for instance <id> (revisionHash <hash>)" notification.
  3. The site pulls the config by deploymentId and applies it, then sends a completion/status response.

This decouples the Akka message size from the config size entirely (no instance is ever un-deployable due to size), and it fits the existing pull-based gRPC telemetry pattern the site already uses (PullSiteCalls, PullAuditEvents over SiteStreamService). A new PullDeploymentConfig/SiteStreamService RPC (or reuse of the management API) is the natural home. Keep the completion ack so the central still gets authoritative success/failure.

Stopgap (until the above ships)

Add an explicit frame size + matching buffers to the dot-netty.tcp block in AkkaHostedService.BuildHocon, e.g.:

remote {
  dot-netty.tcp {
    hostname = …
    port = …
    maximum-frame-size = 4000000b      # 4 MB
    send-buffer-size = 8000000b
    receive-buffer-size = 8000000b
    log-frame-size-exceeding = 1000000b  # warn before silently dropping
  }
}

This is a code change (no appsettings override exists) and must be deployed to both central and site (they share the Host); rebuild + redeploy per deploy/wonder-app-vd03/RUNBOOK.md (Upgrading). Consider exposing it via appsettings/CommunicationOptions so it's tunable without a rebuild. Also fix the AskTimeoutException classification.

Config-only workaround (specific to this case)

For the generic-DelmiaReceiver rollout specifically: removing the Left/Right DelmiaReceiver compositions and keeping only the single generic DelmiaReceiver makes the config smaller than the working 2-receiver baseline, so it deploys under the current limit — and it's the intended "no sides" end state anyway (each machine bound to its own generic Galaxy receiver, DelmiaReceiver_037/038/039). This unblocks without an engine change, but does not fix the underlying fragility for any other instance that legitimately needs a large config.

Current operational state (to clean up)

Engine fix resolves the blocker. With notify-and-fetch merged, a large flattened config is no longer size-capped, so the generic-DelmiaReceiver rollout (and the Download handshake re-add) can proceed on Z28061Sim without hitting this bug. The items below are the operational re-rollout steps, now unblocked — they are NOT engine work.

  • Z28061Sim is running on deployment id 55 (side-based, with the earlier side-crash fix) and is healthy, but is not currently redeployable: CvdReactor has the extra generic DelmiaRecvDelmiaReceiver_037 composition + binding, so every redeploy hits this bug.
  • To restore redeployability, either apply the config-only workaround above (consolidate to one generic receiver) or roll back the generic composition + restore the side-based ProcessRecipeDownload.
  • The Download handshake method (mirroring MESReceiver.MoveIn) is currently removed from the DelmiaReceiver base template (template 12); re-add it once a deployable path is chosen.

Files

  • src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs (BuildHocon — missing maximum-frame-size)
  • src/ZB.MOM.WW.ScadaBridge.DeploymentManager/DeploymentService.cs (inline config serialize; AskTimeout mis-classification)
  • src/ZB.MOM.WW.ScadaBridge.Commons/Messages/Deployment/DeployInstanceCommand.cs (FlattenedConfigurationJson)
  • src/ZB.MOM.WW.ScadaBridge.Communication/Actors/CentralCommunicationActor.cs (ClusterClient send)
  • src/ZB.MOM.WW.ScadaBridge.Communication/CommunicationOptions.cs (DeploymentTimeout = 2 min)
  • src/ZB.MOM.WW.ScadaBridge.TemplateEngine/Flattening/FlatteningService.cs (per-composition member expansion)