# Deploy hangs (120 s) when the flattened instance config exceeds Akka's default frame size **Date:** 2026-06-26 · **Status:** RESOLVED (2026-06-26) · **Severity:** High (silently un-deployable instances) · **Area:** Deployment Manager / Cluster Communication ## Resolution (2026-06-26) Fixed via the **notify-and-fetch** rework (the primary recommendation below), not the frame-size stopgap. The full flattened config no longer travels inside any Akka message: central stages it in a `PendingDeployment` row and sends a small `RefreshDeploymentCommand`; the site fetches the config over HTTP using a per-deployment token. Both frame-trapped Akka hops are fixed — the central→site deploy hop **and** the site active→standby `SiteReplicationActor` hop (which carried the same config across an intra-site Akka hop with the same silent-drop trap). The secondary `AskTimeoutException` mis-classification (below) is fixed, and a startup `SiteReconciliationActor` self-heal plus a topology-page performance fix shipped alongside. - **Design:** [`docs/plans/2026-06-26-deploy-config-notify-and-fetch-design.md`](../plans/2026-06-26-deploy-config-notify-and-fetch-design.md) - **Plan:** [`docs/plans/2026-06-26-deploy-config-notify-and-fetch.md`](../plans/2026-06-26-deploy-config-notify-and-fetch.md) - **Validated:** live docker-cluster smoke — a previously-hanging deploy now completes in ~0.11 s; reconciliation heals single-node and concurrent-both-missing gaps. The diagnosis below is retained as the historical record of how the bug was found and reasoned about. ## Summary Deploying an instance whose **flattened configuration JSON is large enough** that the `DeployInstanceCommand` Akka message exceeds Akka.Remote's default `maximum-frame-size` (**128 KB / `128000b`**) causes the message to be **silently dropped in transit** between the central and site nodes. The central's deploy `Ask` then times out after exactly **120 s** (`Communication:DeploymentTimeout`), the deployment is recorded as failed, and **the instance cannot be (re)deployed at all** until the config shrinks back under the limit. It was discovered while adding a **third composition of the same base template** (a generic `DelmiaReceiver` added to `CvdReactor`, which already composes `LeftDelmiaReceiver` + `RightDelmiaReceiver`). Each composition fully expands the composed template's members into the flattened config, so the 3rd copy tipped the message over 128 KB. ## How it manifests - CLI/UI deploy returns `{"error":"Deployment error: Timeout after 120.00 seconds","code":"COMMAND_FAILED"}` (or a 504 via the management API) after ~120 s. - `deploy status` shows the record going `InProgress (1) → Failed (3)` with `errorMessage = "Deployment error: Timeout after 120.00 seconds"`. - The **running instance is unaffected** — deploys are atomic, so it keeps serving its last successful deployment. - Smaller configs deploy normally and fast (the working baseline here, the 2-receiver side-based config, completed in **0.04 s**). ## Root cause The deploy ships the **entire flattened config inline** as a JSON string inside the Akka message, and Akka.Remote's frame size is left at the un-configured default: 1. **Full config carried in the message** — `DeploymentService.cs` serializes the flattened config (`JsonSerializer.Serialize(flattenedConfig)`) and puts it in `DeployInstanceCommand.FlattenedConfigurationJson` (`src/ZB.MOM.WW.ScadaBridge.Commons/Messages/Deployment/DeployInstanceCommand.cs`). 2. **Sent central → site over Akka** — via ClusterClient (`CentralCommunicationActor` → `ClusterClient.Send("/user/site-communication", …)`). 3. **No frame-size override** — `AkkaHostedService.BuildHocon` (`src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs`, ~lines 240-244) sets only `dot-netty.tcp { hostname; port }`. There is **no `maximum-frame-size`** and **no appsettings knob** for it, so the Akka.Remote 1.5.62 default of `128000b` applies (and `log-frame-size-exceeding` defaults to `off`, so there isn't even a warning). 4. **Amplifier** — no custom Akka serializer / `serialization-bindings` is configured, so the already-serialized `configJson` is JSON-escaped **again** inside the envelope by the default serializer (every `"` → `\"`), roughly doubling that portion of the on-wire payload. So the raw flattened JSON only needs to be ~60-70 KB to blow the 128 KB frame. When the encoded frame exceeds the limit, the transport drops **that one message** without tearing down the association (heartbeats/telemetry keep flowing, so the site still reports healthy), the site never receives the deploy, and the central's `Ask` (timeout = `DeploymentTimeout` = 2 min) expires. ### Why the 3rd same-base composition is the trigger The flattener (`FlatteningService`) fully expands **each** composition's attributes/alarms/scripts, path-qualified — so N compositions of the same base ≈ N× those members in the config. The config grows roughly linearly with composition count; here **2 `DelmiaReceiver` copies sat under the limit, the 3rd pushed it over**. Removing the composed *scripts* did not help, because the *attribute* set alone (×3) already exceeded the frame. ## Evidence (live logs, wonder-app-vd03, 2026-06-26) Central log: ``` 07:44:28.033 [INF] Sending deployment 82a39ef5… for instance Z28061Sim to site site-a … (no further deploy activity; only DB/health/telemetry) … 07:46:28.038 [ERR] Deployment 82a39ef5… for instance Z28061Sim failed 07:46:28.039 [ERR] Management command MgmtDeployInstanceCommand failed 07:46:28.039 [WRN] Dead letter: ManagementError from akka://…/temp/3d to deadLetters ``` Site log (same window): **nothing** — no deploy receipt, no error. The last deploy the site ever received was the smaller side-based config: ``` 05:51:46.429 [INF] Deploying instance Z28061Sim, deploymentId=e531d216… ← worked ``` No `OversizedPayloadException`, serialization error, or deserialization error appears on **either** node — consistent with a silent transport-level frame drop. The differentiator is purely config size (2-receiver config delivered & applied; 3-receiver config never arrived). ## Secondary bug (found during the same investigation) `AskTimeoutException` does **not** derive from `System.TimeoutException`, so the catch in `DeploymentService.cs` (~line 312) mis-classifies a genuine site Ask-timeout as a generic `"Deployment error: …"` instead of the `"Communication failure: …"` / timeout branch. Consequence: the DeploymentManager-006 "query-site-before-redeploy" reconciliation (keyed off the `Communication failure:` prefix) does **not** trigger after these failures. Fix the classification so `AskTimeoutException`/`OperationCanceledException` are treated as timeouts. ## Recommended fix > **Shipped:** the **Primary** option below (notify-and-fetch) was implemented and merged on 2026-06-26 — see the Resolution section at the top. The stopgap frame-size bump and the config-only workaround were NOT taken. ### Primary (recommended): notify-and-fetch instead of push Stop shipping the full flattened config inside the Akka message. Instead: 1. Central persists the pending flattened config (it already persists `DeployedConfigSnapshots` on success — extend to a pending/by-`deploymentId` store). 2. Central sends the site a **small** "apply deployment `` for instance `` (revisionHash ``)" notification. 3. The **site pulls** the config by `deploymentId` and applies it, then sends a completion/status response. This decouples the Akka message size from the config size entirely (no instance is ever un-deployable due to size), and it fits the **existing pull-based gRPC telemetry pattern** the site already uses (`PullSiteCalls`, `PullAuditEvents` over `SiteStreamService`). A new `PullDeploymentConfig`/`SiteStreamService` RPC (or reuse of the management API) is the natural home. Keep the completion ack so the central still gets authoritative success/failure. ### Stopgap (until the above ships) Add an explicit frame size + matching buffers to the `dot-netty.tcp` block in `AkkaHostedService.BuildHocon`, e.g.: ```hocon remote { dot-netty.tcp { hostname = … port = … maximum-frame-size = 4000000b # 4 MB send-buffer-size = 8000000b receive-buffer-size = 8000000b log-frame-size-exceeding = 1000000b # warn before silently dropping } } ``` This is a **code** change (no appsettings override exists) and must be deployed to **both** central and site (they share the Host); rebuild + redeploy per `deploy/wonder-app-vd03/RUNBOOK.md` (Upgrading). Consider exposing it via `appsettings`/`CommunicationOptions` so it's tunable without a rebuild. Also fix the `AskTimeoutException` classification. ### Config-only workaround (specific to this case) For the generic-`DelmiaReceiver` rollout specifically: removing the `Left`/`Right` `DelmiaReceiver` compositions and keeping **only** the single generic `DelmiaReceiver` makes the config *smaller* than the working 2-receiver baseline, so it deploys under the current limit — and it's the intended "no sides" end state anyway (each machine bound to its own generic Galaxy receiver, `DelmiaReceiver_037/038/039`). This unblocks without an engine change, but does not fix the underlying fragility for any other instance that legitimately needs a large config. ## Current operational state (to clean up) > **Engine fix resolves the blocker.** With notify-and-fetch merged, a large flattened config is no longer size-capped, so the generic-`DelmiaReceiver` rollout (and the `Download` handshake re-add) can proceed on `Z28061Sim` without hitting this bug. The items below are the operational re-rollout steps, now unblocked — they are NOT engine work. - `Z28061Sim` is running on deployment **id 55** (side-based, with the earlier `side`-crash fix) and is healthy, but is **not currently redeployable**: `CvdReactor` has the extra generic `DelmiaRecv`→`DelmiaReceiver_037` composition + binding, so every redeploy hits this bug. - To restore redeployability, either apply the config-only workaround above (consolidate to one generic receiver) or roll back the generic composition + restore the side-based `ProcessRecipeDownload`. - The `Download` handshake method (mirroring `MESReceiver.MoveIn`) is currently removed from the `DelmiaReceiver` base template (template 12); re-add it once a deployable path is chosen. ## Files - `src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs` (BuildHocon — missing `maximum-frame-size`) - `src/ZB.MOM.WW.ScadaBridge.DeploymentManager/DeploymentService.cs` (inline config serialize; AskTimeout mis-classification) - `src/ZB.MOM.WW.ScadaBridge.Commons/Messages/Deployment/DeployInstanceCommand.cs` (`FlattenedConfigurationJson`) - `src/ZB.MOM.WW.ScadaBridge.Communication/Actors/CentralCommunicationActor.cs` (ClusterClient send) - `src/ZB.MOM.WW.ScadaBridge.Communication/CommunicationOptions.cs` (`DeploymentTimeout = 2 min`) - `src/ZB.MOM.WW.ScadaBridge.TemplateEngine/Flattening/FlatteningService.cs` (per-composition member expansion)