diff --git a/docs/known-issues/2026-06-26-deploy-config-exceeds-akka-frame-size.md b/docs/known-issues/2026-06-26-deploy-config-exceeds-akka-frame-size.md new file mode 100644 index 00000000..680273ae --- /dev/null +++ b/docs/known-issues/2026-06-26-deploy-config-exceeds-akka-frame-size.md @@ -0,0 +1,170 @@ +# Deploy hangs (120 s) when the flattened instance config exceeds Akka's default frame size + +**Date:** 2026-06-26 · **Status:** RESOLVED (2026-06-26) · **Severity:** High (silently un-deployable instances) · **Area:** Deployment Manager / Cluster Communication + +## Resolution (2026-06-26) + +Fixed via the **notify-and-fetch** rework (the primary recommendation below), not the frame-size stopgap. The full flattened config no longer travels inside any Akka message: central stages it in a `PendingDeployment` row and sends a small `RefreshDeploymentCommand`; the site fetches the config over HTTP using a per-deployment token. Both frame-trapped Akka hops are fixed — the central→site deploy hop **and** the site active→standby `SiteReplicationActor` hop (which carried the same config across an intra-site Akka hop with the same silent-drop trap). The secondary `AskTimeoutException` mis-classification (below) is fixed, and a startup `SiteReconciliationActor` self-heal plus a topology-page performance fix shipped alongside. + +- **Design:** [`docs/plans/2026-06-26-deploy-config-notify-and-fetch-design.md`](../plans/2026-06-26-deploy-config-notify-and-fetch-design.md) +- **Plan:** [`docs/plans/2026-06-26-deploy-config-notify-and-fetch.md`](../plans/2026-06-26-deploy-config-notify-and-fetch.md) +- **Validated:** live docker-cluster smoke — a previously-hanging deploy now completes in ~0.11 s; reconciliation heals single-node and concurrent-both-missing gaps. + +The diagnosis below is retained as the historical record of how the bug was found and reasoned about. + +## Summary + +Deploying an instance whose **flattened configuration JSON is large enough** that the +`DeployInstanceCommand` Akka message exceeds Akka.Remote's default `maximum-frame-size` +(**128 KB / `128000b`**) causes the message to be **silently dropped in transit** between the +central and site nodes. The central's deploy `Ask` then times out after exactly **120 s** +(`Communication:DeploymentTimeout`), the deployment is recorded as failed, and **the instance +cannot be (re)deployed at all** until the config shrinks back under the limit. + +It was discovered while adding a **third composition of the same base template** (a generic +`DelmiaReceiver` added to `CvdReactor`, which already composes `LeftDelmiaReceiver` + +`RightDelmiaReceiver`). Each composition fully expands the composed template's members into the +flattened config, so the 3rd copy tipped the message over 128 KB. + +## How it manifests + +- CLI/UI deploy returns `{"error":"Deployment error: Timeout after 120.00 seconds","code":"COMMAND_FAILED"}` (or a 504 via the management API) after ~120 s. +- `deploy status` shows the record going `InProgress (1) → Failed (3)` with `errorMessage = "Deployment error: Timeout after 120.00 seconds"`. +- The **running instance is unaffected** — deploys are atomic, so it keeps serving its last successful deployment. +- Smaller configs deploy normally and fast (the working baseline here, the 2-receiver side-based config, completed in **0.04 s**). + +## Root cause + +The deploy ships the **entire flattened config inline** as a JSON string inside the Akka message, +and Akka.Remote's frame size is left at the un-configured default: + +1. **Full config carried in the message** — `DeploymentService.cs` serializes the flattened config + (`JsonSerializer.Serialize(flattenedConfig)`) and puts it in + `DeployInstanceCommand.FlattenedConfigurationJson` + (`src/ZB.MOM.WW.ScadaBridge.Commons/Messages/Deployment/DeployInstanceCommand.cs`). +2. **Sent central → site over Akka** — via ClusterClient + (`CentralCommunicationActor` → `ClusterClient.Send("/user/site-communication", …)`). +3. **No frame-size override** — `AkkaHostedService.BuildHocon` + (`src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs`, ~lines 240-244) sets only + `dot-netty.tcp { hostname; port }`. There is **no `maximum-frame-size`** and **no appsettings + knob** for it, so the Akka.Remote 1.5.62 default of `128000b` applies (and + `log-frame-size-exceeding` defaults to `off`, so there isn't even a warning). +4. **Amplifier** — no custom Akka serializer / `serialization-bindings` is configured, so the + already-serialized `configJson` is JSON-escaped **again** inside the envelope by the default + serializer (every `"` → `\"`), roughly doubling that portion of the on-wire payload. So the raw + flattened JSON only needs to be ~60-70 KB to blow the 128 KB frame. + +When the encoded frame exceeds the limit, the transport drops **that one message** without tearing +down the association (heartbeats/telemetry keep flowing, so the site still reports healthy), the +site never receives the deploy, and the central's `Ask` (timeout = `DeploymentTimeout` = 2 min) +expires. + +### Why the 3rd same-base composition is the trigger + +The flattener (`FlatteningService`) fully expands **each** composition's attributes/alarms/scripts, +path-qualified — so N compositions of the same base ≈ N× those members in the config. The config +grows roughly linearly with composition count; here **2 `DelmiaReceiver` copies sat under the limit, +the 3rd pushed it over**. Removing the composed *scripts* did not help, because the *attribute* set +alone (×3) already exceeded the frame. + +## Evidence (live logs, wonder-app-vd03, 2026-06-26) + +Central log: +``` +07:44:28.033 [INF] Sending deployment 82a39ef5… for instance Z28061Sim to site site-a + … (no further deploy activity; only DB/health/telemetry) … +07:46:28.038 [ERR] Deployment 82a39ef5… for instance Z28061Sim failed +07:46:28.039 [ERR] Management command MgmtDeployInstanceCommand failed +07:46:28.039 [WRN] Dead letter: ManagementError from akka://…/temp/3d to deadLetters +``` +Site log (same window): **nothing** — no deploy receipt, no error. The last deploy the site ever +received was the smaller side-based config: +``` +05:51:46.429 [INF] Deploying instance Z28061Sim, deploymentId=e531d216… ← worked +``` +No `OversizedPayloadException`, serialization error, or deserialization error appears on **either** +node — consistent with a silent transport-level frame drop. The differentiator is purely config size +(2-receiver config delivered & applied; 3-receiver config never arrived). + +## Secondary bug (found during the same investigation) + +`AskTimeoutException` does **not** derive from `System.TimeoutException`, so the catch in +`DeploymentService.cs` (~line 312) mis-classifies a genuine site Ask-timeout as a generic +`"Deployment error: …"` instead of the `"Communication failure: …"` / timeout branch. Consequence: +the DeploymentManager-006 "query-site-before-redeploy" reconciliation (keyed off the +`Communication failure:` prefix) does **not** trigger after these failures. Fix the classification so +`AskTimeoutException`/`OperationCanceledException` are treated as timeouts. + +## Recommended fix + +> **Shipped:** the **Primary** option below (notify-and-fetch) was implemented and merged on 2026-06-26 — see the Resolution section at the top. The stopgap frame-size bump and the config-only workaround were NOT taken. + +### Primary (recommended): notify-and-fetch instead of push + +Stop shipping the full flattened config inside the Akka message. Instead: + +1. Central persists the pending flattened config (it already persists `DeployedConfigSnapshots` on + success — extend to a pending/by-`deploymentId` store). +2. Central sends the site a **small** "apply deployment `` for instance `` + (revisionHash ``)" notification. +3. The **site pulls** the config by `deploymentId` and applies it, then sends a completion/status + response. + +This decouples the Akka message size from the config size entirely (no instance is ever +un-deployable due to size), and it fits the **existing pull-based gRPC telemetry pattern** the site +already uses (`PullSiteCalls`, `PullAuditEvents` over `SiteStreamService`). A new +`PullDeploymentConfig`/`SiteStreamService` RPC (or reuse of the management API) is the natural home. +Keep the completion ack so the central still gets authoritative success/failure. + +### Stopgap (until the above ships) + +Add an explicit frame size + matching buffers to the `dot-netty.tcp` block in +`AkkaHostedService.BuildHocon`, e.g.: +```hocon +remote { + dot-netty.tcp { + hostname = … + port = … + maximum-frame-size = 4000000b # 4 MB + send-buffer-size = 8000000b + receive-buffer-size = 8000000b + log-frame-size-exceeding = 1000000b # warn before silently dropping + } +} +``` +This is a **code** change (no appsettings override exists) and must be deployed to **both** central +and site (they share the Host); rebuild + redeploy per `deploy/wonder-app-vd03/RUNBOOK.md` +(Upgrading). Consider exposing it via `appsettings`/`CommunicationOptions` so it's tunable without a +rebuild. Also fix the `AskTimeoutException` classification. + +### Config-only workaround (specific to this case) + +For the generic-`DelmiaReceiver` rollout specifically: removing the `Left`/`Right` `DelmiaReceiver` +compositions and keeping **only** the single generic `DelmiaReceiver` makes the config *smaller* than +the working 2-receiver baseline, so it deploys under the current limit — and it's the intended +"no sides" end state anyway (each machine bound to its own generic Galaxy receiver, +`DelmiaReceiver_037/038/039`). This unblocks without an engine change, but does not fix the +underlying fragility for any other instance that legitimately needs a large config. + +## Current operational state (to clean up) + +> **Engine fix resolves the blocker.** With notify-and-fetch merged, a large flattened config is no longer size-capped, so the generic-`DelmiaReceiver` rollout (and the `Download` handshake re-add) can proceed on `Z28061Sim` without hitting this bug. The items below are the operational re-rollout steps, now unblocked — they are NOT engine work. + +- `Z28061Sim` is running on deployment **id 55** (side-based, with the earlier `side`-crash fix) and + is healthy, but is **not currently redeployable**: `CvdReactor` has the extra generic + `DelmiaRecv`→`DelmiaReceiver_037` composition + binding, so every redeploy hits this bug. +- To restore redeployability, either apply the config-only workaround above (consolidate to one + generic receiver) or roll back the generic composition + restore the side-based + `ProcessRecipeDownload`. +- The `Download` handshake method (mirroring `MESReceiver.MoveIn`) is currently removed from the + `DelmiaReceiver` base template (template 12); re-add it once a deployable path is chosen. + +## Files + +- `src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs` (BuildHocon — missing `maximum-frame-size`) +- `src/ZB.MOM.WW.ScadaBridge.DeploymentManager/DeploymentService.cs` (inline config serialize; AskTimeout mis-classification) +- `src/ZB.MOM.WW.ScadaBridge.Commons/Messages/Deployment/DeployInstanceCommand.cs` (`FlattenedConfigurationJson`) +- `src/ZB.MOM.WW.ScadaBridge.Communication/Actors/CentralCommunicationActor.cs` (ClusterClient send) +- `src/ZB.MOM.WW.ScadaBridge.Communication/CommunicationOptions.cs` (`DeploymentTimeout = 2 min`) +- `src/ZB.MOM.WW.ScadaBridge.TemplateEngine/Flattening/FlatteningService.cs` (per-composition member expansion) +