docs(known-issues): mark deploy-config frame-size bug RESOLVED via notify-and-fetch

The 128 KB Akka frame-size deploy trap is fixed and merged. Record the resolution at the top of the writeup (notify-and-fetch on both the central->site deploy hop and the site active->standby replication hop; AskTimeout classification fix; startup reconciliation + topology perf fix), link the design + plan docs, and note the live-smoke validation. The diagnosis is retained as historical record.
2026-06-26 23:09:24 -04:00
parent f48a748f37
commit d9f5fbb664
1 changed files with 170 additions and 0 deletions
@@ -0,0 +1,170 @@
+# Deploy hangs (120 s) when the flattened instance config exceeds Akka's default frame size
+
+**Date:** 2026-06-26 · **Status:** RESOLVED (2026-06-26) · **Severity:** High (silently un-deployable instances) · **Area:** Deployment Manager / Cluster Communication
+
+## Resolution (2026-06-26)
+
+Fixed via the **notify-and-fetch** rework (the primary recommendation below), not the frame-size stopgap. The full flattened config no longer travels inside any Akka message: central stages it in a `PendingDeployment` row and sends a small `RefreshDeploymentCommand`; the site fetches the config over HTTP using a per-deployment token. Both frame-trapped Akka hops are fixed — the central→site deploy hop **and** the site active→standby `SiteReplicationActor` hop (which carried the same config across an intra-site Akka hop with the same silent-drop trap). The secondary `AskTimeoutException` mis-classification (below) is fixed, and a startup `SiteReconciliationActor` self-heal plus a topology-page performance fix shipped alongside.
+
+- **Design:** [`docs/plans/2026-06-26-deploy-config-notify-and-fetch-design.md`](../plans/2026-06-26-deploy-config-notify-and-fetch-design.md)
+- **Plan:** [`docs/plans/2026-06-26-deploy-config-notify-and-fetch.md`](../plans/2026-06-26-deploy-config-notify-and-fetch.md)
+- **Validated:** live docker-cluster smoke — a previously-hanging deploy now completes in ~0.11 s; reconciliation heals single-node and concurrent-both-missing gaps.
+
+The diagnosis below is retained as the historical record of how the bug was found and reasoned about.
+
+## Summary
+
+Deploying an instance whose **flattened configuration JSON is large enough** that the
+`DeployInstanceCommand` Akka message exceeds Akka.Remote's default `maximum-frame-size`
+(**128 KB / `128000b`**) causes the message to be **silently dropped in transit** between the
+central and site nodes. The central's deploy `Ask` then times out after exactly **120 s**
+(`Communication:DeploymentTimeout`), the deployment is recorded as failed, and **the instance
+cannot be (re)deployed at all** until the config shrinks back under the limit.
+
+It was discovered while adding a **third composition of the same base template** (a generic
+`DelmiaReceiver` added to `CvdReactor`, which already composes `LeftDelmiaReceiver` +
+`RightDelmiaReceiver`). Each composition fully expands the composed template's members into the
+flattened config, so the 3rd copy tipped the message over 128 KB.
+
+## How it manifests
+
+- CLI/UI deploy returns `{"error":"Deployment error: Timeout after 120.00 seconds","code":"COMMAND_FAILED"}` (or a 504 via the management API) after ~120 s.
+- `deploy status` shows the record going `InProgress (1) → Failed (3)` with `errorMessage = "Deployment error: Timeout after 120.00 seconds"`.
+- The **running instance is unaffected** — deploys are atomic, so it keeps serving its last successful deployment.
+- Smaller configs deploy normally and fast (the working baseline here, the 2-receiver side-based config, completed in **0.04 s**).
+
+## Root cause
+
+The deploy ships the **entire flattened config inline** as a JSON string inside the Akka message,
+and Akka.Remote's frame size is left at the un-configured default:
+
+1. **Full config carried in the message** — `DeploymentService.cs` serializes the flattened config
+   (`JsonSerializer.Serialize(flattenedConfig)`) and puts it in
+   `DeployInstanceCommand.FlattenedConfigurationJson`
+   (`src/ZB.MOM.WW.ScadaBridge.Commons/Messages/Deployment/DeployInstanceCommand.cs`).
+2. **Sent central → site over Akka** — via ClusterClient
+   (`CentralCommunicationActor` → `ClusterClient.Send("/user/site-communication", …)`).
+3. **No frame-size override** — `AkkaHostedService.BuildHocon`
+   (`src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs`, ~lines 240-244) sets only
+   `dot-netty.tcp { hostname; port }`. There is **no `maximum-frame-size`** and **no appsettings
+   knob** for it, so the Akka.Remote 1.5.62 default of `128000b` applies (and
+   `log-frame-size-exceeding` defaults to `off`, so there isn't even a warning).
+4. **Amplifier** — no custom Akka serializer / `serialization-bindings` is configured, so the
+   already-serialized `configJson` is JSON-escaped **again** inside the envelope by the default
+   serializer (every `"` → `\"`), roughly doubling that portion of the on-wire payload. So the raw
+   flattened JSON only needs to be ~60-70 KB to blow the 128 KB frame.
+
+When the encoded frame exceeds the limit, the transport drops **that one message** without tearing
+down the association (heartbeats/telemetry keep flowing, so the site still reports healthy), the
+site never receives the deploy, and the central's `Ask` (timeout = `DeploymentTimeout` = 2 min)
+expires.
+
+### Why the 3rd same-base composition is the trigger
+
+The flattener (`FlatteningService`) fully expands **each** composition's attributes/alarms/scripts,
+path-qualified — so N compositions of the same base ≈ N× those members in the config. The config
+grows roughly linearly with composition count; here **2 `DelmiaReceiver` copies sat under the limit,
+the 3rd pushed it over**. Removing the composed *scripts* did not help, because the *attribute* set
+alone (×3) already exceeded the frame.
+
+## Evidence (live logs, wonder-app-vd03, 2026-06-26)
+
+Central log:
+```
+07:44:28.033 [INF] Sending deployment 82a39ef5… for instance Z28061Sim to site site-a
+   … (no further deploy activity; only DB/health/telemetry) …
+07:46:28.038 [ERR] Deployment 82a39ef5… for instance Z28061Sim failed
+07:46:28.039 [ERR] Management command MgmtDeployInstanceCommand failed
+07:46:28.039 [WRN] Dead letter: ManagementError from akka://…/temp/3d to deadLetters
+```
+Site log (same window): **nothing** — no deploy receipt, no error. The last deploy the site ever
+received was the smaller side-based config:
+```
+05:51:46.429 [INF] Deploying instance Z28061Sim, deploymentId=e531d216…   ← worked
+```
+No `OversizedPayloadException`, serialization error, or deserialization error appears on **either**
+node — consistent with a silent transport-level frame drop. The differentiator is purely config size
+(2-receiver config delivered & applied; 3-receiver config never arrived).
+
+## Secondary bug (found during the same investigation)
+
+`AskTimeoutException` does **not** derive from `System.TimeoutException`, so the catch in
+`DeploymentService.cs` (~line 312) mis-classifies a genuine site Ask-timeout as a generic
+`"Deployment error: …"` instead of the `"Communication failure: …"` / timeout branch. Consequence:
+the DeploymentManager-006 "query-site-before-redeploy" reconciliation (keyed off the
+`Communication failure:` prefix) does **not** trigger after these failures. Fix the classification so
+`AskTimeoutException`/`OperationCanceledException` are treated as timeouts.
+
+## Recommended fix
+
+> **Shipped:** the **Primary** option below (notify-and-fetch) was implemented and merged on 2026-06-26 — see the Resolution section at the top. The stopgap frame-size bump and the config-only workaround were NOT taken.
+
+### Primary (recommended): notify-and-fetch instead of push
+
+Stop shipping the full flattened config inside the Akka message. Instead:
+
+1. Central persists the pending flattened config (it already persists `DeployedConfigSnapshots` on
+   success — extend to a pending/by-`deploymentId` store).
+2. Central sends the site a **small** "apply deployment `<deploymentId>` for instance `<id>`
+   (revisionHash `<hash>`)" notification.
+3. The **site pulls** the config by `deploymentId` and applies it, then sends a completion/status
+   response.
+
+This decouples the Akka message size from the config size entirely (no instance is ever
+un-deployable due to size), and it fits the **existing pull-based gRPC telemetry pattern** the site
+already uses (`PullSiteCalls`, `PullAuditEvents` over `SiteStreamService`). A new
+`PullDeploymentConfig`/`SiteStreamService` RPC (or reuse of the management API) is the natural home.
+Keep the completion ack so the central still gets authoritative success/failure.
+
+### Stopgap (until the above ships)
+
+Add an explicit frame size + matching buffers to the `dot-netty.tcp` block in
+`AkkaHostedService.BuildHocon`, e.g.:
+```hocon
+remote {
+  dot-netty.tcp {
+    hostname = …
+    port = …
+    maximum-frame-size = 4000000b      # 4 MB
+    send-buffer-size = 8000000b
+    receive-buffer-size = 8000000b
+    log-frame-size-exceeding = 1000000b  # warn before silently dropping
+  }
+}
+```
+This is a **code** change (no appsettings override exists) and must be deployed to **both** central
+and site (they share the Host); rebuild + redeploy per `deploy/wonder-app-vd03/RUNBOOK.md`
+(Upgrading). Consider exposing it via `appsettings`/`CommunicationOptions` so it's tunable without a
+rebuild. Also fix the `AskTimeoutException` classification.
+
+### Config-only workaround (specific to this case)
+
+For the generic-`DelmiaReceiver` rollout specifically: removing the `Left`/`Right` `DelmiaReceiver`
+compositions and keeping **only** the single generic `DelmiaReceiver` makes the config *smaller* than
+the working 2-receiver baseline, so it deploys under the current limit — and it's the intended
+"no sides" end state anyway (each machine bound to its own generic Galaxy receiver,
+`DelmiaReceiver_037/038/039`). This unblocks without an engine change, but does not fix the
+underlying fragility for any other instance that legitimately needs a large config.
+
+## Current operational state (to clean up)
+
+> **Engine fix resolves the blocker.** With notify-and-fetch merged, a large flattened config is no longer size-capped, so the generic-`DelmiaReceiver` rollout (and the `Download` handshake re-add) can proceed on `Z28061Sim` without hitting this bug. The items below are the operational re-rollout steps, now unblocked — they are NOT engine work.
+
+- `Z28061Sim` is running on deployment **id 55** (side-based, with the earlier `side`-crash fix) and
+  is healthy, but is **not currently redeployable**: `CvdReactor` has the extra generic
+  `DelmiaRecv`→`DelmiaReceiver_037` composition + binding, so every redeploy hits this bug.
+- To restore redeployability, either apply the config-only workaround above (consolidate to one
+  generic receiver) or roll back the generic composition + restore the side-based
+  `ProcessRecipeDownload`.
+- The `Download` handshake method (mirroring `MESReceiver.MoveIn`) is currently removed from the
+  `DelmiaReceiver` base template (template 12); re-add it once a deployable path is chosen.
+
+## Files
+
+- `src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs` (BuildHocon — missing `maximum-frame-size`)
+- `src/ZB.MOM.WW.ScadaBridge.DeploymentManager/DeploymentService.cs` (inline config serialize; AskTimeout mis-classification)
+- `src/ZB.MOM.WW.ScadaBridge.Commons/Messages/Deployment/DeployInstanceCommand.cs` (`FlattenedConfigurationJson`)
+- `src/ZB.MOM.WW.ScadaBridge.Communication/Actors/CentralCommunicationActor.cs` (ClusterClient send)
+- `src/ZB.MOM.WW.ScadaBridge.Communication/CommunicationOptions.cs` (`DeploymentTimeout = 2 min`)
+- `src/ZB.MOM.WW.ScadaBridge.TemplateEngine/Flattening/FlatteningService.cs` (per-composition member expansion)
+</content>