docs(known-issues): mark deploy-config frame-size bug RESOLVED via notify-and-fetch

The 128 KB Akka frame-size deploy trap is fixed and merged. Record the
resolution at the top of the writeup (notify-and-fetch on both the
central->site deploy hop and the site active->standby replication hop;
AskTimeout classification fix; startup reconciliation + topology perf fix),
link the design + plan docs, and note the live-smoke validation. The
diagnosis is retained as historical record.
This commit is contained in:
Joseph Doherty
2026-06-26 23:09:24 -04:00
parent f48a748f37
commit d9f5fbb664
@@ -0,0 +1,170 @@
# Deploy hangs (120 s) when the flattened instance config exceeds Akka's default frame size
**Date:** 2026-06-26 · **Status:** RESOLVED (2026-06-26) · **Severity:** High (silently un-deployable instances) · **Area:** Deployment Manager / Cluster Communication
## Resolution (2026-06-26)
Fixed via the **notify-and-fetch** rework (the primary recommendation below), not the frame-size stopgap. The full flattened config no longer travels inside any Akka message: central stages it in a `PendingDeployment` row and sends a small `RefreshDeploymentCommand`; the site fetches the config over HTTP using a per-deployment token. Both frame-trapped Akka hops are fixed — the central→site deploy hop **and** the site active→standby `SiteReplicationActor` hop (which carried the same config across an intra-site Akka hop with the same silent-drop trap). The secondary `AskTimeoutException` mis-classification (below) is fixed, and a startup `SiteReconciliationActor` self-heal plus a topology-page performance fix shipped alongside.
- **Design:** [`docs/plans/2026-06-26-deploy-config-notify-and-fetch-design.md`](../plans/2026-06-26-deploy-config-notify-and-fetch-design.md)
- **Plan:** [`docs/plans/2026-06-26-deploy-config-notify-and-fetch.md`](../plans/2026-06-26-deploy-config-notify-and-fetch.md)
- **Validated:** live docker-cluster smoke — a previously-hanging deploy now completes in ~0.11 s; reconciliation heals single-node and concurrent-both-missing gaps.
The diagnosis below is retained as the historical record of how the bug was found and reasoned about.
## Summary
Deploying an instance whose **flattened configuration JSON is large enough** that the
`DeployInstanceCommand` Akka message exceeds Akka.Remote's default `maximum-frame-size`
(**128 KB / `128000b`**) causes the message to be **silently dropped in transit** between the
central and site nodes. The central's deploy `Ask` then times out after exactly **120 s**
(`Communication:DeploymentTimeout`), the deployment is recorded as failed, and **the instance
cannot be (re)deployed at all** until the config shrinks back under the limit.
It was discovered while adding a **third composition of the same base template** (a generic
`DelmiaReceiver` added to `CvdReactor`, which already composes `LeftDelmiaReceiver` +
`RightDelmiaReceiver`). Each composition fully expands the composed template's members into the
flattened config, so the 3rd copy tipped the message over 128 KB.
## How it manifests
- CLI/UI deploy returns `{"error":"Deployment error: Timeout after 120.00 seconds","code":"COMMAND_FAILED"}` (or a 504 via the management API) after ~120 s.
- `deploy status` shows the record going `InProgress (1) → Failed (3)` with `errorMessage = "Deployment error: Timeout after 120.00 seconds"`.
- The **running instance is unaffected** — deploys are atomic, so it keeps serving its last successful deployment.
- Smaller configs deploy normally and fast (the working baseline here, the 2-receiver side-based config, completed in **0.04 s**).
## Root cause
The deploy ships the **entire flattened config inline** as a JSON string inside the Akka message,
and Akka.Remote's frame size is left at the un-configured default:
1. **Full config carried in the message**`DeploymentService.cs` serializes the flattened config
(`JsonSerializer.Serialize(flattenedConfig)`) and puts it in
`DeployInstanceCommand.FlattenedConfigurationJson`
(`src/ZB.MOM.WW.ScadaBridge.Commons/Messages/Deployment/DeployInstanceCommand.cs`).
2. **Sent central → site over Akka** — via ClusterClient
(`CentralCommunicationActor``ClusterClient.Send("/user/site-communication", …)`).
3. **No frame-size override**`AkkaHostedService.BuildHocon`
(`src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs`, ~lines 240-244) sets only
`dot-netty.tcp { hostname; port }`. There is **no `maximum-frame-size`** and **no appsettings
knob** for it, so the Akka.Remote 1.5.62 default of `128000b` applies (and
`log-frame-size-exceeding` defaults to `off`, so there isn't even a warning).
4. **Amplifier** — no custom Akka serializer / `serialization-bindings` is configured, so the
already-serialized `configJson` is JSON-escaped **again** inside the envelope by the default
serializer (every `"``\"`), roughly doubling that portion of the on-wire payload. So the raw
flattened JSON only needs to be ~60-70 KB to blow the 128 KB frame.
When the encoded frame exceeds the limit, the transport drops **that one message** without tearing
down the association (heartbeats/telemetry keep flowing, so the site still reports healthy), the
site never receives the deploy, and the central's `Ask` (timeout = `DeploymentTimeout` = 2 min)
expires.
### Why the 3rd same-base composition is the trigger
The flattener (`FlatteningService`) fully expands **each** composition's attributes/alarms/scripts,
path-qualified — so N compositions of the same base ≈ N× those members in the config. The config
grows roughly linearly with composition count; here **2 `DelmiaReceiver` copies sat under the limit,
the 3rd pushed it over**. Removing the composed *scripts* did not help, because the *attribute* set
alone (×3) already exceeded the frame.
## Evidence (live logs, wonder-app-vd03, 2026-06-26)
Central log:
```
07:44:28.033 [INF] Sending deployment 82a39ef5… for instance Z28061Sim to site site-a
… (no further deploy activity; only DB/health/telemetry) …
07:46:28.038 [ERR] Deployment 82a39ef5… for instance Z28061Sim failed
07:46:28.039 [ERR] Management command MgmtDeployInstanceCommand failed
07:46:28.039 [WRN] Dead letter: ManagementError from akka://…/temp/3d to deadLetters
```
Site log (same window): **nothing** — no deploy receipt, no error. The last deploy the site ever
received was the smaller side-based config:
```
05:51:46.429 [INF] Deploying instance Z28061Sim, deploymentId=e531d216… ← worked
```
No `OversizedPayloadException`, serialization error, or deserialization error appears on **either**
node — consistent with a silent transport-level frame drop. The differentiator is purely config size
(2-receiver config delivered & applied; 3-receiver config never arrived).
## Secondary bug (found during the same investigation)
`AskTimeoutException` does **not** derive from `System.TimeoutException`, so the catch in
`DeploymentService.cs` (~line 312) mis-classifies a genuine site Ask-timeout as a generic
`"Deployment error: …"` instead of the `"Communication failure: …"` / timeout branch. Consequence:
the DeploymentManager-006 "query-site-before-redeploy" reconciliation (keyed off the
`Communication failure:` prefix) does **not** trigger after these failures. Fix the classification so
`AskTimeoutException`/`OperationCanceledException` are treated as timeouts.
## Recommended fix
> **Shipped:** the **Primary** option below (notify-and-fetch) was implemented and merged on 2026-06-26 — see the Resolution section at the top. The stopgap frame-size bump and the config-only workaround were NOT taken.
### Primary (recommended): notify-and-fetch instead of push
Stop shipping the full flattened config inside the Akka message. Instead:
1. Central persists the pending flattened config (it already persists `DeployedConfigSnapshots` on
success — extend to a pending/by-`deploymentId` store).
2. Central sends the site a **small** "apply deployment `<deploymentId>` for instance `<id>`
(revisionHash `<hash>`)" notification.
3. The **site pulls** the config by `deploymentId` and applies it, then sends a completion/status
response.
This decouples the Akka message size from the config size entirely (no instance is ever
un-deployable due to size), and it fits the **existing pull-based gRPC telemetry pattern** the site
already uses (`PullSiteCalls`, `PullAuditEvents` over `SiteStreamService`). A new
`PullDeploymentConfig`/`SiteStreamService` RPC (or reuse of the management API) is the natural home.
Keep the completion ack so the central still gets authoritative success/failure.
### Stopgap (until the above ships)
Add an explicit frame size + matching buffers to the `dot-netty.tcp` block in
`AkkaHostedService.BuildHocon`, e.g.:
```hocon
remote {
dot-netty.tcp {
hostname = …
port = …
maximum-frame-size = 4000000b # 4 MB
send-buffer-size = 8000000b
receive-buffer-size = 8000000b
log-frame-size-exceeding = 1000000b # warn before silently dropping
}
}
```
This is a **code** change (no appsettings override exists) and must be deployed to **both** central
and site (they share the Host); rebuild + redeploy per `deploy/wonder-app-vd03/RUNBOOK.md`
(Upgrading). Consider exposing it via `appsettings`/`CommunicationOptions` so it's tunable without a
rebuild. Also fix the `AskTimeoutException` classification.
### Config-only workaround (specific to this case)
For the generic-`DelmiaReceiver` rollout specifically: removing the `Left`/`Right` `DelmiaReceiver`
compositions and keeping **only** the single generic `DelmiaReceiver` makes the config *smaller* than
the working 2-receiver baseline, so it deploys under the current limit — and it's the intended
"no sides" end state anyway (each machine bound to its own generic Galaxy receiver,
`DelmiaReceiver_037/038/039`). This unblocks without an engine change, but does not fix the
underlying fragility for any other instance that legitimately needs a large config.
## Current operational state (to clean up)
> **Engine fix resolves the blocker.** With notify-and-fetch merged, a large flattened config is no longer size-capped, so the generic-`DelmiaReceiver` rollout (and the `Download` handshake re-add) can proceed on `Z28061Sim` without hitting this bug. The items below are the operational re-rollout steps, now unblocked — they are NOT engine work.
- `Z28061Sim` is running on deployment **id 55** (side-based, with the earlier `side`-crash fix) and
is healthy, but is **not currently redeployable**: `CvdReactor` has the extra generic
`DelmiaRecv``DelmiaReceiver_037` composition + binding, so every redeploy hits this bug.
- To restore redeployability, either apply the config-only workaround above (consolidate to one
generic receiver) or roll back the generic composition + restore the side-based
`ProcessRecipeDownload`.
- The `Download` handshake method (mirroring `MESReceiver.MoveIn`) is currently removed from the
`DelmiaReceiver` base template (template 12); re-add it once a deployable path is chosen.
## Files
- `src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs` (BuildHocon — missing `maximum-frame-size`)
- `src/ZB.MOM.WW.ScadaBridge.DeploymentManager/DeploymentService.cs` (inline config serialize; AskTimeout mis-classification)
- `src/ZB.MOM.WW.ScadaBridge.Commons/Messages/Deployment/DeployInstanceCommand.cs` (`FlattenedConfigurationJson`)
- `src/ZB.MOM.WW.ScadaBridge.Communication/Actors/CentralCommunicationActor.cs` (ClusterClient send)
- `src/ZB.MOM.WW.ScadaBridge.Communication/CommunicationOptions.cs` (`DeploymentTimeout = 2 min`)
- `src/ZB.MOM.WW.ScadaBridge.TemplateEngine/Flattening/FlatteningService.cs` (per-composition member expansion)
</content>