Files
ScadaBridge/docs/known-issues/2026-06-26-deploy-config-exceeds-akka-frame-size.md
T
Joseph Doherty d9f5fbb664 docs(known-issues): mark deploy-config frame-size bug RESOLVED via notify-and-fetch
The 128 KB Akka frame-size deploy trap is fixed and merged. Record the
resolution at the top of the writeup (notify-and-fetch on both the
central->site deploy hop and the site active->standby replication hop;
AskTimeout classification fix; startup reconciliation + topology perf fix),
link the design + plan docs, and note the live-smoke validation. The
diagnosis is retained as historical record.
2026-06-26 23:09:24 -04:00

171 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Deploy hangs (120 s) when the flattened instance config exceeds Akka's default frame size
**Date:** 2026-06-26 · **Status:** RESOLVED (2026-06-26) · **Severity:** High (silently un-deployable instances) · **Area:** Deployment Manager / Cluster Communication
## Resolution (2026-06-26)
Fixed via the **notify-and-fetch** rework (the primary recommendation below), not the frame-size stopgap. The full flattened config no longer travels inside any Akka message: central stages it in a `PendingDeployment` row and sends a small `RefreshDeploymentCommand`; the site fetches the config over HTTP using a per-deployment token. Both frame-trapped Akka hops are fixed — the central→site deploy hop **and** the site active→standby `SiteReplicationActor` hop (which carried the same config across an intra-site Akka hop with the same silent-drop trap). The secondary `AskTimeoutException` mis-classification (below) is fixed, and a startup `SiteReconciliationActor` self-heal plus a topology-page performance fix shipped alongside.
- **Design:** [`docs/plans/2026-06-26-deploy-config-notify-and-fetch-design.md`](../plans/2026-06-26-deploy-config-notify-and-fetch-design.md)
- **Plan:** [`docs/plans/2026-06-26-deploy-config-notify-and-fetch.md`](../plans/2026-06-26-deploy-config-notify-and-fetch.md)
- **Validated:** live docker-cluster smoke — a previously-hanging deploy now completes in ~0.11 s; reconciliation heals single-node and concurrent-both-missing gaps.
The diagnosis below is retained as the historical record of how the bug was found and reasoned about.
## Summary
Deploying an instance whose **flattened configuration JSON is large enough** that the
`DeployInstanceCommand` Akka message exceeds Akka.Remote's default `maximum-frame-size`
(**128 KB / `128000b`**) causes the message to be **silently dropped in transit** between the
central and site nodes. The central's deploy `Ask` then times out after exactly **120 s**
(`Communication:DeploymentTimeout`), the deployment is recorded as failed, and **the instance
cannot be (re)deployed at all** until the config shrinks back under the limit.
It was discovered while adding a **third composition of the same base template** (a generic
`DelmiaReceiver` added to `CvdReactor`, which already composes `LeftDelmiaReceiver` +
`RightDelmiaReceiver`). Each composition fully expands the composed template's members into the
flattened config, so the 3rd copy tipped the message over 128 KB.
## How it manifests
- CLI/UI deploy returns `{"error":"Deployment error: Timeout after 120.00 seconds","code":"COMMAND_FAILED"}` (or a 504 via the management API) after ~120 s.
- `deploy status` shows the record going `InProgress (1) → Failed (3)` with `errorMessage = "Deployment error: Timeout after 120.00 seconds"`.
- The **running instance is unaffected** — deploys are atomic, so it keeps serving its last successful deployment.
- Smaller configs deploy normally and fast (the working baseline here, the 2-receiver side-based config, completed in **0.04 s**).
## Root cause
The deploy ships the **entire flattened config inline** as a JSON string inside the Akka message,
and Akka.Remote's frame size is left at the un-configured default:
1. **Full config carried in the message**`DeploymentService.cs` serializes the flattened config
(`JsonSerializer.Serialize(flattenedConfig)`) and puts it in
`DeployInstanceCommand.FlattenedConfigurationJson`
(`src/ZB.MOM.WW.ScadaBridge.Commons/Messages/Deployment/DeployInstanceCommand.cs`).
2. **Sent central → site over Akka** — via ClusterClient
(`CentralCommunicationActor``ClusterClient.Send("/user/site-communication", …)`).
3. **No frame-size override**`AkkaHostedService.BuildHocon`
(`src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs`, ~lines 240-244) sets only
`dot-netty.tcp { hostname; port }`. There is **no `maximum-frame-size`** and **no appsettings
knob** for it, so the Akka.Remote 1.5.62 default of `128000b` applies (and
`log-frame-size-exceeding` defaults to `off`, so there isn't even a warning).
4. **Amplifier** — no custom Akka serializer / `serialization-bindings` is configured, so the
already-serialized `configJson` is JSON-escaped **again** inside the envelope by the default
serializer (every `"``\"`), roughly doubling that portion of the on-wire payload. So the raw
flattened JSON only needs to be ~60-70 KB to blow the 128 KB frame.
When the encoded frame exceeds the limit, the transport drops **that one message** without tearing
down the association (heartbeats/telemetry keep flowing, so the site still reports healthy), the
site never receives the deploy, and the central's `Ask` (timeout = `DeploymentTimeout` = 2 min)
expires.
### Why the 3rd same-base composition is the trigger
The flattener (`FlatteningService`) fully expands **each** composition's attributes/alarms/scripts,
path-qualified — so N compositions of the same base ≈ N× those members in the config. The config
grows roughly linearly with composition count; here **2 `DelmiaReceiver` copies sat under the limit,
the 3rd pushed it over**. Removing the composed *scripts* did not help, because the *attribute* set
alone (×3) already exceeded the frame.
## Evidence (live logs, wonder-app-vd03, 2026-06-26)
Central log:
```
07:44:28.033 [INF] Sending deployment 82a39ef5… for instance Z28061Sim to site site-a
… (no further deploy activity; only DB/health/telemetry) …
07:46:28.038 [ERR] Deployment 82a39ef5… for instance Z28061Sim failed
07:46:28.039 [ERR] Management command MgmtDeployInstanceCommand failed
07:46:28.039 [WRN] Dead letter: ManagementError from akka://…/temp/3d to deadLetters
```
Site log (same window): **nothing** — no deploy receipt, no error. The last deploy the site ever
received was the smaller side-based config:
```
05:51:46.429 [INF] Deploying instance Z28061Sim, deploymentId=e531d216… ← worked
```
No `OversizedPayloadException`, serialization error, or deserialization error appears on **either**
node — consistent with a silent transport-level frame drop. The differentiator is purely config size
(2-receiver config delivered & applied; 3-receiver config never arrived).
## Secondary bug (found during the same investigation)
`AskTimeoutException` does **not** derive from `System.TimeoutException`, so the catch in
`DeploymentService.cs` (~line 312) mis-classifies a genuine site Ask-timeout as a generic
`"Deployment error: …"` instead of the `"Communication failure: …"` / timeout branch. Consequence:
the DeploymentManager-006 "query-site-before-redeploy" reconciliation (keyed off the
`Communication failure:` prefix) does **not** trigger after these failures. Fix the classification so
`AskTimeoutException`/`OperationCanceledException` are treated as timeouts.
## Recommended fix
> **Shipped:** the **Primary** option below (notify-and-fetch) was implemented and merged on 2026-06-26 — see the Resolution section at the top. The stopgap frame-size bump and the config-only workaround were NOT taken.
### Primary (recommended): notify-and-fetch instead of push
Stop shipping the full flattened config inside the Akka message. Instead:
1. Central persists the pending flattened config (it already persists `DeployedConfigSnapshots` on
success — extend to a pending/by-`deploymentId` store).
2. Central sends the site a **small** "apply deployment `<deploymentId>` for instance `<id>`
(revisionHash `<hash>`)" notification.
3. The **site pulls** the config by `deploymentId` and applies it, then sends a completion/status
response.
This decouples the Akka message size from the config size entirely (no instance is ever
un-deployable due to size), and it fits the **existing pull-based gRPC telemetry pattern** the site
already uses (`PullSiteCalls`, `PullAuditEvents` over `SiteStreamService`). A new
`PullDeploymentConfig`/`SiteStreamService` RPC (or reuse of the management API) is the natural home.
Keep the completion ack so the central still gets authoritative success/failure.
### Stopgap (until the above ships)
Add an explicit frame size + matching buffers to the `dot-netty.tcp` block in
`AkkaHostedService.BuildHocon`, e.g.:
```hocon
remote {
dot-netty.tcp {
hostname = …
port = …
maximum-frame-size = 4000000b # 4 MB
send-buffer-size = 8000000b
receive-buffer-size = 8000000b
log-frame-size-exceeding = 1000000b # warn before silently dropping
}
}
```
This is a **code** change (no appsettings override exists) and must be deployed to **both** central
and site (they share the Host); rebuild + redeploy per `deploy/wonder-app-vd03/RUNBOOK.md`
(Upgrading). Consider exposing it via `appsettings`/`CommunicationOptions` so it's tunable without a
rebuild. Also fix the `AskTimeoutException` classification.
### Config-only workaround (specific to this case)
For the generic-`DelmiaReceiver` rollout specifically: removing the `Left`/`Right` `DelmiaReceiver`
compositions and keeping **only** the single generic `DelmiaReceiver` makes the config *smaller* than
the working 2-receiver baseline, so it deploys under the current limit — and it's the intended
"no sides" end state anyway (each machine bound to its own generic Galaxy receiver,
`DelmiaReceiver_037/038/039`). This unblocks without an engine change, but does not fix the
underlying fragility for any other instance that legitimately needs a large config.
## Current operational state (to clean up)
> **Engine fix resolves the blocker.** With notify-and-fetch merged, a large flattened config is no longer size-capped, so the generic-`DelmiaReceiver` rollout (and the `Download` handshake re-add) can proceed on `Z28061Sim` without hitting this bug. The items below are the operational re-rollout steps, now unblocked — they are NOT engine work.
- `Z28061Sim` is running on deployment **id 55** (side-based, with the earlier `side`-crash fix) and
is healthy, but is **not currently redeployable**: `CvdReactor` has the extra generic
`DelmiaRecv``DelmiaReceiver_037` composition + binding, so every redeploy hits this bug.
- To restore redeployability, either apply the config-only workaround above (consolidate to one
generic receiver) or roll back the generic composition + restore the side-based
`ProcessRecipeDownload`.
- The `Download` handshake method (mirroring `MESReceiver.MoveIn`) is currently removed from the
`DelmiaReceiver` base template (template 12); re-add it once a deployable path is chosen.
## Files
- `src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs` (BuildHocon — missing `maximum-frame-size`)
- `src/ZB.MOM.WW.ScadaBridge.DeploymentManager/DeploymentService.cs` (inline config serialize; AskTimeout mis-classification)
- `src/ZB.MOM.WW.ScadaBridge.Commons/Messages/Deployment/DeployInstanceCommand.cs` (`FlattenedConfigurationJson`)
- `src/ZB.MOM.WW.ScadaBridge.Communication/Actors/CentralCommunicationActor.cs` (ClusterClient send)
- `src/ZB.MOM.WW.ScadaBridge.Communication/CommunicationOptions.cs` (`DeploymentTimeout = 2 min`)
- `src/ZB.MOM.WW.ScadaBridge.TemplateEngine/Flattening/FlatteningService.cs` (per-composition member expansion)
</content>