docs(known-issues): mark deploy-config frame-size bug RESOLVED via notify-and-fetch
The 128 KB Akka frame-size deploy trap is fixed and merged. Record the resolution at the top of the writeup (notify-and-fetch on both the central->site deploy hop and the site active->standby replication hop; AskTimeout classification fix; startup reconciliation + topology perf fix), link the design + plan docs, and note the live-smoke validation. The diagnosis is retained as historical record.
This commit is contained in:
@@ -0,0 +1,170 @@
|
||||
# Deploy hangs (120 s) when the flattened instance config exceeds Akka's default frame size
|
||||
|
||||
**Date:** 2026-06-26 · **Status:** RESOLVED (2026-06-26) · **Severity:** High (silently un-deployable instances) · **Area:** Deployment Manager / Cluster Communication
|
||||
|
||||
## Resolution (2026-06-26)
|
||||
|
||||
Fixed via the **notify-and-fetch** rework (the primary recommendation below), not the frame-size stopgap. The full flattened config no longer travels inside any Akka message: central stages it in a `PendingDeployment` row and sends a small `RefreshDeploymentCommand`; the site fetches the config over HTTP using a per-deployment token. Both frame-trapped Akka hops are fixed — the central→site deploy hop **and** the site active→standby `SiteReplicationActor` hop (which carried the same config across an intra-site Akka hop with the same silent-drop trap). The secondary `AskTimeoutException` mis-classification (below) is fixed, and a startup `SiteReconciliationActor` self-heal plus a topology-page performance fix shipped alongside.
|
||||
|
||||
- **Design:** [`docs/plans/2026-06-26-deploy-config-notify-and-fetch-design.md`](../plans/2026-06-26-deploy-config-notify-and-fetch-design.md)
|
||||
- **Plan:** [`docs/plans/2026-06-26-deploy-config-notify-and-fetch.md`](../plans/2026-06-26-deploy-config-notify-and-fetch.md)
|
||||
- **Validated:** live docker-cluster smoke — a previously-hanging deploy now completes in ~0.11 s; reconciliation heals single-node and concurrent-both-missing gaps.
|
||||
|
||||
The diagnosis below is retained as the historical record of how the bug was found and reasoned about.
|
||||
|
||||
## Summary
|
||||
|
||||
Deploying an instance whose **flattened configuration JSON is large enough** that the
|
||||
`DeployInstanceCommand` Akka message exceeds Akka.Remote's default `maximum-frame-size`
|
||||
(**128 KB / `128000b`**) causes the message to be **silently dropped in transit** between the
|
||||
central and site nodes. The central's deploy `Ask` then times out after exactly **120 s**
|
||||
(`Communication:DeploymentTimeout`), the deployment is recorded as failed, and **the instance
|
||||
cannot be (re)deployed at all** until the config shrinks back under the limit.
|
||||
|
||||
It was discovered while adding a **third composition of the same base template** (a generic
|
||||
`DelmiaReceiver` added to `CvdReactor`, which already composes `LeftDelmiaReceiver` +
|
||||
`RightDelmiaReceiver`). Each composition fully expands the composed template's members into the
|
||||
flattened config, so the 3rd copy tipped the message over 128 KB.
|
||||
|
||||
## How it manifests
|
||||
|
||||
- CLI/UI deploy returns `{"error":"Deployment error: Timeout after 120.00 seconds","code":"COMMAND_FAILED"}` (or a 504 via the management API) after ~120 s.
|
||||
- `deploy status` shows the record going `InProgress (1) → Failed (3)` with `errorMessage = "Deployment error: Timeout after 120.00 seconds"`.
|
||||
- The **running instance is unaffected** — deploys are atomic, so it keeps serving its last successful deployment.
|
||||
- Smaller configs deploy normally and fast (the working baseline here, the 2-receiver side-based config, completed in **0.04 s**).
|
||||
|
||||
## Root cause
|
||||
|
||||
The deploy ships the **entire flattened config inline** as a JSON string inside the Akka message,
|
||||
and Akka.Remote's frame size is left at the un-configured default:
|
||||
|
||||
1. **Full config carried in the message** — `DeploymentService.cs` serializes the flattened config
|
||||
(`JsonSerializer.Serialize(flattenedConfig)`) and puts it in
|
||||
`DeployInstanceCommand.FlattenedConfigurationJson`
|
||||
(`src/ZB.MOM.WW.ScadaBridge.Commons/Messages/Deployment/DeployInstanceCommand.cs`).
|
||||
2. **Sent central → site over Akka** — via ClusterClient
|
||||
(`CentralCommunicationActor` → `ClusterClient.Send("/user/site-communication", …)`).
|
||||
3. **No frame-size override** — `AkkaHostedService.BuildHocon`
|
||||
(`src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs`, ~lines 240-244) sets only
|
||||
`dot-netty.tcp { hostname; port }`. There is **no `maximum-frame-size`** and **no appsettings
|
||||
knob** for it, so the Akka.Remote 1.5.62 default of `128000b` applies (and
|
||||
`log-frame-size-exceeding` defaults to `off`, so there isn't even a warning).
|
||||
4. **Amplifier** — no custom Akka serializer / `serialization-bindings` is configured, so the
|
||||
already-serialized `configJson` is JSON-escaped **again** inside the envelope by the default
|
||||
serializer (every `"` → `\"`), roughly doubling that portion of the on-wire payload. So the raw
|
||||
flattened JSON only needs to be ~60-70 KB to blow the 128 KB frame.
|
||||
|
||||
When the encoded frame exceeds the limit, the transport drops **that one message** without tearing
|
||||
down the association (heartbeats/telemetry keep flowing, so the site still reports healthy), the
|
||||
site never receives the deploy, and the central's `Ask` (timeout = `DeploymentTimeout` = 2 min)
|
||||
expires.
|
||||
|
||||
### Why the 3rd same-base composition is the trigger
|
||||
|
||||
The flattener (`FlatteningService`) fully expands **each** composition's attributes/alarms/scripts,
|
||||
path-qualified — so N compositions of the same base ≈ N× those members in the config. The config
|
||||
grows roughly linearly with composition count; here **2 `DelmiaReceiver` copies sat under the limit,
|
||||
the 3rd pushed it over**. Removing the composed *scripts* did not help, because the *attribute* set
|
||||
alone (×3) already exceeded the frame.
|
||||
|
||||
## Evidence (live logs, wonder-app-vd03, 2026-06-26)
|
||||
|
||||
Central log:
|
||||
```
|
||||
07:44:28.033 [INF] Sending deployment 82a39ef5… for instance Z28061Sim to site site-a
|
||||
… (no further deploy activity; only DB/health/telemetry) …
|
||||
07:46:28.038 [ERR] Deployment 82a39ef5… for instance Z28061Sim failed
|
||||
07:46:28.039 [ERR] Management command MgmtDeployInstanceCommand failed
|
||||
07:46:28.039 [WRN] Dead letter: ManagementError from akka://…/temp/3d to deadLetters
|
||||
```
|
||||
Site log (same window): **nothing** — no deploy receipt, no error. The last deploy the site ever
|
||||
received was the smaller side-based config:
|
||||
```
|
||||
05:51:46.429 [INF] Deploying instance Z28061Sim, deploymentId=e531d216… ← worked
|
||||
```
|
||||
No `OversizedPayloadException`, serialization error, or deserialization error appears on **either**
|
||||
node — consistent with a silent transport-level frame drop. The differentiator is purely config size
|
||||
(2-receiver config delivered & applied; 3-receiver config never arrived).
|
||||
|
||||
## Secondary bug (found during the same investigation)
|
||||
|
||||
`AskTimeoutException` does **not** derive from `System.TimeoutException`, so the catch in
|
||||
`DeploymentService.cs` (~line 312) mis-classifies a genuine site Ask-timeout as a generic
|
||||
`"Deployment error: …"` instead of the `"Communication failure: …"` / timeout branch. Consequence:
|
||||
the DeploymentManager-006 "query-site-before-redeploy" reconciliation (keyed off the
|
||||
`Communication failure:` prefix) does **not** trigger after these failures. Fix the classification so
|
||||
`AskTimeoutException`/`OperationCanceledException` are treated as timeouts.
|
||||
|
||||
## Recommended fix
|
||||
|
||||
> **Shipped:** the **Primary** option below (notify-and-fetch) was implemented and merged on 2026-06-26 — see the Resolution section at the top. The stopgap frame-size bump and the config-only workaround were NOT taken.
|
||||
|
||||
### Primary (recommended): notify-and-fetch instead of push
|
||||
|
||||
Stop shipping the full flattened config inside the Akka message. Instead:
|
||||
|
||||
1. Central persists the pending flattened config (it already persists `DeployedConfigSnapshots` on
|
||||
success — extend to a pending/by-`deploymentId` store).
|
||||
2. Central sends the site a **small** "apply deployment `<deploymentId>` for instance `<id>`
|
||||
(revisionHash `<hash>`)" notification.
|
||||
3. The **site pulls** the config by `deploymentId` and applies it, then sends a completion/status
|
||||
response.
|
||||
|
||||
This decouples the Akka message size from the config size entirely (no instance is ever
|
||||
un-deployable due to size), and it fits the **existing pull-based gRPC telemetry pattern** the site
|
||||
already uses (`PullSiteCalls`, `PullAuditEvents` over `SiteStreamService`). A new
|
||||
`PullDeploymentConfig`/`SiteStreamService` RPC (or reuse of the management API) is the natural home.
|
||||
Keep the completion ack so the central still gets authoritative success/failure.
|
||||
|
||||
### Stopgap (until the above ships)
|
||||
|
||||
Add an explicit frame size + matching buffers to the `dot-netty.tcp` block in
|
||||
`AkkaHostedService.BuildHocon`, e.g.:
|
||||
```hocon
|
||||
remote {
|
||||
dot-netty.tcp {
|
||||
hostname = …
|
||||
port = …
|
||||
maximum-frame-size = 4000000b # 4 MB
|
||||
send-buffer-size = 8000000b
|
||||
receive-buffer-size = 8000000b
|
||||
log-frame-size-exceeding = 1000000b # warn before silently dropping
|
||||
}
|
||||
}
|
||||
```
|
||||
This is a **code** change (no appsettings override exists) and must be deployed to **both** central
|
||||
and site (they share the Host); rebuild + redeploy per `deploy/wonder-app-vd03/RUNBOOK.md`
|
||||
(Upgrading). Consider exposing it via `appsettings`/`CommunicationOptions` so it's tunable without a
|
||||
rebuild. Also fix the `AskTimeoutException` classification.
|
||||
|
||||
### Config-only workaround (specific to this case)
|
||||
|
||||
For the generic-`DelmiaReceiver` rollout specifically: removing the `Left`/`Right` `DelmiaReceiver`
|
||||
compositions and keeping **only** the single generic `DelmiaReceiver` makes the config *smaller* than
|
||||
the working 2-receiver baseline, so it deploys under the current limit — and it's the intended
|
||||
"no sides" end state anyway (each machine bound to its own generic Galaxy receiver,
|
||||
`DelmiaReceiver_037/038/039`). This unblocks without an engine change, but does not fix the
|
||||
underlying fragility for any other instance that legitimately needs a large config.
|
||||
|
||||
## Current operational state (to clean up)
|
||||
|
||||
> **Engine fix resolves the blocker.** With notify-and-fetch merged, a large flattened config is no longer size-capped, so the generic-`DelmiaReceiver` rollout (and the `Download` handshake re-add) can proceed on `Z28061Sim` without hitting this bug. The items below are the operational re-rollout steps, now unblocked — they are NOT engine work.
|
||||
|
||||
- `Z28061Sim` is running on deployment **id 55** (side-based, with the earlier `side`-crash fix) and
|
||||
is healthy, but is **not currently redeployable**: `CvdReactor` has the extra generic
|
||||
`DelmiaRecv`→`DelmiaReceiver_037` composition + binding, so every redeploy hits this bug.
|
||||
- To restore redeployability, either apply the config-only workaround above (consolidate to one
|
||||
generic receiver) or roll back the generic composition + restore the side-based
|
||||
`ProcessRecipeDownload`.
|
||||
- The `Download` handshake method (mirroring `MESReceiver.MoveIn`) is currently removed from the
|
||||
`DelmiaReceiver` base template (template 12); re-add it once a deployable path is chosen.
|
||||
|
||||
## Files
|
||||
|
||||
- `src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs` (BuildHocon — missing `maximum-frame-size`)
|
||||
- `src/ZB.MOM.WW.ScadaBridge.DeploymentManager/DeploymentService.cs` (inline config serialize; AskTimeout mis-classification)
|
||||
- `src/ZB.MOM.WW.ScadaBridge.Commons/Messages/Deployment/DeployInstanceCommand.cs` (`FlattenedConfigurationJson`)
|
||||
- `src/ZB.MOM.WW.ScadaBridge.Communication/Actors/CentralCommunicationActor.cs` (ClusterClient send)
|
||||
- `src/ZB.MOM.WW.ScadaBridge.Communication/CommunicationOptions.cs` (`DeploymentTimeout = 2 min`)
|
||||
- `src/ZB.MOM.WW.ScadaBridge.TemplateEngine/Flattening/FlatteningService.cs` (per-composition member expansion)
|
||||
</content>
|
||||
Reference in New Issue
Block a user