Files
ScadaBridge/docs/plans/2026-06-26-deploy-config-notify-and-fetch-design.md
T

185 lines
16 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Deploy + Replication Config Transport — Notify-and-Fetch Design
**Date:** 2026-06-26 · **Status:** APPROVED (design) · **Area:** Deployment Manager / Site Runtime / Cluster Communication
**Fixes:** [`docs/known-issues/2026-06-26-deploy-config-exceeds-akka-frame-size.md`](../known-issues/2026-06-26-deploy-config-exceeds-akka-frame-size.md)
## 1. Problem
The flattened instance config travels over Akka on **two** hops today, and both are bounded by Akka.Remote's default `maximum-frame-size` (**128 KB / `128000b`**). A large config (e.g. a 3rd composition of the same base template) silently breaks both:
1. **Central → site (deploy).** `DeploymentService` serializes the whole flattened config into `DeployInstanceCommand.FlattenedConfigurationJson` and Asks the site over ClusterClient. An oversized frame is **silently dropped**; the central `Ask` times out after 120 s (`Communication:DeploymentTimeout`); the instance becomes un-deployable until the config shrinks.
2. **Site active → standby (replication).** After a successful apply, `DeploymentManagerActor` does `_replicationActor.Tell(new ReplicateConfigDeploy(instanceName, command.FlattenedConfigurationJson, …))`; `SiteReplicationActor` relays it node-to-node via `ActorSelection(RootActorPath(peer)/"user"/"site-replication").Tell(…)`. This is **fire-and-forget** — an oversized `ApplyConfigDeploy` is silently dropped with *no timeout and no error*. The deploy "succeeds," the active node runs fine, and the standby silently lacks the instance — surfacing only at **failover**, when the now-active node recovers from its SQLite and the instance simply isn't there.
The JSON is double-escaped inside the Akka envelope (default serializer), so ~6070 KB of raw config is enough to blow the 128 KB frame. `log-frame-size-exceeding` defaults off, so there isn't even a warning.
## 2. Goal & guiding principle
Eliminate the frame-size failure on **both** hops, permanently and at any config size, **without** raising `maximum-frame-size`.
**Principle:** *small control messages stay on Akka; the bulk config never touches Akka.* The config always moves over **HTTP from central** (the single source of truth) to whichever node needs it — the active node (which applies + runs it) and the standby node (which persists it for future recovery).
This mirrors how the site already works at startup: a site recovers its instances **purely from local SQLite** (`deployed_configurations`), never contacting central. A deploy thus becomes an *on-demand, single-instance version of startup recovery*: "your definition changed — fetch it and refresh." Only the config **source** changes; the apply path is unchanged.
### Why notify-and-fetch over alternatives
- **Push over gRPC (central→site):** would need host-aware routing (the `DeploymentManager` singleton runs only on the active node; the gRPC server runs on both), and still leaves the *intra-site* replication hop broken.
- **Raise Akka `maximum-frame-size`:** only raises the ceiling, doesn't decouple size from transport, and changes remoting config globally.
- **Sites read central MS SQL directly:** rejected — it tears down the enforced hub-spoke boundary (sites = local SQLite, central = MS SQL, enforced in DI *and* `StartupValidator`), gives every site node a login that can read central's *entire* config DB (huge blast radius), and couples sites to central's EF schema.
A small `RefreshDeploymentCommand` routes to the active-node singleton **for free** via the existing `ClusterSingletonProxy` (no host detection, no frame risk), and the same small-message pattern fixes replication. The only genuinely new piece is the **fetch channel**, implemented as an authenticated `GET` against central's **existing** HTTP management API.
## 3. Architecture overview
```
central MS SQL
┌──────────────────┐
│ PendingDeployment │ (configJson, token, TTL, by deploymentId; ≤1 per instance)
└──────────────────┘
▲ │ read (token-gated)
persist before notify │
│ ▼
DeploymentService ──Akka small notify──▶ site (active node, via singleton proxy)
RefreshDeploymentCommand(id, name, hash, │
deployedBy, ts, centralFetchBaseUrl, token) │ 1. HTTP GET config (token) ◀── central mgmt API
▲ │ 2. apply (existing path): Instance Actor + SQLite
│ DeploymentStatusResponse│ 3. reply success/failure
└─────────────────────────┘
│ 4. tell standby (small, id-only)
SiteReplicationActor (standby)
HTTP GET config (token) ──▶ central mgmt API
write deployed_configurations SQLite (no Instance Actor)
```
Every Akka hop now carries only small control messages. The bulk config moves only over HTTP, point-to-point, to the node that needs it.
## 4. Component changes
### 4.1 Central — pending-config store
New table **`PendingDeployment`** in central MS SQL (EF entity + migration):
| column | type | notes |
|---|---|---|
| `DeploymentId` | `varchar` PK | matches the deploy's GUID (`N` format) |
| `InstanceId` | `int`, indexed | for supersession lookup |
| `RevisionHash` | `varchar` | staleness/version marker |
| `ConfigJson` | `nvarchar(max)` | the flattened config |
| `Token` | `varchar` | random per-deployment fetch secret |
| `CreatedAtUtc` | `datetimeoffset` | |
| `ExpiresAtUtc` | `datetimeoffset`, indexed | TTL (default 5 min, configurable) |
Kept **separate** from `DeployedConfigSnapshot` so the existing "deployed = success" semantics stay clean. On success the pending row's config is promoted into `DeployedConfigSnapshot` (existing logic); on failure/timeout/TTL it's purged. A bounded periodic purge removes expired rows.
**Supersession (requested):** when persisting a new pending row for an instance, **first delete any existing pending rows for that same `InstanceId`** (delete-then-insert, one transaction) → at most one pending row per instance, always the latest. Race-free because the existing **per-instance operation lock** already serializes same-instance deploys.
### 4.2 Central — serve endpoint
`GET /api/internal/deployments/{deploymentId}/config` on the existing management API:
- `[AllowAnonymous]` + **explicit token validation** (constant-time compare against the pending row's `Token`), TTL-checked. Machine-to-machine; never user-session-gated.
- Returns the row's `ConfigJson` (200) — or **404** when the `deploymentId` is unknown, expired, or **superseded** (no longer the current pending row for its instance). The 404-on-superseded is what makes stale fetches fail closed (see §5).
- Both central nodes read the same MS SQL, so the request works through Traefik's active-node LB.
- HTTPS strongly recommended in transit (config carries already-encrypted `SecretsBlock`s; the token is single-deployment + short-TTL).
- Companion endpoint for the follow-up (§6): `GET /api/internal/sites/{siteId}/deployments``[{instanceUniqueName, revisionHash}]`.
### 4.3 Central — send small notify
New message:
```csharp
public record RefreshDeploymentCommand(
string DeploymentId,
string InstanceUniqueName,
string RevisionHash,
string DeployedBy,
DateTimeOffset Timestamp,
string CentralFetchBaseUrl, // central's LB/Traefik URL — carried so the site needs no new standing config
string FetchToken);
```
`DeploymentService` flow: flatten + hash (unchanged) → persist `PendingDeployment` (config + token + TTL, superseding prior) → send `RefreshDeploymentCommand` over the existing ClusterClient path (replacing the fat `DeployInstanceCommand`) → await `DeploymentStatusResponse` (reply path unchanged) → on success promote pending→snapshot. The `Ask`/`DeploymentTimeout` stay, but now carry a tiny request/reply — no frame risk. `CentralFetchBaseUrl` is a new central option (e.g. `Communication:CentralFetchBaseUrl`).
### 4.4 Site — deploy (active node)
- `SiteCommunicationActor` routes `RefreshDeploymentCommand` to the `deployment-manager-proxy` (replacing the `DeployInstanceCommand` forward). The proxy delivers to the singleton on the **active node automatically** — small message, no host detection.
- `DeploymentManagerActor` handles `RefreshDeploymentCommand`: fetch the config via an injected `IDeploymentConfigFetcher` (HttpClient wrapper, registered for the site role), then run the **existing `ApplyDeployment`** verbatim (create/replace Instance Actor, write `deployed_configurations`, reply `DeploymentStatusResponse`). Only the config source changes.
- Fetch failures reply `Failed` with a `Communication failure:`-prefixed message (transient/comm) so central's DeploymentManager-006 query-site reconciliation fires; apply failures stay generic deployment errors.
### 4.5 Site — replication via fetch (active → standby), unified
- `ReplicateConfigDeploy` / `ApplyConfigDeploy` **drop `ConfigJson`** and gain `(DeploymentId, RevisionHash, IsEnabled, CentralFetchBaseUrl, FetchToken)`. After applying, the active node sends the id-only message to the standby.
- The standby's `SiteReplicationActor` **fetches** the config via the same `IDeploymentConfigFetcher`, then writes `deployed_configurations` via `SiteStorageService.StoreDeployedConfigAsync`**no Instance Actor** (it's standby). Best-effort, but the fetch gets a small bounded retry (cheap/async).
- **Defensive write guard:** the standby only overwrites a `deployed_configurations` row when the incoming `RevisionHash`/`DeployedAt` is newer, so a delayed older replicate can't clobber a newer config.
- A stale replicate (superseded deploymentId) fetches → 404 → standby skips it; the newer deploy's replicate-notify delivers the correct config (§5).
- These records are intra-site (same Host binary on both nodes) → no cross-version concern.
### 4.6 Secondary bug — timeout classification
`DeploymentService.cs` (~line 312): add `AskTimeoutException` to the timeout branch (it does **not** derive from `System.TimeoutException`), alongside `OperationCanceledException`, so a no-reply deploy is classified as a timeout and triggers the query-site reconciliation.
### 4.7 Retire the fat message
`DeployInstanceCommand.FlattenedConfigurationJson` is no longer sent cross-cluster. Keep the record only if convenient as an internal apply DTO; otherwise remove. After this change, **no Akka message carries config.**
## 5. Supersession & ordering semantics
- **At central:** delete-then-insert under the per-instance lock ⇒ ≤1 pending row per instance, always the newest deploy.
- **Fetch fails closed:** the serve endpoint returns 404 for any `deploymentId` that isn't the current pending row (unknown / expired / superseded / already promoted). So an older in-flight deploy notify or an older replicate-notify can never fetch a lower-version config.
- **Standby last-write-wins + guard:** `deployed_configurations` upserts by `instance_unique_name`; the standby additionally refuses to overwrite with an older `RevisionHash`/`DeployedAt`. Net: the standby converges on the latest config, never a stale one.
## 6. Follow-up (in the plan, separate & lower-priority): standby/startup reconciliation
Replication is best-effort with no retry and **no startup reconciliation** today — a standby that is *down* during a deploy permanently misses that instance until its next deploy (a pre-existing gap, independent of frame size). To make replication self-healing:
- On node start / peer rejoin, the node reconciles its `deployed_configurations` against central's **expected set for this site** (via `GET /api/internal/sites/{siteId}/deployments``{instanceUniqueName, revisionHash}` pairs), fetching missing/stale configs (reusing the §4.2 config endpoint) and dropping orphans central no longer has.
Marked as a distinct task, sequenced **after** the core frame-size fix.
## 7. Configuration & options
- **Central:** `Communication:CentralFetchBaseUrl` (LB/Traefik URL); `Deployment:PendingDeploymentTtl` (default 5 min); token length; purge interval.
- **Site:** `HttpClient` registration for the site role; fetch timeout; replication fetch retry count.
- **Akka `maximum-frame-size`:** intentionally **unchanged** — the fix doesn't rely on it.
- gRPC message size: not involved (HTTP carries the config).
## 8. Failover & edge cases
- Site failover: the notify routes to the new active node automatically via the singleton proxy.
- Active node dies mid-fetch: central's `Ask` times out → reconciliation.
- Central node failover during a fetch: Traefik routes to the active central node; the pending row is in shared MS SQL → still served.
- Standby down during deploy: misses the replicate-notify → covered by the §6 follow-up.
- Token TTL must comfortably cover both nodes' fetches within the deploy window (default 5 min is ample).
## 9. Security
Token-gated internal endpoint; constant-time compare; short TTL; scoped to exactly one deployment's config; not tied to a user session; HTTPS recommended. Config `SecretsBlock`s are already encrypted at rest.
## 10. Testing
- **Unit:** `PendingDeployment` store incl. supersession (delete-then-insert) + TTL/purge; token validation (valid / expired / wrong / superseded, constant-time); `IDeploymentConfigFetcher` (200 / 401 / 404 / timeout); `DeploymentService` notify→promote flow; AskTimeout classification; standby fetch-on-replicate + older-write guard.
- **Integration:** the exact failing case — a **>128 KB** config deploys end-to-end; standby obtains it via fetch; kill the active node → failover recovers it from the standby's SQLite; redeploy a *newer* version → pending superseded, stale fetch 404s.
- **Live smoke (docker):** redeploy the 3rd-composition config that originally hung; confirm both nodes hold it.
## 11. Migration / deploy
- Single Host binary serves both roles → one rebuild + redeploy covers central and site. Message-contract changes (`RefreshDeploymentCommand` new; `ReplicateConfigDeploy` drops `ConfigJson`) are safe since all nodes upgrade together.
- New EF migration for `PendingDeployment` (auto-apply in dev; SQL script for prod). No site SQLite schema change.
- Add `CentralFetchBaseUrl` to central appsettings in `docker/`, `docker-env2/`, `deploy/wonder-app-vd03/`; confirm site→central HTTP reachability (fine on the co-located test host/docker; a firewall port to open in a hub-spoke prod). Update `deploy/wonder-app-vd03/RUNBOOK.md`.
## 12. Affected files (for the plan)
- `src/ZB.MOM.WW.ScadaBridge.Commons/Entities/Deployment/` — new `PendingDeployment` entity.
- `src/ZB.MOM.WW.ScadaBridge.Commons/Interfaces/Repositories/IDeploymentManagerRepository.cs` — pending CRUD + supersession + expected-set query.
- `src/ZB.MOM.WW.ScadaBridge.Commons/Messages/Deployment/` — new `RefreshDeploymentCommand`; retire `DeployInstanceCommand` wire use.
- `src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase/` — EF mapping, repository impl, migration.
- `src/ZB.MOM.WW.ScadaBridge.DeploymentManager/DeploymentService.cs` — persist pending + supersede, send notify, promote on success, AskTimeout classification.
- `src/ZB.MOM.WW.ScadaBridge.Communication/CommunicationService.cs` + `Actors/{Central,Site}CommunicationActor.cs` — route `RefreshDeploymentCommand`.
- `src/ZB.MOM.WW.ScadaBridge.CentralUI` (or management API project) — `/api/internal/deployments/{id}/config` + `/api/internal/sites/{id}/deployments` endpoints.
- `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs` — handle `RefreshDeploymentCommand`, fetch, replicate id-only.
- `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/SiteReplicationActor.cs` + `Messages/ReplicationMessages.cs` — id-only replication, fetch on standby, older-write guard.
- `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/``IDeploymentConfigFetcher` + HttpClient registration; site reconciliation (follow-up).
- appsettings (`docker/`, `docker-env2/`, `deploy/wonder-app-vd03/`) + `RUNBOOK.md`.
- Tests under `tests/`.