fix(host): register ActorSystem as DI singleton so health-probe scopes don't dispose it (HOST-021)

Per-probe health-check child scopes were disposing the AddTransient-bridged ActorSystem (IDisposable), terminating the live cluster node ~4s after boot and leaving every singleton-proxy Ask to hang the full 30s QueryTimeout — the central report pages (/notifications, /site-calls, /monitoring/health) loaded in ~30s. Bridge it as a singleton via a new lazy AkkaHostedService.GetOrCreateActorSystem() so child-scope disposal never touches it. Verified: 0 post-startup terminates, healthy active/standby, report pages ~0.05s, Playwright 68 passed / 0 failed.
2026-06-05 08:26:09 -04:00
parent 0783547a2d
commit d33617d65d
4 changed files with 328 additions and 39 deletions
@@ -0,0 +1,225 @@
+# Central report pages hang ~30s — NotificationOutbox / SiteCallAudit singleton query Asks never reply
+
+**Status:** FIXED — verified 2026-06-05 (pending commit) · **Severity:** High (real users see 30s page loads) · **Found:** 2026-06-05
+**Components:** Notification Outbox (#21), Site Call Audit (#22), Central UI (#9), Host/cluster (#15/#13)
+
+## FIX APPLIED & VERIFIED (2026-06-05)
+
+`HOST-021`. The Akka `ActorSystem` DI bridge was changed from `AddTransient` to a **singleton**
+routed through a new lazy, idempotent, thread-safe `AkkaHostedService.GetOrCreateActorSystem()`
+(creates the system once on first call from either `StartAsync` or the DI factory). A singleton
+is resolved from the root provider and is never disposed by a per-probe health-check child
+scope, so the `ActorSystem.Dispose()` → `Terminate()` no longer fires; routing through the
+creator (rather than a plain `AddSingleton(sp => …ActorSystem)` factory) avoids caching a
+`null` if a probe wins the startup race.
+
+Files:
+- `src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs` — new `GetOrCreateActorSystem()`
+  + `_actorSystemLock`; `StartAsync` calls it instead of creating the system inline.
+- `src/ZB.MOM.WW.ScadaBridge.Host/Program.cs` (central) and `SiteServiceRegistration.cs` (site)
+  — `AddTransient<ActorSystem>` → `AddSingleton<ActorSystem>(sp => …GetOrCreateActorSystem())`.
+
+Verification after `bash docker/deploy.sh`:
+- `ActorSystemTerminateReason` post-startup occurrences: **0** on both central nodes (was 1/boot).
+- `/health/active`: central-a **Healthy "Active node (cluster leader)"**, central-b **"Up but
+  not the cluster leader"** — correct active/standby (was both Standby Exiting/Removed).
+- Page render: `/notifications/report` **0.069s**, `/notifications/kpis` **0.091s**,
+  `/site-calls/report` **0.026s**, `/monitoring/health` **0.058s** (all were ~30s+).
+- Playwright E2E: **68 passed / 0 failed / 0 skipped** (was 62/6/0).
+
+## ROOT CAUSE (confirmed 2026-06-05 — supersedes the hypotheses below)
+
+The Akka `ActorSystem` is a **process singleton owned by `AkkaHostedService`**, but it is
+registered into DI as a **`Transient`** via a factory:
+
+```csharp
+// Program.cs:211 (central) and SiteServiceRegistration.cs:82 (site)
+builder.Services.AddTransient<Akka.Actor.ActorSystem>(sp =>
+    sp.GetRequiredService<AkkaHostedService>().ActorSystem!);
+```
+
+`ActorSystem` is `IDisposable`. In Microsoft.Extensions.DependencyInjection, an `IDisposable`
+produced by a `Transient`/`Scoped` factory is **captured for disposal by the scope that
+resolved it**. The shared `ZB.MOM.WW.Health.Akka` checks (`AkkaClusterHealthCheck`,
+`ActiveNodeHealthCheck`) are registered with `AddTypeActivatedCheck` and resolve the system
+**lazily per probe** — `_serviceProvider.GetService<ActorSystem>()`
+(`AkkaClusterHealthCheck.cs:42`, `ActiveNodeHealthCheck.cs:102`). `HealthCheckService` runs
+each probe in its **own child scope**, so every `/health/ready` and `/health/active` probe:
+
+1. resolves the live `ActorSystem` (a `Transient`) into the probe's child scope,
+2. the probe completes and `HealthCheckService` disposes the scope,
+3. the container disposes the captured `ActorSystem` → `ActorSystem.Dispose()` →
+   `CoordinatedShutdown.Run(ActorSystemTerminateReason)` → the node Leaves → Exiting → the
+   actor system terminates.
+
+The ASP.NET host process keeps running (only a DI-tracked transient was disposed; the
+hosted service's `StopAsync`/`ClrExitReason` path never runs), so the node is left
+**permanently dead** — member status frozen at `Exiting` (central-a) / `Removed` (central-b),
+no `Up` member, the cluster singletons have no host, and every Central UI page that `Ask`s a
+singleton proxy buffers the message until the 30s `QueryTimeout`. The health checks meant to
+*report* cluster status are what *kill* the cluster.
+
+**Evidence (clean redeploy, 2026-06-05 11:43):** central-a forms its cluster, goes `Up`, the
+singletons start + are identified (11:43:10.77); the first `GET /health/active` lands at
+11:43:14; `CoordinatedShutdown … ActorSystemTerminateReason … ExitCode:0` fires immediately
+(11:43:14.801); node Leaves → Exiting → "Successfully shut down" (11:43:24); process stays up
+serving HTTP. central-b shows the identical pattern at 11:43:17. `/health/ready` then = 503 on
+central-b, and `/health/active` = `Standby: node is not Up (status: Exiting/Removed)` on both.
+No application code calls `.Terminate()` (grep), confirming the disposal path.
+
+**Why earlier analysis missed it:** the prior hypotheses examined the actor handler, proxy
+wiring, singleton lifecycle, and DB — all of which are correct. They are irrelevant because
+the `ActorSystem` is simply **dead** by the time a page queries it. "Deterministic, survives
+restart and full redeploy" is fully explained: it is a DI-lifetime code defect that
+re-triggers on the first post-`Up` health probe every boot.
+
+**Fix (pending, task #48):** stop the container from disposing the externally-owned
+`ActorSystem`. It must be resolvable from DI as the live instance (the kit calls
+`GetService<ActorSystem>()`), re-readable (must not cache `null` during warmup), and never
+disposed by a child scope. A `Transient`/`Scoped` factory returning the `IDisposable` system
+is always captured by the resolving scope, and a plain `AddSingleton(factory)` caches whatever
+the first resolve sees (→ permanent `null` if a probe wins the warmup race). The chosen fix is
+a lazy, idempotent, thread-safe `AkkaHostedService.GetOrCreateActorSystem()` (creates the
+system once on first call from either `StartAsync` or the DI factory) registered as
+`AddSingleton<ActorSystem>(sp => sp.GetRequiredService<AkkaHostedService>().GetOrCreateActorSystem())`
+— a process singleton, so child-scope disposal never touches it, and never `null` because the
+first resolve creates it. Apply in **both** the central (`Program.cs`) and site
+(`SiteServiceRegistration.cs`) registrations.
+
+## Summary
+
+The Central UI pages that query the central **cluster singletons** — `/notifications/report`,
+`/notifications/kpis`, `/site-calls/report`, and the `/monitoring/health` KPI tiles — hang for
+**exactly ~30 seconds** during server render, then render an empty/error state. The hang is the
+Akka `Ask` to the `notification-outbox` / `site-call-audit` singletons timing out at
+`CommunicationOptions.QueryTimeout` (30s): **the singleton never replies**. Every other page is
+fast. This is **deterministic** (it survived a clean cluster restart) and is **not** a test
+problem — the E2E tests that load these pages are correctly failing. The root cause was *not*
+pinned by static analysis + a restart; the remaining step is runtime instrumentation (below).
+
+## Affected surface
+
+| Page | Server render time | Path |
+|------|--------------------|------|
+| `/admin/sites` | 0.026s | DB (`ISiteRepository`) |
+| `/audit/log` | 0.044s | DB |
+| `/deployment/deployments` | 0.018s | `CentralCommunicationActor` (local actor) Ask |
+| `/design/templates` | 0.013s | DB |
+| **`/notifications/report`** | **30.01s** | `GetNotificationOutbox()` singleton-proxy Ask |
+| **`/notifications/kpis`** | **30.05s** | `GetNotificationOutbox()` singleton-proxy Ask |
+| **`/site-calls/report`** | **30.02s** | `GetSiteCallAudit()` singleton-proxy Ask |
+| **`/monitoring/health`** | **>35s** | both singleton KPIs |
+
+Measured with an authenticated `curl` against the live cluster (`http://localhost:9000`), so it
+is a **server-side prerender** hang, independent of the browser.
+
+## Trigger path
+
+`NotificationReport.OnInitializedAsync` → `RefreshAll()` → `FetchPage()` →
+`CommunicationService.QueryNotificationOutboxAsync(request)` →
+`GetNotificationOutbox().Ask<NotificationOutboxQueryResponse>(request, _options.QueryTimeout)`.
+The page auto-queries the singleton on init (during prerender); the `Ask` times out at 30s.
+`SiteCallsReport` does the analogous thing for the `site-call-audit` singleton.
+
+- `CommunicationService.QueryNotificationOutboxAsync` → `src/ZB.MOM.WW.ScadaBridge.Communication/CommunicationService.cs:456`
+- `GetNotificationOutbox()` returns the cached `_notificationOutboxProxy` → `CommunicationService.cs:100`
+
+## What was verified correct / ruled out (with evidence)
+
+1. **The CentralCommunicationActor path is healthy.** `/deployment/deployments`
+   (`GetActor().Ask`, a node-local actor) returns in 0.018s. The cluster/ClusterClient transport
+   and the Ask machinery work. Only the **singleton-proxy** Asks hang. This is the key asymmetry.
+2. **The singletons start and are reachable by their proxies.** On the current boot the active
+   node (central-a) logged `NotificationOutbox singleton created and registered…`,
+   `SiteCallAuditActor singleton created and registered…`,
+   `ClusterSingletonManager state change [Start -> Oldest]`,
+   `Singleton manager started singleton actor [.../notification-outbox]`, and the proxy logged
+   `Singleton identified at [.../notification-outbox-singleton/notification-outbox]`.
+3. **Not cluster state — neither a restart nor a full redeploy fixes it.** Restarting the central
+   nodes (sequentially, then both together) re-formed a healthy active/standby cluster (central-a
+   active leader, central-b standby) with the singletons started + identified on the active node
+   (so the Ask is **local**), and the pages **still hung exactly 30s**. A subsequent **full
+   `docker/deploy.sh`** (fresh image rebuild + recreation of all containers) *also* left the pages
+   hanging exactly 30s. This rules out a stale-proxy / wedged-singleton cluster-state explanation,
+   a stale binary, and cross-node serialization (a local Ask is not serialized) — the defect is
+   **deterministic**.
+4. **The query handlers are correct.** Both `PipeTo` the async query to the captured `Sender`
+   with a **failure projection on every path** (a faulted query replies `Success:false`, it does
+   not hang):
+   - `NotificationOutboxActor.HandleQuery` → `src/ZB.MOM.WW.ScadaBridge.NotificationOutbox/NotificationOutboxActor.cs:760` (PipeTo at :765, failure arm :768)
+   - `SiteCallAuditActor.HandleQuery` → `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:224` (PipeTo at :229, failure arm :231)
+5. **The DB query would be instant.** `Notifications` and `SiteCalls` are **empty (0 rows, 0 ms)**
+   in the live DB, and the repository query is a plain EF `Where`/paginate
+   (`NotificationOutboxRepository.QueryAsync` → `…/Repositories/NotificationOutboxRepository.cs:132`).
+   So a query that actually executes returns in well under a second.
+6. **The proxy wiring is textbook.** `notification-outbox-proxy` is a standard
+   `ClusterSingletonProxy` for `/user/notification-outbox-singleton`, handed to
+   `CommunicationService.SetNotificationOutbox(...)` →
+   `src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs:367-379` (and `:515-524` for site-call-audit).
+   `GetNotificationOutbox()` is non-null (a null ref would throw fast, not hang).
+7. **No relevant exception/serialization error is logged** on either central node at query time
+   (`ActorInitializationException`, restart loop, `cannot be serialized`, `no serializer` — none).
+8. **Singleton-agnostic to the dispatch loop.** `NotificationOutbox` has a 5s dispatcher loop;
+   `SiteCallAudit` has **no** periodic loop at all (deferred) — yet both hang identically, so the
+   dispatch loop is not the shared cause. The loop is also fire-and-forget
+   (`RunDispatchPass(...).PipeTo(Self)`), so it cannot starve the mailbox.
+
+Net: handler, proxy wiring, singleton lifecycle, and DB query are all correct, and the table is
+empty — yet a **local** Ask to the singleton never replies within 30s. The defect is at the
+singleton **activation / message-processing boundary** on the live node, not in the visible code.
+
+## Leading hypotheses (not yet confirmed — need runtime instrumentation)
+
+1. **The singleton instance is not draining its mailbox** even though the
+   `ClusterSingletonManager` reports it started (a half-activated / perpetually-restarting / stuck
+   instance). The manager holds the name and the proxy "identifies" it, but messages are buffered
+   and never processed → 30s timeout with no per-query exception. Both singletons share
+   construction via `Props.Create(() => new …(_serviceProvider, …))`
+   (`AkkaHostedService.cs:357`, `:471`) and the `_serviceProvider.CreateScope()` +
+   `ScadaBridgeDbContext` pattern in their handlers — a shared activation-time defect would hit
+   both.
+2. **DB scope/connection acquisition from the actor's root-provider scope hangs** (e.g. a leaked-
+   connection / pool-wait specific to `_serviceProvider.CreateScope()` in the actor, vs the
+   request-scoped DbContext that `/admin/sites` uses successfully). The 30s is *exactly* the Ask
+   timeout, so any handler-side hang ≥30s presents identically.
+3. **The reply cannot be delivered back to the Ask's temporary actor** (less likely for a local
+   Ask, but not disproven).
+
+## How to confirm (next step)
+
+Bisect "message never reaches the singleton (routing)" vs "singleton receives but never replies
+(handler/DB)":
+
+- Turn on Akka receive logging for the run — `akka.loglevel = DEBUG` and
+  `akka.actor.debug.receive = on` in the Host's HOCON (`AkkaHostedService.BuildHocon`,
+  ~`AkkaHostedService.cs:171-216`) — or add a single `_log.Info("HandleQuery received …")` line at
+  the top of `NotificationOutboxActor.HandleQuery`, then `bash docker/deploy.sh` and hit
+  `/notifications/report` once.
+  - If the log line **does not** appear → the message isn't reaching the singleton (routing /
+    proxy / mailbox-stuck) → investigate the singleton activation + proxy delivery.
+  - If it **does** appear → the handler/`QueryOutboxAsync` is hanging → wrap with timing around
+    `CreateScope()`, `GetRequiredService`, and `await repository.QueryAsync(...)` to find which
+    awaits.
+
+## Blocked tests
+
+All currently-failing Playwright tests are blocked solely by this hang (they load the affected
+pages):
+
+- `Audit.AuditLogPageTests.NotificationsPage_RendersAuditDrillInLinkPattern` (loads `/notifications/report`)
+- All `SiteCalls.SiteCallsPageTests` page tests (load `/site-calls/report`): `PageLoads_ForDeploymentUser`,
+  `FilterNarrowing_ChannelFilterShrinksGrid`, `RetryClickThrough_OnParkedRow_ConfirmsRelayAndShowsOutcomeToast`,
+  `RetryDiscard_VisibleOnlyOnParkedRows`, `DrillIn_ViewAuditHistory_NavigatesToPreFilteredAuditLog`.
+
+The rest of the suite is green (the Audit grid/drawer tests pass after the `AuditDataSeeder`
+canonical-schema fix landed in the same session).
+
+## Notes
+
+- Pre-existing: the hang was present before any test-suite or cluster-restart work this session,
+  and the restarts did not cause it (the cluster is healthy active/standby afterward).
+- Timeframe correlation only (not proven causal): this surfaced around the audit subsystem
+  re-architecture (`CollapseAuditLogToCanonical`) — but the NotificationOutbox/SiteCallAudit
+  handlers and repositories read the unchanged-and-empty `Notifications`/`SiteCalls` tables and
+  are themselves correct, so the defect is at the singleton hosting/messaging layer rather than in
+  the audit-table change.