Per-probe health-check child scopes were disposing the AddTransient-bridged ActorSystem (IDisposable), terminating the live cluster node ~4s after boot and leaving every singleton-proxy Ask to hang the full 30s QueryTimeout — the central report pages (/notifications, /site-calls, /monitoring/health) loaded in ~30s. Bridge it as a singleton via a new lazy AkkaHostedService.GetOrCreateActorSystem() so child-scope disposal never touches it. Verified: 0 post-startup terminates, healthy active/standby, report pages ~0.05s, Playwright 68 passed / 0 failed.
15 KiB
Central report pages hang ~30s — NotificationOutbox / SiteCallAudit singleton query Asks never reply
Status: FIXED — verified 2026-06-05 (pending commit) · Severity: High (real users see 30s page loads) · Found: 2026-06-05 Components: Notification Outbox (#21), Site Call Audit (#22), Central UI (#9), Host/cluster (#15/#13)
FIX APPLIED & VERIFIED (2026-06-05)
HOST-021. The Akka ActorSystem DI bridge was changed from AddTransient to a singleton
routed through a new lazy, idempotent, thread-safe AkkaHostedService.GetOrCreateActorSystem()
(creates the system once on first call from either StartAsync or the DI factory). A singleton
is resolved from the root provider and is never disposed by a per-probe health-check child
scope, so the ActorSystem.Dispose() → Terminate() no longer fires; routing through the
creator (rather than a plain AddSingleton(sp => …ActorSystem) factory) avoids caching a
null if a probe wins the startup race.
Files:
src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs— newGetOrCreateActorSystem()_actorSystemLock;StartAsynccalls it instead of creating the system inline.
src/ZB.MOM.WW.ScadaBridge.Host/Program.cs(central) andSiteServiceRegistration.cs(site) —AddTransient<ActorSystem>→AddSingleton<ActorSystem>(sp => …GetOrCreateActorSystem()).
Verification after bash docker/deploy.sh:
ActorSystemTerminateReasonpost-startup occurrences: 0 on both central nodes (was 1/boot)./health/active: central-a Healthy "Active node (cluster leader)", central-b "Up but not the cluster leader" — correct active/standby (was both Standby Exiting/Removed).- Page render:
/notifications/report0.069s,/notifications/kpis0.091s,/site-calls/report0.026s,/monitoring/health0.058s (all were ~30s+). - Playwright E2E: 68 passed / 0 failed / 0 skipped (was 62/6/0).
ROOT CAUSE (confirmed 2026-06-05 — supersedes the hypotheses below)
The Akka ActorSystem is a process singleton owned by AkkaHostedService, but it is
registered into DI as a Transient via a factory:
// Program.cs:211 (central) and SiteServiceRegistration.cs:82 (site)
builder.Services.AddTransient<Akka.Actor.ActorSystem>(sp =>
sp.GetRequiredService<AkkaHostedService>().ActorSystem!);
ActorSystem is IDisposable. In Microsoft.Extensions.DependencyInjection, an IDisposable
produced by a Transient/Scoped factory is captured for disposal by the scope that
resolved it. The shared ZB.MOM.WW.Health.Akka checks (AkkaClusterHealthCheck,
ActiveNodeHealthCheck) are registered with AddTypeActivatedCheck and resolve the system
lazily per probe — _serviceProvider.GetService<ActorSystem>()
(AkkaClusterHealthCheck.cs:42, ActiveNodeHealthCheck.cs:102). HealthCheckService runs
each probe in its own child scope, so every /health/ready and /health/active probe:
- resolves the live
ActorSystem(aTransient) into the probe's child scope, - the probe completes and
HealthCheckServicedisposes the scope, - the container disposes the captured
ActorSystem→ActorSystem.Dispose()→CoordinatedShutdown.Run(ActorSystemTerminateReason)→ the node Leaves → Exiting → the actor system terminates.
The ASP.NET host process keeps running (only a DI-tracked transient was disposed; the
hosted service's StopAsync/ClrExitReason path never runs), so the node is left
permanently dead — member status frozen at Exiting (central-a) / Removed (central-b),
no Up member, the cluster singletons have no host, and every Central UI page that Asks a
singleton proxy buffers the message until the 30s QueryTimeout. The health checks meant to
report cluster status are what kill the cluster.
Evidence (clean redeploy, 2026-06-05 11:43): central-a forms its cluster, goes Up, the
singletons start + are identified (11:43:10.77); the first GET /health/active lands at
11:43:14; CoordinatedShutdown … ActorSystemTerminateReason … ExitCode:0 fires immediately
(11:43:14.801); node Leaves → Exiting → "Successfully shut down" (11:43:24); process stays up
serving HTTP. central-b shows the identical pattern at 11:43:17. /health/ready then = 503 on
central-b, and /health/active = Standby: node is not Up (status: Exiting/Removed) on both.
No application code calls .Terminate() (grep), confirming the disposal path.
Why earlier analysis missed it: the prior hypotheses examined the actor handler, proxy
wiring, singleton lifecycle, and DB — all of which are correct. They are irrelevant because
the ActorSystem is simply dead by the time a page queries it. "Deterministic, survives
restart and full redeploy" is fully explained: it is a DI-lifetime code defect that
re-triggers on the first post-Up health probe every boot.
Fix (pending, task #48): stop the container from disposing the externally-owned
ActorSystem. It must be resolvable from DI as the live instance (the kit calls
GetService<ActorSystem>()), re-readable (must not cache null during warmup), and never
disposed by a child scope. A Transient/Scoped factory returning the IDisposable system
is always captured by the resolving scope, and a plain AddSingleton(factory) caches whatever
the first resolve sees (→ permanent null if a probe wins the warmup race). The chosen fix is
a lazy, idempotent, thread-safe AkkaHostedService.GetOrCreateActorSystem() (creates the
system once on first call from either StartAsync or the DI factory) registered as
AddSingleton<ActorSystem>(sp => sp.GetRequiredService<AkkaHostedService>().GetOrCreateActorSystem())
— a process singleton, so child-scope disposal never touches it, and never null because the
first resolve creates it. Apply in both the central (Program.cs) and site
(SiteServiceRegistration.cs) registrations.
Summary
The Central UI pages that query the central cluster singletons — /notifications/report,
/notifications/kpis, /site-calls/report, and the /monitoring/health KPI tiles — hang for
exactly ~30 seconds during server render, then render an empty/error state. The hang is the
Akka Ask to the notification-outbox / site-call-audit singletons timing out at
CommunicationOptions.QueryTimeout (30s): the singleton never replies. Every other page is
fast. This is deterministic (it survived a clean cluster restart) and is not a test
problem — the E2E tests that load these pages are correctly failing. The root cause was not
pinned by static analysis + a restart; the remaining step is runtime instrumentation (below).
Affected surface
| Page | Server render time | Path |
|---|---|---|
/admin/sites |
0.026s | DB (ISiteRepository) |
/audit/log |
0.044s | DB |
/deployment/deployments |
0.018s | CentralCommunicationActor (local actor) Ask |
/design/templates |
0.013s | DB |
/notifications/report |
30.01s | GetNotificationOutbox() singleton-proxy Ask |
/notifications/kpis |
30.05s | GetNotificationOutbox() singleton-proxy Ask |
/site-calls/report |
30.02s | GetSiteCallAudit() singleton-proxy Ask |
/monitoring/health |
>35s | both singleton KPIs |
Measured with an authenticated curl against the live cluster (http://localhost:9000), so it
is a server-side prerender hang, independent of the browser.
Trigger path
NotificationReport.OnInitializedAsync → RefreshAll() → FetchPage() →
CommunicationService.QueryNotificationOutboxAsync(request) →
GetNotificationOutbox().Ask<NotificationOutboxQueryResponse>(request, _options.QueryTimeout).
The page auto-queries the singleton on init (during prerender); the Ask times out at 30s.
SiteCallsReport does the analogous thing for the site-call-audit singleton.
CommunicationService.QueryNotificationOutboxAsync→src/ZB.MOM.WW.ScadaBridge.Communication/CommunicationService.cs:456GetNotificationOutbox()returns the cached_notificationOutboxProxy→CommunicationService.cs:100
What was verified correct / ruled out (with evidence)
- The CentralCommunicationActor path is healthy.
/deployment/deployments(GetActor().Ask, a node-local actor) returns in 0.018s. The cluster/ClusterClient transport and the Ask machinery work. Only the singleton-proxy Asks hang. This is the key asymmetry. - The singletons start and are reachable by their proxies. On the current boot the active
node (central-a) logged
NotificationOutbox singleton created and registered…,SiteCallAuditActor singleton created and registered…,ClusterSingletonManager state change [Start -> Oldest],Singleton manager started singleton actor [.../notification-outbox], and the proxy loggedSingleton identified at [.../notification-outbox-singleton/notification-outbox]. - Not cluster state — neither a restart nor a full redeploy fixes it. Restarting the central
nodes (sequentially, then both together) re-formed a healthy active/standby cluster (central-a
active leader, central-b standby) with the singletons started + identified on the active node
(so the Ask is local), and the pages still hung exactly 30s. A subsequent full
docker/deploy.sh(fresh image rebuild + recreation of all containers) also left the pages hanging exactly 30s. This rules out a stale-proxy / wedged-singleton cluster-state explanation, a stale binary, and cross-node serialization (a local Ask is not serialized) — the defect is deterministic. - The query handlers are correct. Both
PipeTothe async query to the capturedSenderwith a failure projection on every path (a faulted query repliesSuccess:false, it does not hang):NotificationOutboxActor.HandleQuery→src/ZB.MOM.WW.ScadaBridge.NotificationOutbox/NotificationOutboxActor.cs:760(PipeTo at :765, failure arm :768)SiteCallAuditActor.HandleQuery→src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:224(PipeTo at :229, failure arm :231)
- The DB query would be instant.
NotificationsandSiteCallsare empty (0 rows, 0 ms) in the live DB, and the repository query is a plain EFWhere/paginate (NotificationOutboxRepository.QueryAsync→…/Repositories/NotificationOutboxRepository.cs:132). So a query that actually executes returns in well under a second. - The proxy wiring is textbook.
notification-outbox-proxyis a standardClusterSingletonProxyfor/user/notification-outbox-singleton, handed toCommunicationService.SetNotificationOutbox(...)→src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs:367-379(and:515-524for site-call-audit).GetNotificationOutbox()is non-null (a null ref would throw fast, not hang). - No relevant exception/serialization error is logged on either central node at query time
(
ActorInitializationException, restart loop,cannot be serialized,no serializer— none). - Singleton-agnostic to the dispatch loop.
NotificationOutboxhas a 5s dispatcher loop;SiteCallAudithas no periodic loop at all (deferred) — yet both hang identically, so the dispatch loop is not the shared cause. The loop is also fire-and-forget (RunDispatchPass(...).PipeTo(Self)), so it cannot starve the mailbox.
Net: handler, proxy wiring, singleton lifecycle, and DB query are all correct, and the table is empty — yet a local Ask to the singleton never replies within 30s. The defect is at the singleton activation / message-processing boundary on the live node, not in the visible code.
Leading hypotheses (not yet confirmed — need runtime instrumentation)
- The singleton instance is not draining its mailbox even though the
ClusterSingletonManagerreports it started (a half-activated / perpetually-restarting / stuck instance). The manager holds the name and the proxy "identifies" it, but messages are buffered and never processed → 30s timeout with no per-query exception. Both singletons share construction viaProps.Create(() => new …(_serviceProvider, …))(AkkaHostedService.cs:357,:471) and the_serviceProvider.CreateScope()+ScadaBridgeDbContextpattern in their handlers — a shared activation-time defect would hit both. - DB scope/connection acquisition from the actor's root-provider scope hangs (e.g. a leaked-
connection / pool-wait specific to
_serviceProvider.CreateScope()in the actor, vs the request-scoped DbContext that/admin/sitesuses successfully). The 30s is exactly the Ask timeout, so any handler-side hang ≥30s presents identically. - The reply cannot be delivered back to the Ask's temporary actor (less likely for a local Ask, but not disproven).
How to confirm (next step)
Bisect "message never reaches the singleton (routing)" vs "singleton receives but never replies (handler/DB)":
- Turn on Akka receive logging for the run —
akka.loglevel = DEBUGandakka.actor.debug.receive = onin the Host's HOCON (AkkaHostedService.BuildHocon, ~AkkaHostedService.cs:171-216) — or add a single_log.Info("HandleQuery received …")line at the top ofNotificationOutboxActor.HandleQuery, thenbash docker/deploy.shand hit/notifications/reportonce.- If the log line does not appear → the message isn't reaching the singleton (routing / proxy / mailbox-stuck) → investigate the singleton activation + proxy delivery.
- If it does appear → the handler/
QueryOutboxAsyncis hanging → wrap with timing aroundCreateScope(),GetRequiredService, andawait repository.QueryAsync(...)to find which awaits.
Blocked tests
All currently-failing Playwright tests are blocked solely by this hang (they load the affected pages):
Audit.AuditLogPageTests.NotificationsPage_RendersAuditDrillInLinkPattern(loads/notifications/report)- All
SiteCalls.SiteCallsPageTestspage tests (load/site-calls/report):PageLoads_ForDeploymentUser,FilterNarrowing_ChannelFilterShrinksGrid,RetryClickThrough_OnParkedRow_ConfirmsRelayAndShowsOutcomeToast,RetryDiscard_VisibleOnlyOnParkedRows,DrillIn_ViewAuditHistory_NavigatesToPreFilteredAuditLog.
The rest of the suite is green (the Audit grid/drawer tests pass after the AuditDataSeeder
canonical-schema fix landed in the same session).
Notes
- Pre-existing: the hang was present before any test-suite or cluster-restart work this session, and the restarts did not cause it (the cluster is healthy active/standby afterward).
- Timeframe correlation only (not proven causal): this surfaced around the audit subsystem
re-architecture (
CollapseAuditLogToCanonical) — but the NotificationOutbox/SiteCallAudit handlers and repositories read the unchanged-and-emptyNotifications/SiteCallstables and are themselves correct, so the defect is at the singleton hosting/messaging layer rather than in the audit-table change.