Files

T

Joseph Doherty d33617d65d fix(host): register ActorSystem as DI singleton so health-probe scopes don't dispose it (HOST-021)

Per-probe health-check child scopes were disposing the AddTransient-bridged
ActorSystem (IDisposable), terminating the live cluster node ~4s after boot and
leaving every singleton-proxy Ask to hang the full 30s QueryTimeout — the central
report pages (/notifications, /site-calls, /monitoring/health) loaded in ~30s.
Bridge it as a singleton via a new lazy AkkaHostedService.GetOrCreateActorSystem()
so child-scope disposal never touches it. Verified: 0 post-startup terminates,
healthy active/standby, report pages ~0.05s, Playwright 68 passed / 0 failed.

2026-06-05 08:26:09 -04:00

15 KiB

Raw Blame History

Central report pages hang ~30s — NotificationOutbox / SiteCallAudit singleton query Asks never reply

Status: FIXED — verified 2026-06-05 (pending commit) · Severity: High (real users see 30s page loads) · Found: 2026-06-05 Components: Notification Outbox (#21), Site Call Audit (#22), Central UI (#9), Host/cluster (#15/#13)

FIX APPLIED & VERIFIED (2026-06-05)

HOST-021. The Akka ActorSystem DI bridge was changed from AddTransient to a singleton routed through a new lazy, idempotent, thread-safe AkkaHostedService.GetOrCreateActorSystem() (creates the system once on first call from either StartAsync or the DI factory). A singleton is resolved from the root provider and is never disposed by a per-probe health-check child scope, so the ActorSystem.Dispose() → Terminate() no longer fires; routing through the creator (rather than a plain AddSingleton(sp => …ActorSystem) factory) avoids caching a null if a probe wins the startup race.

Files:

src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs — new GetOrCreateActorSystem()
- _actorSystemLock; StartAsync calls it instead of creating the system inline.
src/ZB.MOM.WW.ScadaBridge.Host/Program.cs (central) and SiteServiceRegistration.cs (site) — AddTransient<ActorSystem> → AddSingleton<ActorSystem>(sp => …GetOrCreateActorSystem()).

Verification after bash docker/deploy.sh:

ActorSystemTerminateReason post-startup occurrences: 0 on both central nodes (was 1/boot).
/health/active: central-a Healthy "Active node (cluster leader)", central-b "Up but not the cluster leader" — correct active/standby (was both Standby Exiting/Removed).
Page render: /notifications/report 0.069s, /notifications/kpis 0.091s, /site-calls/report 0.026s, /monitoring/health 0.058s (all were ~30s+).
Playwright E2E: 68 passed / 0 failed / 0 skipped (was 62/6/0).

ROOT CAUSE (confirmed 2026-06-05 — supersedes the hypotheses below)

The Akka ActorSystem is a process singleton owned by AkkaHostedService, but it is registered into DI as a Transient via a factory:

// Program.cs:211 (central) and SiteServiceRegistration.cs:82 (site)
builder.Services.AddTransient<Akka.Actor.ActorSystem>(sp =>
    sp.GetRequiredService<AkkaHostedService>().ActorSystem!);

ActorSystem is IDisposable. In Microsoft.Extensions.DependencyInjection, an IDisposable produced by a Transient/Scoped factory is captured for disposal by the scope that resolved it. The shared ZB.MOM.WW.Health.Akka checks (AkkaClusterHealthCheck, ActiveNodeHealthCheck) are registered with AddTypeActivatedCheck and resolve the system lazily per probe — _serviceProvider.GetService<ActorSystem>() (AkkaClusterHealthCheck.cs:42, ActiveNodeHealthCheck.cs:102). HealthCheckService runs each probe in its own child scope, so every /health/ready and /health/active probe:

resolves the live ActorSystem (a Transient) into the probe's child scope,
the probe completes and HealthCheckService disposes the scope,
the container disposes the captured ActorSystem → ActorSystem.Dispose() → CoordinatedShutdown.Run(ActorSystemTerminateReason) → the node Leaves → Exiting → the actor system terminates.

The ASP.NET host process keeps running (only a DI-tracked transient was disposed; the hosted service's StopAsync/ClrExitReason path never runs), so the node is left permanently dead — member status frozen at Exiting (central-a) / Removed (central-b), no Up member, the cluster singletons have no host, and every Central UI page that Asks a singleton proxy buffers the message until the 30s QueryTimeout. The health checks meant to report cluster status are what kill the cluster.

Evidence (clean redeploy, 2026-06-05 11:43): central-a forms its cluster, goes Up, the singletons start + are identified (11:43:10.77); the first GET /health/active lands at 11:43:14; CoordinatedShutdown … ActorSystemTerminateReason … ExitCode:0 fires immediately (11:43:14.801); node Leaves → Exiting → "Successfully shut down" (11:43:24); process stays up serving HTTP. central-b shows the identical pattern at 11:43:17. /health/ready then = 503 on central-b, and /health/active = Standby: node is not Up (status: Exiting/Removed) on both. No application code calls .Terminate() (grep), confirming the disposal path.

Why earlier analysis missed it: the prior hypotheses examined the actor handler, proxy wiring, singleton lifecycle, and DB — all of which are correct. They are irrelevant because the ActorSystem is simply dead by the time a page queries it. "Deterministic, survives restart and full redeploy" is fully explained: it is a DI-lifetime code defect that re-triggers on the first post-Up health probe every boot.

Fix (pending, task #48): stop the container from disposing the externally-owned ActorSystem. It must be resolvable from DI as the live instance (the kit calls GetService<ActorSystem>()), re-readable (must not cache null during warmup), and never disposed by a child scope. A Transient/Scoped factory returning the IDisposable system is always captured by the resolving scope, and a plain AddSingleton(factory) caches whatever the first resolve sees (→ permanent null if a probe wins the warmup race). The chosen fix is a lazy, idempotent, thread-safe AkkaHostedService.GetOrCreateActorSystem() (creates the system once on first call from either StartAsync or the DI factory) registered as AddSingleton<ActorSystem>(sp => sp.GetRequiredService<AkkaHostedService>().GetOrCreateActorSystem()) — a process singleton, so child-scope disposal never touches it, and never null because the first resolve creates it. Apply in both the central (Program.cs) and site (SiteServiceRegistration.cs) registrations.

Summary

The Central UI pages that query the central cluster singletons — /notifications/report, /notifications/kpis, /site-calls/report, and the /monitoring/health KPI tiles — hang for exactly ~30 seconds during server render, then render an empty/error state. The hang is the Akka Ask to the notification-outbox / site-call-audit singletons timing out at CommunicationOptions.QueryTimeout (30s): the singleton never replies. Every other page is fast. This is deterministic (it survived a clean cluster restart) and is not a test problem — the E2E tests that load these pages are correctly failing. The root cause was not pinned by static analysis + a restart; the remaining step is runtime instrumentation (below).

Affected surface

Page	Server render time	Path
`/admin/sites`	0.026s	DB (`ISiteRepository`)
`/audit/log`	0.044s	DB
`/deployment/deployments`	0.018s	`CentralCommunicationActor` (local actor) Ask
`/design/templates`	0.013s	DB
`/notifications/report`	30.01s	`GetNotificationOutbox()` singleton-proxy Ask
`/notifications/kpis`	30.05s	`GetNotificationOutbox()` singleton-proxy Ask
`/site-calls/report`	30.02s	`GetSiteCallAudit()` singleton-proxy Ask
`/monitoring/health`	>35s	both singleton KPIs

Measured with an authenticated curl against the live cluster (http://localhost:9000), so it is a server-side prerender hang, independent of the browser.

Trigger path

NotificationReport.OnInitializedAsync → RefreshAll() → FetchPage() → CommunicationService.QueryNotificationOutboxAsync(request) → GetNotificationOutbox().Ask<NotificationOutboxQueryResponse>(request, _options.QueryTimeout). The page auto-queries the singleton on init (during prerender); the Ask times out at 30s. SiteCallsReport does the analogous thing for the site-call-audit singleton.

CommunicationService.QueryNotificationOutboxAsync → src/ZB.MOM.WW.ScadaBridge.Communication/CommunicationService.cs:456
GetNotificationOutbox() returns the cached _notificationOutboxProxy → CommunicationService.cs:100

What was verified correct / ruled out (with evidence)

The CentralCommunicationActor path is healthy. /deployment/deployments (GetActor().Ask, a node-local actor) returns in 0.018s. The cluster/ClusterClient transport and the Ask machinery work. Only the singleton-proxy Asks hang. This is the key asymmetry.
The singletons start and are reachable by their proxies. On the current boot the active node (central-a) logged NotificationOutbox singleton created and registered…, SiteCallAuditActor singleton created and registered…, ClusterSingletonManager state change [Start -> Oldest], Singleton manager started singleton actor [.../notification-outbox], and the proxy logged Singleton identified at [.../notification-outbox-singleton/notification-outbox].
Not cluster state — neither a restart nor a full redeploy fixes it. Restarting the central nodes (sequentially, then both together) re-formed a healthy active/standby cluster (central-a active leader, central-b standby) with the singletons started + identified on the active node (so the Ask is local), and the pages still hung exactly 30s. A subsequent full docker/deploy.sh (fresh image rebuild + recreation of all containers) also left the pages hanging exactly 30s. This rules out a stale-proxy / wedged-singleton cluster-state explanation, a stale binary, and cross-node serialization (a local Ask is not serialized) — the defect is deterministic.
The query handlers are correct. Both PipeTo the async query to the captured Sender with a failure projection on every path (a faulted query replies Success:false, it does not hang):
- NotificationOutboxActor.HandleQuery → src/ZB.MOM.WW.ScadaBridge.NotificationOutbox/NotificationOutboxActor.cs:760 (PipeTo at :765, failure arm :768)
- SiteCallAuditActor.HandleQuery → src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:224 (PipeTo at :229, failure arm :231)
The DB query would be instant. Notifications and SiteCalls are empty (0 rows, 0 ms) in the live DB, and the repository query is a plain EF Where/paginate (NotificationOutboxRepository.QueryAsync → …/Repositories/NotificationOutboxRepository.cs:132). So a query that actually executes returns in well under a second.
The proxy wiring is textbook. notification-outbox-proxy is a standard ClusterSingletonProxy for /user/notification-outbox-singleton, handed to CommunicationService.SetNotificationOutbox(...) → src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs:367-379 (and :515-524 for site-call-audit). GetNotificationOutbox() is non-null (a null ref would throw fast, not hang).
No relevant exception/serialization error is logged on either central node at query time (ActorInitializationException, restart loop, cannot be serialized, no serializer — none).
Singleton-agnostic to the dispatch loop. NotificationOutbox has a 5s dispatcher loop; SiteCallAudit has no periodic loop at all (deferred) — yet both hang identically, so the dispatch loop is not the shared cause. The loop is also fire-and-forget (RunDispatchPass(...).PipeTo(Self)), so it cannot starve the mailbox.

Net: handler, proxy wiring, singleton lifecycle, and DB query are all correct, and the table is empty — yet a local Ask to the singleton never replies within 30s. The defect is at the singleton activation / message-processing boundary on the live node, not in the visible code.

Leading hypotheses (not yet confirmed — need runtime instrumentation)

The singleton instance is not draining its mailbox even though the ClusterSingletonManager reports it started (a half-activated / perpetually-restarting / stuck instance). The manager holds the name and the proxy "identifies" it, but messages are buffered and never processed → 30s timeout with no per-query exception. Both singletons share construction via Props.Create(() => new …(_serviceProvider, …)) (AkkaHostedService.cs:357, :471) and the _serviceProvider.CreateScope() + ScadaBridgeDbContext pattern in their handlers — a shared activation-time defect would hit both.
DB scope/connection acquisition from the actor's root-provider scope hangs (e.g. a leaked- connection / pool-wait specific to _serviceProvider.CreateScope() in the actor, vs the request-scoped DbContext that /admin/sites uses successfully). The 30s is exactly the Ask timeout, so any handler-side hang ≥30s presents identically.
The reply cannot be delivered back to the Ask's temporary actor (less likely for a local Ask, but not disproven).

How to confirm (next step)

Bisect "message never reaches the singleton (routing)" vs "singleton receives but never replies (handler/DB)":

Turn on Akka receive logging for the run — akka.loglevel = DEBUG and akka.actor.debug.receive = on in the Host's HOCON (AkkaHostedService.BuildHocon, ~AkkaHostedService.cs:171-216) — or add a single _log.Info("HandleQuery received …") line at the top of NotificationOutboxActor.HandleQuery, then bash docker/deploy.sh and hit /notifications/report once.
- If the log line does not appear → the message isn't reaching the singleton (routing / proxy / mailbox-stuck) → investigate the singleton activation + proxy delivery.
- If it does appear → the handler/QueryOutboxAsync is hanging → wrap with timing around CreateScope(), GetRequiredService, and await repository.QueryAsync(...) to find which awaits.

Blocked tests

All currently-failing Playwright tests are blocked solely by this hang (they load the affected pages):

Audit.AuditLogPageTests.NotificationsPage_RendersAuditDrillInLinkPattern (loads /notifications/report)
All SiteCalls.SiteCallsPageTests page tests (load /site-calls/report): PageLoads_ForDeploymentUser, FilterNarrowing_ChannelFilterShrinksGrid, RetryClickThrough_OnParkedRow_ConfirmsRelayAndShowsOutcomeToast, RetryDiscard_VisibleOnlyOnParkedRows, DrillIn_ViewAuditHistory_NavigatesToPreFilteredAuditLog.

The rest of the suite is green (the Audit grid/drawer tests pass after the AuditDataSeeder canonical-schema fix landed in the same session).

Notes

Pre-existing: the hang was present before any test-suite or cluster-restart work this session, and the restarts did not cause it (the cluster is healthy active/standby afterward).
Timeframe correlation only (not proven causal): this surfaced around the audit subsystem re-architecture (CollapseAuditLogToCanonical) — but the NotificationOutbox/SiteCallAudit handlers and repositories read the unchanged-and-empty Notifications/SiteCalls tables and are themselves correct, so the defect is at the singleton hosting/messaging layer rather than in the audit-table change.

15 KiB Raw Blame History