diff --git a/docs/known-issues/2026-06-05-central-report-pages-singleton-query-hang.md b/docs/known-issues/2026-06-05-central-report-pages-singleton-query-hang.md new file mode 100644 index 00000000..f45cb553 --- /dev/null +++ b/docs/known-issues/2026-06-05-central-report-pages-singleton-query-hang.md @@ -0,0 +1,225 @@ +# Central report pages hang ~30s — NotificationOutbox / SiteCallAudit singleton query Asks never reply + +**Status:** FIXED — verified 2026-06-05 (pending commit) · **Severity:** High (real users see 30s page loads) · **Found:** 2026-06-05 +**Components:** Notification Outbox (#21), Site Call Audit (#22), Central UI (#9), Host/cluster (#15/#13) + +## FIX APPLIED & VERIFIED (2026-06-05) + +`HOST-021`. The Akka `ActorSystem` DI bridge was changed from `AddTransient` to a **singleton** +routed through a new lazy, idempotent, thread-safe `AkkaHostedService.GetOrCreateActorSystem()` +(creates the system once on first call from either `StartAsync` or the DI factory). A singleton +is resolved from the root provider and is never disposed by a per-probe health-check child +scope, so the `ActorSystem.Dispose()` → `Terminate()` no longer fires; routing through the +creator (rather than a plain `AddSingleton(sp => …ActorSystem)` factory) avoids caching a +`null` if a probe wins the startup race. + +Files: +- `src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs` — new `GetOrCreateActorSystem()` + + `_actorSystemLock`; `StartAsync` calls it instead of creating the system inline. +- `src/ZB.MOM.WW.ScadaBridge.Host/Program.cs` (central) and `SiteServiceRegistration.cs` (site) + — `AddTransient` → `AddSingleton(sp => …GetOrCreateActorSystem())`. + +Verification after `bash docker/deploy.sh`: +- `ActorSystemTerminateReason` post-startup occurrences: **0** on both central nodes (was 1/boot). +- `/health/active`: central-a **Healthy "Active node (cluster leader)"**, central-b **"Up but + not the cluster leader"** — correct active/standby (was both Standby Exiting/Removed). +- Page render: `/notifications/report` **0.069s**, `/notifications/kpis` **0.091s**, + `/site-calls/report` **0.026s**, `/monitoring/health` **0.058s** (all were ~30s+). +- Playwright E2E: **68 passed / 0 failed / 0 skipped** (was 62/6/0). + +## ROOT CAUSE (confirmed 2026-06-05 — supersedes the hypotheses below) + +The Akka `ActorSystem` is a **process singleton owned by `AkkaHostedService`**, but it is +registered into DI as a **`Transient`** via a factory: + +```csharp +// Program.cs:211 (central) and SiteServiceRegistration.cs:82 (site) +builder.Services.AddTransient(sp => + sp.GetRequiredService().ActorSystem!); +``` + +`ActorSystem` is `IDisposable`. In Microsoft.Extensions.DependencyInjection, an `IDisposable` +produced by a `Transient`/`Scoped` factory is **captured for disposal by the scope that +resolved it**. The shared `ZB.MOM.WW.Health.Akka` checks (`AkkaClusterHealthCheck`, +`ActiveNodeHealthCheck`) are registered with `AddTypeActivatedCheck` and resolve the system +**lazily per probe** — `_serviceProvider.GetService()` +(`AkkaClusterHealthCheck.cs:42`, `ActiveNodeHealthCheck.cs:102`). `HealthCheckService` runs +each probe in its **own child scope**, so every `/health/ready` and `/health/active` probe: + +1. resolves the live `ActorSystem` (a `Transient`) into the probe's child scope, +2. the probe completes and `HealthCheckService` disposes the scope, +3. the container disposes the captured `ActorSystem` → `ActorSystem.Dispose()` → + `CoordinatedShutdown.Run(ActorSystemTerminateReason)` → the node Leaves → Exiting → the + actor system terminates. + +The ASP.NET host process keeps running (only a DI-tracked transient was disposed; the +hosted service's `StopAsync`/`ClrExitReason` path never runs), so the node is left +**permanently dead** — member status frozen at `Exiting` (central-a) / `Removed` (central-b), +no `Up` member, the cluster singletons have no host, and every Central UI page that `Ask`s a +singleton proxy buffers the message until the 30s `QueryTimeout`. The health checks meant to +*report* cluster status are what *kill* the cluster. + +**Evidence (clean redeploy, 2026-06-05 11:43):** central-a forms its cluster, goes `Up`, the +singletons start + are identified (11:43:10.77); the first `GET /health/active` lands at +11:43:14; `CoordinatedShutdown … ActorSystemTerminateReason … ExitCode:0` fires immediately +(11:43:14.801); node Leaves → Exiting → "Successfully shut down" (11:43:24); process stays up +serving HTTP. central-b shows the identical pattern at 11:43:17. `/health/ready` then = 503 on +central-b, and `/health/active` = `Standby: node is not Up (status: Exiting/Removed)` on both. +No application code calls `.Terminate()` (grep), confirming the disposal path. + +**Why earlier analysis missed it:** the prior hypotheses examined the actor handler, proxy +wiring, singleton lifecycle, and DB — all of which are correct. They are irrelevant because +the `ActorSystem` is simply **dead** by the time a page queries it. "Deterministic, survives +restart and full redeploy" is fully explained: it is a DI-lifetime code defect that +re-triggers on the first post-`Up` health probe every boot. + +**Fix (pending, task #48):** stop the container from disposing the externally-owned +`ActorSystem`. It must be resolvable from DI as the live instance (the kit calls +`GetService()`), re-readable (must not cache `null` during warmup), and never +disposed by a child scope. A `Transient`/`Scoped` factory returning the `IDisposable` system +is always captured by the resolving scope, and a plain `AddSingleton(factory)` caches whatever +the first resolve sees (→ permanent `null` if a probe wins the warmup race). The chosen fix is +a lazy, idempotent, thread-safe `AkkaHostedService.GetOrCreateActorSystem()` (creates the +system once on first call from either `StartAsync` or the DI factory) registered as +`AddSingleton(sp => sp.GetRequiredService().GetOrCreateActorSystem())` +— a process singleton, so child-scope disposal never touches it, and never `null` because the +first resolve creates it. Apply in **both** the central (`Program.cs`) and site +(`SiteServiceRegistration.cs`) registrations. + +## Summary + +The Central UI pages that query the central **cluster singletons** — `/notifications/report`, +`/notifications/kpis`, `/site-calls/report`, and the `/monitoring/health` KPI tiles — hang for +**exactly ~30 seconds** during server render, then render an empty/error state. The hang is the +Akka `Ask` to the `notification-outbox` / `site-call-audit` singletons timing out at +`CommunicationOptions.QueryTimeout` (30s): **the singleton never replies**. Every other page is +fast. This is **deterministic** (it survived a clean cluster restart) and is **not** a test +problem — the E2E tests that load these pages are correctly failing. The root cause was *not* +pinned by static analysis + a restart; the remaining step is runtime instrumentation (below). + +## Affected surface + +| Page | Server render time | Path | +|------|--------------------|------| +| `/admin/sites` | 0.026s | DB (`ISiteRepository`) | +| `/audit/log` | 0.044s | DB | +| `/deployment/deployments` | 0.018s | `CentralCommunicationActor` (local actor) Ask | +| `/design/templates` | 0.013s | DB | +| **`/notifications/report`** | **30.01s** | `GetNotificationOutbox()` singleton-proxy Ask | +| **`/notifications/kpis`** | **30.05s** | `GetNotificationOutbox()` singleton-proxy Ask | +| **`/site-calls/report`** | **30.02s** | `GetSiteCallAudit()` singleton-proxy Ask | +| **`/monitoring/health`** | **>35s** | both singleton KPIs | + +Measured with an authenticated `curl` against the live cluster (`http://localhost:9000`), so it +is a **server-side prerender** hang, independent of the browser. + +## Trigger path + +`NotificationReport.OnInitializedAsync` → `RefreshAll()` → `FetchPage()` → +`CommunicationService.QueryNotificationOutboxAsync(request)` → +`GetNotificationOutbox().Ask(request, _options.QueryTimeout)`. +The page auto-queries the singleton on init (during prerender); the `Ask` times out at 30s. +`SiteCallsReport` does the analogous thing for the `site-call-audit` singleton. + +- `CommunicationService.QueryNotificationOutboxAsync` → `src/ZB.MOM.WW.ScadaBridge.Communication/CommunicationService.cs:456` +- `GetNotificationOutbox()` returns the cached `_notificationOutboxProxy` → `CommunicationService.cs:100` + +## What was verified correct / ruled out (with evidence) + +1. **The CentralCommunicationActor path is healthy.** `/deployment/deployments` + (`GetActor().Ask`, a node-local actor) returns in 0.018s. The cluster/ClusterClient transport + and the Ask machinery work. Only the **singleton-proxy** Asks hang. This is the key asymmetry. +2. **The singletons start and are reachable by their proxies.** On the current boot the active + node (central-a) logged `NotificationOutbox singleton created and registered…`, + `SiteCallAuditActor singleton created and registered…`, + `ClusterSingletonManager state change [Start -> Oldest]`, + `Singleton manager started singleton actor [.../notification-outbox]`, and the proxy logged + `Singleton identified at [.../notification-outbox-singleton/notification-outbox]`. +3. **Not cluster state — neither a restart nor a full redeploy fixes it.** Restarting the central + nodes (sequentially, then both together) re-formed a healthy active/standby cluster (central-a + active leader, central-b standby) with the singletons started + identified on the active node + (so the Ask is **local**), and the pages **still hung exactly 30s**. A subsequent **full + `docker/deploy.sh`** (fresh image rebuild + recreation of all containers) *also* left the pages + hanging exactly 30s. This rules out a stale-proxy / wedged-singleton cluster-state explanation, + a stale binary, and cross-node serialization (a local Ask is not serialized) — the defect is + **deterministic**. +4. **The query handlers are correct.** Both `PipeTo` the async query to the captured `Sender` + with a **failure projection on every path** (a faulted query replies `Success:false`, it does + not hang): + - `NotificationOutboxActor.HandleQuery` → `src/ZB.MOM.WW.ScadaBridge.NotificationOutbox/NotificationOutboxActor.cs:760` (PipeTo at :765, failure arm :768) + - `SiteCallAuditActor.HandleQuery` → `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:224` (PipeTo at :229, failure arm :231) +5. **The DB query would be instant.** `Notifications` and `SiteCalls` are **empty (0 rows, 0 ms)** + in the live DB, and the repository query is a plain EF `Where`/paginate + (`NotificationOutboxRepository.QueryAsync` → `…/Repositories/NotificationOutboxRepository.cs:132`). + So a query that actually executes returns in well under a second. +6. **The proxy wiring is textbook.** `notification-outbox-proxy` is a standard + `ClusterSingletonProxy` for `/user/notification-outbox-singleton`, handed to + `CommunicationService.SetNotificationOutbox(...)` → + `src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs:367-379` (and `:515-524` for site-call-audit). + `GetNotificationOutbox()` is non-null (a null ref would throw fast, not hang). +7. **No relevant exception/serialization error is logged** on either central node at query time + (`ActorInitializationException`, restart loop, `cannot be serialized`, `no serializer` — none). +8. **Singleton-agnostic to the dispatch loop.** `NotificationOutbox` has a 5s dispatcher loop; + `SiteCallAudit` has **no** periodic loop at all (deferred) — yet both hang identically, so the + dispatch loop is not the shared cause. The loop is also fire-and-forget + (`RunDispatchPass(...).PipeTo(Self)`), so it cannot starve the mailbox. + +Net: handler, proxy wiring, singleton lifecycle, and DB query are all correct, and the table is +empty — yet a **local** Ask to the singleton never replies within 30s. The defect is at the +singleton **activation / message-processing boundary** on the live node, not in the visible code. + +## Leading hypotheses (not yet confirmed — need runtime instrumentation) + +1. **The singleton instance is not draining its mailbox** even though the + `ClusterSingletonManager` reports it started (a half-activated / perpetually-restarting / stuck + instance). The manager holds the name and the proxy "identifies" it, but messages are buffered + and never processed → 30s timeout with no per-query exception. Both singletons share + construction via `Props.Create(() => new …(_serviceProvider, …))` + (`AkkaHostedService.cs:357`, `:471`) and the `_serviceProvider.CreateScope()` + + `ScadaBridgeDbContext` pattern in their handlers — a shared activation-time defect would hit + both. +2. **DB scope/connection acquisition from the actor's root-provider scope hangs** (e.g. a leaked- + connection / pool-wait specific to `_serviceProvider.CreateScope()` in the actor, vs the + request-scoped DbContext that `/admin/sites` uses successfully). The 30s is *exactly* the Ask + timeout, so any handler-side hang ≥30s presents identically. +3. **The reply cannot be delivered back to the Ask's temporary actor** (less likely for a local + Ask, but not disproven). + +## How to confirm (next step) + +Bisect "message never reaches the singleton (routing)" vs "singleton receives but never replies +(handler/DB)": + +- Turn on Akka receive logging for the run — `akka.loglevel = DEBUG` and + `akka.actor.debug.receive = on` in the Host's HOCON (`AkkaHostedService.BuildHocon`, + ~`AkkaHostedService.cs:171-216`) — or add a single `_log.Info("HandleQuery received …")` line at + the top of `NotificationOutboxActor.HandleQuery`, then `bash docker/deploy.sh` and hit + `/notifications/report` once. + - If the log line **does not** appear → the message isn't reaching the singleton (routing / + proxy / mailbox-stuck) → investigate the singleton activation + proxy delivery. + - If it **does** appear → the handler/`QueryOutboxAsync` is hanging → wrap with timing around + `CreateScope()`, `GetRequiredService`, and `await repository.QueryAsync(...)` to find which + awaits. + +## Blocked tests + +All currently-failing Playwright tests are blocked solely by this hang (they load the affected +pages): + +- `Audit.AuditLogPageTests.NotificationsPage_RendersAuditDrillInLinkPattern` (loads `/notifications/report`) +- All `SiteCalls.SiteCallsPageTests` page tests (load `/site-calls/report`): `PageLoads_ForDeploymentUser`, + `FilterNarrowing_ChannelFilterShrinksGrid`, `RetryClickThrough_OnParkedRow_ConfirmsRelayAndShowsOutcomeToast`, + `RetryDiscard_VisibleOnlyOnParkedRows`, `DrillIn_ViewAuditHistory_NavigatesToPreFilteredAuditLog`. + +The rest of the suite is green (the Audit grid/drawer tests pass after the `AuditDataSeeder` +canonical-schema fix landed in the same session). + +## Notes + +- Pre-existing: the hang was present before any test-suite or cluster-restart work this session, + and the restarts did not cause it (the cluster is healthy active/standby afterward). +- Timeframe correlation only (not proven causal): this surfaced around the audit subsystem + re-architecture (`CollapseAuditLogToCanonical`) — but the NotificationOutbox/SiteCallAudit + handlers and repositories read the unchanged-and-empty `Notifications`/`SiteCalls` tables and + are themselves correct, so the defect is at the singleton hosting/messaging layer rather than in + the audit-table change. diff --git a/src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs b/src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs index ebb01ee6..ab93ce56 100644 --- a/src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs +++ b/src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs @@ -34,6 +34,14 @@ public class AkkaHostedService : IHostedService private readonly CommunicationOptions _communicationOptions; private readonly ILogger _logger; private ActorSystem? _actorSystem; + + /// + /// Guards the one-time creation of in + /// so and a concurrent + /// health-probe resolution of the DI singleton race to create + /// it exactly once (HOST-021). + /// + private readonly object _actorSystemLock = new(); /// /// Auxiliary IDisposables (e.g. the SiteAuditTelemetryStalledTracker) /// that this hosted service constructs at start time and must tear down @@ -91,38 +99,18 @@ public class AkkaHostedService : IHostedService /// A task representing the asynchronous operation. public async Task StartAsync(CancellationToken cancellationToken) { - // For site nodes, include a site-specific role (e.g., "site-SiteA") alongside the base role - var roles = BuildRoles(); - - // WP-3: Transport heartbeat explicitly configured from CommunicationOptions (not framework defaults) - var transportHeartbeatSec = _communicationOptions.TransportHeartbeatInterval.TotalSeconds; - var transportFailureSec = _communicationOptions.TransportFailureThreshold.TotalSeconds; - - // Host-006: HOCON is assembled in a dedicated builder that quotes/escapes every - // interpolated value, so a hostname, seed node or strategy containing a quote, - // backslash or whitespace cannot corrupt the configuration document. - var hocon = BuildHocon(_nodeOptions, _clusterOptions, roles, - _communicationOptions.TransportHeartbeatInterval, - _communicationOptions.TransportFailureThreshold); - - var config = ConfigurationFactory.ParseString(hocon); - _actorSystem = ActorSystem.Create("scadabridge", config); - - _logger.LogInformation( - "Akka.NET actor system 'scadabridge' started. Role={Role}, Roles={Roles}, Hostname={Hostname}, Port={Port}, " + - "TransportHeartbeat={TransportHeartbeat}s, TransportFailure={TransportFailure}s", - _nodeOptions.Role, - string.Join(", ", roles), - _nodeOptions.NodeHostname, - _nodeOptions.RemotingPort, - transportHeartbeatSec, - transportFailureSec); + // HOST-021: create (or reuse) the externally-owned, process-singleton ActorSystem. A + // health probe may already have created it via the DI singleton bridge + // (GetOrCreateActorSystem) before this hosted service's StartAsync ran; either way the + // call yields the one instance and sets _actorSystem. Actor registration below then + // runs on it. + var actorSystem = GetOrCreateActorSystem(); // Register the dead letter monitor actor var loggerFactory = _serviceProvider.GetRequiredService(); var dlmLogger = loggerFactory.CreateLogger(); var dlmHealthCollector = _serviceProvider.GetService(); - _actorSystem.ActorOf( + actorSystem.ActorOf( Props.Create(() => new DeadLetterMonitorActor(dlmLogger, dlmHealthCollector)), "dead-letter-monitor"); @@ -137,6 +125,72 @@ public class AkkaHostedService : IHostedService } } + /// + /// Returns the process-wide Akka , creating it on first call. + /// Idempotent and thread-safe: both and the DI bridge that + /// exposes the system to the shared ZB.MOM.WW.Health.Akka checks call this, and + /// whichever runs first creates the system exactly once. + /// + /// + /// HOST-021: the is an externally-owned process singleton — its + /// lifecycle is this hosted service's (created here, torn down via + /// CoordinatedShutdown in ). It MUST be registered in DI as a + /// singleton resolved through this method, never as a transient/scoped factory: + /// is , and a transient/scoped factory + /// hands a fresh disposable to every resolving child scope (e.g. each per-probe + /// health-check scope), so the container disposes it when that scope ends — + /// ActorSystem.Dispose() runs CoordinatedShutdown(ActorSystemTerminateReason) + /// and tears the live cluster node down mid-flight, which is exactly the + /// "central report pages hang 30s" defect this method fixes. Creating the system here and + /// exposing it as a singleton keeps child-scope disposal away from it; routing the singleton + /// through this method (rather than a plain AddSingleton(sp => ...ActorSystem) + /// factory) also avoids caching a null if a health probe wins the startup race, since + /// the first resolve creates the system instead of capturing a not-yet-started reference. + /// + /// The single live actor system. + public ActorSystem GetOrCreateActorSystem() + { + if (_actorSystem is not null) + { + return _actorSystem; + } + + lock (_actorSystemLock) + { + if (_actorSystem is not null) + { + return _actorSystem; + } + + // For site nodes, include a site-specific role (e.g., "site-SiteA") alongside the base role + var roles = BuildRoles(); + + // Host-006: HOCON is assembled in a dedicated builder that quotes/escapes every + // interpolated value, so a hostname, seed node or strategy containing a quote, + // backslash or whitespace cannot corrupt the configuration document. + var hocon = BuildHocon(_nodeOptions, _clusterOptions, roles, + _communicationOptions.TransportHeartbeatInterval, + _communicationOptions.TransportFailureThreshold); + + var config = ConfigurationFactory.ParseString(hocon); + var system = ActorSystem.Create("scadabridge", config); + + _logger.LogInformation( + "Akka.NET actor system 'scadabridge' started. Role={Role}, Roles={Roles}, Hostname={Hostname}, Port={Port}, " + + "TransportHeartbeat={TransportHeartbeat}s, TransportFailure={TransportFailure}s", + _nodeOptions.Role, + string.Join(", ", roles), + _nodeOptions.NodeHostname, + _nodeOptions.RemotingPort, + _communicationOptions.TransportHeartbeatInterval.TotalSeconds, + _communicationOptions.TransportFailureThreshold.TotalSeconds); + + // Publish last so a concurrent reader never observes a half-constructed system. + _actorSystem = system; + return _actorSystem; + } + } + /// /// Builds the Akka HOCON configuration document. Every interpolated value is /// routed through (string values) so a hostname, diff --git a/src/ZB.MOM.WW.ScadaBridge.Host/Program.cs b/src/ZB.MOM.WW.ScadaBridge.Host/Program.cs index 38d89120..8d3f36aa 100644 --- a/src/ZB.MOM.WW.ScadaBridge.Host/Program.cs +++ b/src/ZB.MOM.WW.ScadaBridge.Host/Program.cs @@ -204,12 +204,17 @@ try builder.Services.AddSingleton(); builder.Services.AddHostedService(sp => sp.GetRequiredService()); - // The shared ZB.MOM.WW.Health Akka checks resolve ActorSystem from DI. ScadaBridge owns the - // ActorSystem inside AkkaHostedService (not a DI singleton), so bridge it as TRANSIENT: each - // resolve re-reads the current value — null while warming up (checks → Degraded), live after. - // The factory must NOT throw: GetService() must return null (not raise) pre-start. - builder.Services.AddTransient(sp => - sp.GetRequiredService().ActorSystem!); + // HOST-021: bridge the AkkaHostedService-owned ActorSystem to DI as a SINGLETON via + // GetOrCreateActorSystem(). The shared ZB.MOM.WW.Health Akka checks resolve ActorSystem + // from DI, per probe, inside a child scope. ActorSystem is IDisposable, so a TRANSIENT + // (or scoped) bridge is captured-and-disposed by each probe's scope — disposing the live + // system mid-flight (CoordinatedShutdown/ActorSystemTerminateReason) and wedging the + // central report pages at the 30s Ask timeout. A singleton is resolved from the root and + // never disposed by a child scope; routing through GetOrCreateActorSystem (instead of a + // plain singleton factory over .ActorSystem) means the first resolve CREATES the system + // rather than caching a null if a probe wins the startup race. + builder.Services.AddSingleton(sp => + sp.GetRequiredService().GetOrCreateActorSystem()); // InboundAPI-022: register the production IActiveNodeGate implementation so // standby-node gating is actually enforced (the InboundApiEndpointFilter diff --git a/src/ZB.MOM.WW.ScadaBridge.Host/SiteServiceRegistration.cs b/src/ZB.MOM.WW.ScadaBridge.Host/SiteServiceRegistration.cs index dd400206..70cfec10 100644 --- a/src/ZB.MOM.WW.ScadaBridge.Host/SiteServiceRegistration.cs +++ b/src/ZB.MOM.WW.ScadaBridge.Host/SiteServiceRegistration.cs @@ -75,12 +75,17 @@ public static class SiteServiceRegistration services.AddSingleton(); services.AddHostedService(sp => sp.GetRequiredService()); - // The shared ZB.MOM.WW.Health Akka checks resolve ActorSystem from DI. ScadaBridge owns the - // ActorSystem inside AkkaHostedService (not a DI singleton), so bridge it as TRANSIENT: each - // resolve re-reads the current value — null while warming up (checks → Degraded), live after. - // The factory must NOT throw: GetService() must return null (not raise) pre-start. - services.AddTransient(sp => - sp.GetRequiredService().ActorSystem!); + // HOST-021: bridge the AkkaHostedService-owned ActorSystem to DI as a SINGLETON via + // GetOrCreateActorSystem(). The shared ZB.MOM.WW.Health Akka checks resolve ActorSystem + // from DI, per probe, inside a child scope. ActorSystem is IDisposable, so a TRANSIENT + // (or scoped) bridge is captured-and-disposed by each probe's scope — disposing the live + // system mid-flight (CoordinatedShutdown/ActorSystemTerminateReason) and tearing down the + // node. A singleton is resolved from the root and never disposed by a child scope; routing + // through GetOrCreateActorSystem (instead of a plain singleton factory over .ActorSystem) + // means the first resolve CREATES the system rather than caching a null if a probe wins + // the startup race. + services.AddSingleton(sp => + sp.GetRequiredService().GetOrCreateActorSystem()); // Cluster node status provider for health reports services.AddSingleton(sp =>