fix(host): register ActorSystem as DI singleton so health-probe scopes don't dispose it (HOST-021)
Per-probe health-check child scopes were disposing the AddTransient-bridged ActorSystem (IDisposable), terminating the live cluster node ~4s after boot and leaving every singleton-proxy Ask to hang the full 30s QueryTimeout — the central report pages (/notifications, /site-calls, /monitoring/health) loaded in ~30s. Bridge it as a singleton via a new lazy AkkaHostedService.GetOrCreateActorSystem() so child-scope disposal never touches it. Verified: 0 post-startup terminates, healthy active/standby, report pages ~0.05s, Playwright 68 passed / 0 failed.
This commit is contained in:
@@ -0,0 +1,225 @@
|
|||||||
|
# Central report pages hang ~30s — NotificationOutbox / SiteCallAudit singleton query Asks never reply
|
||||||
|
|
||||||
|
**Status:** FIXED — verified 2026-06-05 (pending commit) · **Severity:** High (real users see 30s page loads) · **Found:** 2026-06-05
|
||||||
|
**Components:** Notification Outbox (#21), Site Call Audit (#22), Central UI (#9), Host/cluster (#15/#13)
|
||||||
|
|
||||||
|
## FIX APPLIED & VERIFIED (2026-06-05)
|
||||||
|
|
||||||
|
`HOST-021`. The Akka `ActorSystem` DI bridge was changed from `AddTransient` to a **singleton**
|
||||||
|
routed through a new lazy, idempotent, thread-safe `AkkaHostedService.GetOrCreateActorSystem()`
|
||||||
|
(creates the system once on first call from either `StartAsync` or the DI factory). A singleton
|
||||||
|
is resolved from the root provider and is never disposed by a per-probe health-check child
|
||||||
|
scope, so the `ActorSystem.Dispose()` → `Terminate()` no longer fires; routing through the
|
||||||
|
creator (rather than a plain `AddSingleton(sp => …ActorSystem)` factory) avoids caching a
|
||||||
|
`null` if a probe wins the startup race.
|
||||||
|
|
||||||
|
Files:
|
||||||
|
- `src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs` — new `GetOrCreateActorSystem()`
|
||||||
|
+ `_actorSystemLock`; `StartAsync` calls it instead of creating the system inline.
|
||||||
|
- `src/ZB.MOM.WW.ScadaBridge.Host/Program.cs` (central) and `SiteServiceRegistration.cs` (site)
|
||||||
|
— `AddTransient<ActorSystem>` → `AddSingleton<ActorSystem>(sp => …GetOrCreateActorSystem())`.
|
||||||
|
|
||||||
|
Verification after `bash docker/deploy.sh`:
|
||||||
|
- `ActorSystemTerminateReason` post-startup occurrences: **0** on both central nodes (was 1/boot).
|
||||||
|
- `/health/active`: central-a **Healthy "Active node (cluster leader)"**, central-b **"Up but
|
||||||
|
not the cluster leader"** — correct active/standby (was both Standby Exiting/Removed).
|
||||||
|
- Page render: `/notifications/report` **0.069s**, `/notifications/kpis` **0.091s**,
|
||||||
|
`/site-calls/report` **0.026s**, `/monitoring/health` **0.058s** (all were ~30s+).
|
||||||
|
- Playwright E2E: **68 passed / 0 failed / 0 skipped** (was 62/6/0).
|
||||||
|
|
||||||
|
## ROOT CAUSE (confirmed 2026-06-05 — supersedes the hypotheses below)
|
||||||
|
|
||||||
|
The Akka `ActorSystem` is a **process singleton owned by `AkkaHostedService`**, but it is
|
||||||
|
registered into DI as a **`Transient`** via a factory:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
// Program.cs:211 (central) and SiteServiceRegistration.cs:82 (site)
|
||||||
|
builder.Services.AddTransient<Akka.Actor.ActorSystem>(sp =>
|
||||||
|
sp.GetRequiredService<AkkaHostedService>().ActorSystem!);
|
||||||
|
```
|
||||||
|
|
||||||
|
`ActorSystem` is `IDisposable`. In Microsoft.Extensions.DependencyInjection, an `IDisposable`
|
||||||
|
produced by a `Transient`/`Scoped` factory is **captured for disposal by the scope that
|
||||||
|
resolved it**. The shared `ZB.MOM.WW.Health.Akka` checks (`AkkaClusterHealthCheck`,
|
||||||
|
`ActiveNodeHealthCheck`) are registered with `AddTypeActivatedCheck` and resolve the system
|
||||||
|
**lazily per probe** — `_serviceProvider.GetService<ActorSystem>()`
|
||||||
|
(`AkkaClusterHealthCheck.cs:42`, `ActiveNodeHealthCheck.cs:102`). `HealthCheckService` runs
|
||||||
|
each probe in its **own child scope**, so every `/health/ready` and `/health/active` probe:
|
||||||
|
|
||||||
|
1. resolves the live `ActorSystem` (a `Transient`) into the probe's child scope,
|
||||||
|
2. the probe completes and `HealthCheckService` disposes the scope,
|
||||||
|
3. the container disposes the captured `ActorSystem` → `ActorSystem.Dispose()` →
|
||||||
|
`CoordinatedShutdown.Run(ActorSystemTerminateReason)` → the node Leaves → Exiting → the
|
||||||
|
actor system terminates.
|
||||||
|
|
||||||
|
The ASP.NET host process keeps running (only a DI-tracked transient was disposed; the
|
||||||
|
hosted service's `StopAsync`/`ClrExitReason` path never runs), so the node is left
|
||||||
|
**permanently dead** — member status frozen at `Exiting` (central-a) / `Removed` (central-b),
|
||||||
|
no `Up` member, the cluster singletons have no host, and every Central UI page that `Ask`s a
|
||||||
|
singleton proxy buffers the message until the 30s `QueryTimeout`. The health checks meant to
|
||||||
|
*report* cluster status are what *kill* the cluster.
|
||||||
|
|
||||||
|
**Evidence (clean redeploy, 2026-06-05 11:43):** central-a forms its cluster, goes `Up`, the
|
||||||
|
singletons start + are identified (11:43:10.77); the first `GET /health/active` lands at
|
||||||
|
11:43:14; `CoordinatedShutdown … ActorSystemTerminateReason … ExitCode:0` fires immediately
|
||||||
|
(11:43:14.801); node Leaves → Exiting → "Successfully shut down" (11:43:24); process stays up
|
||||||
|
serving HTTP. central-b shows the identical pattern at 11:43:17. `/health/ready` then = 503 on
|
||||||
|
central-b, and `/health/active` = `Standby: node is not Up (status: Exiting/Removed)` on both.
|
||||||
|
No application code calls `.Terminate()` (grep), confirming the disposal path.
|
||||||
|
|
||||||
|
**Why earlier analysis missed it:** the prior hypotheses examined the actor handler, proxy
|
||||||
|
wiring, singleton lifecycle, and DB — all of which are correct. They are irrelevant because
|
||||||
|
the `ActorSystem` is simply **dead** by the time a page queries it. "Deterministic, survives
|
||||||
|
restart and full redeploy" is fully explained: it is a DI-lifetime code defect that
|
||||||
|
re-triggers on the first post-`Up` health probe every boot.
|
||||||
|
|
||||||
|
**Fix (pending, task #48):** stop the container from disposing the externally-owned
|
||||||
|
`ActorSystem`. It must be resolvable from DI as the live instance (the kit calls
|
||||||
|
`GetService<ActorSystem>()`), re-readable (must not cache `null` during warmup), and never
|
||||||
|
disposed by a child scope. A `Transient`/`Scoped` factory returning the `IDisposable` system
|
||||||
|
is always captured by the resolving scope, and a plain `AddSingleton(factory)` caches whatever
|
||||||
|
the first resolve sees (→ permanent `null` if a probe wins the warmup race). The chosen fix is
|
||||||
|
a lazy, idempotent, thread-safe `AkkaHostedService.GetOrCreateActorSystem()` (creates the
|
||||||
|
system once on first call from either `StartAsync` or the DI factory) registered as
|
||||||
|
`AddSingleton<ActorSystem>(sp => sp.GetRequiredService<AkkaHostedService>().GetOrCreateActorSystem())`
|
||||||
|
— a process singleton, so child-scope disposal never touches it, and never `null` because the
|
||||||
|
first resolve creates it. Apply in **both** the central (`Program.cs`) and site
|
||||||
|
(`SiteServiceRegistration.cs`) registrations.
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
The Central UI pages that query the central **cluster singletons** — `/notifications/report`,
|
||||||
|
`/notifications/kpis`, `/site-calls/report`, and the `/monitoring/health` KPI tiles — hang for
|
||||||
|
**exactly ~30 seconds** during server render, then render an empty/error state. The hang is the
|
||||||
|
Akka `Ask` to the `notification-outbox` / `site-call-audit` singletons timing out at
|
||||||
|
`CommunicationOptions.QueryTimeout` (30s): **the singleton never replies**. Every other page is
|
||||||
|
fast. This is **deterministic** (it survived a clean cluster restart) and is **not** a test
|
||||||
|
problem — the E2E tests that load these pages are correctly failing. The root cause was *not*
|
||||||
|
pinned by static analysis + a restart; the remaining step is runtime instrumentation (below).
|
||||||
|
|
||||||
|
## Affected surface
|
||||||
|
|
||||||
|
| Page | Server render time | Path |
|
||||||
|
|------|--------------------|------|
|
||||||
|
| `/admin/sites` | 0.026s | DB (`ISiteRepository`) |
|
||||||
|
| `/audit/log` | 0.044s | DB |
|
||||||
|
| `/deployment/deployments` | 0.018s | `CentralCommunicationActor` (local actor) Ask |
|
||||||
|
| `/design/templates` | 0.013s | DB |
|
||||||
|
| **`/notifications/report`** | **30.01s** | `GetNotificationOutbox()` singleton-proxy Ask |
|
||||||
|
| **`/notifications/kpis`** | **30.05s** | `GetNotificationOutbox()` singleton-proxy Ask |
|
||||||
|
| **`/site-calls/report`** | **30.02s** | `GetSiteCallAudit()` singleton-proxy Ask |
|
||||||
|
| **`/monitoring/health`** | **>35s** | both singleton KPIs |
|
||||||
|
|
||||||
|
Measured with an authenticated `curl` against the live cluster (`http://localhost:9000`), so it
|
||||||
|
is a **server-side prerender** hang, independent of the browser.
|
||||||
|
|
||||||
|
## Trigger path
|
||||||
|
|
||||||
|
`NotificationReport.OnInitializedAsync` → `RefreshAll()` → `FetchPage()` →
|
||||||
|
`CommunicationService.QueryNotificationOutboxAsync(request)` →
|
||||||
|
`GetNotificationOutbox().Ask<NotificationOutboxQueryResponse>(request, _options.QueryTimeout)`.
|
||||||
|
The page auto-queries the singleton on init (during prerender); the `Ask` times out at 30s.
|
||||||
|
`SiteCallsReport` does the analogous thing for the `site-call-audit` singleton.
|
||||||
|
|
||||||
|
- `CommunicationService.QueryNotificationOutboxAsync` → `src/ZB.MOM.WW.ScadaBridge.Communication/CommunicationService.cs:456`
|
||||||
|
- `GetNotificationOutbox()` returns the cached `_notificationOutboxProxy` → `CommunicationService.cs:100`
|
||||||
|
|
||||||
|
## What was verified correct / ruled out (with evidence)
|
||||||
|
|
||||||
|
1. **The CentralCommunicationActor path is healthy.** `/deployment/deployments`
|
||||||
|
(`GetActor().Ask`, a node-local actor) returns in 0.018s. The cluster/ClusterClient transport
|
||||||
|
and the Ask machinery work. Only the **singleton-proxy** Asks hang. This is the key asymmetry.
|
||||||
|
2. **The singletons start and are reachable by their proxies.** On the current boot the active
|
||||||
|
node (central-a) logged `NotificationOutbox singleton created and registered…`,
|
||||||
|
`SiteCallAuditActor singleton created and registered…`,
|
||||||
|
`ClusterSingletonManager state change [Start -> Oldest]`,
|
||||||
|
`Singleton manager started singleton actor [.../notification-outbox]`, and the proxy logged
|
||||||
|
`Singleton identified at [.../notification-outbox-singleton/notification-outbox]`.
|
||||||
|
3. **Not cluster state — neither a restart nor a full redeploy fixes it.** Restarting the central
|
||||||
|
nodes (sequentially, then both together) re-formed a healthy active/standby cluster (central-a
|
||||||
|
active leader, central-b standby) with the singletons started + identified on the active node
|
||||||
|
(so the Ask is **local**), and the pages **still hung exactly 30s**. A subsequent **full
|
||||||
|
`docker/deploy.sh`** (fresh image rebuild + recreation of all containers) *also* left the pages
|
||||||
|
hanging exactly 30s. This rules out a stale-proxy / wedged-singleton cluster-state explanation,
|
||||||
|
a stale binary, and cross-node serialization (a local Ask is not serialized) — the defect is
|
||||||
|
**deterministic**.
|
||||||
|
4. **The query handlers are correct.** Both `PipeTo` the async query to the captured `Sender`
|
||||||
|
with a **failure projection on every path** (a faulted query replies `Success:false`, it does
|
||||||
|
not hang):
|
||||||
|
- `NotificationOutboxActor.HandleQuery` → `src/ZB.MOM.WW.ScadaBridge.NotificationOutbox/NotificationOutboxActor.cs:760` (PipeTo at :765, failure arm :768)
|
||||||
|
- `SiteCallAuditActor.HandleQuery` → `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:224` (PipeTo at :229, failure arm :231)
|
||||||
|
5. **The DB query would be instant.** `Notifications` and `SiteCalls` are **empty (0 rows, 0 ms)**
|
||||||
|
in the live DB, and the repository query is a plain EF `Where`/paginate
|
||||||
|
(`NotificationOutboxRepository.QueryAsync` → `…/Repositories/NotificationOutboxRepository.cs:132`).
|
||||||
|
So a query that actually executes returns in well under a second.
|
||||||
|
6. **The proxy wiring is textbook.** `notification-outbox-proxy` is a standard
|
||||||
|
`ClusterSingletonProxy` for `/user/notification-outbox-singleton`, handed to
|
||||||
|
`CommunicationService.SetNotificationOutbox(...)` →
|
||||||
|
`src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs:367-379` (and `:515-524` for site-call-audit).
|
||||||
|
`GetNotificationOutbox()` is non-null (a null ref would throw fast, not hang).
|
||||||
|
7. **No relevant exception/serialization error is logged** on either central node at query time
|
||||||
|
(`ActorInitializationException`, restart loop, `cannot be serialized`, `no serializer` — none).
|
||||||
|
8. **Singleton-agnostic to the dispatch loop.** `NotificationOutbox` has a 5s dispatcher loop;
|
||||||
|
`SiteCallAudit` has **no** periodic loop at all (deferred) — yet both hang identically, so the
|
||||||
|
dispatch loop is not the shared cause. The loop is also fire-and-forget
|
||||||
|
(`RunDispatchPass(...).PipeTo(Self)`), so it cannot starve the mailbox.
|
||||||
|
|
||||||
|
Net: handler, proxy wiring, singleton lifecycle, and DB query are all correct, and the table is
|
||||||
|
empty — yet a **local** Ask to the singleton never replies within 30s. The defect is at the
|
||||||
|
singleton **activation / message-processing boundary** on the live node, not in the visible code.
|
||||||
|
|
||||||
|
## Leading hypotheses (not yet confirmed — need runtime instrumentation)
|
||||||
|
|
||||||
|
1. **The singleton instance is not draining its mailbox** even though the
|
||||||
|
`ClusterSingletonManager` reports it started (a half-activated / perpetually-restarting / stuck
|
||||||
|
instance). The manager holds the name and the proxy "identifies" it, but messages are buffered
|
||||||
|
and never processed → 30s timeout with no per-query exception. Both singletons share
|
||||||
|
construction via `Props.Create(() => new …(_serviceProvider, …))`
|
||||||
|
(`AkkaHostedService.cs:357`, `:471`) and the `_serviceProvider.CreateScope()` +
|
||||||
|
`ScadaBridgeDbContext` pattern in their handlers — a shared activation-time defect would hit
|
||||||
|
both.
|
||||||
|
2. **DB scope/connection acquisition from the actor's root-provider scope hangs** (e.g. a leaked-
|
||||||
|
connection / pool-wait specific to `_serviceProvider.CreateScope()` in the actor, vs the
|
||||||
|
request-scoped DbContext that `/admin/sites` uses successfully). The 30s is *exactly* the Ask
|
||||||
|
timeout, so any handler-side hang ≥30s presents identically.
|
||||||
|
3. **The reply cannot be delivered back to the Ask's temporary actor** (less likely for a local
|
||||||
|
Ask, but not disproven).
|
||||||
|
|
||||||
|
## How to confirm (next step)
|
||||||
|
|
||||||
|
Bisect "message never reaches the singleton (routing)" vs "singleton receives but never replies
|
||||||
|
(handler/DB)":
|
||||||
|
|
||||||
|
- Turn on Akka receive logging for the run — `akka.loglevel = DEBUG` and
|
||||||
|
`akka.actor.debug.receive = on` in the Host's HOCON (`AkkaHostedService.BuildHocon`,
|
||||||
|
~`AkkaHostedService.cs:171-216`) — or add a single `_log.Info("HandleQuery received …")` line at
|
||||||
|
the top of `NotificationOutboxActor.HandleQuery`, then `bash docker/deploy.sh` and hit
|
||||||
|
`/notifications/report` once.
|
||||||
|
- If the log line **does not** appear → the message isn't reaching the singleton (routing /
|
||||||
|
proxy / mailbox-stuck) → investigate the singleton activation + proxy delivery.
|
||||||
|
- If it **does** appear → the handler/`QueryOutboxAsync` is hanging → wrap with timing around
|
||||||
|
`CreateScope()`, `GetRequiredService`, and `await repository.QueryAsync(...)` to find which
|
||||||
|
awaits.
|
||||||
|
|
||||||
|
## Blocked tests
|
||||||
|
|
||||||
|
All currently-failing Playwright tests are blocked solely by this hang (they load the affected
|
||||||
|
pages):
|
||||||
|
|
||||||
|
- `Audit.AuditLogPageTests.NotificationsPage_RendersAuditDrillInLinkPattern` (loads `/notifications/report`)
|
||||||
|
- All `SiteCalls.SiteCallsPageTests` page tests (load `/site-calls/report`): `PageLoads_ForDeploymentUser`,
|
||||||
|
`FilterNarrowing_ChannelFilterShrinksGrid`, `RetryClickThrough_OnParkedRow_ConfirmsRelayAndShowsOutcomeToast`,
|
||||||
|
`RetryDiscard_VisibleOnlyOnParkedRows`, `DrillIn_ViewAuditHistory_NavigatesToPreFilteredAuditLog`.
|
||||||
|
|
||||||
|
The rest of the suite is green (the Audit grid/drawer tests pass after the `AuditDataSeeder`
|
||||||
|
canonical-schema fix landed in the same session).
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
|
||||||
|
- Pre-existing: the hang was present before any test-suite or cluster-restart work this session,
|
||||||
|
and the restarts did not cause it (the cluster is healthy active/standby afterward).
|
||||||
|
- Timeframe correlation only (not proven causal): this surfaced around the audit subsystem
|
||||||
|
re-architecture (`CollapseAuditLogToCanonical`) — but the NotificationOutbox/SiteCallAudit
|
||||||
|
handlers and repositories read the unchanged-and-empty `Notifications`/`SiteCalls` tables and
|
||||||
|
are themselves correct, so the defect is at the singleton hosting/messaging layer rather than in
|
||||||
|
the audit-table change.
|
||||||
@@ -34,6 +34,14 @@ public class AkkaHostedService : IHostedService
|
|||||||
private readonly CommunicationOptions _communicationOptions;
|
private readonly CommunicationOptions _communicationOptions;
|
||||||
private readonly ILogger<AkkaHostedService> _logger;
|
private readonly ILogger<AkkaHostedService> _logger;
|
||||||
private ActorSystem? _actorSystem;
|
private ActorSystem? _actorSystem;
|
||||||
|
|
||||||
|
/// <summary>
|
||||||
|
/// Guards the one-time creation of <see cref="_actorSystem"/> in
|
||||||
|
/// <see cref="GetOrCreateActorSystem"/> so <see cref="StartAsync"/> and a concurrent
|
||||||
|
/// health-probe resolution of the DI <see cref="ActorSystem"/> singleton race to create
|
||||||
|
/// it exactly once (HOST-021).
|
||||||
|
/// </summary>
|
||||||
|
private readonly object _actorSystemLock = new();
|
||||||
/// <summary>
|
/// <summary>
|
||||||
/// Auxiliary IDisposables (e.g. the SiteAuditTelemetryStalledTracker)
|
/// Auxiliary IDisposables (e.g. the SiteAuditTelemetryStalledTracker)
|
||||||
/// that this hosted service constructs at start time and must tear down
|
/// that this hosted service constructs at start time and must tear down
|
||||||
@@ -91,38 +99,18 @@ public class AkkaHostedService : IHostedService
|
|||||||
/// <returns>A task representing the asynchronous operation.</returns>
|
/// <returns>A task representing the asynchronous operation.</returns>
|
||||||
public async Task StartAsync(CancellationToken cancellationToken)
|
public async Task StartAsync(CancellationToken cancellationToken)
|
||||||
{
|
{
|
||||||
// For site nodes, include a site-specific role (e.g., "site-SiteA") alongside the base role
|
// HOST-021: create (or reuse) the externally-owned, process-singleton ActorSystem. A
|
||||||
var roles = BuildRoles();
|
// health probe may already have created it via the DI singleton bridge
|
||||||
|
// (GetOrCreateActorSystem) before this hosted service's StartAsync ran; either way the
|
||||||
// WP-3: Transport heartbeat explicitly configured from CommunicationOptions (not framework defaults)
|
// call yields the one instance and sets _actorSystem. Actor registration below then
|
||||||
var transportHeartbeatSec = _communicationOptions.TransportHeartbeatInterval.TotalSeconds;
|
// runs on it.
|
||||||
var transportFailureSec = _communicationOptions.TransportFailureThreshold.TotalSeconds;
|
var actorSystem = GetOrCreateActorSystem();
|
||||||
|
|
||||||
// Host-006: HOCON is assembled in a dedicated builder that quotes/escapes every
|
|
||||||
// interpolated value, so a hostname, seed node or strategy containing a quote,
|
|
||||||
// backslash or whitespace cannot corrupt the configuration document.
|
|
||||||
var hocon = BuildHocon(_nodeOptions, _clusterOptions, roles,
|
|
||||||
_communicationOptions.TransportHeartbeatInterval,
|
|
||||||
_communicationOptions.TransportFailureThreshold);
|
|
||||||
|
|
||||||
var config = ConfigurationFactory.ParseString(hocon);
|
|
||||||
_actorSystem = ActorSystem.Create("scadabridge", config);
|
|
||||||
|
|
||||||
_logger.LogInformation(
|
|
||||||
"Akka.NET actor system 'scadabridge' started. Role={Role}, Roles={Roles}, Hostname={Hostname}, Port={Port}, " +
|
|
||||||
"TransportHeartbeat={TransportHeartbeat}s, TransportFailure={TransportFailure}s",
|
|
||||||
_nodeOptions.Role,
|
|
||||||
string.Join(", ", roles),
|
|
||||||
_nodeOptions.NodeHostname,
|
|
||||||
_nodeOptions.RemotingPort,
|
|
||||||
transportHeartbeatSec,
|
|
||||||
transportFailureSec);
|
|
||||||
|
|
||||||
// Register the dead letter monitor actor
|
// Register the dead letter monitor actor
|
||||||
var loggerFactory = _serviceProvider.GetRequiredService<ILoggerFactory>();
|
var loggerFactory = _serviceProvider.GetRequiredService<ILoggerFactory>();
|
||||||
var dlmLogger = loggerFactory.CreateLogger<DeadLetterMonitorActor>();
|
var dlmLogger = loggerFactory.CreateLogger<DeadLetterMonitorActor>();
|
||||||
var dlmHealthCollector = _serviceProvider.GetService<ZB.MOM.WW.ScadaBridge.HealthMonitoring.ISiteHealthCollector>();
|
var dlmHealthCollector = _serviceProvider.GetService<ZB.MOM.WW.ScadaBridge.HealthMonitoring.ISiteHealthCollector>();
|
||||||
_actorSystem.ActorOf(
|
actorSystem.ActorOf(
|
||||||
Props.Create(() => new DeadLetterMonitorActor(dlmLogger, dlmHealthCollector)),
|
Props.Create(() => new DeadLetterMonitorActor(dlmLogger, dlmHealthCollector)),
|
||||||
"dead-letter-monitor");
|
"dead-letter-monitor");
|
||||||
|
|
||||||
@@ -137,6 +125,72 @@ public class AkkaHostedService : IHostedService
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/// <summary>
|
||||||
|
/// Returns the process-wide Akka <see cref="ActorSystem"/>, creating it on first call.
|
||||||
|
/// Idempotent and thread-safe: both <see cref="StartAsync"/> and the DI bridge that
|
||||||
|
/// exposes the system to the shared <c>ZB.MOM.WW.Health.Akka</c> checks call this, and
|
||||||
|
/// whichever runs first creates the system exactly once.
|
||||||
|
/// </summary>
|
||||||
|
/// <remarks>
|
||||||
|
/// HOST-021: the <see cref="ActorSystem"/> is an externally-owned process singleton — its
|
||||||
|
/// lifecycle is this hosted service's (created here, torn down via
|
||||||
|
/// <c>CoordinatedShutdown</c> in <see cref="StopAsync"/>). It MUST be registered in DI as a
|
||||||
|
/// <b>singleton resolved through this method</b>, never as a transient/scoped factory:
|
||||||
|
/// <see cref="ActorSystem"/> is <see cref="IDisposable"/>, and a transient/scoped factory
|
||||||
|
/// hands a fresh disposable to every resolving child scope (e.g. each per-probe
|
||||||
|
/// health-check scope), so the container disposes it when that scope ends —
|
||||||
|
/// <c>ActorSystem.Dispose()</c> runs <c>CoordinatedShutdown(ActorSystemTerminateReason)</c>
|
||||||
|
/// and tears the live cluster node down mid-flight, which is exactly the
|
||||||
|
/// "central report pages hang 30s" defect this method fixes. Creating the system here and
|
||||||
|
/// exposing it as a singleton keeps child-scope disposal away from it; routing the singleton
|
||||||
|
/// through this method (rather than a plain <c>AddSingleton(sp => ...ActorSystem)</c>
|
||||||
|
/// factory) also avoids caching a <c>null</c> if a health probe wins the startup race, since
|
||||||
|
/// the first resolve creates the system instead of capturing a not-yet-started reference.
|
||||||
|
/// </remarks>
|
||||||
|
/// <returns>The single live actor system.</returns>
|
||||||
|
public ActorSystem GetOrCreateActorSystem()
|
||||||
|
{
|
||||||
|
if (_actorSystem is not null)
|
||||||
|
{
|
||||||
|
return _actorSystem;
|
||||||
|
}
|
||||||
|
|
||||||
|
lock (_actorSystemLock)
|
||||||
|
{
|
||||||
|
if (_actorSystem is not null)
|
||||||
|
{
|
||||||
|
return _actorSystem;
|
||||||
|
}
|
||||||
|
|
||||||
|
// For site nodes, include a site-specific role (e.g., "site-SiteA") alongside the base role
|
||||||
|
var roles = BuildRoles();
|
||||||
|
|
||||||
|
// Host-006: HOCON is assembled in a dedicated builder that quotes/escapes every
|
||||||
|
// interpolated value, so a hostname, seed node or strategy containing a quote,
|
||||||
|
// backslash or whitespace cannot corrupt the configuration document.
|
||||||
|
var hocon = BuildHocon(_nodeOptions, _clusterOptions, roles,
|
||||||
|
_communicationOptions.TransportHeartbeatInterval,
|
||||||
|
_communicationOptions.TransportFailureThreshold);
|
||||||
|
|
||||||
|
var config = ConfigurationFactory.ParseString(hocon);
|
||||||
|
var system = ActorSystem.Create("scadabridge", config);
|
||||||
|
|
||||||
|
_logger.LogInformation(
|
||||||
|
"Akka.NET actor system 'scadabridge' started. Role={Role}, Roles={Roles}, Hostname={Hostname}, Port={Port}, " +
|
||||||
|
"TransportHeartbeat={TransportHeartbeat}s, TransportFailure={TransportFailure}s",
|
||||||
|
_nodeOptions.Role,
|
||||||
|
string.Join(", ", roles),
|
||||||
|
_nodeOptions.NodeHostname,
|
||||||
|
_nodeOptions.RemotingPort,
|
||||||
|
_communicationOptions.TransportHeartbeatInterval.TotalSeconds,
|
||||||
|
_communicationOptions.TransportFailureThreshold.TotalSeconds);
|
||||||
|
|
||||||
|
// Publish last so a concurrent reader never observes a half-constructed system.
|
||||||
|
_actorSystem = system;
|
||||||
|
return _actorSystem;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
/// <summary>
|
/// <summary>
|
||||||
/// Builds the Akka HOCON configuration document. Every interpolated value is
|
/// Builds the Akka HOCON configuration document. Every interpolated value is
|
||||||
/// routed through <see cref="QuoteHocon"/> (string values) so a hostname,
|
/// routed through <see cref="QuoteHocon"/> (string values) so a hostname,
|
||||||
|
|||||||
@@ -204,12 +204,17 @@ try
|
|||||||
builder.Services.AddSingleton<AkkaHostedService>();
|
builder.Services.AddSingleton<AkkaHostedService>();
|
||||||
builder.Services.AddHostedService(sp => sp.GetRequiredService<AkkaHostedService>());
|
builder.Services.AddHostedService(sp => sp.GetRequiredService<AkkaHostedService>());
|
||||||
|
|
||||||
// The shared ZB.MOM.WW.Health Akka checks resolve ActorSystem from DI. ScadaBridge owns the
|
// HOST-021: bridge the AkkaHostedService-owned ActorSystem to DI as a SINGLETON via
|
||||||
// ActorSystem inside AkkaHostedService (not a DI singleton), so bridge it as TRANSIENT: each
|
// GetOrCreateActorSystem(). The shared ZB.MOM.WW.Health Akka checks resolve ActorSystem
|
||||||
// resolve re-reads the current value — null while warming up (checks → Degraded), live after.
|
// from DI, per probe, inside a child scope. ActorSystem is IDisposable, so a TRANSIENT
|
||||||
// The factory must NOT throw: GetService<ActorSystem>() must return null (not raise) pre-start.
|
// (or scoped) bridge is captured-and-disposed by each probe's scope — disposing the live
|
||||||
builder.Services.AddTransient<Akka.Actor.ActorSystem>(sp =>
|
// system mid-flight (CoordinatedShutdown/ActorSystemTerminateReason) and wedging the
|
||||||
sp.GetRequiredService<AkkaHostedService>().ActorSystem!);
|
// central report pages at the 30s Ask timeout. A singleton is resolved from the root and
|
||||||
|
// never disposed by a child scope; routing through GetOrCreateActorSystem (instead of a
|
||||||
|
// plain singleton factory over .ActorSystem) means the first resolve CREATES the system
|
||||||
|
// rather than caching a null if a probe wins the startup race.
|
||||||
|
builder.Services.AddSingleton<Akka.Actor.ActorSystem>(sp =>
|
||||||
|
sp.GetRequiredService<AkkaHostedService>().GetOrCreateActorSystem());
|
||||||
|
|
||||||
// InboundAPI-022: register the production IActiveNodeGate implementation so
|
// InboundAPI-022: register the production IActiveNodeGate implementation so
|
||||||
// standby-node gating is actually enforced (the InboundApiEndpointFilter
|
// standby-node gating is actually enforced (the InboundApiEndpointFilter
|
||||||
|
|||||||
@@ -75,12 +75,17 @@ public static class SiteServiceRegistration
|
|||||||
services.AddSingleton<AkkaHostedService>();
|
services.AddSingleton<AkkaHostedService>();
|
||||||
services.AddHostedService(sp => sp.GetRequiredService<AkkaHostedService>());
|
services.AddHostedService(sp => sp.GetRequiredService<AkkaHostedService>());
|
||||||
|
|
||||||
// The shared ZB.MOM.WW.Health Akka checks resolve ActorSystem from DI. ScadaBridge owns the
|
// HOST-021: bridge the AkkaHostedService-owned ActorSystem to DI as a SINGLETON via
|
||||||
// ActorSystem inside AkkaHostedService (not a DI singleton), so bridge it as TRANSIENT: each
|
// GetOrCreateActorSystem(). The shared ZB.MOM.WW.Health Akka checks resolve ActorSystem
|
||||||
// resolve re-reads the current value — null while warming up (checks → Degraded), live after.
|
// from DI, per probe, inside a child scope. ActorSystem is IDisposable, so a TRANSIENT
|
||||||
// The factory must NOT throw: GetService<ActorSystem>() must return null (not raise) pre-start.
|
// (or scoped) bridge is captured-and-disposed by each probe's scope — disposing the live
|
||||||
services.AddTransient<Akka.Actor.ActorSystem>(sp =>
|
// system mid-flight (CoordinatedShutdown/ActorSystemTerminateReason) and tearing down the
|
||||||
sp.GetRequiredService<AkkaHostedService>().ActorSystem!);
|
// node. A singleton is resolved from the root and never disposed by a child scope; routing
|
||||||
|
// through GetOrCreateActorSystem (instead of a plain singleton factory over .ActorSystem)
|
||||||
|
// means the first resolve CREATES the system rather than caching a null if a probe wins
|
||||||
|
// the startup race.
|
||||||
|
services.AddSingleton<Akka.Actor.ActorSystem>(sp =>
|
||||||
|
sp.GetRequiredService<AkkaHostedService>().GetOrCreateActorSystem());
|
||||||
|
|
||||||
// Cluster node status provider for health reports
|
// Cluster node status provider for health reports
|
||||||
services.AddSingleton<IClusterNodeProvider>(sp =>
|
services.AddSingleton<IClusterNodeProvider>(sp =>
|
||||||
|
|||||||
Reference in New Issue
Block a user