Commons (third-party dep, 7 namespaces, retired ApiKey, repo SaveChanges carve-out), ConfigurationDatabase (5 persisted + 1 non-persisted computed col), ClusterInfrastructure (abbreviated HOCON note, RemotingPort default), Host (component matrix: CI/HealthMonitoring/ExternalSystemGateway have no actors; DeadLetterMonitorActor runs on both roles), Security (Bearer not X-API-Key; ApiKeyAdmin registered by Host), Communication (Task.Run/Sender).
20 KiB
Host
The Host is the single deployable binary for ScadaBridge. The same executable runs on every node — central and site alike — and selects its component set entirely from configuration, with no separate build targets or conditional compilation.
Overview
Host (#15) is the composition root: it reads ScadaBridge:Node:Role from appsettings.json (layered with a role-specific override file selected by the SCADABRIDGE_CONFIG environment variable), runs pre-DI startup validation, wires every applicable component into the DI container and Akka.NET actor system, and then hands off to ASP.NET Core's WebApplication host.
The component code lives in src/ZB.MOM.WW.ScadaBridge.Host/, split across:
Program.cs— the entry point: configuration loading,StartupValidator, role-branched DI registration, Kestrel setup, middleware pipeline, and endpoint mapping.Actors/AkkaHostedService.cs— owns theActorSystemlifetime; builds HOCON from bound options; registers role-specific actors as cluster singletons or plainActorOfcalls.Actors/DeadLetterMonitorActor.cs— subscribes to theDeadLetterevent stream and increments the health metric.Health/ActiveNodeGate.cs— productionIActiveNodeGatebacked by Akka cluster leadership; used by the Inbound API endpoint filter to gate traffic on standby nodes.Health/AkkaClusterNodeProvider.cs— feedsIClusterNodeProviderfrom live Akka cluster membership for health reporting.SiteServiceRegistration.cs— extracted site-role DI registrations reused by bothProgram.csand integration test harnesses.StartupValidator.cs— pre-DI configuration preflight that fails fast before any actor system is created.StartupRetry.cs— bounded exponential-backoff helper for startup preconditions (database migrations).LoggerConfigurationFactory.cs— builds the SerilogLoggerConfigurationwith node-identity enrichment.
Key Concepts
Role selection via SCADABRIDGE_CONFIG
The configuration builder layers appsettings.json, then appsettings.{SCADABRIDGE_CONFIG}.json. The SCADABRIDGE_CONFIG environment variable selects the role-specific file (Central or Site); when absent, it falls back to DOTNET_ENVIRONMENT. DOTNET_ENVIRONMENT/ASPNETCORE_ENVIRONMENT remain Development for dev tooling (static assets, EF migrations) independently of which role is active.
var scadabridgeConfig = Environment.GetEnvironmentVariable("SCADABRIDGE_CONFIG")
?? Environment.GetEnvironmentVariable("DOTNET_ENVIRONMENT")
?? "Production";
var configuration = new ConfigurationBuilder()
.AddJsonFile("appsettings.json", optional: false)
.AddJsonFile($"appsettings.{scadabridgeConfig}.json", optional: true)
.AddEnvironmentVariables()
.AddCommandLine(args)
.Build();
The resolved ScadaBridge:Node:Role value then branches the entire DI and Akka bootstrap.
Pre-DI startup validation
StartupValidator.Validate runs before any DI or actor system setup. It assembles all errors, then throws a single InvalidOperationException listing every problem. This avoids the confusing partial-startup failures that occur when validation is deferred to first resolve. Site nodes additionally validate that GrpcPort, MetricsPort, and RemotingPort are all distinct and that no seed-node entry points at the gRPC port.
Akka HOCON construction
AkkaHostedService.BuildHocon assembles the HOCON configuration document from strongly-typed options rather than inline strings. Every interpolated value passes through QuoteHocon (escapes backslashes and double-quotes) to prevent a hostname, seed-node URI, or split-brain strategy value from corrupting the document. Durations are rendered in milliseconds (DurationHocon) so sub-second timing values (e.g. a 750 ms heartbeat) are preserved exactly.
The actor system name is always scadabridge. Site nodes carry two cluster roles: the generic "Site" role and a per-site role ("site-{SiteId}") used to scope cluster singletons to a specific site.
/health/ready — readiness gating
Central nodes register DatabaseHealthCheck<ScadaBridgeDbContext> (tagged Ready) and AkkaClusterHealthCheck (tagged Ready). The /health/ready endpoint returns 200 only when both pass. Readiness is explicitly not tied to cluster leadership: a fully operational standby central node still reports ready because ActiveNodeHealthCheck carries only the Active tag, not Ready.
Load balancers and orchestrators should poll /health/ready to determine when a freshly started or failed-over node can receive traffic.
/health/active — active-node routing for Traefik
ActiveNodeHealthCheck carries the Active tag and is served at /health/active. It returns 200 only on the cluster leader. Traefik polls this endpoint and routes inbound traffic — Central UI, Inbound API, Management API — exclusively to the node that answers 200. See TraefikProxy for the upstream routing rules.
The same leadership check backs ActiveNodeGate, the IActiveNodeGate implementation the Inbound API endpoint filter consults before executing a method script. A standby node therefore refuses inbound API calls even if traffic somehow reaches it directly.
public bool IsActiveNode
{
get
{
var system = _akkaService.ActorSystem;
if (system == null)
return false;
var cluster = Cluster.Get(system);
var self = cluster.SelfMember;
if (self.Status != MemberStatus.Up)
return false;
var leader = cluster.State.Leader;
return leader != null && leader == self.Address;
}
}
Architecture
Central composition root
Program.cs (Central branch) calls WebApplication.CreateBuilder, registers shared and central-only components, builds the WebApplication, applies or retries database migrations, and mounts the middleware pipeline and endpoints. The order is intentional: UseAuthentication and UseAuthorization run before UseAuditWriteMiddleware so HttpContext.User is populated when the audit row is written.
Before branching on role, AkkaHostedService.StartAsync creates one actor unconditionally on every node:
DeadLetterMonitorActor— plainActorOf; subscribes to theDeadLetterevent stream onPreStart. Runs on both central and site nodes.
AkkaHostedService.RegisterCentralActors creates:
CentralCommunicationActor— registered withClusterClientReceptionistso siteClusterClients can reach it.ManagementActor— also registered withClusterClientReceptionist; the CLI connects viaClusterClientwithout joining the cluster.NotificationOutboxActor— cluster singleton (no role scope); a proxy is handed toCentralCommunicationActorso forwardedNotificationSubmitmessages from sites are routed to it.AuditLogIngestActor— cluster singleton; proxy registered with bothCentralCommunicationActorand (if present) theSiteStreamGrpcServer.SiteCallAuditActor— cluster singleton; a graceful-stop task is added to thecluster-leavecoordinated-shutdown phase with a 10-second drain window.
Site composition root
Program.cs (Site branch) calls WebApplication.CreateBuilder with a Kestrel configuration that binds two listeners: HTTP/2 only on GrpcPort (default 8083) for the gRPC server, and HTTP/1+2 on MetricsPort (default 8084) for the Prometheus /metrics scrape endpoint. The separation exists because a standard HTTP/1.1 Prometheus scraper cannot negotiate HTTP/2; the gRPC listener must stay pure HTTP/2.
SiteServiceRegistration.Configure registers the site-only components. AkkaHostedService.RegisterSiteActorsAsync creates:
DeploymentManagerActor— cluster singleton scoped to"site-{SiteId}".SiteCommunicationActor— registered withClusterClientReceptionist; creates aClusterClientto configured central contact points.SiteReplicationActor— one per node (not a singleton); handles best-effort S&F replication to the standby.EventLogHandlerActor— cluster singleton scoped to"site-{SiteId}".ParkedMessageHandlerActor— bridges Akka toStoreAndForwardService.SiteAuditTelemetryActor— created on a dedicatedaudit-telemetry-dispatcher(2-threadForkJoinDispatcher) so SQLite reads and gRPC pushes never contend with hot-path actors.DataConnectionManagerActor— ifIDataConnectionFactoryis registered.
Shutdown ordering for the site role is explicit: IHostApplicationLifetime.ApplicationStopping fires before IHostedService.StopAsync, so SiteStreamGrpcServer.CancelAllStreams is called first (clients observe a clean cancellation and reconnect), then AkkaHostedService runs CoordinatedShutdown and tears down actors.
siteLifetime.ApplicationStopping.Register(() => siteGrpcServer.CancelAllStreams());
Database migration retry
On central nodes, StartupRetry.ExecuteWithRetryAsync wraps the migration step with up to 8 attempts and initial 2-second exponential backoff (capped at 30 seconds). Only connection-class faults (SocketException, SqlException, DbException, TimeoutException) are retried; a schema-version mismatch surfaces as an InvalidOperationException and fails immediately. The ApplicationStopping token is threaded into both the migration call and the inter-attempt Task.Delay so a SIGTERM during the retry window tears down cleanly.
Usage
The Host is not consumed as a library; it is the executable entry point. Other components expose themselves to the Host via the extension-method convention:
IServiceCollection.AddXxx()— registers DI services.AkkaHostedService.RegisterXxxActors()/ inlineActorOfcalls inAkkaHostedService— registers actors.WebApplication.MapXxx()— maps web endpoints (Central UI, Inbound API, Management API, Audit API).
Program.cs calls these methods; the component libraries own the registration logic. This keeps the Host thin and each component self-contained.
Component registration by role
| Component | Central | Site | AddXxx |
Actors | MapXxx |
|---|---|---|---|---|---|
| ClusterInfrastructure | Yes | Yes | Yes | — | — |
| Communication | Yes | Yes | Yes | Yes | — |
| HealthMonitoring | Yes | Yes | Yes | — | — |
| ExternalSystemGateway | Yes | Yes | Yes | — | — |
| AuditLog | Yes | Yes | Yes | Yes | — |
| NotificationService | Yes | No | Yes | — | — |
| NotificationOutbox | Yes | No | Yes | Yes (singleton) | — |
| SiteCallAudit | Yes | No | Yes | Yes (singleton) | — |
| TemplateEngine | Yes | No | Yes | Yes | — |
| DeploymentManager | Yes | No | Yes | Yes | — |
| Security | Yes | No | Yes | — | — |
| CentralUI | Yes | No | Yes | — | Yes |
| InboundAPI | Yes | No | Yes | — | Yes |
| ManagementService | Yes | No | Yes | Yes | Yes |
| Transport | Yes | No | Yes | — | — |
| ConfigurationDatabase | Yes | No | Yes | — | — |
| SiteRuntime | No | Yes | Yes | Yes (singleton) | — |
| DataConnectionLayer | No | Yes | Yes | Yes | — |
| StoreAndForward | No | Yes | Yes | Yes | — |
| SiteEventLogging | No | Yes | Yes | Yes (singleton) | — |
AuditLog calls AddAuditLog on both roles; central additionally calls AddAuditLogCentralMaintenance. Site calls AddAuditLogHealthMetricsBridge to bridge write failures into the site health report.
Configuration
Options are bound via the .NET Options pattern (IOptions<T>). Each component owns its options class; the Host binds each section and passes the IConfiguration to component extension methods only where the component's own validator needs it at startup.
ScadaBridge:Node → NodeOptions
| Key | Default | Description |
|---|---|---|
Role |
— | "Central" or "Site". Validated by StartupValidator. |
NodeHostname |
— | Hostname or IP advertised to the Akka cluster and enriched on log entries. |
NodeName |
— | Free-form semantic name stamped as SourceNode on audit rows (e.g. "central-a", "node-b"). Empty normalises to null. |
SiteId |
— | Site identifier; required for Site nodes; used to scope cluster singletons and enrich telemetry. |
RemotingPort |
8081 |
Akka.NET remoting TCP port. Must be in range 1–65535. |
GrpcPort |
8083 |
Kestrel HTTP/2 port for the site gRPC stream server (Site nodes only). Must differ from RemotingPort. |
MetricsPort |
8084 |
Kestrel HTTP/1+2 port for the Prometheus /metrics scrape endpoint (Site nodes only). Must differ from both RemotingPort and GrpcPort. |
ScadaBridge:Cluster → ClusterOptions
| Key | Default | Description |
|---|---|---|
SeedNodes |
— | List of Akka seed-node URIs (akka.tcp://scadabridge@host:port). At least 2 required. Must reference remoting ports, not gRPC ports. |
SplitBrainResolverStrategy |
keep-oldest |
Active strategy name (e.g. "keep-oldest"). |
StableAfter |
"00:00:15" |
Duration the cluster must be stable before the resolver acts. |
HeartbeatInterval |
"00:00:02" |
Akka failure-detector heartbeat cadence. |
FailureDetectionThreshold |
"00:00:10" |
Acceptable heartbeat pause before a node is considered unreachable. |
MinNrOfMembers |
1 |
Minimum cluster members before the leader is elected. |
DownIfAlone |
true |
When using keep-oldest, whether a lone surviving node downs itself. |
ScadaBridge:Database → DatabaseOptions
| Key | Role | Description |
|---|---|---|
ConfigurationDb |
Central | MS SQL connection string for the central ScadaBridgeDbContext. Required; validated by StartupValidator. |
SiteDbPath |
Site | Filesystem path to the site-local SQLite database. Required for Site nodes. |
ScadaBridge:Logging → LoggingOptions
| Key | Default | Description |
|---|---|---|
MinimumLevel |
"Information" |
Serilog minimum log level. Overrides any Serilog:MinimumLevel entry — a one-shot warning is emitted to stderr if both are present. Parsed case-insensitively; unrecognised values fall back to Information with a warning. |
Serilog sinks (console output template, file path, rolling interval) are configured under the standard Serilog JSON section and applied via ReadFrom.Configuration. Every log entry is enriched with SiteId, NodeHostname, and NodeRole properties from the resolved node configuration.
ScadaBridge:InboundApi:ApiKeyStore
| Key | Default | Description |
|---|---|---|
SqlitePath |
data/inbound-api-keys.sqlite under content root |
Path to the SQLite store for inbound API keys. |
TokenPrefix |
"sbk" |
Prefix for issued API key tokens. Fixed; injected by the Host as in-memory config. |
PepperSecretName |
"ScadaBridge:InboundApi:ApiKeyPepper" |
Configuration key holding the peppered-HMAC secret. The pepper itself must be ≥ 16 characters; validated by StartupValidator. |
RunMigrationsOnStartup |
true |
Whether the hosted service creates the SQLite schema on first run. |
All other per-component configuration sections (ScadaBridge:Communication, ScadaBridge:HealthMonitoring, ScadaBridge:Security, ScadaBridge:InboundApi, ScadaBridge:NotificationOutbox, ScadaBridge:Transport, ScadaBridge:DataConnection, ScadaBridge:StoreAndForward, ScadaBridge:SiteEventLog, ScadaBridge:SiteRuntime, ScadaBridge:Notification) are bound by their respective component extension methods. The Host binds them at the shared BindSharedOptions call or at the role-specific Configure<T> sites in Program.cs and SiteServiceRegistration.Configure.
Dependencies & Interactions
- All 19 component libraries — the Host project-references every component to call its extension methods. The Host is the only project with this fan-out; component libraries do not reference each other except where documented.
- Cluster Infrastructure (#13) — the Host configures the underlying Akka.NET cluster (
AkkaHostedService.BuildHocon); ClusterInfrastructure manages it at runtime. - Configuration Database (#17) — the Host registers
ScadaBridgeDbContextand callsAddConfigurationDatabase(Central only); theStartupRetry-wrapped migration step runs before traffic is accepted. - Central–Site Communication (#5) — the Host creates
CentralCommunicationActorandSiteCommunicationActor, registers them withClusterClientReceptionist, and wires theClusterClientfor site→central messaging; the gRPC server is mapped atapp.MapGrpcService<SiteStreamGrpcServer>(). - Health Monitoring (#11) — the Host registers health checks (
DatabaseHealthCheck,AkkaClusterHealthCheck,ActiveNodeHealthCheck) and mounts them viaapp.MapZbHealth()on central; site nodes registerAddSiteHealthMonitoringandAkkaHealthReportTransport. - Audit Log (#23) — the Host calls
AddAuditLogon both roles,AddAuditLogCentralMaintenanceon central, andAddAuditLogHealthMetricsBridgeon site; it creates theAuditLogIngestActorsingleton and registersSiteAuditTelemetryActoron the dedicated dispatcher. - Notification Outbox (#21) — the Host creates the
NotificationOutboxActorcluster singleton and hands its proxy toCentralCommunicationActor. - Site Call Audit (#22) — the Host creates the
SiteCallAuditActorcluster singleton with a graceful-stop drain task registered in thecluster-leavecoordinated-shutdown phase. - Management Service (#18) — the Host creates
ManagementActorand registers it withClusterClientReceptionist; maps the Management and Audit HTTP APIs. - Traefik Proxy (#20) — Traefik polls
/health/activeto determine which central node to route traffic to; the Host implements theActiveNodeHealthCheckandActiveNodeGatethat back this endpoint. - Design spec: Component-Host.md.
Troubleshooting
Node fails to start with validation errors
StartupValidator throws before any DI or actor system setup. The exception message lists all failing keys and their expected constraints. Common causes: missing ScadaBridge:Node:Role, a GrpcPort/RemotingPort collision on a site node, a seed-node URI that accidentally points at the gRPC port rather than the remoting port, or a missing ConfigurationDb connection string on a central node.
Central node loops on database migration
StartupRetry retries connection-class faults up to 8 times (roughly 2 minutes worst-case). If the loop exhausts without success, the process exits with a Fatal log entry. Permanent errors (schema-version mismatch detected by MigrationHelper) are not retried and exit on the first attempt. Check SqlException details in the log to distinguish a connectivity failure from a schema fault.
Dead letters appearing at startup
A burst of dead letters during startup is normal: actors send messages before their targets finish PreStart. DeadLetterMonitorActor logs each at Warning and increments the health counter — these are observable on the site health report. Sustained dead letters after the cluster stabilises indicate a stale actor reference or a lifecycle race.
Standby central node receives traffic
If Traefik is not yet polling /health/active or its health-check interval has not elapsed after a failover, traffic may briefly reach the standby. ActiveNodeGate returns false on the standby, causing the Inbound API endpoint filter to respond 503 Service Unavailable. The response header X-ScadaBridge-Active: false is present so the condition is identifiable in access logs. No operator action is needed; Traefik will reroute on its next health-check cycle.