Files
Joseph Doherty c5fb02d640 docs(components): accuracy fixes from deep review (batch 1)
Commons (third-party dep, 7 namespaces, retired ApiKey, repo SaveChanges
carve-out), ConfigurationDatabase (5 persisted + 1 non-persisted computed col),
ClusterInfrastructure (abbreviated HOCON note, RemotingPort default),
Host (component matrix: CI/HealthMonitoring/ExternalSystemGateway have no
actors; DeadLetterMonitorActor runs on both roles), Security (Bearer not
X-API-Key; ApiKeyAdmin registered by Host), Communication (Task.Run/Sender).
2026-06-03 16:32:01 -04:00

20 KiB
Raw Permalink Blame History

Host

The Host is the single deployable binary for ScadaBridge. The same executable runs on every node — central and site alike — and selects its component set entirely from configuration, with no separate build targets or conditional compilation.

Overview

Host (#15) is the composition root: it reads ScadaBridge:Node:Role from appsettings.json (layered with a role-specific override file selected by the SCADABRIDGE_CONFIG environment variable), runs pre-DI startup validation, wires every applicable component into the DI container and Akka.NET actor system, and then hands off to ASP.NET Core's WebApplication host.

The component code lives in src/ZB.MOM.WW.ScadaBridge.Host/, split across:

  • Program.cs — the entry point: configuration loading, StartupValidator, role-branched DI registration, Kestrel setup, middleware pipeline, and endpoint mapping.
  • Actors/AkkaHostedService.cs — owns the ActorSystem lifetime; builds HOCON from bound options; registers role-specific actors as cluster singletons or plain ActorOf calls.
  • Actors/DeadLetterMonitorActor.cs — subscribes to the DeadLetter event stream and increments the health metric.
  • Health/ActiveNodeGate.cs — production IActiveNodeGate backed by Akka cluster leadership; used by the Inbound API endpoint filter to gate traffic on standby nodes.
  • Health/AkkaClusterNodeProvider.cs — feeds IClusterNodeProvider from live Akka cluster membership for health reporting.
  • SiteServiceRegistration.cs — extracted site-role DI registrations reused by both Program.cs and integration test harnesses.
  • StartupValidator.cs — pre-DI configuration preflight that fails fast before any actor system is created.
  • StartupRetry.cs — bounded exponential-backoff helper for startup preconditions (database migrations).
  • LoggerConfigurationFactory.cs — builds the Serilog LoggerConfiguration with node-identity enrichment.

Key Concepts

Role selection via SCADABRIDGE_CONFIG

The configuration builder layers appsettings.json, then appsettings.{SCADABRIDGE_CONFIG}.json. The SCADABRIDGE_CONFIG environment variable selects the role-specific file (Central or Site); when absent, it falls back to DOTNET_ENVIRONMENT. DOTNET_ENVIRONMENT/ASPNETCORE_ENVIRONMENT remain Development for dev tooling (static assets, EF migrations) independently of which role is active.

var scadabridgeConfig = Environment.GetEnvironmentVariable("SCADABRIDGE_CONFIG")
    ?? Environment.GetEnvironmentVariable("DOTNET_ENVIRONMENT")
    ?? "Production";

var configuration = new ConfigurationBuilder()
    .AddJsonFile("appsettings.json", optional: false)
    .AddJsonFile($"appsettings.{scadabridgeConfig}.json", optional: true)
    .AddEnvironmentVariables()
    .AddCommandLine(args)
    .Build();

The resolved ScadaBridge:Node:Role value then branches the entire DI and Akka bootstrap.

Pre-DI startup validation

StartupValidator.Validate runs before any DI or actor system setup. It assembles all errors, then throws a single InvalidOperationException listing every problem. This avoids the confusing partial-startup failures that occur when validation is deferred to first resolve. Site nodes additionally validate that GrpcPort, MetricsPort, and RemotingPort are all distinct and that no seed-node entry points at the gRPC port.

Akka HOCON construction

AkkaHostedService.BuildHocon assembles the HOCON configuration document from strongly-typed options rather than inline strings. Every interpolated value passes through QuoteHocon (escapes backslashes and double-quotes) to prevent a hostname, seed-node URI, or split-brain strategy value from corrupting the document. Durations are rendered in milliseconds (DurationHocon) so sub-second timing values (e.g. a 750 ms heartbeat) are preserved exactly.

The actor system name is always scadabridge. Site nodes carry two cluster roles: the generic "Site" role and a per-site role ("site-{SiteId}") used to scope cluster singletons to a specific site.

/health/ready — readiness gating

Central nodes register DatabaseHealthCheck<ScadaBridgeDbContext> (tagged Ready) and AkkaClusterHealthCheck (tagged Ready). The /health/ready endpoint returns 200 only when both pass. Readiness is explicitly not tied to cluster leadership: a fully operational standby central node still reports ready because ActiveNodeHealthCheck carries only the Active tag, not Ready.

Load balancers and orchestrators should poll /health/ready to determine when a freshly started or failed-over node can receive traffic.

/health/active — active-node routing for Traefik

ActiveNodeHealthCheck carries the Active tag and is served at /health/active. It returns 200 only on the cluster leader. Traefik polls this endpoint and routes inbound traffic — Central UI, Inbound API, Management API — exclusively to the node that answers 200. See TraefikProxy for the upstream routing rules.

The same leadership check backs ActiveNodeGate, the IActiveNodeGate implementation the Inbound API endpoint filter consults before executing a method script. A standby node therefore refuses inbound API calls even if traffic somehow reaches it directly.

public bool IsActiveNode
{
    get
    {
        var system = _akkaService.ActorSystem;
        if (system == null)
            return false;

        var cluster = Cluster.Get(system);
        var self = cluster.SelfMember;
        if (self.Status != MemberStatus.Up)
            return false;

        var leader = cluster.State.Leader;
        return leader != null && leader == self.Address;
    }
}

Architecture

Central composition root

Program.cs (Central branch) calls WebApplication.CreateBuilder, registers shared and central-only components, builds the WebApplication, applies or retries database migrations, and mounts the middleware pipeline and endpoints. The order is intentional: UseAuthentication and UseAuthorization run before UseAuditWriteMiddleware so HttpContext.User is populated when the audit row is written.

Before branching on role, AkkaHostedService.StartAsync creates one actor unconditionally on every node:

  • DeadLetterMonitorActor — plain ActorOf; subscribes to the DeadLetter event stream on PreStart. Runs on both central and site nodes.

AkkaHostedService.RegisterCentralActors creates:

  • CentralCommunicationActor — registered with ClusterClientReceptionist so site ClusterClients can reach it.
  • ManagementActor — also registered with ClusterClientReceptionist; the CLI connects via ClusterClient without joining the cluster.
  • NotificationOutboxActor — cluster singleton (no role scope); a proxy is handed to CentralCommunicationActor so forwarded NotificationSubmit messages from sites are routed to it.
  • AuditLogIngestActor — cluster singleton; proxy registered with both CentralCommunicationActor and (if present) the SiteStreamGrpcServer.
  • SiteCallAuditActor — cluster singleton; a graceful-stop task is added to the cluster-leave coordinated-shutdown phase with a 10-second drain window.

Site composition root

Program.cs (Site branch) calls WebApplication.CreateBuilder with a Kestrel configuration that binds two listeners: HTTP/2 only on GrpcPort (default 8083) for the gRPC server, and HTTP/1+2 on MetricsPort (default 8084) for the Prometheus /metrics scrape endpoint. The separation exists because a standard HTTP/1.1 Prometheus scraper cannot negotiate HTTP/2; the gRPC listener must stay pure HTTP/2.

SiteServiceRegistration.Configure registers the site-only components. AkkaHostedService.RegisterSiteActorsAsync creates:

  • DeploymentManagerActor — cluster singleton scoped to "site-{SiteId}".
  • SiteCommunicationActor — registered with ClusterClientReceptionist; creates a ClusterClient to configured central contact points.
  • SiteReplicationActor — one per node (not a singleton); handles best-effort S&F replication to the standby.
  • EventLogHandlerActor — cluster singleton scoped to "site-{SiteId}".
  • ParkedMessageHandlerActor — bridges Akka to StoreAndForwardService.
  • SiteAuditTelemetryActor — created on a dedicated audit-telemetry-dispatcher (2-thread ForkJoinDispatcher) so SQLite reads and gRPC pushes never contend with hot-path actors.
  • DataConnectionManagerActor — if IDataConnectionFactory is registered.

Shutdown ordering for the site role is explicit: IHostApplicationLifetime.ApplicationStopping fires before IHostedService.StopAsync, so SiteStreamGrpcServer.CancelAllStreams is called first (clients observe a clean cancellation and reconnect), then AkkaHostedService runs CoordinatedShutdown and tears down actors.

siteLifetime.ApplicationStopping.Register(() => siteGrpcServer.CancelAllStreams());

Database migration retry

On central nodes, StartupRetry.ExecuteWithRetryAsync wraps the migration step with up to 8 attempts and initial 2-second exponential backoff (capped at 30 seconds). Only connection-class faults (SocketException, SqlException, DbException, TimeoutException) are retried; a schema-version mismatch surfaces as an InvalidOperationException and fails immediately. The ApplicationStopping token is threaded into both the migration call and the inter-attempt Task.Delay so a SIGTERM during the retry window tears down cleanly.

Usage

The Host is not consumed as a library; it is the executable entry point. Other components expose themselves to the Host via the extension-method convention:

  • IServiceCollection.AddXxx() — registers DI services.
  • AkkaHostedService.RegisterXxxActors() / inline ActorOf calls in AkkaHostedService — registers actors.
  • WebApplication.MapXxx() — maps web endpoints (Central UI, Inbound API, Management API, Audit API).

Program.cs calls these methods; the component libraries own the registration logic. This keeps the Host thin and each component self-contained.

Component registration by role

Component Central Site AddXxx Actors MapXxx
ClusterInfrastructure Yes Yes Yes
Communication Yes Yes Yes Yes
HealthMonitoring Yes Yes Yes
ExternalSystemGateway Yes Yes Yes
AuditLog Yes Yes Yes Yes
NotificationService Yes No Yes
NotificationOutbox Yes No Yes Yes (singleton)
SiteCallAudit Yes No Yes Yes (singleton)
TemplateEngine Yes No Yes Yes
DeploymentManager Yes No Yes Yes
Security Yes No Yes
CentralUI Yes No Yes Yes
InboundAPI Yes No Yes Yes
ManagementService Yes No Yes Yes Yes
Transport Yes No Yes
ConfigurationDatabase Yes No Yes
SiteRuntime No Yes Yes Yes (singleton)
DataConnectionLayer No Yes Yes Yes
StoreAndForward No Yes Yes Yes
SiteEventLogging No Yes Yes Yes (singleton)

AuditLog calls AddAuditLog on both roles; central additionally calls AddAuditLogCentralMaintenance. Site calls AddAuditLogHealthMetricsBridge to bridge write failures into the site health report.

Configuration

Options are bound via the .NET Options pattern (IOptions<T>). Each component owns its options class; the Host binds each section and passes the IConfiguration to component extension methods only where the component's own validator needs it at startup.

ScadaBridge:NodeNodeOptions

Key Default Description
Role "Central" or "Site". Validated by StartupValidator.
NodeHostname Hostname or IP advertised to the Akka cluster and enriched on log entries.
NodeName Free-form semantic name stamped as SourceNode on audit rows (e.g. "central-a", "node-b"). Empty normalises to null.
SiteId Site identifier; required for Site nodes; used to scope cluster singletons and enrich telemetry.
RemotingPort 8081 Akka.NET remoting TCP port. Must be in range 165535.
GrpcPort 8083 Kestrel HTTP/2 port for the site gRPC stream server (Site nodes only). Must differ from RemotingPort.
MetricsPort 8084 Kestrel HTTP/1+2 port for the Prometheus /metrics scrape endpoint (Site nodes only). Must differ from both RemotingPort and GrpcPort.

ScadaBridge:ClusterClusterOptions

Key Default Description
SeedNodes List of Akka seed-node URIs (akka.tcp://scadabridge@host:port). At least 2 required. Must reference remoting ports, not gRPC ports.
SplitBrainResolverStrategy keep-oldest Active strategy name (e.g. "keep-oldest").
StableAfter "00:00:15" Duration the cluster must be stable before the resolver acts.
HeartbeatInterval "00:00:02" Akka failure-detector heartbeat cadence.
FailureDetectionThreshold "00:00:10" Acceptable heartbeat pause before a node is considered unreachable.
MinNrOfMembers 1 Minimum cluster members before the leader is elected.
DownIfAlone true When using keep-oldest, whether a lone surviving node downs itself.

ScadaBridge:DatabaseDatabaseOptions

Key Role Description
ConfigurationDb Central MS SQL connection string for the central ScadaBridgeDbContext. Required; validated by StartupValidator.
SiteDbPath Site Filesystem path to the site-local SQLite database. Required for Site nodes.

ScadaBridge:LoggingLoggingOptions

Key Default Description
MinimumLevel "Information" Serilog minimum log level. Overrides any Serilog:MinimumLevel entry — a one-shot warning is emitted to stderr if both are present. Parsed case-insensitively; unrecognised values fall back to Information with a warning.

Serilog sinks (console output template, file path, rolling interval) are configured under the standard Serilog JSON section and applied via ReadFrom.Configuration. Every log entry is enriched with SiteId, NodeHostname, and NodeRole properties from the resolved node configuration.

ScadaBridge:InboundApi:ApiKeyStore

Key Default Description
SqlitePath data/inbound-api-keys.sqlite under content root Path to the SQLite store for inbound API keys.
TokenPrefix "sbk" Prefix for issued API key tokens. Fixed; injected by the Host as in-memory config.
PepperSecretName "ScadaBridge:InboundApi:ApiKeyPepper" Configuration key holding the peppered-HMAC secret. The pepper itself must be ≥ 16 characters; validated by StartupValidator.
RunMigrationsOnStartup true Whether the hosted service creates the SQLite schema on first run.

All other per-component configuration sections (ScadaBridge:Communication, ScadaBridge:HealthMonitoring, ScadaBridge:Security, ScadaBridge:InboundApi, ScadaBridge:NotificationOutbox, ScadaBridge:Transport, ScadaBridge:DataConnection, ScadaBridge:StoreAndForward, ScadaBridge:SiteEventLog, ScadaBridge:SiteRuntime, ScadaBridge:Notification) are bound by their respective component extension methods. The Host binds them at the shared BindSharedOptions call or at the role-specific Configure<T> sites in Program.cs and SiteServiceRegistration.Configure.

Dependencies & Interactions

  • All 19 component libraries — the Host project-references every component to call its extension methods. The Host is the only project with this fan-out; component libraries do not reference each other except where documented.
  • Cluster Infrastructure (#13) — the Host configures the underlying Akka.NET cluster (AkkaHostedService.BuildHocon); ClusterInfrastructure manages it at runtime.
  • Configuration Database (#17) — the Host registers ScadaBridgeDbContext and calls AddConfigurationDatabase (Central only); the StartupRetry-wrapped migration step runs before traffic is accepted.
  • CentralSite Communication (#5) — the Host creates CentralCommunicationActor and SiteCommunicationActor, registers them with ClusterClientReceptionist, and wires the ClusterClient for site→central messaging; the gRPC server is mapped at app.MapGrpcService<SiteStreamGrpcServer>().
  • Health Monitoring (#11) — the Host registers health checks (DatabaseHealthCheck, AkkaClusterHealthCheck, ActiveNodeHealthCheck) and mounts them via app.MapZbHealth() on central; site nodes register AddSiteHealthMonitoring and AkkaHealthReportTransport.
  • Audit Log (#23) — the Host calls AddAuditLog on both roles, AddAuditLogCentralMaintenance on central, and AddAuditLogHealthMetricsBridge on site; it creates the AuditLogIngestActor singleton and registers SiteAuditTelemetryActor on the dedicated dispatcher.
  • Notification Outbox (#21) — the Host creates the NotificationOutboxActor cluster singleton and hands its proxy to CentralCommunicationActor.
  • Site Call Audit (#22) — the Host creates the SiteCallAuditActor cluster singleton with a graceful-stop drain task registered in the cluster-leave coordinated-shutdown phase.
  • Management Service (#18) — the Host creates ManagementActor and registers it with ClusterClientReceptionist; maps the Management and Audit HTTP APIs.
  • Traefik Proxy (#20) — Traefik polls /health/active to determine which central node to route traffic to; the Host implements the ActiveNodeHealthCheck and ActiveNodeGate that back this endpoint.
  • Design spec: Component-Host.md.

Troubleshooting

Node fails to start with validation errors

StartupValidator throws before any DI or actor system setup. The exception message lists all failing keys and their expected constraints. Common causes: missing ScadaBridge:Node:Role, a GrpcPort/RemotingPort collision on a site node, a seed-node URI that accidentally points at the gRPC port rather than the remoting port, or a missing ConfigurationDb connection string on a central node.

Central node loops on database migration

StartupRetry retries connection-class faults up to 8 times (roughly 2 minutes worst-case). If the loop exhausts without success, the process exits with a Fatal log entry. Permanent errors (schema-version mismatch detected by MigrationHelper) are not retried and exit on the first attempt. Check SqlException details in the log to distinguish a connectivity failure from a schema fault.

Dead letters appearing at startup

A burst of dead letters during startup is normal: actors send messages before their targets finish PreStart. DeadLetterMonitorActor logs each at Warning and increments the health counter — these are observable on the site health report. Sustained dead letters after the cluster stabilises indicate a stale actor reference or a lifecycle race.

Standby central node receives traffic

If Traefik is not yet polling /health/active or its health-check interval has not elapsed after a failover, traffic may briefly reach the standby. ActiveNodeGate returns false on the standby, causing the Inbound API endpoint filter to respond 503 Service Unavailable. The response header X-ScadaBridge-Active: false is present so the condition is identifiable in access logs. No operator action is needed; Traefik will reroute on its next health-check cycle.