# Host The Host is the single deployable binary for ScadaBridge. The same executable runs on every node — central and site alike — and selects its component set entirely from configuration, with no separate build targets or conditional compilation. ## Overview Host (#15) is the composition root: it reads `ScadaBridge:Node:Role` from `appsettings.json` (layered with a role-specific override file selected by the `SCADABRIDGE_CONFIG` environment variable), runs pre-DI startup validation, wires every applicable component into the DI container and Akka.NET actor system, and then hands off to ASP.NET Core's `WebApplication` host. The component code lives in `src/ZB.MOM.WW.ScadaBridge.Host/`, split across: - `Program.cs` — the entry point: configuration loading, `StartupValidator`, role-branched DI registration, Kestrel setup, middleware pipeline, and endpoint mapping. - `Actors/AkkaHostedService.cs` — owns the `ActorSystem` lifetime; builds HOCON from bound options; registers role-specific actors as cluster singletons or plain `ActorOf` calls. - `Actors/DeadLetterMonitorActor.cs` — subscribes to the `DeadLetter` event stream and increments the health metric. - `Health/ActiveNodeGate.cs` — production `IActiveNodeGate` backed by Akka cluster leadership; used by the Inbound API endpoint filter to gate traffic on standby nodes. - `Health/AkkaClusterNodeProvider.cs` — feeds `IClusterNodeProvider` from live Akka cluster membership for health reporting. - `SiteServiceRegistration.cs` — extracted site-role DI registrations reused by both `Program.cs` and integration test harnesses. - `StartupValidator.cs` — pre-DI configuration preflight that fails fast before any actor system is created. - `StartupRetry.cs` — bounded exponential-backoff helper for startup preconditions (database migrations). - `LoggerConfigurationFactory.cs` — builds the Serilog `LoggerConfiguration` with node-identity enrichment. ## Key Concepts ### Role selection via `SCADABRIDGE_CONFIG` The configuration builder layers `appsettings.json`, then `appsettings.{SCADABRIDGE_CONFIG}.json`. The `SCADABRIDGE_CONFIG` environment variable selects the role-specific file (`Central` or `Site`); when absent, it falls back to `DOTNET_ENVIRONMENT`. `DOTNET_ENVIRONMENT`/`ASPNETCORE_ENVIRONMENT` remain `Development` for dev tooling (static assets, EF migrations) independently of which role is active. ```csharp var scadabridgeConfig = Environment.GetEnvironmentVariable("SCADABRIDGE_CONFIG") ?? Environment.GetEnvironmentVariable("DOTNET_ENVIRONMENT") ?? "Production"; var configuration = new ConfigurationBuilder() .AddJsonFile("appsettings.json", optional: false) .AddJsonFile($"appsettings.{scadabridgeConfig}.json", optional: true) .AddEnvironmentVariables() .AddCommandLine(args) .Build(); ``` The resolved `ScadaBridge:Node:Role` value then branches the entire DI and Akka bootstrap. ### Pre-DI startup validation `StartupValidator.Validate` runs before any DI or actor system setup. It assembles all errors, then throws a single `InvalidOperationException` listing every problem. This avoids the confusing partial-startup failures that occur when validation is deferred to first resolve. Site nodes additionally validate that `GrpcPort`, `MetricsPort`, and `RemotingPort` are all distinct and that no seed-node entry points at the gRPC port. ### Akka HOCON construction `AkkaHostedService.BuildHocon` assembles the HOCON configuration document from strongly-typed options rather than inline strings. Every interpolated value passes through `QuoteHocon` (escapes backslashes and double-quotes) to prevent a hostname, seed-node URI, or split-brain strategy value from corrupting the document. Durations are rendered in milliseconds (`DurationHocon`) so sub-second timing values (e.g. a 750 ms heartbeat) are preserved exactly. The actor system name is always `scadabridge`. Site nodes carry two cluster roles: the generic `"Site"` role and a per-site role (`"site-{SiteId}"`) used to scope cluster singletons to a specific site. ### `/health/ready` — readiness gating Central nodes register `DatabaseHealthCheck` (tagged `Ready`) and `AkkaClusterHealthCheck` (tagged `Ready`). The `/health/ready` endpoint returns 200 only when both pass. Readiness is explicitly not tied to cluster leadership: a fully operational standby central node still reports ready because `ActiveNodeHealthCheck` carries only the `Active` tag, not `Ready`. Load balancers and orchestrators should poll `/health/ready` to determine when a freshly started or failed-over node can receive traffic. ### `/health/active` — active-node routing for Traefik `ActiveNodeHealthCheck` carries the `Active` tag and is served at `/health/active`. It returns 200 only on the cluster leader. Traefik polls this endpoint and routes inbound traffic — Central UI, Inbound API, Management API — exclusively to the node that answers 200. See [TraefikProxy](./TraefikProxy.md) for the upstream routing rules. The same leadership check backs `ActiveNodeGate`, the `IActiveNodeGate` implementation the Inbound API endpoint filter consults before executing a method script. A standby node therefore refuses inbound API calls even if traffic somehow reaches it directly. ```csharp public bool IsActiveNode { get { var system = _akkaService.ActorSystem; if (system == null) return false; var cluster = Cluster.Get(system); var self = cluster.SelfMember; if (self.Status != MemberStatus.Up) return false; var leader = cluster.State.Leader; return leader != null && leader == self.Address; } } ``` ## Architecture ### Central composition root `Program.cs` (Central branch) calls `WebApplication.CreateBuilder`, registers shared and central-only components, builds the `WebApplication`, applies or retries database migrations, and mounts the middleware pipeline and endpoints. The order is intentional: `UseAuthentication` and `UseAuthorization` run before `UseAuditWriteMiddleware` so `HttpContext.User` is populated when the audit row is written. `AkkaHostedService.RegisterCentralActors` creates: - `CentralCommunicationActor` — registered with `ClusterClientReceptionist` so site `ClusterClient`s can reach it. - `ManagementActor` — also registered with `ClusterClientReceptionist`; the CLI connects via `ClusterClient` without joining the cluster. - `NotificationOutboxActor` — cluster singleton (no role scope); a proxy is handed to `CentralCommunicationActor` so forwarded `NotificationSubmit` messages from sites are routed to it. - `AuditLogIngestActor` — cluster singleton; proxy registered with both `CentralCommunicationActor` and (if present) the `SiteStreamGrpcServer`. - `SiteCallAuditActor` — cluster singleton; a graceful-stop task is added to the `cluster-leave` coordinated-shutdown phase with a 10-second drain window. - `DeadLetterMonitorActor` — plain `ActorOf`; subscribes to the `DeadLetter` event stream on `PreStart`. ### Site composition root `Program.cs` (Site branch) calls `WebApplication.CreateBuilder` with a Kestrel configuration that binds two listeners: HTTP/2 only on `GrpcPort` (default 8083) for the gRPC server, and HTTP/1+2 on `MetricsPort` (default 8084) for the Prometheus `/metrics` scrape endpoint. The separation exists because a standard HTTP/1.1 Prometheus scraper cannot negotiate HTTP/2; the gRPC listener must stay pure HTTP/2. `SiteServiceRegistration.Configure` registers the site-only components. `AkkaHostedService.RegisterSiteActorsAsync` creates: - `DeploymentManagerActor` — cluster singleton scoped to `"site-{SiteId}"`. - `SiteCommunicationActor` — registered with `ClusterClientReceptionist`; creates a `ClusterClient` to configured central contact points. - `SiteReplicationActor` — one per node (not a singleton); handles best-effort S&F replication to the standby. - `EventLogHandlerActor` — cluster singleton scoped to `"site-{SiteId}"`. - `ParkedMessageHandlerActor` — bridges Akka to `StoreAndForwardService`. - `SiteAuditTelemetryActor` — created on a dedicated `audit-telemetry-dispatcher` (2-thread `ForkJoinDispatcher`) so SQLite reads and gRPC pushes never contend with hot-path actors. - `DataConnectionManagerActor` — if `IDataConnectionFactory` is registered. Shutdown ordering for the site role is explicit: `IHostApplicationLifetime.ApplicationStopping` fires before `IHostedService.StopAsync`, so `SiteStreamGrpcServer.CancelAllStreams` is called first (clients observe a clean cancellation and reconnect), then `AkkaHostedService` runs `CoordinatedShutdown` and tears down actors. ```csharp siteLifetime.ApplicationStopping.Register(() => siteGrpcServer.CancelAllStreams()); ``` ### Database migration retry On central nodes, `StartupRetry.ExecuteWithRetryAsync` wraps the migration step with up to 8 attempts and initial 2-second exponential backoff (capped at 30 seconds). Only connection-class faults (`SocketException`, `SqlException`, `DbException`, `TimeoutException`) are retried; a schema-version mismatch surfaces as an `InvalidOperationException` and fails immediately. The `ApplicationStopping` token is threaded into both the migration call and the inter-attempt `Task.Delay` so a SIGTERM during the retry window tears down cleanly. ## Usage The Host is not consumed as a library; it is the executable entry point. Other components expose themselves to the Host via the extension-method convention: - `IServiceCollection.AddXxx()` — registers DI services. - `AkkaHostedService.RegisterXxxActors()` / inline `ActorOf` calls in `AkkaHostedService` — registers actors. - `WebApplication.MapXxx()` — maps web endpoints (Central UI, Inbound API, Management API, Audit API). `Program.cs` calls these methods; the component libraries own the registration logic. This keeps the Host thin and each component self-contained. ### Component registration by role | Component | Central | Site | `AddXxx` | Actors | `MapXxx` | |---|:---:|:---:|:---:|:---:|:---:| | ClusterInfrastructure | Yes | Yes | Yes | Yes | — | | Communication | Yes | Yes | Yes | Yes | — | | HealthMonitoring | Yes | Yes | Yes | Yes | — | | ExternalSystemGateway | Yes | Yes | Yes | Yes | — | | AuditLog | Yes | Yes | Yes | Yes | — | | NotificationService | Yes | No | Yes | — | — | | NotificationOutbox | Yes | No | Yes | Yes (singleton) | — | | SiteCallAudit | Yes | No | Yes | Yes (singleton) | — | | TemplateEngine | Yes | No | Yes | Yes | — | | DeploymentManager | Yes | No | Yes | Yes | — | | Security | Yes | No | Yes | — | — | | CentralUI | Yes | No | Yes | — | Yes | | InboundAPI | Yes | No | Yes | — | Yes | | ManagementService | Yes | No | Yes | Yes | Yes | | Transport | Yes | No | Yes | — | — | | ConfigurationDatabase | Yes | No | Yes | — | — | | SiteRuntime | No | Yes | Yes | Yes (singleton) | — | | DataConnectionLayer | No | Yes | Yes | Yes | — | | StoreAndForward | No | Yes | Yes | Yes | — | | SiteEventLogging | No | Yes | Yes | Yes (singleton) | — | `AuditLog` calls `AddAuditLog` on both roles; central additionally calls `AddAuditLogCentralMaintenance`. Site calls `AddAuditLogHealthMetricsBridge` to bridge write failures into the site health report. ## Configuration Options are bound via the .NET Options pattern (`IOptions`). Each component owns its options class; the Host binds each section and passes the `IConfiguration` to component extension methods only where the component's own validator needs it at startup. ### `ScadaBridge:Node` → `NodeOptions` | Key | Default | Description | |-----|---------|-------------| | `Role` | — | `"Central"` or `"Site"`. Validated by `StartupValidator`. | | `NodeHostname` | — | Hostname or IP advertised to the Akka cluster and enriched on log entries. | | `NodeName` | — | Free-form semantic name stamped as `SourceNode` on audit rows (e.g. `"central-a"`, `"node-b"`). Empty normalises to `null`. | | `SiteId` | — | Site identifier; required for Site nodes; used to scope cluster singletons and enrich telemetry. | | `RemotingPort` | `8081` | Akka.NET remoting TCP port. Must be in range 1–65535. | | `GrpcPort` | `8083` | Kestrel HTTP/2 port for the site gRPC stream server (Site nodes only). Must differ from `RemotingPort`. | | `MetricsPort` | `8084` | Kestrel HTTP/1+2 port for the Prometheus `/metrics` scrape endpoint (Site nodes only). Must differ from both `RemotingPort` and `GrpcPort`. | ### `ScadaBridge:Cluster` → `ClusterOptions` | Key | Default | Description | |-----|---------|-------------| | `SeedNodes` | — | List of Akka seed-node URIs (`akka.tcp://scadabridge@host:port`). At least 2 required. Must reference remoting ports, not gRPC ports. | | `SplitBrainResolverStrategy` | — | Active strategy name (e.g. `"keep-oldest"`). | | `StableAfter` | `"00:00:15"` | Duration the cluster must be stable before the resolver acts. | | `HeartbeatInterval` | `"00:00:02"` | Akka failure-detector heartbeat cadence. | | `FailureDetectionThreshold` | `"00:00:10"` | Acceptable heartbeat pause before a node is considered unreachable. | | `MinNrOfMembers` | `1` | Minimum cluster members before the leader is elected. | | `DownIfAlone` | `true` | When using `keep-oldest`, whether a lone surviving node downs itself. | ### `ScadaBridge:Database` → `DatabaseOptions` | Key | Role | Description | |-----|------|-------------| | `ConfigurationDb` | Central | MS SQL connection string for the central `ScadaBridgeDbContext`. Required; validated by `StartupValidator`. | | `SiteDbPath` | Site | Filesystem path to the site-local SQLite database. Required for Site nodes. | ### `ScadaBridge:Logging` → `LoggingOptions` | Key | Default | Description | |-----|---------|-------------| | `MinimumLevel` | `"Information"` | Serilog minimum log level. Overrides any `Serilog:MinimumLevel` entry — a one-shot warning is emitted to `stderr` if both are present. Parsed case-insensitively; unrecognised values fall back to `Information` with a warning. | Serilog sinks (console output template, file path, rolling interval) are configured under the standard `Serilog` JSON section and applied via `ReadFrom.Configuration`. Every log entry is enriched with `SiteId`, `NodeHostname`, and `NodeRole` properties from the resolved node configuration. ### `ScadaBridge:InboundApi:ApiKeyStore` | Key | Default | Description | |-----|---------|-------------| | `SqlitePath` | `data/inbound-api-keys.sqlite` under content root | Path to the SQLite store for inbound API keys. | | `TokenPrefix` | `"sbk"` | Prefix for issued API key tokens. Fixed; injected by the Host as in-memory config. | | `PepperSecretName` | `"ScadaBridge:InboundApi:ApiKeyPepper"` | Configuration key holding the peppered-HMAC secret. The pepper itself must be ≥ 16 characters; validated by `StartupValidator`. | | `RunMigrationsOnStartup` | `true` | Whether the hosted service creates the SQLite schema on first run. | All other per-component configuration sections (`ScadaBridge:Communication`, `ScadaBridge:HealthMonitoring`, `ScadaBridge:Security`, `ScadaBridge:InboundApi`, `ScadaBridge:NotificationOutbox`, `ScadaBridge:Transport`, `ScadaBridge:DataConnection`, `ScadaBridge:StoreAndForward`, `ScadaBridge:SiteEventLog`, `ScadaBridge:SiteRuntime`, `ScadaBridge:Notification`) are bound by their respective component extension methods. The Host binds them at the shared `BindSharedOptions` call or at the role-specific `Configure` sites in `Program.cs` and `SiteServiceRegistration.Configure`. ## Dependencies & Interactions - **All 19 component libraries** — the Host project-references every component to call its extension methods. The Host is the only project with this fan-out; component libraries do not reference each other except where documented. - [Cluster Infrastructure (#13)](./ClusterInfrastructure.md) — the Host configures the underlying Akka.NET cluster (`AkkaHostedService.BuildHocon`); ClusterInfrastructure manages it at runtime. - [Configuration Database (#17)](./ConfigurationDatabase.md) — the Host registers `ScadaBridgeDbContext` and calls `AddConfigurationDatabase` (Central only); the `StartupRetry`-wrapped migration step runs before traffic is accepted. - [Central–Site Communication (#5)](./Communication.md) — the Host creates `CentralCommunicationActor` and `SiteCommunicationActor`, registers them with `ClusterClientReceptionist`, and wires the `ClusterClient` for site→central messaging; the gRPC server is mapped at `app.MapGrpcService()`. - [Health Monitoring (#11)](./HealthMonitoring.md) — the Host registers health checks (`DatabaseHealthCheck`, `AkkaClusterHealthCheck`, `ActiveNodeHealthCheck`) and mounts them via `app.MapZbHealth()` on central; site nodes register `AddSiteHealthMonitoring` and `AkkaHealthReportTransport`. - [Audit Log (#23)](./AuditLog.md) — the Host calls `AddAuditLog` on both roles, `AddAuditLogCentralMaintenance` on central, and `AddAuditLogHealthMetricsBridge` on site; it creates the `AuditLogIngestActor` singleton and registers `SiteAuditTelemetryActor` on the dedicated dispatcher. - [Notification Outbox (#21)](./NotificationOutbox.md) — the Host creates the `NotificationOutboxActor` cluster singleton and hands its proxy to `CentralCommunicationActor`. - [Site Call Audit (#22)](./SiteCallAudit.md) — the Host creates the `SiteCallAuditActor` cluster singleton with a graceful-stop drain task registered in the `cluster-leave` coordinated-shutdown phase. - [Management Service (#18)](./ManagementService.md) — the Host creates `ManagementActor` and registers it with `ClusterClientReceptionist`; maps the Management and Audit HTTP APIs. - [Traefik Proxy (#20)](./TraefikProxy.md) — Traefik polls `/health/active` to determine which central node to route traffic to; the Host implements the `ActiveNodeHealthCheck` and `ActiveNodeGate` that back this endpoint. - Design spec: [Component-Host.md](../requirements/Component-Host.md). ## Troubleshooting ### Node fails to start with validation errors `StartupValidator` throws before any DI or actor system setup. The exception message lists all failing keys and their expected constraints. Common causes: missing `ScadaBridge:Node:Role`, a `GrpcPort`/`RemotingPort` collision on a site node, a seed-node URI that accidentally points at the gRPC port rather than the remoting port, or a missing `ConfigurationDb` connection string on a central node. ### Central node loops on database migration `StartupRetry` retries connection-class faults up to 8 times (roughly 2 minutes worst-case). If the loop exhausts without success, the process exits with a `Fatal` log entry. Permanent errors (schema-version mismatch detected by `MigrationHelper`) are not retried and exit on the first attempt. Check `SqlException` details in the log to distinguish a connectivity failure from a schema fault. ### Dead letters appearing at startup A burst of dead letters during startup is normal: actors send messages before their targets finish `PreStart`. `DeadLetterMonitorActor` logs each at `Warning` and increments the health counter — these are observable on the site health report. Sustained dead letters after the cluster stabilises indicate a stale actor reference or a lifecycle race. ### Standby central node receives traffic If Traefik is not yet polling `/health/active` or its health-check interval has not elapsed after a failover, traffic may briefly reach the standby. `ActiveNodeGate` returns `false` on the standby, causing the Inbound API endpoint filter to respond `503 Service Unavailable`. The response header `X-ScadaBridge-Active: false` is present so the condition is identifiable in access logs. No operator action is needed; Traefik will reroute on its next health-check cycle. ## Related Documentation - [Host design specification](../requirements/Component-Host.md) - [Cluster Infrastructure](./ClusterInfrastructure.md) - [Central–Site Communication](./Communication.md) - [Configuration Database](./ConfigurationDatabase.md) - [Health Monitoring](./HealthMonitoring.md) - [Audit Log](./AuditLog.md) - [Notification Outbox](./NotificationOutbox.md) - [Site Call Audit](./SiteCallAudit.md) - [Management Service](./ManagementService.md) - [Traefik Proxy](./TraefikProxy.md)