Files
ScadaBridge/docs/components/Host.md
T

256 lines
20 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Host
The Host is the single deployable binary for ScadaBridge. The same executable runs on every node — central and site alike — and selects its component set entirely from configuration, with no separate build targets or conditional compilation.
## Overview
Host (#15) is the composition root: it reads `ScadaBridge:Node:Role` from `appsettings.json` (layered with a role-specific override file selected by the `SCADABRIDGE_CONFIG` environment variable), runs pre-DI startup validation, wires every applicable component into the DI container and Akka.NET actor system, and then hands off to ASP.NET Core's `WebApplication` host.
The component code lives in `src/ZB.MOM.WW.ScadaBridge.Host/`, split across:
- `Program.cs` — the entry point: configuration loading, `StartupValidator`, role-branched DI registration, Kestrel setup, middleware pipeline, and endpoint mapping.
- `Actors/AkkaHostedService.cs` — owns the `ActorSystem` lifetime; builds HOCON from bound options; registers role-specific actors as cluster singletons or plain `ActorOf` calls.
- `Actors/DeadLetterMonitorActor.cs` — subscribes to the `DeadLetter` event stream and increments the health metric.
- `Health/ActiveNodeGate.cs` — production `IActiveNodeGate` backed by Akka cluster leadership; used by the Inbound API endpoint filter to gate traffic on standby nodes.
- `Health/AkkaClusterNodeProvider.cs` — feeds `IClusterNodeProvider` from live Akka cluster membership for health reporting.
- `SiteServiceRegistration.cs` — extracted site-role DI registrations reused by both `Program.cs` and integration test harnesses.
- `StartupValidator.cs` — pre-DI configuration preflight that fails fast before any actor system is created.
- `StartupRetry.cs` — bounded exponential-backoff helper for startup preconditions (database migrations).
- `LoggerConfigurationFactory.cs` — builds the Serilog `LoggerConfiguration` with node-identity enrichment.
## Key Concepts
### Role selection via `SCADABRIDGE_CONFIG`
The configuration builder layers `appsettings.json`, then `appsettings.{SCADABRIDGE_CONFIG}.json`. The `SCADABRIDGE_CONFIG` environment variable selects the role-specific file (`Central` or `Site`); when absent, it falls back to `DOTNET_ENVIRONMENT`. `DOTNET_ENVIRONMENT`/`ASPNETCORE_ENVIRONMENT` remain `Development` for dev tooling (static assets, EF migrations) independently of which role is active.
```csharp
var scadabridgeConfig = Environment.GetEnvironmentVariable("SCADABRIDGE_CONFIG")
?? Environment.GetEnvironmentVariable("DOTNET_ENVIRONMENT")
?? "Production";
var configuration = new ConfigurationBuilder()
.AddJsonFile("appsettings.json", optional: false)
.AddJsonFile($"appsettings.{scadabridgeConfig}.json", optional: true)
.AddEnvironmentVariables()
.AddCommandLine(args)
.Build();
```
The resolved `ScadaBridge:Node:Role` value then branches the entire DI and Akka bootstrap.
### Pre-DI startup validation
`StartupValidator.Validate` runs before any DI or actor system setup. It assembles all errors, then throws a single `InvalidOperationException` listing every problem. This avoids the confusing partial-startup failures that occur when validation is deferred to first resolve. Site nodes additionally validate that `GrpcPort`, `MetricsPort`, and `RemotingPort` are all distinct and that no seed-node entry points at the gRPC port.
### Akka HOCON construction
`AkkaHostedService.BuildHocon` assembles the HOCON configuration document from strongly-typed options rather than inline strings. Every interpolated value passes through `QuoteHocon` (escapes backslashes and double-quotes) to prevent a hostname, seed-node URI, or split-brain strategy value from corrupting the document. Durations are rendered in milliseconds (`DurationHocon`) so sub-second timing values (e.g. a 750 ms heartbeat) are preserved exactly.
The actor system name is always `scadabridge`. Site nodes carry two cluster roles: the generic `"Site"` role and a per-site role (`"site-{SiteId}"`) used to scope cluster singletons to a specific site.
### `/health/ready` — readiness gating
Central nodes register `DatabaseHealthCheck<ScadaBridgeDbContext>` (tagged `Ready`) and `AkkaClusterHealthCheck` (tagged `Ready`). The `/health/ready` endpoint returns 200 only when both pass. Readiness is explicitly not tied to cluster leadership: a fully operational standby central node still reports ready because `ActiveNodeHealthCheck` carries only the `Active` tag, not `Ready`.
Load balancers and orchestrators should poll `/health/ready` to determine when a freshly started or failed-over node can receive traffic.
### `/health/active` — active-node routing for Traefik
`ActiveNodeHealthCheck` carries the `Active` tag and is served at `/health/active`. It returns 200 only on the cluster leader. Traefik polls this endpoint and routes inbound traffic — Central UI, Inbound API, Management API — exclusively to the node that answers 200. See [TraefikProxy](./TraefikProxy.md) for the upstream routing rules.
The same leadership check backs `ActiveNodeGate`, the `IActiveNodeGate` implementation the Inbound API endpoint filter consults before executing a method script. A standby node therefore refuses inbound API calls even if traffic somehow reaches it directly.
```csharp
public bool IsActiveNode
{
get
{
var system = _akkaService.ActorSystem;
if (system == null)
return false;
var cluster = Cluster.Get(system);
var self = cluster.SelfMember;
if (self.Status != MemberStatus.Up)
return false;
var leader = cluster.State.Leader;
return leader != null && leader == self.Address;
}
}
```
## Architecture
### Central composition root
`Program.cs` (Central branch) calls `WebApplication.CreateBuilder`, registers shared and central-only components, builds the `WebApplication`, applies or retries database migrations, and mounts the middleware pipeline and endpoints. The order is intentional: `UseAuthentication` and `UseAuthorization` run before `UseAuditWriteMiddleware` so `HttpContext.User` is populated when the audit row is written.
`AkkaHostedService.RegisterCentralActors` creates:
- `CentralCommunicationActor` — registered with `ClusterClientReceptionist` so site `ClusterClient`s can reach it.
- `ManagementActor` — also registered with `ClusterClientReceptionist`; the CLI connects via `ClusterClient` without joining the cluster.
- `NotificationOutboxActor` — cluster singleton (no role scope); a proxy is handed to `CentralCommunicationActor` so forwarded `NotificationSubmit` messages from sites are routed to it.
- `AuditLogIngestActor` — cluster singleton; proxy registered with both `CentralCommunicationActor` and (if present) the `SiteStreamGrpcServer`.
- `SiteCallAuditActor` — cluster singleton; a graceful-stop task is added to the `cluster-leave` coordinated-shutdown phase with a 10-second drain window.
- `DeadLetterMonitorActor` — plain `ActorOf`; subscribes to the `DeadLetter` event stream on `PreStart`.
### Site composition root
`Program.cs` (Site branch) calls `WebApplication.CreateBuilder` with a Kestrel configuration that binds two listeners: HTTP/2 only on `GrpcPort` (default 8083) for the gRPC server, and HTTP/1+2 on `MetricsPort` (default 8084) for the Prometheus `/metrics` scrape endpoint. The separation exists because a standard HTTP/1.1 Prometheus scraper cannot negotiate HTTP/2; the gRPC listener must stay pure HTTP/2.
`SiteServiceRegistration.Configure` registers the site-only components. `AkkaHostedService.RegisterSiteActorsAsync` creates:
- `DeploymentManagerActor` — cluster singleton scoped to `"site-{SiteId}"`.
- `SiteCommunicationActor` — registered with `ClusterClientReceptionist`; creates a `ClusterClient` to configured central contact points.
- `SiteReplicationActor` — one per node (not a singleton); handles best-effort S&F replication to the standby.
- `EventLogHandlerActor` — cluster singleton scoped to `"site-{SiteId}"`.
- `ParkedMessageHandlerActor` — bridges Akka to `StoreAndForwardService`.
- `SiteAuditTelemetryActor` — created on a dedicated `audit-telemetry-dispatcher` (2-thread `ForkJoinDispatcher`) so SQLite reads and gRPC pushes never contend with hot-path actors.
- `DataConnectionManagerActor` — if `IDataConnectionFactory` is registered.
Shutdown ordering for the site role is explicit: `IHostApplicationLifetime.ApplicationStopping` fires before `IHostedService.StopAsync`, so `SiteStreamGrpcServer.CancelAllStreams` is called first (clients observe a clean cancellation and reconnect), then `AkkaHostedService` runs `CoordinatedShutdown` and tears down actors.
```csharp
siteLifetime.ApplicationStopping.Register(() => siteGrpcServer.CancelAllStreams());
```
### Database migration retry
On central nodes, `StartupRetry.ExecuteWithRetryAsync` wraps the migration step with up to 8 attempts and initial 2-second exponential backoff (capped at 30 seconds). Only connection-class faults (`SocketException`, `SqlException`, `DbException`, `TimeoutException`) are retried; a schema-version mismatch surfaces as an `InvalidOperationException` and fails immediately. The `ApplicationStopping` token is threaded into both the migration call and the inter-attempt `Task.Delay` so a SIGTERM during the retry window tears down cleanly.
## Usage
The Host is not consumed as a library; it is the executable entry point. Other components expose themselves to the Host via the extension-method convention:
- `IServiceCollection.AddXxx()` — registers DI services.
- `AkkaHostedService.RegisterXxxActors()` / inline `ActorOf` calls in `AkkaHostedService` — registers actors.
- `WebApplication.MapXxx()` — maps web endpoints (Central UI, Inbound API, Management API, Audit API).
`Program.cs` calls these methods; the component libraries own the registration logic. This keeps the Host thin and each component self-contained.
### Component registration by role
| Component | Central | Site | `AddXxx` | Actors | `MapXxx` |
|---|:---:|:---:|:---:|:---:|:---:|
| ClusterInfrastructure | Yes | Yes | Yes | Yes | — |
| Communication | Yes | Yes | Yes | Yes | — |
| HealthMonitoring | Yes | Yes | Yes | Yes | — |
| ExternalSystemGateway | Yes | Yes | Yes | Yes | — |
| AuditLog | Yes | Yes | Yes | Yes | — |
| NotificationService | Yes | No | Yes | — | — |
| NotificationOutbox | Yes | No | Yes | Yes (singleton) | — |
| SiteCallAudit | Yes | No | Yes | Yes (singleton) | — |
| TemplateEngine | Yes | No | Yes | Yes | — |
| DeploymentManager | Yes | No | Yes | Yes | — |
| Security | Yes | No | Yes | — | — |
| CentralUI | Yes | No | Yes | — | Yes |
| InboundAPI | Yes | No | Yes | — | Yes |
| ManagementService | Yes | No | Yes | Yes | Yes |
| Transport | Yes | No | Yes | — | — |
| ConfigurationDatabase | Yes | No | Yes | — | — |
| SiteRuntime | No | Yes | Yes | Yes (singleton) | — |
| DataConnectionLayer | No | Yes | Yes | Yes | — |
| StoreAndForward | No | Yes | Yes | Yes | — |
| SiteEventLogging | No | Yes | Yes | Yes (singleton) | — |
`AuditLog` calls `AddAuditLog` on both roles; central additionally calls `AddAuditLogCentralMaintenance`. Site calls `AddAuditLogHealthMetricsBridge` to bridge write failures into the site health report.
## Configuration
Options are bound via the .NET Options pattern (`IOptions<T>`). Each component owns its options class; the Host binds each section and passes the `IConfiguration` to component extension methods only where the component's own validator needs it at startup.
### `ScadaBridge:Node` → `NodeOptions`
| Key | Default | Description |
|-----|---------|-------------|
| `Role` | — | `"Central"` or `"Site"`. Validated by `StartupValidator`. |
| `NodeHostname` | — | Hostname or IP advertised to the Akka cluster and enriched on log entries. |
| `NodeName` | — | Free-form semantic name stamped as `SourceNode` on audit rows (e.g. `"central-a"`, `"node-b"`). Empty normalises to `null`. |
| `SiteId` | — | Site identifier; required for Site nodes; used to scope cluster singletons and enrich telemetry. |
| `RemotingPort` | `8081` | Akka.NET remoting TCP port. Must be in range 165535. |
| `GrpcPort` | `8083` | Kestrel HTTP/2 port for the site gRPC stream server (Site nodes only). Must differ from `RemotingPort`. |
| `MetricsPort` | `8084` | Kestrel HTTP/1+2 port for the Prometheus `/metrics` scrape endpoint (Site nodes only). Must differ from both `RemotingPort` and `GrpcPort`. |
### `ScadaBridge:Cluster` → `ClusterOptions`
| Key | Default | Description |
|-----|---------|-------------|
| `SeedNodes` | — | List of Akka seed-node URIs (`akka.tcp://scadabridge@host:port`). At least 2 required. Must reference remoting ports, not gRPC ports. |
| `SplitBrainResolverStrategy` | — | Active strategy name (e.g. `"keep-oldest"`). |
| `StableAfter` | `"00:00:15"` | Duration the cluster must be stable before the resolver acts. |
| `HeartbeatInterval` | `"00:00:02"` | Akka failure-detector heartbeat cadence. |
| `FailureDetectionThreshold` | `"00:00:10"` | Acceptable heartbeat pause before a node is considered unreachable. |
| `MinNrOfMembers` | `1` | Minimum cluster members before the leader is elected. |
| `DownIfAlone` | `true` | When using `keep-oldest`, whether a lone surviving node downs itself. |
### `ScadaBridge:Database` → `DatabaseOptions`
| Key | Role | Description |
|-----|------|-------------|
| `ConfigurationDb` | Central | MS SQL connection string for the central `ScadaBridgeDbContext`. Required; validated by `StartupValidator`. |
| `SiteDbPath` | Site | Filesystem path to the site-local SQLite database. Required for Site nodes. |
### `ScadaBridge:Logging` → `LoggingOptions`
| Key | Default | Description |
|-----|---------|-------------|
| `MinimumLevel` | `"Information"` | Serilog minimum log level. Overrides any `Serilog:MinimumLevel` entry — a one-shot warning is emitted to `stderr` if both are present. Parsed case-insensitively; unrecognised values fall back to `Information` with a warning. |
Serilog sinks (console output template, file path, rolling interval) are configured under the standard `Serilog` JSON section and applied via `ReadFrom.Configuration`. Every log entry is enriched with `SiteId`, `NodeHostname`, and `NodeRole` properties from the resolved node configuration.
### `ScadaBridge:InboundApi:ApiKeyStore`
| Key | Default | Description |
|-----|---------|-------------|
| `SqlitePath` | `data/inbound-api-keys.sqlite` under content root | Path to the SQLite store for inbound API keys. |
| `TokenPrefix` | `"sbk"` | Prefix for issued API key tokens. Fixed; injected by the Host as in-memory config. |
| `PepperSecretName` | `"ScadaBridge:InboundApi:ApiKeyPepper"` | Configuration key holding the peppered-HMAC secret. The pepper itself must be ≥ 16 characters; validated by `StartupValidator`. |
| `RunMigrationsOnStartup` | `true` | Whether the hosted service creates the SQLite schema on first run. |
All other per-component configuration sections (`ScadaBridge:Communication`, `ScadaBridge:HealthMonitoring`, `ScadaBridge:Security`, `ScadaBridge:InboundApi`, `ScadaBridge:NotificationOutbox`, `ScadaBridge:Transport`, `ScadaBridge:DataConnection`, `ScadaBridge:StoreAndForward`, `ScadaBridge:SiteEventLog`, `ScadaBridge:SiteRuntime`, `ScadaBridge:Notification`) are bound by their respective component extension methods. The Host binds them at the shared `BindSharedOptions` call or at the role-specific `Configure<T>` sites in `Program.cs` and `SiteServiceRegistration.Configure`.
## Dependencies & Interactions
- **All 19 component libraries** — the Host project-references every component to call its extension methods. The Host is the only project with this fan-out; component libraries do not reference each other except where documented.
- [Cluster Infrastructure (#13)](./ClusterInfrastructure.md) — the Host configures the underlying Akka.NET cluster (`AkkaHostedService.BuildHocon`); ClusterInfrastructure manages it at runtime.
- [Configuration Database (#17)](./ConfigurationDatabase.md) — the Host registers `ScadaBridgeDbContext` and calls `AddConfigurationDatabase` (Central only); the `StartupRetry`-wrapped migration step runs before traffic is accepted.
- [CentralSite Communication (#5)](./Communication.md) — the Host creates `CentralCommunicationActor` and `SiteCommunicationActor`, registers them with `ClusterClientReceptionist`, and wires the `ClusterClient` for site→central messaging; the gRPC server is mapped at `app.MapGrpcService<SiteStreamGrpcServer>()`.
- [Health Monitoring (#11)](./HealthMonitoring.md) — the Host registers health checks (`DatabaseHealthCheck`, `AkkaClusterHealthCheck`, `ActiveNodeHealthCheck`) and mounts them via `app.MapZbHealth()` on central; site nodes register `AddSiteHealthMonitoring` and `AkkaHealthReportTransport`.
- [Audit Log (#23)](./AuditLog.md) — the Host calls `AddAuditLog` on both roles, `AddAuditLogCentralMaintenance` on central, and `AddAuditLogHealthMetricsBridge` on site; it creates the `AuditLogIngestActor` singleton and registers `SiteAuditTelemetryActor` on the dedicated dispatcher.
- [Notification Outbox (#21)](./NotificationOutbox.md) — the Host creates the `NotificationOutboxActor` cluster singleton and hands its proxy to `CentralCommunicationActor`.
- [Site Call Audit (#22)](./SiteCallAudit.md) — the Host creates the `SiteCallAuditActor` cluster singleton with a graceful-stop drain task registered in the `cluster-leave` coordinated-shutdown phase.
- [Management Service (#18)](./ManagementService.md) — the Host creates `ManagementActor` and registers it with `ClusterClientReceptionist`; maps the Management and Audit HTTP APIs.
- [Traefik Proxy (#20)](./TraefikProxy.md) — Traefik polls `/health/active` to determine which central node to route traffic to; the Host implements the `ActiveNodeHealthCheck` and `ActiveNodeGate` that back this endpoint.
- Design spec: [Component-Host.md](../requirements/Component-Host.md).
## Troubleshooting
### Node fails to start with validation errors
`StartupValidator` throws before any DI or actor system setup. The exception message lists all failing keys and their expected constraints. Common causes: missing `ScadaBridge:Node:Role`, a `GrpcPort`/`RemotingPort` collision on a site node, a seed-node URI that accidentally points at the gRPC port rather than the remoting port, or a missing `ConfigurationDb` connection string on a central node.
### Central node loops on database migration
`StartupRetry` retries connection-class faults up to 8 times (roughly 2 minutes worst-case). If the loop exhausts without success, the process exits with a `Fatal` log entry. Permanent errors (schema-version mismatch detected by `MigrationHelper`) are not retried and exit on the first attempt. Check `SqlException` details in the log to distinguish a connectivity failure from a schema fault.
### Dead letters appearing at startup
A burst of dead letters during startup is normal: actors send messages before their targets finish `PreStart`. `DeadLetterMonitorActor` logs each at `Warning` and increments the health counter — these are observable on the site health report. Sustained dead letters after the cluster stabilises indicate a stale actor reference or a lifecycle race.
### Standby central node receives traffic
If Traefik is not yet polling `/health/active` or its health-check interval has not elapsed after a failover, traffic may briefly reach the standby. `ActiveNodeGate` returns `false` on the standby, causing the Inbound API endpoint filter to respond `503 Service Unavailable`. The response header `X-ScadaBridge-Active: false` is present so the condition is identifiable in access logs. No operator action is needed; Traefik will reroute on its next health-check cycle.
## Related Documentation
- [Host design specification](../requirements/Component-Host.md)
- [Cluster Infrastructure](./ClusterInfrastructure.md)
- [CentralSite Communication](./Communication.md)
- [Configuration Database](./ConfigurationDatabase.md)
- [Health Monitoring](./HealthMonitoring.md)
- [Audit Log](./AuditLog.md)
- [Notification Outbox](./NotificationOutbox.md)
- [Site Call Audit](./SiteCallAudit.md)
- [Management Service](./ManagementService.md)
- [Traefik Proxy](./TraefikProxy.md)