256 lines
20 KiB
Markdown
256 lines
20 KiB
Markdown
# Host
|
||
|
||
The Host is the single deployable binary for ScadaBridge. The same executable runs on every node — central and site alike — and selects its component set entirely from configuration, with no separate build targets or conditional compilation.
|
||
|
||
## Overview
|
||
|
||
Host (#15) is the composition root: it reads `ScadaBridge:Node:Role` from `appsettings.json` (layered with a role-specific override file selected by the `SCADABRIDGE_CONFIG` environment variable), runs pre-DI startup validation, wires every applicable component into the DI container and Akka.NET actor system, and then hands off to ASP.NET Core's `WebApplication` host.
|
||
|
||
The component code lives in `src/ZB.MOM.WW.ScadaBridge.Host/`, split across:
|
||
|
||
- `Program.cs` — the entry point: configuration loading, `StartupValidator`, role-branched DI registration, Kestrel setup, middleware pipeline, and endpoint mapping.
|
||
- `Actors/AkkaHostedService.cs` — owns the `ActorSystem` lifetime; builds HOCON from bound options; registers role-specific actors as cluster singletons or plain `ActorOf` calls.
|
||
- `Actors/DeadLetterMonitorActor.cs` — subscribes to the `DeadLetter` event stream and increments the health metric.
|
||
- `Health/ActiveNodeGate.cs` — production `IActiveNodeGate` backed by Akka cluster leadership; used by the Inbound API endpoint filter to gate traffic on standby nodes.
|
||
- `Health/AkkaClusterNodeProvider.cs` — feeds `IClusterNodeProvider` from live Akka cluster membership for health reporting.
|
||
- `SiteServiceRegistration.cs` — extracted site-role DI registrations reused by both `Program.cs` and integration test harnesses.
|
||
- `StartupValidator.cs` — pre-DI configuration preflight that fails fast before any actor system is created.
|
||
- `StartupRetry.cs` — bounded exponential-backoff helper for startup preconditions (database migrations).
|
||
- `LoggerConfigurationFactory.cs` — builds the Serilog `LoggerConfiguration` with node-identity enrichment.
|
||
|
||
## Key Concepts
|
||
|
||
### Role selection via `SCADABRIDGE_CONFIG`
|
||
|
||
The configuration builder layers `appsettings.json`, then `appsettings.{SCADABRIDGE_CONFIG}.json`. The `SCADABRIDGE_CONFIG` environment variable selects the role-specific file (`Central` or `Site`); when absent, it falls back to `DOTNET_ENVIRONMENT`. `DOTNET_ENVIRONMENT`/`ASPNETCORE_ENVIRONMENT` remain `Development` for dev tooling (static assets, EF migrations) independently of which role is active.
|
||
|
||
```csharp
|
||
var scadabridgeConfig = Environment.GetEnvironmentVariable("SCADABRIDGE_CONFIG")
|
||
?? Environment.GetEnvironmentVariable("DOTNET_ENVIRONMENT")
|
||
?? "Production";
|
||
|
||
var configuration = new ConfigurationBuilder()
|
||
.AddJsonFile("appsettings.json", optional: false)
|
||
.AddJsonFile($"appsettings.{scadabridgeConfig}.json", optional: true)
|
||
.AddEnvironmentVariables()
|
||
.AddCommandLine(args)
|
||
.Build();
|
||
```
|
||
|
||
The resolved `ScadaBridge:Node:Role` value then branches the entire DI and Akka bootstrap.
|
||
|
||
### Pre-DI startup validation
|
||
|
||
`StartupValidator.Validate` runs before any DI or actor system setup. It assembles all errors, then throws a single `InvalidOperationException` listing every problem. This avoids the confusing partial-startup failures that occur when validation is deferred to first resolve. Site nodes additionally validate that `GrpcPort`, `MetricsPort`, and `RemotingPort` are all distinct and that no seed-node entry points at the gRPC port.
|
||
|
||
### Akka HOCON construction
|
||
|
||
`AkkaHostedService.BuildHocon` assembles the HOCON configuration document from strongly-typed options rather than inline strings. Every interpolated value passes through `QuoteHocon` (escapes backslashes and double-quotes) to prevent a hostname, seed-node URI, or split-brain strategy value from corrupting the document. Durations are rendered in milliseconds (`DurationHocon`) so sub-second timing values (e.g. a 750 ms heartbeat) are preserved exactly.
|
||
|
||
The actor system name is always `scadabridge`. Site nodes carry two cluster roles: the generic `"Site"` role and a per-site role (`"site-{SiteId}"`) used to scope cluster singletons to a specific site.
|
||
|
||
### `/health/ready` — readiness gating
|
||
|
||
Central nodes register `DatabaseHealthCheck<ScadaBridgeDbContext>` (tagged `Ready`) and `AkkaClusterHealthCheck` (tagged `Ready`). The `/health/ready` endpoint returns 200 only when both pass. Readiness is explicitly not tied to cluster leadership: a fully operational standby central node still reports ready because `ActiveNodeHealthCheck` carries only the `Active` tag, not `Ready`.
|
||
|
||
Load balancers and orchestrators should poll `/health/ready` to determine when a freshly started or failed-over node can receive traffic.
|
||
|
||
### `/health/active` — active-node routing for Traefik
|
||
|
||
`ActiveNodeHealthCheck` carries the `Active` tag and is served at `/health/active`. It returns 200 only on the cluster leader. Traefik polls this endpoint and routes inbound traffic — Central UI, Inbound API, Management API — exclusively to the node that answers 200. See [TraefikProxy](./TraefikProxy.md) for the upstream routing rules.
|
||
|
||
The same leadership check backs `ActiveNodeGate`, the `IActiveNodeGate` implementation the Inbound API endpoint filter consults before executing a method script. A standby node therefore refuses inbound API calls even if traffic somehow reaches it directly.
|
||
|
||
```csharp
|
||
public bool IsActiveNode
|
||
{
|
||
get
|
||
{
|
||
var system = _akkaService.ActorSystem;
|
||
if (system == null)
|
||
return false;
|
||
|
||
var cluster = Cluster.Get(system);
|
||
var self = cluster.SelfMember;
|
||
if (self.Status != MemberStatus.Up)
|
||
return false;
|
||
|
||
var leader = cluster.State.Leader;
|
||
return leader != null && leader == self.Address;
|
||
}
|
||
}
|
||
```
|
||
|
||
## Architecture
|
||
|
||
### Central composition root
|
||
|
||
`Program.cs` (Central branch) calls `WebApplication.CreateBuilder`, registers shared and central-only components, builds the `WebApplication`, applies or retries database migrations, and mounts the middleware pipeline and endpoints. The order is intentional: `UseAuthentication` and `UseAuthorization` run before `UseAuditWriteMiddleware` so `HttpContext.User` is populated when the audit row is written.
|
||
|
||
`AkkaHostedService.RegisterCentralActors` creates:
|
||
- `CentralCommunicationActor` — registered with `ClusterClientReceptionist` so site `ClusterClient`s can reach it.
|
||
- `ManagementActor` — also registered with `ClusterClientReceptionist`; the CLI connects via `ClusterClient` without joining the cluster.
|
||
- `NotificationOutboxActor` — cluster singleton (no role scope); a proxy is handed to `CentralCommunicationActor` so forwarded `NotificationSubmit` messages from sites are routed to it.
|
||
- `AuditLogIngestActor` — cluster singleton; proxy registered with both `CentralCommunicationActor` and (if present) the `SiteStreamGrpcServer`.
|
||
- `SiteCallAuditActor` — cluster singleton; a graceful-stop task is added to the `cluster-leave` coordinated-shutdown phase with a 10-second drain window.
|
||
- `DeadLetterMonitorActor` — plain `ActorOf`; subscribes to the `DeadLetter` event stream on `PreStart`.
|
||
|
||
### Site composition root
|
||
|
||
`Program.cs` (Site branch) calls `WebApplication.CreateBuilder` with a Kestrel configuration that binds two listeners: HTTP/2 only on `GrpcPort` (default 8083) for the gRPC server, and HTTP/1+2 on `MetricsPort` (default 8084) for the Prometheus `/metrics` scrape endpoint. The separation exists because a standard HTTP/1.1 Prometheus scraper cannot negotiate HTTP/2; the gRPC listener must stay pure HTTP/2.
|
||
|
||
`SiteServiceRegistration.Configure` registers the site-only components. `AkkaHostedService.RegisterSiteActorsAsync` creates:
|
||
- `DeploymentManagerActor` — cluster singleton scoped to `"site-{SiteId}"`.
|
||
- `SiteCommunicationActor` — registered with `ClusterClientReceptionist`; creates a `ClusterClient` to configured central contact points.
|
||
- `SiteReplicationActor` — one per node (not a singleton); handles best-effort S&F replication to the standby.
|
||
- `EventLogHandlerActor` — cluster singleton scoped to `"site-{SiteId}"`.
|
||
- `ParkedMessageHandlerActor` — bridges Akka to `StoreAndForwardService`.
|
||
- `SiteAuditTelemetryActor` — created on a dedicated `audit-telemetry-dispatcher` (2-thread `ForkJoinDispatcher`) so SQLite reads and gRPC pushes never contend with hot-path actors.
|
||
- `DataConnectionManagerActor` — if `IDataConnectionFactory` is registered.
|
||
|
||
Shutdown ordering for the site role is explicit: `IHostApplicationLifetime.ApplicationStopping` fires before `IHostedService.StopAsync`, so `SiteStreamGrpcServer.CancelAllStreams` is called first (clients observe a clean cancellation and reconnect), then `AkkaHostedService` runs `CoordinatedShutdown` and tears down actors.
|
||
|
||
```csharp
|
||
siteLifetime.ApplicationStopping.Register(() => siteGrpcServer.CancelAllStreams());
|
||
```
|
||
|
||
### Database migration retry
|
||
|
||
On central nodes, `StartupRetry.ExecuteWithRetryAsync` wraps the migration step with up to 8 attempts and initial 2-second exponential backoff (capped at 30 seconds). Only connection-class faults (`SocketException`, `SqlException`, `DbException`, `TimeoutException`) are retried; a schema-version mismatch surfaces as an `InvalidOperationException` and fails immediately. The `ApplicationStopping` token is threaded into both the migration call and the inter-attempt `Task.Delay` so a SIGTERM during the retry window tears down cleanly.
|
||
|
||
## Usage
|
||
|
||
The Host is not consumed as a library; it is the executable entry point. Other components expose themselves to the Host via the extension-method convention:
|
||
|
||
- `IServiceCollection.AddXxx()` — registers DI services.
|
||
- `AkkaHostedService.RegisterXxxActors()` / inline `ActorOf` calls in `AkkaHostedService` — registers actors.
|
||
- `WebApplication.MapXxx()` — maps web endpoints (Central UI, Inbound API, Management API, Audit API).
|
||
|
||
`Program.cs` calls these methods; the component libraries own the registration logic. This keeps the Host thin and each component self-contained.
|
||
|
||
### Component registration by role
|
||
|
||
| Component | Central | Site | `AddXxx` | Actors | `MapXxx` |
|
||
|---|:---:|:---:|:---:|:---:|:---:|
|
||
| ClusterInfrastructure | Yes | Yes | Yes | Yes | — |
|
||
| Communication | Yes | Yes | Yes | Yes | — |
|
||
| HealthMonitoring | Yes | Yes | Yes | Yes | — |
|
||
| ExternalSystemGateway | Yes | Yes | Yes | Yes | — |
|
||
| AuditLog | Yes | Yes | Yes | Yes | — |
|
||
| NotificationService | Yes | No | Yes | — | — |
|
||
| NotificationOutbox | Yes | No | Yes | Yes (singleton) | — |
|
||
| SiteCallAudit | Yes | No | Yes | Yes (singleton) | — |
|
||
| TemplateEngine | Yes | No | Yes | Yes | — |
|
||
| DeploymentManager | Yes | No | Yes | Yes | — |
|
||
| Security | Yes | No | Yes | — | — |
|
||
| CentralUI | Yes | No | Yes | — | Yes |
|
||
| InboundAPI | Yes | No | Yes | — | Yes |
|
||
| ManagementService | Yes | No | Yes | Yes | Yes |
|
||
| Transport | Yes | No | Yes | — | — |
|
||
| ConfigurationDatabase | Yes | No | Yes | — | — |
|
||
| SiteRuntime | No | Yes | Yes | Yes (singleton) | — |
|
||
| DataConnectionLayer | No | Yes | Yes | Yes | — |
|
||
| StoreAndForward | No | Yes | Yes | Yes | — |
|
||
| SiteEventLogging | No | Yes | Yes | Yes (singleton) | — |
|
||
|
||
`AuditLog` calls `AddAuditLog` on both roles; central additionally calls `AddAuditLogCentralMaintenance`. Site calls `AddAuditLogHealthMetricsBridge` to bridge write failures into the site health report.
|
||
|
||
## Configuration
|
||
|
||
Options are bound via the .NET Options pattern (`IOptions<T>`). Each component owns its options class; the Host binds each section and passes the `IConfiguration` to component extension methods only where the component's own validator needs it at startup.
|
||
|
||
### `ScadaBridge:Node` → `NodeOptions`
|
||
|
||
| Key | Default | Description |
|
||
|-----|---------|-------------|
|
||
| `Role` | — | `"Central"` or `"Site"`. Validated by `StartupValidator`. |
|
||
| `NodeHostname` | — | Hostname or IP advertised to the Akka cluster and enriched on log entries. |
|
||
| `NodeName` | — | Free-form semantic name stamped as `SourceNode` on audit rows (e.g. `"central-a"`, `"node-b"`). Empty normalises to `null`. |
|
||
| `SiteId` | — | Site identifier; required for Site nodes; used to scope cluster singletons and enrich telemetry. |
|
||
| `RemotingPort` | `8081` | Akka.NET remoting TCP port. Must be in range 1–65535. |
|
||
| `GrpcPort` | `8083` | Kestrel HTTP/2 port for the site gRPC stream server (Site nodes only). Must differ from `RemotingPort`. |
|
||
| `MetricsPort` | `8084` | Kestrel HTTP/1+2 port for the Prometheus `/metrics` scrape endpoint (Site nodes only). Must differ from both `RemotingPort` and `GrpcPort`. |
|
||
|
||
### `ScadaBridge:Cluster` → `ClusterOptions`
|
||
|
||
| Key | Default | Description |
|
||
|-----|---------|-------------|
|
||
| `SeedNodes` | — | List of Akka seed-node URIs (`akka.tcp://scadabridge@host:port`). At least 2 required. Must reference remoting ports, not gRPC ports. |
|
||
| `SplitBrainResolverStrategy` | — | Active strategy name (e.g. `"keep-oldest"`). |
|
||
| `StableAfter` | `"00:00:15"` | Duration the cluster must be stable before the resolver acts. |
|
||
| `HeartbeatInterval` | `"00:00:02"` | Akka failure-detector heartbeat cadence. |
|
||
| `FailureDetectionThreshold` | `"00:00:10"` | Acceptable heartbeat pause before a node is considered unreachable. |
|
||
| `MinNrOfMembers` | `1` | Minimum cluster members before the leader is elected. |
|
||
| `DownIfAlone` | `true` | When using `keep-oldest`, whether a lone surviving node downs itself. |
|
||
|
||
### `ScadaBridge:Database` → `DatabaseOptions`
|
||
|
||
| Key | Role | Description |
|
||
|-----|------|-------------|
|
||
| `ConfigurationDb` | Central | MS SQL connection string for the central `ScadaBridgeDbContext`. Required; validated by `StartupValidator`. |
|
||
| `SiteDbPath` | Site | Filesystem path to the site-local SQLite database. Required for Site nodes. |
|
||
|
||
### `ScadaBridge:Logging` → `LoggingOptions`
|
||
|
||
| Key | Default | Description |
|
||
|-----|---------|-------------|
|
||
| `MinimumLevel` | `"Information"` | Serilog minimum log level. Overrides any `Serilog:MinimumLevel` entry — a one-shot warning is emitted to `stderr` if both are present. Parsed case-insensitively; unrecognised values fall back to `Information` with a warning. |
|
||
|
||
Serilog sinks (console output template, file path, rolling interval) are configured under the standard `Serilog` JSON section and applied via `ReadFrom.Configuration`. Every log entry is enriched with `SiteId`, `NodeHostname`, and `NodeRole` properties from the resolved node configuration.
|
||
|
||
### `ScadaBridge:InboundApi:ApiKeyStore`
|
||
|
||
| Key | Default | Description |
|
||
|-----|---------|-------------|
|
||
| `SqlitePath` | `data/inbound-api-keys.sqlite` under content root | Path to the SQLite store for inbound API keys. |
|
||
| `TokenPrefix` | `"sbk"` | Prefix for issued API key tokens. Fixed; injected by the Host as in-memory config. |
|
||
| `PepperSecretName` | `"ScadaBridge:InboundApi:ApiKeyPepper"` | Configuration key holding the peppered-HMAC secret. The pepper itself must be ≥ 16 characters; validated by `StartupValidator`. |
|
||
| `RunMigrationsOnStartup` | `true` | Whether the hosted service creates the SQLite schema on first run. |
|
||
|
||
All other per-component configuration sections (`ScadaBridge:Communication`, `ScadaBridge:HealthMonitoring`, `ScadaBridge:Security`, `ScadaBridge:InboundApi`, `ScadaBridge:NotificationOutbox`, `ScadaBridge:Transport`, `ScadaBridge:DataConnection`, `ScadaBridge:StoreAndForward`, `ScadaBridge:SiteEventLog`, `ScadaBridge:SiteRuntime`, `ScadaBridge:Notification`) are bound by their respective component extension methods. The Host binds them at the shared `BindSharedOptions` call or at the role-specific `Configure<T>` sites in `Program.cs` and `SiteServiceRegistration.Configure`.
|
||
|
||
## Dependencies & Interactions
|
||
|
||
- **All 19 component libraries** — the Host project-references every component to call its extension methods. The Host is the only project with this fan-out; component libraries do not reference each other except where documented.
|
||
- [Cluster Infrastructure (#13)](./ClusterInfrastructure.md) — the Host configures the underlying Akka.NET cluster (`AkkaHostedService.BuildHocon`); ClusterInfrastructure manages it at runtime.
|
||
- [Configuration Database (#17)](./ConfigurationDatabase.md) — the Host registers `ScadaBridgeDbContext` and calls `AddConfigurationDatabase` (Central only); the `StartupRetry`-wrapped migration step runs before traffic is accepted.
|
||
- [Central–Site Communication (#5)](./Communication.md) — the Host creates `CentralCommunicationActor` and `SiteCommunicationActor`, registers them with `ClusterClientReceptionist`, and wires the `ClusterClient` for site→central messaging; the gRPC server is mapped at `app.MapGrpcService<SiteStreamGrpcServer>()`.
|
||
- [Health Monitoring (#11)](./HealthMonitoring.md) — the Host registers health checks (`DatabaseHealthCheck`, `AkkaClusterHealthCheck`, `ActiveNodeHealthCheck`) and mounts them via `app.MapZbHealth()` on central; site nodes register `AddSiteHealthMonitoring` and `AkkaHealthReportTransport`.
|
||
- [Audit Log (#23)](./AuditLog.md) — the Host calls `AddAuditLog` on both roles, `AddAuditLogCentralMaintenance` on central, and `AddAuditLogHealthMetricsBridge` on site; it creates the `AuditLogIngestActor` singleton and registers `SiteAuditTelemetryActor` on the dedicated dispatcher.
|
||
- [Notification Outbox (#21)](./NotificationOutbox.md) — the Host creates the `NotificationOutboxActor` cluster singleton and hands its proxy to `CentralCommunicationActor`.
|
||
- [Site Call Audit (#22)](./SiteCallAudit.md) — the Host creates the `SiteCallAuditActor` cluster singleton with a graceful-stop drain task registered in the `cluster-leave` coordinated-shutdown phase.
|
||
- [Management Service (#18)](./ManagementService.md) — the Host creates `ManagementActor` and registers it with `ClusterClientReceptionist`; maps the Management and Audit HTTP APIs.
|
||
- [Traefik Proxy (#20)](./TraefikProxy.md) — Traefik polls `/health/active` to determine which central node to route traffic to; the Host implements the `ActiveNodeHealthCheck` and `ActiveNodeGate` that back this endpoint.
|
||
- Design spec: [Component-Host.md](../requirements/Component-Host.md).
|
||
|
||
## Troubleshooting
|
||
|
||
### Node fails to start with validation errors
|
||
|
||
`StartupValidator` throws before any DI or actor system setup. The exception message lists all failing keys and their expected constraints. Common causes: missing `ScadaBridge:Node:Role`, a `GrpcPort`/`RemotingPort` collision on a site node, a seed-node URI that accidentally points at the gRPC port rather than the remoting port, or a missing `ConfigurationDb` connection string on a central node.
|
||
|
||
### Central node loops on database migration
|
||
|
||
`StartupRetry` retries connection-class faults up to 8 times (roughly 2 minutes worst-case). If the loop exhausts without success, the process exits with a `Fatal` log entry. Permanent errors (schema-version mismatch detected by `MigrationHelper`) are not retried and exit on the first attempt. Check `SqlException` details in the log to distinguish a connectivity failure from a schema fault.
|
||
|
||
### Dead letters appearing at startup
|
||
|
||
A burst of dead letters during startup is normal: actors send messages before their targets finish `PreStart`. `DeadLetterMonitorActor` logs each at `Warning` and increments the health counter — these are observable on the site health report. Sustained dead letters after the cluster stabilises indicate a stale actor reference or a lifecycle race.
|
||
|
||
### Standby central node receives traffic
|
||
|
||
If Traefik is not yet polling `/health/active` or its health-check interval has not elapsed after a failover, traffic may briefly reach the standby. `ActiveNodeGate` returns `false` on the standby, causing the Inbound API endpoint filter to respond `503 Service Unavailable`. The response header `X-ScadaBridge-Active: false` is present so the condition is identifiable in access logs. No operator action is needed; Traefik will reroute on its next health-check cycle.
|
||
|
||
## Related Documentation
|
||
|
||
- [Host design specification](../requirements/Component-Host.md)
|
||
- [Cluster Infrastructure](./ClusterInfrastructure.md)
|
||
- [Central–Site Communication](./Communication.md)
|
||
- [Configuration Database](./ConfigurationDatabase.md)
|
||
- [Health Monitoring](./HealthMonitoring.md)
|
||
- [Audit Log](./AuditLog.md)
|
||
- [Notification Outbox](./NotificationOutbox.md)
|
||
- [Site Call Audit](./SiteCallAudit.md)
|
||
- [Management Service](./ManagementService.md)
|
||
- [Traefik Proxy](./TraefikProxy.md)
|