# Code Review — Host | Field | Value | |-------|-------| | Module | `src/ScadaLink.Host` | | Design doc | `docs/requirements/Component-Host.md` | | Status | Reviewed | | Last reviewed | 2026-05-17 | | Reviewer | claude-agent | | Commit reviewed | `39d737e` | | Open findings | 0 | ## Summary The Host module is the composition root for the entire ScadaLink system: a single binary whose behaviour (`Central` vs `Site`) is driven entirely by configuration. The implementation is generally faithful to `Component-Host.md` — startup validation, role-based registration, Serilog enrichment, Windows Service support, dead-letter monitoring, CoordinatedShutdown, and gRPC hosting on site nodes are all present and backed by a solid test suite (`tests/ScadaLink.Host.Tests`). The most significant problem is the readiness endpoint: `/health/ready` runs **all** registered health checks, including the leader-only `active-node` check, so a fully operational *standby* central node permanently reports `503` on `/health/ready` — directly contradicting REQ-HOST-4a, which defines readiness as cluster membership + DB connectivity (not leadership). Several other findings concern configuration that is validated-but-never-consumed (`MachineDataDb`), design-doc drift (Akka.Persistence is required by REQ-HOST-6 but the system uses no persistent actors), an incorrect seed-node entry in the shipped site config, blocking sync-over-async during startup, and unguarded string interpolation when building HOCON. None are crash/data-loss class, but the readiness bug is High because it breaks load-balancer behaviour with no safe workaround. #### Re-review 2026-05-17 (commit `39d737e`) All eleven findings from the first review (Host-001..011) are confirmed `Resolved` in the current tree — the `/health/ready` predicate, the externalised secrets, the seed-node/GrpcPort validation rules, the escaped HOCON builder, the bounded migration retry and the live `LoggingOptions.MinimumLevel` are all present as described in their Resolution notes. This re-review walked all ten checklist categories again over the full module and recorded four new findings, none of them crash/data-loss class. Following up on a batch-1 ClusterInfrastructure re-review note, Host-012 (Medium) confirms `AkkaHostedService.BuildHocon` hard-codes `down-if-alone = on` and never reads the `ClusterOptions.DownIfAlone` property, so that documented, bound option is dead. The remaining three are Low: `:F0` rounding of cluster timing values silently degrades any sub-second configuration (Host-013), Serilog sink setup is hard-coded in `Program.cs` rather than configuration-driven as REQ-HOST-8 requires (Host-014), and `StartupRetry` retries indiscriminately on every exception type including permanent schema-validation failures (Host-015). ## Checklist coverage | # | Category | Examined | Notes | |---|----------|----------|-------| | 1 | Correctness & logic bugs | ☑ | Host-001/004 resolved. Re-review: `:F0` rounding of cluster timing values silently distorts sub-second durations (Host-013). | | 2 | Akka.NET conventions | ☑ | CoordinatedShutdown, receptionist registration, singleton scoping all correct. Host-006/009 resolved; no new issues. | | 3 | Concurrency & thread safety | ☑ | Host-005 resolved (`StartAsync` now genuinely async). `DeadLetterMonitorActor` state is actor-confined — no issues. | | 4 | Error handling & resilience | ☑ | Host-010 resolved. Re-review: `StartupRetry` retries indiscriminately on permanent failures (e.g. schema-validation mismatch) (Host-015). | | 5 | Security | ☑ | Host-003 resolved — secrets externalised to env vars. No new issues. | | 6 | Performance & resource management | ☑ | No undisposed resources. Inbound API script compilation is a synchronous startup loop — acceptable. No new issues. | | 7 | Design-document adherence | ☑ | Host-002/007 resolved. Re-review: `ClusterOptions.DownIfAlone` bound/documented but never consumed — HOCON hard-codes `on` (Host-012); Serilog sinks hard-coded, not config-driven per REQ-HOST-8 (Host-014). | | 8 | Code organization & conventions | ☑ | Host-008/011 resolved. No new issues. | | 9 | Testing coverage | ☑ | Strong suite; regression tests added for Host-001/004/006/007/010/011. No coverage for the new `down-if-alone`, sub-second-duration, or non-transient-retry paths (Host-012/013/015). | | 10 | Documentation & comments | ☑ | REQ-HOST-6 stale-doc resolved. Re-review: REQ-HOST-8 says sinks are "configuration-driven" but they are code-defined (Host-014). | ## Findings ### Host-001 — `/health/ready` includes the leader-only `active-node` check | | | |--|--| | Severity | High | | Category | Correctness & logic bugs | | Status | Resolved | | Location | `src/ScadaLink.Host/Program.cs:135-145` | **Description** `/health/ready` is mapped with `MapHealthChecks("/health/ready", ...)` and **no `Predicate`**, so it executes every registered check: `database`, `akka-cluster` *and* `active-node`. `ActiveNodeHealthCheck` (`Health/ActiveNodeHealthCheck.cs:38`) returns `Unhealthy` on any node that is not the cluster leader. As a result a standby central node that is fully operational (cluster member `Up`, database reachable) still returns `503` on `/health/ready`. This contradicts REQ-HOST-4a, which defines readiness as cluster membership + DB connectivity + singletons — explicitly *not* leadership. `/health/active` is the endpoint intended to report leadership. A load balancer using `/health/ready` to decide whether a node may serve traffic will permanently treat the standby as unready, defeating failover readiness. No test covers this: `HealthCheckTests.HealthReady_Endpoint_ReturnsResponse` only asserts a response is returned, not the standby semantics. **Recommendation** Add a `Predicate` to the `/health/ready` mapping that excludes the `active-node` check, e.g. `Predicate = check => check.Name != "active-node"` (or tag the readiness checks and filter by tag). Add a regression test asserting a non-leader node returns `200` on `/health/ready`. **Resolution** Resolved 2026-05-16 (commit ``). Root cause confirmed against `Program.cs`: the `/health/ready` mapping had no `Predicate`, so it executed all three registered checks including the leader-only `active-node` check, while `ActiveNodeHealthCheck` returns `Unhealthy` on any non-leader node — making a fully operational standby central node permanently report `503`. Fix: added `Predicate = check => check.Name != "active-node"` to the `/health/ready` `HealthCheckOptions`, so readiness now reflects cluster membership + DB connectivity only (REQ-HOST-4a); leadership remains reported solely by `/health/active`. Regression test `HealthCheckTests.HealthReady_Endpoint_ExcludesActiveNodeCheck` asserts the `active-node` check name does not appear in the `/health/ready` response body; it failed before the fix and passes after. Full Host suite green (156 passed). ### Host-002 — Akka.Persistence required by REQ-HOST-6 is not configured and not used | | | |--|--| | Severity | Medium | | Category | Design-document adherence | | Status | Resolved | | Location | `src/ScadaLink.Host/Actors/AkkaHostedService.cs:70-108`, `docs/requirements/Component-Host.md` REQ-HOST-6 | **Description** REQ-HOST-6 states the Host "must configure the Akka.NET actor system using Akka.Hosting with ... **Persistence**: Configured with the appropriate journal and snapshot store (SQL for central, SQLite for site)." The HOCON built in `AkkaHostedService.StartAsync` contains no `akka.persistence` section, no journal and no snapshot-store plugin, and `ScadaLink.Host.csproj` references neither `Akka.Persistence.Hosting` nor any persistence plugin (the design doc Dependencies list `Akka.Persistence.Hosting`). A repo-wide search finds **no** `PersistentActor` / `ReceivePersistentActor` subclasses — the system deliberately uses custom SQLite storage services instead. The code is internally consistent, but the design document is stale: it mandates a subsystem that does not exist. This is a documented-vs-actual drift that will mislead future maintainers and any audit against REQ-HOST-6. **Recommendation** Update `Component-Host.md` REQ-HOST-6 and the Dependencies list to remove the Akka.Persistence requirement (or explicitly state persistence is provided by component-owned SQLite storage, not Akka.Persistence). If persistence *is* intended, add the plugin packages and HOCON. Either way, code and doc must agree. **Resolution** Resolved 2026-05-16 (commit ``). Confirmed accurate: a repo-wide search finds **no** `PersistentActor` / `ReceivePersistentActor` subclasses anywhere in `src/`, no `akka.persistence` section in the HOCON built by `AkkaHostedService.StartAsync`, and `ScadaLink.Host.csproj` references no persistence plugin packages. The Host code is correct and internally consistent; the defect was a stale design doc. Fix: updated `docs/requirements/Component-Host.md` — REQ-HOST-6 no longer lists Persistence as a configured subsystem and now carries an explicit Persistence note stating that durable state is owned by individual components and persisted through component-owned stores (SQLite at sites, MS SQL centrally), with no Akka journal/snapshot store and no `PersistentActor` subclasses by design. The Responsibilities line and the Dependencies entry (`Akka.Persistence.Hosting` removed) were corrected to match. Code and doc now agree; no code or test change required. ### Host-003 — Secrets committed in plaintext in `appsettings.Central.json` | | | |--|--| | Severity | Medium | | Category | Security | | Status | Resolved | | Location | `src/ScadaLink.Host/appsettings.Central.json:20-31` | **Description** `appsettings.Central.json` contains real-looking secrets in plaintext, checked into source control: SQL Server passwords in the `ConfigurationDb` / `MachineDataDb` connection strings (`Password=ScadaLink_Dev1#`), an LDAP service-account password (`LdapServiceAccountPassword: "password"`), and a JWT signing key (`JwtSigningKey: "scadalink-dev-jwt-signing-key-..."`). Even though these are intended as development defaults, shipping them in the default config invites them being reused verbatim in production, and a committed JWT signing key allows anyone with repo access to forge session tokens. `TrustServerCertificate=true` additionally disables TLS validation for the SQL connection. **Recommendation** Move all secrets out of committed `appsettings*.json` into environment variables, user-secrets, or a secret store. Keep only non-sensitive structural defaults in the file and document the required environment variables. At minimum add a clear comment that these values are dev-only and must be overridden, and rotate the JWT key per environment. **Resolution** Resolved 2026-05-16 (commit `pending`). Root cause confirmed against `appsettings.Central.json`: the committed file carried real-looking secrets in plaintext — SQL Server passwords (`Password=ScadaLink_Dev1#`) in both connection strings, an LDAP service-account password, and a JWT signing key. Fix: all four secrets were removed from the committed file and replaced with non-functional `${...}` placeholder references (`ConfigurationDb` / `MachineDataDb`, `LdapServiceAccountPassword`, `JwtSigningKey`). A new top-level `_secrets` note documents that the Host's configuration builder (`AddEnvironmentVariables()`) overlays the real values supplied via environment variables (`ScadaLink__Database__ConfigurationDb`, `ScadaLink__Database__MachineDataDb`, `ScadaLink__Security__LdapServiceAccountPassword`, `ScadaLink__Security__JwtSigningKey`); the placeholders are intentionally invalid so a misconfigured deployment fails loudly rather than silently using a committed key. Regression test class `ConfigSecretsTests` asserts the committed file carries no plaintext `Password=` value, no committed LDAP service-account password, and no committed JWT signing key; all three tests failed before the fix and pass after. Tests that drive the full `Program` startup pipeline against the real SQL provider (`HealthCheckTests`, `HostStartupTests.CentralRole_StartsWithoutError`) were adapted to supply the local dev connection strings themselves via the new `CentralDbTestEnvironment` test helper (environment variables) — they must no longer depend on committed secrets. Note: the `docker/central-node-*/appsettings.Central.json` files still contain the same dev secrets but lie outside this task's permitted edit scope; they should receive the same treatment in a follow-up. ### Host-004 — Site seed-node list points at the gRPC port, not a remoting port | | | |--|--| | Severity | Medium | | Category | Correctness & logic bugs | | Status | Resolved | | Location | `src/ScadaLink.Host/appsettings.Site.json:10-19` | **Description** The shipped site config sets `Node:RemotingPort = 8082` and `Node:GrpcPort = 8083`, but `Cluster:SeedNodes` is `["akka.tcp://scadalink@localhost:8082", "akka.tcp://scadalink@localhost:8083"]`. The second seed node targets `8083`, which is the Kestrel HTTP/2 gRPC port — not an Akka remoting endpoint. A node attempting to join via that seed will try to establish an Akka.Remote TCP association against the gRPC listener and fail. `StartupValidator` only checks that ≥2 seed nodes exist (`StartupValidator.cs:54-56`), so this misconfiguration passes validation silently. For the single-node dev site it is harmless (the first seed succeeds), but it is an incorrect example that will be copied into multi-node site configs. **Recommendation** Correct the site seed-node list to reference the two site nodes' *remoting* ports (e.g. `8082` and `8084`), never the gRPC port. Consider extending `StartupValidator` to reject a seed node whose port equals this node's `GrpcPort`. **Resolution** Resolved 2026-05-16 (commit `pending`). Root cause confirmed against `appsettings.Site.json`: with `Node:RemotingPort = 8082` and `Node:GrpcPort = 8083`, the second `Cluster:SeedNodes` entry was `akka.tcp://scadalink@localhost:8083` — the Kestrel HTTP/2 gRPC port, not an Akka.Remote endpoint. `StartupValidator` only checked seed-node *count* (≥2), so the misconfiguration passed silently. Fix, two parts: (1) the shipped `appsettings.Site.json` second seed entry was corrected to a remoting port (`localhost:8084`); (2) `StartupValidator.Validate` was extended — for `Site` nodes it now parses each seed node's TCP port (via a new `SeedNodePort` helper) and rejects any entry whose port equals the node's `GrpcPort`, using the resolved GrpcPort including the `8083` `NodeOptions` default when the key is absent. The seed-node-count check was hoisted above the Site block so the new rule can reuse the parsed list. Regression tests in `StartupValidatorTests`: `Site_SeedNodeOnGrpcPort_FailsValidation`, `Site_SeedNodeOnDefaultGrpcPort_FailsValidation` (default-8083 path), `Site_SeedNodesOnRemotingPort_PassesValidation`, and `Central_SeedNodeOnPort8083_PassesValidation` (rule is Site-only) — all failed appropriately before the fix and pass after. ### Host-005 — Blocking sync-over-async (`GetAwaiter().GetResult()`) inside `StartAsync` | | | |--|--| | Severity | Low | | Category | Concurrency & thread safety | | Status | Resolved | | Location | `src/ScadaLink.Host/Actors/AkkaHostedService.cs:345` | **Description** `RegisterSiteActors` calls `storeAndForwardService.StartAsync().GetAwaiter().GetResult()` synchronously, blocking inside the `IHostedService.StartAsync` path. `StartAsync` is itself declared synchronous (returns `Task.CompletedTask`), so the work cannot be awaited cleanly. Blocking on async work risks thread-pool starvation during startup and, if the awaited operation captures a synchronization context, deadlock. It also hides exceptions behind an `AggregateException` wrapper. **Recommendation** Make `AkkaHostedService.StartAsync` genuinely `async` and `await storeAndForwardService.StartAsync(cancellationToken)`. Propagate the `CancellationToken` and let exceptions surface as the original type. **Resolution** Resolved 2026-05-16 (commit `pending`). Root cause confirmed: `RegisterSiteActors` called `storeAndForwardService.StartAsync().GetAwaiter().GetResult()` synchronously inside the `IHostedService.StartAsync` path, blocking a startup thread and wrapping any failure in `AggregateException`. Fix: `AkkaHostedService.StartAsync` is now genuinely `async Task`; the site path was renamed `RegisterSiteActorsAsync` and made `async`; the store-and-forward init is now `await storeAndForwardService.StartAsync()` (the service's own signature, in another module, takes no `CancellationToken`, so the token is honoured via a `cancellationToken.ThrowIfCancellationRequested()` checkpoint immediately before the await). No blocking, no sync-context deadlock risk, and the original exception type now surfaces. No dedicated regression test: the only observable change is non-blocking/exception-unwrapping behaviour on a path that requires the full site DI graph plus a real SQLite store; the existing `HostStartupTests.SiteRole_*` and `CompositionRootTests` continue to exercise the async startup pipeline and remain green (175 passed). ### Host-006 — HOCON assembled by unescaped string interpolation | | | |--|--| | Severity | Low | | Category | Akka.NET conventions | | Status | Resolved | | Location | `src/ScadaLink.Host/Actors/AkkaHostedService.cs:70-108` | **Description** The Akka HOCON is built with an interpolated string that injects `_nodeOptions.NodeHostname`, `_clusterOptions.SeedNodes`, the computed roles, and `SplitBrainResolverStrategy` directly into the configuration text. Values are not escaped. A hostname or seed-node string containing a quote, backslash, brace, or comment sequence would corrupt the HOCON and produce a confusing parse error far from the real cause; `SplitBrainResolverStrategy` is interpolated without quoting, so a value with whitespace breaks the document. Building cluster configuration from raw string concatenation is also harder to maintain than the typed Akka.Hosting builder the design doc (REQ-HOST-6) actually calls for ("via Akka.Hosting"). **Recommendation** Prefer the `Akka.Hosting` `AddAkka(...)` builder with strongly-typed `WithRemoting`, `WithClustering`, and split-brain-resolver configuration instead of hand-built HOCON. If HOCON must be retained, validate/escape interpolated values (especially hostname and seed nodes) before substitution. **Resolution** Resolved 2026-05-16 (commit `pending`). Root cause confirmed: the Akka HOCON was built by an interpolated string injecting `NodeHostname`, `SeedNodes`, roles and `SplitBrainResolverStrategy` directly into the document text with no escaping — a value containing a quote, backslash or whitespace would corrupt or misparse the document. Re-triaged the recommendation: switching wholesale to the typed `Akka.Hosting` builder is a larger refactor than this Low finding warrants and would touch how the whole cluster is bootstrapped, so the HOCON form was retained but made safe. Fix: the HOCON construction was extracted into a testable static `AkkaHostedService.BuildHocon`, and every interpolated *string* value is now routed through a `QuoteHocon` helper that wraps the value in a double-quoted HOCON literal and escapes embedded backslashes and quotes — `NodeHostname`, every seed node, every role, and `SplitBrainResolverStrategy` (the last was previously unquoted entirely, so whitespace alone would break it). Regression tests in new `HoconBuilderTests`: `BuildHocon_HostnameWithQuote_DoesNotCorruptDocument` (a hostname with a `"` still parses and following keys stay intact), `BuildHocon_StrategyWithWhitespace_RemainsQuoted`, and `BuildHocon_PlainValues_ParsesAndPreservesValues`. With the old raw interpolation a quote-containing hostname produced `hostname = "evil"host"`, closing the literal early and corrupting following keys; the escaped form parses correctly. Full Host suite green (175 passed). ### Host-007 — REQ-HOST-4 rule "GrpcPort ≠ RemotingPort" is not enforced | | | |--|--| | Severity | Low | | Category | Design-document adherence | | Status | Resolved | | Location | `src/ScadaLink.Host/StartupValidator.cs:43-47` | **Description** REQ-HOST-4 requires: "Site nodes must have `GrpcPort` in valid port range (1–65535) **and different from `RemotingPort`**." `StartupValidator` validates the GrpcPort range but never compares it to `RemotingPort`. A site config that sets both ports to the same value passes validation and then fails opaquely at runtime when Kestrel and Akka.Remote both try to bind the port. The GrpcPort range check is also skipped entirely when the key is absent (`grpcPortStr != null`), relying on the `NodeOptions` default of 8083 — acceptable, but the equality rule is the missing piece. **Recommendation** Add a check in the `role == "Site"` block: if `GrpcPort` (resolved, including the 8083 default) equals `RemotingPort`, add an error `"ScadaLink:Node:GrpcPort must differ from RemotingPort"`. **Resolution** Resolved 2026-05-16 (commit `pending`). Root cause confirmed against `StartupValidator.cs`: the `role == "Site"` block validated the GrpcPort range but never compared it to `RemotingPort`, so a site config setting both ports equal passed validation and then failed opaquely when Kestrel and Akka.Remote both bound the port. Fix: added `if (port == grpcPort) errors.Add("ScadaLink:Node:GrpcPort must differ from RemotingPort")` in the Site block, using the resolved GrpcPort (including the 8083 `NodeOptions` default when the key is absent) against the parsed RemotingPort. Regression tests in `StartupValidatorTests`: `Site_GrpcPortEqualsRemotingPort_FailsValidation`, `Site_DefaultGrpcPortEqualsRemotingPort_FailsValidation` (default-8083 path), and `Site_GrpcPortDiffersFromRemotingPort_PassesValidation` — all failed appropriately before the fix and pass after. Full Host suite green (175 passed). ### Host-008 — `MachineDataDb` is validated and declared but never consumed | | | |--|--| | Severity | Low | | Category | Code organization & conventions | | Status | Resolved | | Location | `src/ScadaLink.Host/StartupValidator.cs:33-34`, `src/ScadaLink.Host/DatabaseOptions.cs:6` | **Description** `StartupValidator` requires a non-empty `ScadaLink:Database:MachineDataDb` connection string for Central nodes, and `DatabaseOptions` exposes a `MachineDataDb` property, but a repo-wide search shows the value is never read anywhere outside the Host module — only `ConfigurationDb` is passed to `AddConfigurationDatabase` (`Program.cs:83-85`). The Host therefore fails startup if `MachineDataDb` is missing even though nothing uses it. This is either dead configuration that should be removed or a missing wiring (a machine-data DbContext that was never registered). **Recommendation** Determine whether a machine-data store is actually required. If yes, wire it into the relevant component's DI registration. If no, remove the `MachineDataDb` validation rule, the `DatabaseOptions` property, and the key from `appsettings.Central.json`. **Resolution** Resolved 2026-05-16 (commit `pending`). Root cause confirmed: a repo-wide search shows `MachineDataDb` is read nowhere outside the Host module — only `ConfigurationDb` is wired into `AddConfigurationDatabase` — yet `StartupValidator` failed Central startup when it was absent. No machine-data store exists or is registered, so this is dead configuration. Fix (the "if no" branch of the recommendation): removed the `MachineDataDb` validation rule from `StartupValidator`, the `MachineDataDb` property from `DatabaseOptions`, the `MachineDataDb` key from `appsettings.Central.json`, and the now-pointless `MachineDataDb` reference from that file's `_secrets` note and from the `CentralDbTestEnvironment` test helper. Regression test: the prior `StartupValidatorTests.Central_MissingMachineDataDb_FailsValidation` was inverted to `Central_MissingMachineDataDb_PassesValidation` (absence must now NOT fail startup), and `MachineDataDb` was removed from the shared `ValidCentralConfig` fixture so all Central tests prove the key is no longer required; it failed under the old code and passes after. Full Host suite green (175 passed). Note: `appsettings.Central.json` under `docker/central-node-*` is outside this task's edit scope and still carries the key harmlessly — it will simply be ignored; a follow-up may remove it for tidiness. ### Host-009 — `StartAsync` reports success before role actors are confirmed running | | | |--|--| | Severity | Low | | Category | Akka.NET conventions | | Status | Resolved | | Location | `src/ScadaLink.Host/Actors/AkkaHostedService.cs:127-141` | **Description** `StartAsync` creates actors with `ActorOf` (a fire-and-forget operation — the actor's `PreStart` runs asynchronously on its own thread) and then returns `Task.CompletedTask`. For site nodes, `grpcServer.SetReady(_actorSystem)` is called synchronously at the end of `RegisterSiteActors`, marking the gRPC server ready even though `SiteCommunicationActor`, the deployment-manager singleton, and the `ClusterClient` may not yet have completed their `PreStart`/initial-contact handshake. REQ-HOST-7 requires "Actor system and SiteStreamManager ... initialized before gRPC begins accepting connections" — `SiteStreamManager.Initialize` is awaited-equivalent, but the broader actor graph is not. The window is small and the gRPC server still rejects streams until `SetReady`, so impact is limited, but readiness is being asserted optimistically. **Recommendation** If strict ordering matters, gate `SetReady` on confirmation that `SiteCommunicationActor` is fully initialized (e.g. an `Ask` round-trip or a readiness message), or document explicitly that gRPC readiness only guarantees the actor system exists, not that the cluster handshake has completed. **Resolution** Resolved 2026-05-16 (commit `pending`) — re-triaged, documentation fix. Root cause confirmed: `SetReady` is called optimistically after `ActorOf` registrations whose `PreStart`/initial-contact handshakes run asynchronously. Re-triaged the first recommended option (gating `SetReady` on a `SiteCommunicationActor` `Ask` round-trip): that option is **incorrect** because the meaningful asynchronous step is the `ClusterClient` initial-contact handshake with central — gating site gRPC readiness on it would prevent a site from coming up and streaming locally while central is briefly unreachable, contradicting the design's site-autonomy model. Confirmed `SiteStreamGrpcServer` already rejects any stream opened before `SetReady` with `StatusCode.Unavailable`, so the window is benign. Fix (the second, correct option): the `SetReady` call site now carries an explicit comment documenting the narrow readiness contract — by `SetReady` the actor system exists, `SiteStreamManager.Initialize` has run, and every role actor has been created with ordered synchronous `ActorOf` + registration `Tell`s; what is deliberately NOT guaranteed is `PreStart` completion or the central handshake. No regression test is meaningful for a documentation-only clarification of an intentional design contract; the existing `HostStartupTests.SiteRole_*` cover the surrounding startup path. Full Host suite green (175 passed). ### Host-010 — No retry/backoff around startup preconditions (DB migration, readiness) | | | |--|--| | Severity | Low | | Category | Error handling & resilience | | Status | Resolved | | Location | `src/ScadaLink.Host/Program.cs:112-125` | **Description** On Central startup the Host opens a DI scope and calls `MigrationHelper.ApplyOrValidateMigrationsAsync` directly. If the SQL Server is not yet reachable (common in container orchestration where the DB and app start together), the call throws, the top-level `catch` logs `Fatal`, and the process exits. There is no bounded retry/backoff to tolerate a database that is briefly unavailable at boot. The design intent (REQ-HOST-4a, readiness gating, `503` until ready) is about *serving traffic*, but the migration step happens before the host even runs and has no such tolerance. **Recommendation** Wrap the migration/validation step in a bounded retry with exponential backoff (e.g. Polly), or move schema apply behind the readiness gate so the process stays up and reports `503` until the database becomes reachable. **Resolution** Resolved 2026-05-16 (commit `pending`). Root cause confirmed: Central startup called `MigrationHelper.ApplyOrValidateMigrationsAsync` directly with no tolerance for a database that is briefly unreachable at boot — a first connection failure threw, the top-level `catch` logged `Fatal`, and the process exited. Fix: added a new `StartupRetry.ExecuteWithRetryAsync` helper (bounded retry with capped exponential backoff — no Polly dependency, keeping the Host's package set minimal) and wrapped the migration step in it (`maxAttempts: 8`, `initialDelay: 2s`, doubling, capped at 30s). Each retry opens a fresh DI scope so a transient connection fault does not leave a poisoned `DbContext`; once attempts are exhausted the last exception propagates and the existing fatal-log path still fires. Regression tests in new `StartupRetryTests`: `ExecuteWithRetry_SucceedsFirstTry_RunsOnce`, `ExecuteWithRetry_TransientFailures_RetriesUntilSuccess` (fails twice then succeeds), and `ExecuteWithRetry_ExhaustsAttempts_RethrowsLastException` (verifies the original exception type/message surfaces after exhaustion) — they fail to compile/pass against the pre-fix code (the helper did not exist) and pass after. Full Host suite green (175 passed). ### Host-011 — `LoggingOptions.MinimumLevel` is dead configuration | | | |--|--| | Severity | Low | | Category | Code organization & conventions | | Status | Resolved | | Location | `src/ScadaLink.Host/LoggingOptions.cs:5`, `src/ScadaLink.Host/Program.cs:42-50` | **Description** `LoggingOptions` exposes a `MinimumLevel` property bound from `ScadaLink:Logging` (`SiteServiceRegistration.BindSharedOptions`), and both `appsettings.Central.json` and `appsettings.Site.json` set `"Logging": { "MinimumLevel": "Information" }`. However Serilog is configured purely via `ReadFrom.Configuration(configuration)`, which reads the standard `Serilog` section — not `ScadaLink:Logging`. The `LoggingOptions.MinimumLevel` value is never read by any code, so changing it has no effect. This is misleading: an operator editing `ScadaLink:Logging:MinimumLevel` expecting a log-level change will see nothing happen. **Recommendation** Either consume `LoggingOptions.MinimumLevel` when configuring the Serilog `LoggerConfiguration` (e.g. set `MinimumLevel.Is(...)` from it), or remove the option class and the `ScadaLink:Logging` sections and rely solely on the `Serilog` configuration section. Keep one mechanism, not two. **Resolution** Resolved 2026-05-16 (commit `pending`). Root cause confirmed: `LoggingOptions.MinimumLevel` was bound from `ScadaLink:Logging` but Serilog was configured purely via `ReadFrom.Configuration` (standard `Serilog` section), so editing `ScadaLink:Logging:MinimumLevel` had no effect. Re-triaged the two options: the "remove the option class" branch was rejected because `Component-Host.md` REQ-HOST-3 explicitly lists `ScadaLink:Logging` → `LoggingOptions` and that design doc is outside this task's edit scope — removing the class would create a fresh code-vs-doc drift. Fix (the "consume it" branch): added a `LoggerConfigurationFactory.Build` that binds `LoggingOptions`, parses `MinimumLevel` into a Serilog `LogEventLevel` (falling back to `Information` for null/blank/unrecognised values), and pins it via `MinimumLevel.Is(...)` on top of `ReadFrom.Configuration`; `Program.cs` now builds the logger through this factory. `ScadaLink:Logging:MinimumLevel` is now live. Regression tests in new `LoggerConfigurationTests`: `MinimumLevel_Warning_SuppressesInformationLogs`, `MinimumLevel_Debug_AllowsDebugLogs`, and `MinimumLevel_Absent_DefaultsToInformation` — they assert the configured level actually filters log events; they pass against the new factory (and would fail against the old inline configuration, which ignored the key). Full Host suite green (175 passed). ### Host-012 — `down-if-alone` hard-coded in HOCON; `ClusterOptions.DownIfAlone` is never read | | | |--|--| | Severity | Medium | | Category | Design-document adherence | | Status | Resolved | | Location | `src/ScadaLink.Host/Actors/AkkaHostedService.cs:146-148` | **Description** `AkkaHostedService.BuildHocon` emits the split-brain-resolver block with `keep-oldest { down-if-alone = on }` as a literal constant. `ClusterOptions` (`src/ScadaLink.ClusterInfrastructure/ClusterOptions.cs:74`) exposes a `DownIfAlone` property — bound from `ScadaLink:Cluster` via the Options pattern, documented as "the design-doc requirement", default `true` — but a repo-wide search shows it is referenced **nowhere outside its own declaration**. The Host therefore ignores the bound value entirely: setting `ScadaLink:Cluster:DownIfAlone` to `false` in `appsettings.json` has no effect, the resolver still runs with `down-if-alone = on`. This is the same class of defect as the resolved Host-011 (`LoggingOptions.MinimumLevel` was dead config) — a configuration option that is declared, bound, and documented but never consumed, which silently misleads any operator who edits it. A batch-1 re-review of ClusterInfrastructure flagged the same hard-coding; it is recorded here because `BuildHocon` is Host-owned code (the ClusterInfrastructure project owns only the configuration contract, per the `ClusterOptions` class comment). **Recommendation** Make `BuildHocon` consume `clusterOptions.DownIfAlone` — emit `down-if-alone = {(clusterOptions.DownIfAlone ? "on" : "off")}` (the value is a bool, so no escaping is needed). Add a `HoconBuilderTests` case asserting both `true` and `false` produce the corresponding `down-if-alone` token. If the flag is genuinely meant to be fixed at `on` for ScadaLink's two-node clusters, remove the `DownIfAlone` property and its doc comment instead so code and configuration contract agree — but do not leave it declared-but-dead. **Resolution** Resolved 2026-05-17 (commit ``). Root cause confirmed against `AkkaHostedService.cs`: `BuildHocon` emitted `keep-oldest { down-if-alone = on }` as a literal, and `ClusterOptions.DownIfAlone` was referenced nowhere outside its own declaration, so the bound option was dead. Fix: `BuildHocon` now emits `down-if-alone = {(clusterOptions.DownIfAlone ? "on" : "off")}`, consuming the bound property. Regression tests in `HoconBuilderTests`: `BuildHocon_DownIfAloneTrue_EmitsOn` and `BuildHocon_DownIfAloneFalse_EmitsOff` parse the resulting HOCON and assert the `keep-oldest.down-if-alone` token tracks the property; the `false` case failed before the fix (token was still `on`) and passes after. Full Host suite green (182 passed). ### Host-013 — `:F0` rounding of cluster timing values silently degrades sub-second configuration | | | |--|--| | Severity | Low | | Category | Correctness & logic bugs | | Status | Resolved | | Location | `src/ScadaLink.Host/Actors/AkkaHostedService.cs:135-136,145,151-152` | **Description** `BuildHocon` renders every duration into HOCON via `TotalSeconds:F0` — `transportHeartbeatSec:F0`, `transportFailureSec:F0`, `StableAfter.TotalSeconds:F0`, `HeartbeatInterval.TotalSeconds:F0`, `FailureDetectionThreshold.TotalSeconds:F0`. The `F0` format specifier rounds to whole seconds, so any sub-second configuration is silently distorted: a `HeartbeatInterval` of `00:00:00.5` renders as `heartbeat-interval = 0s` (a degenerate / invalid value that Akka will reject or treat as zero), and `00:00:02.6` becomes `3s`. The shipped defaults are all whole seconds so this is latent, but the configuration model accepts arbitrary `TimeSpan` values and an operator tuning a heartbeat to e.g. `750ms` would get a `1s` value with no warning — or, worse, `0s`. Rounding configured timing values without surfacing the change is a correctness hazard for exactly the kind of failure-detection tuning these options exist for. **Recommendation** Render durations without precision loss — emit milliseconds (e.g. `heartbeat-interval = {ts.TotalMilliseconds:F0}ms`) so sub-second values survive, or validate in `StartupValidator` that each cluster timing value is a positive whole number of seconds and fail fast otherwise. Either way, do not silently round. **Resolution** Resolved 2026-05-17 (commit ``). Root cause confirmed: `BuildHocon` rendered every duration via `TotalSeconds:F0`, rounding sub-second values — a 750ms heartbeat became `1s` and a 500ms value would collapse to a degenerate `0s`. Fix: durations are now rendered by a new `DurationHocon` helper that emits whole milliseconds with an `ms` suffix (Akka's HOCON parser accepts it), so sub-second configuration survives exactly; `BuildHocon` takes the transport heartbeat/failure as `TimeSpan` rather than pre-rounded `double` seconds. Regression test `HoconBuilderTests.BuildHocon_SubSecondTimings_PreservedWithoutRounding` configures 750ms/2600ms/500ms/2.5s/7.5s timings and asserts each round-trips through the parsed HOCON unchanged; it failed before the fix (750ms parsed as `00:00:01`) and passes after. Full Host suite green (182 passed). ### Host-014 — Serilog sinks are hard-coded in `Program.cs`, not configuration-driven (REQ-HOST-8) | | | |--|--| | Severity | Low | | Category | Design-document adherence | | Status | Resolved | | Location | `src/ScadaLink.Host/Program.cs:43-48`, `src/ScadaLink.Host/appsettings.json:1-7` | **Description** REQ-HOST-8 requires the Host to configure Serilog with "Configuration-driven sink setup (console and file sinks at minimum)". `LoggerConfigurationFactory.Build` calls `.ReadFrom.Configuration(configuration)`, which reads the standard `Serilog` configuration section — but neither `appsettings.json` nor either role-specific file contains a `Serilog` section (only a Microsoft `Logging` section with a `LogLevel` map, which Serilog's `ReadFrom.Configuration` does not consume). The two sinks that actually run are appended in `Program.cs` as hard-coded `.WriteTo.Console(...)` and `.WriteTo.File("logs/scadalink-.log", rollingInterval: Day)` calls. As a result the console output template, the file path, and the rolling interval cannot be changed without recompiling — an operator cannot redirect logs, change the file location, or add a sink via configuration, contrary to REQ-HOST-8. The `ReadFrom.Configuration` call is effectively a no-op because the section it reads does not exist. **Recommendation** Move the console and file sink definitions into a `Serilog` section in `appsettings.json` (with `WriteTo` entries and the output template / file path / rolling interval as arguments) so `ReadFrom.Configuration` drives them, and drop the hard-coded `.WriteTo` calls from `Program.cs`. Alternatively, update REQ-HOST-8 to state the sinks are intentionally code-defined — but the design doc currently says "configuration-driven", so code and doc must be reconciled. **Resolution** Resolved 2026-05-17 (commit ``). Root cause confirmed: `LoggerConfigurationFactory.Build` called `ReadFrom.Configuration` but neither `appsettings.json` nor either role-specific file contained a `Serilog` section, so the two sinks that actually ran were hard-coded `.WriteTo.Console(...)` / `.WriteTo.File(...)` calls in `Program.cs` — uneditable without recompiling, contrary to REQ-HOST-8. Fix: added a `Serilog` section to `appsettings.json` with `WriteTo` entries for the Console and File sinks (carrying the output template, file path `logs/scadalink-.log` and `Day` rolling interval as args) and removed the hard-coded `.WriteTo` calls from `Program.cs`; the factory's `ReadFrom.Configuration` (backed by the transitive `Serilog.Settings.Configuration` package) now drives the sinks. Regression tests in new `SerilogSinkConfigTests`: `ShippedAppSettings_HasSerilogSection_WithConsoleAndFileSinks` asserts the shipped `appsettings.json` declares both sinks in a `Serilog:WriteTo` array (fails against the pre-fix file, which had no `Serilog` section), and `LoggerConfigurationFactory_AppliesConfiguredFileSink` proves a configuration-supplied File sink reaches the built logger and writes to the configured path. Full Host suite green (182 passed). ### Host-015 — `StartupRetry` retries on every exception type, including permanent failures | | | |--|--| | Severity | Low | | Category | Error handling & resilience | | Status | Resolved | | Location | `src/ScadaLink.Host/StartupRetry.cs:36-45` | **Description** `StartupRetry.ExecuteWithRetryAsync` catches `Exception` with only the guard `when (attempt < maxAttempts)` — it retries on *any* exception type. Its sole caller wraps `MigrationHelper.ApplyOrValidateMigrationsAsync` (`Program.cs:124-134`), which on a Central node in production *validates* the schema version and throws if the deployed schema does not match the expected migration set. A schema-version mismatch is a permanent error — retrying it cannot succeed — yet `StartupRetry` will retry it 8 times with capped exponential backoff (2s, 4s, 8s, 16s, 30s, 30s, 30s ≈ 2 minutes) before finally rethrowing, delaying the fatal-exit-and-alert by minutes for a fault that is fatal on the first attempt. The retry helper is meant for *transient* connection faults (the XML doc says exactly that: "the database may be briefly unreachable"), but it cannot distinguish transient from permanent failures. **Recommendation** Restrict retries to transient faults — e.g. accept an `Func isTransient` predicate and, for the migration call site, treat only connection-class exceptions (`SqlException` with a connection/transport error number, `TimeoutException`, socket errors) as retryable while letting schema-validation / `InvalidOperationException` failures propagate immediately. Add a `StartupRetryTests` case asserting a non-transient exception is rethrown after a single attempt. **Resolution** Resolved 2026-05-17 (commit ``). Root cause confirmed: `ExecuteWithRetryAsync` caught `Exception` with only the `when (attempt < maxAttempts)` guard, so it retried any exception — including a permanent schema-version mismatch raised by the migration call site, delaying the inevitable fatal exit by ~2 minutes. Fix: added an optional `Func isTransient` parameter (default treats all exceptions as transient, preserving prior behaviour for unclassified callers); the catch guard is now `when (attempt < maxAttempts && isTransient(ex))`, so a non-transient fault propagates after a single attempt. Added a `StartupRetry.IsTransientDatabaseFault` classifier that returns `true` only for connection-class faults (`TimeoutException`, `SocketException`, `SqlException` / `DbException` by type-name match) and `false` for everything else — notably schema-validation `InvalidOperationException`s — and the `Program.cs` migration call site now passes it. Regression tests in `StartupRetryTests`: `ExecuteWithRetry_NonTransientFailure_RethrowsAfterSingleAttempt` (runs exactly once when `isTransient` returns false) and `ExecuteWithRetry_TransientThenPermanent_StopsAtPermanent` (retries a `TimeoutException` then stops at a permanent `InvalidOperationException`). Full Host suite green (182 passed).