21 new findings: Host-012..015, InboundAPI-014..017, ManagementService-014..017, NotificationService-014..018, Security-012..015.
731 lines
38 KiB
Markdown
731 lines
38 KiB
Markdown
# Code Review — Host
|
||
|
||
| Field | Value |
|
||
|-------|-------|
|
||
| Module | `src/ScadaLink.Host` |
|
||
| Design doc | `docs/requirements/Component-Host.md` |
|
||
| Status | Reviewed |
|
||
| Last reviewed | 2026-05-17 |
|
||
| Reviewer | claude-agent |
|
||
| Commit reviewed | `39d737e` |
|
||
| Open findings | 4 |
|
||
|
||
## Summary
|
||
|
||
The Host module is the composition root for the entire ScadaLink system: a single
|
||
binary whose behaviour (`Central` vs `Site`) is driven entirely by configuration. The
|
||
implementation is generally faithful to `Component-Host.md` — startup validation,
|
||
role-based registration, Serilog enrichment, Windows Service support, dead-letter
|
||
monitoring, CoordinatedShutdown, and gRPC hosting on site nodes are all present and
|
||
backed by a solid test suite (`tests/ScadaLink.Host.Tests`).
|
||
|
||
The most significant problem is the readiness endpoint: `/health/ready` runs **all**
|
||
registered health checks, including the leader-only `active-node` check, so a fully
|
||
operational *standby* central node permanently reports `503` on `/health/ready` —
|
||
directly contradicting REQ-HOST-4a, which defines readiness as cluster membership +
|
||
DB connectivity (not leadership). Several other findings concern configuration that
|
||
is validated-but-never-consumed (`MachineDataDb`), design-doc drift (Akka.Persistence
|
||
is required by REQ-HOST-6 but the system uses no persistent actors), an incorrect
|
||
seed-node entry in the shipped site config, blocking sync-over-async during startup,
|
||
and unguarded string interpolation when building HOCON. None are crash/data-loss
|
||
class, but the readiness bug is High because it breaks load-balancer behaviour with
|
||
no safe workaround.
|
||
|
||
#### Re-review 2026-05-17 (commit `39d737e`)
|
||
|
||
All eleven findings from the first review (Host-001..011) are confirmed `Resolved`
|
||
in the current tree — the `/health/ready` predicate, the externalised secrets, the
|
||
seed-node/GrpcPort validation rules, the escaped HOCON builder, the bounded
|
||
migration retry and the live `LoggingOptions.MinimumLevel` are all present as
|
||
described in their Resolution notes. This re-review walked all ten checklist
|
||
categories again over the full module and recorded four new findings, none of them
|
||
crash/data-loss class. Following up on a batch-1 ClusterInfrastructure re-review
|
||
note, Host-012 (Medium) confirms `AkkaHostedService.BuildHocon` hard-codes
|
||
`down-if-alone = on` and never reads the `ClusterOptions.DownIfAlone` property, so
|
||
that documented, bound option is dead. The remaining three are Low: `:F0` rounding
|
||
of cluster timing values silently degrades any sub-second configuration (Host-013),
|
||
Serilog sink setup is hard-coded in `Program.cs` rather than configuration-driven as
|
||
REQ-HOST-8 requires (Host-014), and `StartupRetry` retries indiscriminately on every
|
||
exception type including permanent schema-validation failures (Host-015).
|
||
|
||
## Checklist coverage
|
||
|
||
| # | Category | Examined | Notes |
|
||
|---|----------|----------|-------|
|
||
| 1 | Correctness & logic bugs | ☑ | Host-001/004 resolved. Re-review: `:F0` rounding of cluster timing values silently distorts sub-second durations (Host-013). |
|
||
| 2 | Akka.NET conventions | ☑ | CoordinatedShutdown, receptionist registration, singleton scoping all correct. Host-006/009 resolved; no new issues. |
|
||
| 3 | Concurrency & thread safety | ☑ | Host-005 resolved (`StartAsync` now genuinely async). `DeadLetterMonitorActor` state is actor-confined — no issues. |
|
||
| 4 | Error handling & resilience | ☑ | Host-010 resolved. Re-review: `StartupRetry` retries indiscriminately on permanent failures (e.g. schema-validation mismatch) (Host-015). |
|
||
| 5 | Security | ☑ | Host-003 resolved — secrets externalised to env vars. No new issues. |
|
||
| 6 | Performance & resource management | ☑ | No undisposed resources. Inbound API script compilation is a synchronous startup loop — acceptable. No new issues. |
|
||
| 7 | Design-document adherence | ☑ | Host-002/007 resolved. Re-review: `ClusterOptions.DownIfAlone` bound/documented but never consumed — HOCON hard-codes `on` (Host-012); Serilog sinks hard-coded, not config-driven per REQ-HOST-8 (Host-014). |
|
||
| 8 | Code organization & conventions | ☑ | Host-008/011 resolved. No new issues. |
|
||
| 9 | Testing coverage | ☑ | Strong suite; regression tests added for Host-001/004/006/007/010/011. No coverage for the new `down-if-alone`, sub-second-duration, or non-transient-retry paths (Host-012/013/015). |
|
||
| 10 | Documentation & comments | ☑ | REQ-HOST-6 stale-doc resolved. Re-review: REQ-HOST-8 says sinks are "configuration-driven" but they are code-defined (Host-014). |
|
||
|
||
## Findings
|
||
|
||
### Host-001 — `/health/ready` includes the leader-only `active-node` check
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | High |
|
||
| Category | Correctness & logic bugs |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.Host/Program.cs:135-145` |
|
||
|
||
**Description**
|
||
|
||
`/health/ready` is mapped with `MapHealthChecks("/health/ready", ...)` and **no
|
||
`Predicate`**, so it executes every registered check: `database`, `akka-cluster`
|
||
*and* `active-node`. `ActiveNodeHealthCheck` (`Health/ActiveNodeHealthCheck.cs:38`)
|
||
returns `Unhealthy` on any node that is not the cluster leader. As a result a
|
||
standby central node that is fully operational (cluster member `Up`, database
|
||
reachable) still returns `503` on `/health/ready`. This contradicts REQ-HOST-4a,
|
||
which defines readiness as cluster membership + DB connectivity + singletons —
|
||
explicitly *not* leadership. `/health/active` is the endpoint intended to report
|
||
leadership. A load balancer using `/health/ready` to decide whether a node may
|
||
serve traffic will permanently treat the standby as unready, defeating failover
|
||
readiness. No test covers this: `HealthCheckTests.HealthReady_Endpoint_ReturnsResponse`
|
||
only asserts a response is returned, not the standby semantics.
|
||
|
||
**Recommendation**
|
||
|
||
Add a `Predicate` to the `/health/ready` mapping that excludes the `active-node`
|
||
check, e.g. `Predicate = check => check.Name != "active-node"` (or tag the readiness
|
||
checks and filter by tag). Add a regression test asserting a non-leader node returns
|
||
`200` on `/health/ready`.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit `<pending>`). Root cause confirmed against
|
||
`Program.cs`: the `/health/ready` mapping had no `Predicate`, so it executed all
|
||
three registered checks including the leader-only `active-node` check, while
|
||
`ActiveNodeHealthCheck` returns `Unhealthy` on any non-leader node — making a fully
|
||
operational standby central node permanently report `503`. Fix: added
|
||
`Predicate = check => check.Name != "active-node"` to the `/health/ready`
|
||
`HealthCheckOptions`, so readiness now reflects cluster membership + DB connectivity
|
||
only (REQ-HOST-4a); leadership remains reported solely by `/health/active`.
|
||
Regression test `HealthCheckTests.HealthReady_Endpoint_ExcludesActiveNodeCheck`
|
||
asserts the `active-node` check name does not appear in the `/health/ready`
|
||
response body; it failed before the fix and passes after. Full Host suite green
|
||
(156 passed).
|
||
|
||
### Host-002 — Akka.Persistence required by REQ-HOST-6 is not configured and not used
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Design-document adherence |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.Host/Actors/AkkaHostedService.cs:70-108`, `docs/requirements/Component-Host.md` REQ-HOST-6 |
|
||
|
||
**Description**
|
||
|
||
REQ-HOST-6 states the Host "must configure the Akka.NET actor system using
|
||
Akka.Hosting with ... **Persistence**: Configured with the appropriate journal and
|
||
snapshot store (SQL for central, SQLite for site)." The HOCON built in
|
||
`AkkaHostedService.StartAsync` contains no `akka.persistence` section, no journal and
|
||
no snapshot-store plugin, and `ScadaLink.Host.csproj` references neither
|
||
`Akka.Persistence.Hosting` nor any persistence plugin (the design doc Dependencies
|
||
list `Akka.Persistence.Hosting`). A repo-wide search finds **no** `PersistentActor` /
|
||
`ReceivePersistentActor` subclasses — the system deliberately uses custom SQLite
|
||
storage services instead. The code is internally consistent, but the design document
|
||
is stale: it mandates a subsystem that does not exist. This is a documented-vs-actual
|
||
drift that will mislead future maintainers and any audit against REQ-HOST-6.
|
||
|
||
**Recommendation**
|
||
|
||
Update `Component-Host.md` REQ-HOST-6 and the Dependencies list to remove the
|
||
Akka.Persistence requirement (or explicitly state persistence is provided by
|
||
component-owned SQLite storage, not Akka.Persistence). If persistence *is* intended,
|
||
add the plugin packages and HOCON. Either way, code and doc must agree.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit `<pending>`). Confirmed accurate: a repo-wide search
|
||
finds **no** `PersistentActor` / `ReceivePersistentActor` subclasses anywhere in
|
||
`src/`, no `akka.persistence` section in the HOCON built by
|
||
`AkkaHostedService.StartAsync`, and `ScadaLink.Host.csproj` references no
|
||
persistence plugin packages. The Host code is correct and internally consistent;
|
||
the defect was a stale design doc. Fix: updated `docs/requirements/Component-Host.md`
|
||
— REQ-HOST-6 no longer lists Persistence as a configured subsystem and now carries
|
||
an explicit Persistence note stating that durable state is owned by individual
|
||
components and persisted through component-owned stores (SQLite at sites, MS SQL
|
||
centrally), with no Akka journal/snapshot store and no `PersistentActor` subclasses
|
||
by design. The Responsibilities line and the Dependencies entry
|
||
(`Akka.Persistence.Hosting` removed) were corrected to match. Code and doc now
|
||
agree; no code or test change required.
|
||
|
||
### Host-003 — Secrets committed in plaintext in `appsettings.Central.json`
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Security |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.Host/appsettings.Central.json:20-31` |
|
||
|
||
**Description**
|
||
|
||
`appsettings.Central.json` contains real-looking secrets in plaintext, checked into
|
||
source control: SQL Server passwords in the `ConfigurationDb` / `MachineDataDb`
|
||
connection strings (`Password=ScadaLink_Dev1#`), an LDAP service-account password
|
||
(`LdapServiceAccountPassword: "password"`), and a JWT signing key
|
||
(`JwtSigningKey: "scadalink-dev-jwt-signing-key-..."`). Even though these are
|
||
intended as development defaults, shipping them in the default config invites them
|
||
being reused verbatim in production, and a committed JWT signing key allows anyone
|
||
with repo access to forge session tokens. `TrustServerCertificate=true` additionally
|
||
disables TLS validation for the SQL connection.
|
||
|
||
**Recommendation**
|
||
|
||
Move all secrets out of committed `appsettings*.json` into environment variables,
|
||
user-secrets, or a secret store. Keep only non-sensitive structural defaults in the
|
||
file and document the required environment variables. At minimum add a clear comment
|
||
that these values are dev-only and must be overridden, and rotate the JWT key per
|
||
environment.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit `pending`). Root cause confirmed against
|
||
`appsettings.Central.json`: the committed file carried real-looking secrets in
|
||
plaintext — SQL Server passwords (`Password=ScadaLink_Dev1#`) in both connection
|
||
strings, an LDAP service-account password, and a JWT signing key. Fix: all four
|
||
secrets were removed from the committed file and replaced with non-functional
|
||
`${...}` placeholder references (`ConfigurationDb` / `MachineDataDb`,
|
||
`LdapServiceAccountPassword`, `JwtSigningKey`). A new top-level `_secrets` note
|
||
documents that the Host's configuration builder (`AddEnvironmentVariables()`)
|
||
overlays the real values supplied via environment variables
|
||
(`ScadaLink__Database__ConfigurationDb`, `ScadaLink__Database__MachineDataDb`,
|
||
`ScadaLink__Security__LdapServiceAccountPassword`,
|
||
`ScadaLink__Security__JwtSigningKey`); the placeholders are intentionally invalid so
|
||
a misconfigured deployment fails loudly rather than silently using a committed key.
|
||
Regression test class `ConfigSecretsTests` asserts the committed file carries no
|
||
plaintext `Password=` value, no committed LDAP service-account password, and no
|
||
committed JWT signing key; all three tests failed before the fix and pass after.
|
||
Tests that drive the full `Program` startup pipeline against the real SQL provider
|
||
(`HealthCheckTests`, `HostStartupTests.CentralRole_StartsWithoutError`) were adapted
|
||
to supply the local dev connection strings themselves via the new
|
||
`CentralDbTestEnvironment` test helper (environment variables) — they must no longer
|
||
depend on committed secrets. Note: the `docker/central-node-*/appsettings.Central.json`
|
||
files still contain the same dev secrets but lie outside this task's permitted edit
|
||
scope; they should receive the same treatment in a follow-up.
|
||
|
||
### Host-004 — Site seed-node list points at the gRPC port, not a remoting port
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Correctness & logic bugs |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.Host/appsettings.Site.json:10-19` |
|
||
|
||
**Description**
|
||
|
||
The shipped site config sets `Node:RemotingPort = 8082` and `Node:GrpcPort = 8083`,
|
||
but `Cluster:SeedNodes` is `["akka.tcp://scadalink@localhost:8082",
|
||
"akka.tcp://scadalink@localhost:8083"]`. The second seed node targets `8083`, which
|
||
is the Kestrel HTTP/2 gRPC port — not an Akka remoting endpoint. A node attempting to
|
||
join via that seed will try to establish an Akka.Remote TCP association against the
|
||
gRPC listener and fail. `StartupValidator` only checks that ≥2 seed nodes exist
|
||
(`StartupValidator.cs:54-56`), so this misconfiguration passes validation silently.
|
||
For the single-node dev site it is harmless (the first seed succeeds), but it is an
|
||
incorrect example that will be copied into multi-node site configs.
|
||
|
||
**Recommendation**
|
||
|
||
Correct the site seed-node list to reference the two site nodes' *remoting* ports
|
||
(e.g. `8082` and `8084`), never the gRPC port. Consider extending `StartupValidator`
|
||
to reject a seed node whose port equals this node's `GrpcPort`.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit `pending`). Root cause confirmed against
|
||
`appsettings.Site.json`: with `Node:RemotingPort = 8082` and `Node:GrpcPort = 8083`,
|
||
the second `Cluster:SeedNodes` entry was `akka.tcp://scadalink@localhost:8083` — the
|
||
Kestrel HTTP/2 gRPC port, not an Akka.Remote endpoint. `StartupValidator` only
|
||
checked seed-node *count* (≥2), so the misconfiguration passed silently. Fix, two
|
||
parts: (1) the shipped `appsettings.Site.json` second seed entry was corrected to a
|
||
remoting port (`localhost:8084`); (2) `StartupValidator.Validate` was extended — for
|
||
`Site` nodes it now parses each seed node's TCP port (via a new `SeedNodePort`
|
||
helper) and rejects any entry whose port equals the node's `GrpcPort`, using the
|
||
resolved GrpcPort including the `8083` `NodeOptions` default when the key is absent.
|
||
The seed-node-count check was hoisted above the Site block so the new rule can reuse
|
||
the parsed list. Regression tests in `StartupValidatorTests`:
|
||
`Site_SeedNodeOnGrpcPort_FailsValidation`,
|
||
`Site_SeedNodeOnDefaultGrpcPort_FailsValidation` (default-8083 path),
|
||
`Site_SeedNodesOnRemotingPort_PassesValidation`, and
|
||
`Central_SeedNodeOnPort8083_PassesValidation` (rule is Site-only) — all failed
|
||
appropriately before the fix and pass after.
|
||
|
||
### Host-005 — Blocking sync-over-async (`GetAwaiter().GetResult()`) inside `StartAsync`
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Concurrency & thread safety |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.Host/Actors/AkkaHostedService.cs:345` |
|
||
|
||
**Description**
|
||
|
||
`RegisterSiteActors` calls `storeAndForwardService.StartAsync().GetAwaiter().GetResult()`
|
||
synchronously, blocking inside the `IHostedService.StartAsync` path. `StartAsync` is
|
||
itself declared synchronous (returns `Task.CompletedTask`), so the work cannot be
|
||
awaited cleanly. Blocking on async work risks thread-pool starvation during startup
|
||
and, if the awaited operation captures a synchronization context, deadlock. It also
|
||
hides exceptions behind an `AggregateException` wrapper.
|
||
|
||
**Recommendation**
|
||
|
||
Make `AkkaHostedService.StartAsync` genuinely `async` and `await
|
||
storeAndForwardService.StartAsync(cancellationToken)`. Propagate the
|
||
`CancellationToken` and let exceptions surface as the original type.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit `pending`). Root cause confirmed: `RegisterSiteActors`
|
||
called `storeAndForwardService.StartAsync().GetAwaiter().GetResult()` synchronously
|
||
inside the `IHostedService.StartAsync` path, blocking a startup thread and wrapping
|
||
any failure in `AggregateException`. Fix: `AkkaHostedService.StartAsync` is now
|
||
genuinely `async Task`; the site path was renamed `RegisterSiteActorsAsync` and made
|
||
`async`; the store-and-forward init is now `await storeAndForwardService.StartAsync()`
|
||
(the service's own signature, in another module, takes no `CancellationToken`, so the
|
||
token is honoured via a `cancellationToken.ThrowIfCancellationRequested()` checkpoint
|
||
immediately before the await). No blocking, no sync-context deadlock risk, and the
|
||
original exception type now surfaces. No dedicated regression test: the only
|
||
observable change is non-blocking/exception-unwrapping behaviour on a path that
|
||
requires the full site DI graph plus a real SQLite store; the existing
|
||
`HostStartupTests.SiteRole_*` and `CompositionRootTests` continue to exercise the
|
||
async startup pipeline and remain green (175 passed).
|
||
|
||
### Host-006 — HOCON assembled by unescaped string interpolation
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Akka.NET conventions |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.Host/Actors/AkkaHostedService.cs:70-108` |
|
||
|
||
**Description**
|
||
|
||
The Akka HOCON is built with an interpolated string that injects
|
||
`_nodeOptions.NodeHostname`, `_clusterOptions.SeedNodes`, the computed roles, and
|
||
`SplitBrainResolverStrategy` directly into the configuration text. Values are not
|
||
escaped. A hostname or seed-node string containing a quote, backslash, brace, or
|
||
comment sequence would corrupt the HOCON and produce a confusing parse error far from
|
||
the real cause; `SplitBrainResolverStrategy` is interpolated without quoting, so a
|
||
value with whitespace breaks the document. Building cluster configuration from raw
|
||
string concatenation is also harder to maintain than the typed Akka.Hosting builder
|
||
the design doc (REQ-HOST-6) actually calls for ("via Akka.Hosting").
|
||
|
||
**Recommendation**
|
||
|
||
Prefer the `Akka.Hosting` `AddAkka(...)` builder with strongly-typed `WithRemoting`,
|
||
`WithClustering`, and split-brain-resolver configuration instead of hand-built HOCON.
|
||
If HOCON must be retained, validate/escape interpolated values (especially hostname
|
||
and seed nodes) before substitution.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit `pending`). Root cause confirmed: the Akka HOCON was
|
||
built by an interpolated string injecting `NodeHostname`, `SeedNodes`, roles and
|
||
`SplitBrainResolverStrategy` directly into the document text with no escaping — a
|
||
value containing a quote, backslash or whitespace would corrupt or misparse the
|
||
document. Re-triaged the recommendation: switching wholesale to the typed
|
||
`Akka.Hosting` builder is a larger refactor than this Low finding warrants and would
|
||
touch how the whole cluster is bootstrapped, so the HOCON form was retained but made
|
||
safe. Fix: the HOCON construction was extracted into a testable static
|
||
`AkkaHostedService.BuildHocon`, and every interpolated *string* value is now routed
|
||
through a `QuoteHocon` helper that wraps the value in a double-quoted HOCON literal
|
||
and escapes embedded backslashes and quotes — `NodeHostname`, every seed node, every
|
||
role, and `SplitBrainResolverStrategy` (the last was previously unquoted entirely, so
|
||
whitespace alone would break it). Regression tests in new `HoconBuilderTests`:
|
||
`BuildHocon_HostnameWithQuote_DoesNotCorruptDocument` (a hostname with a `"` still
|
||
parses and following keys stay intact), `BuildHocon_StrategyWithWhitespace_RemainsQuoted`,
|
||
and `BuildHocon_PlainValues_ParsesAndPreservesValues`. With the old raw interpolation
|
||
a quote-containing hostname produced `hostname = "evil"host"`, closing the literal
|
||
early and corrupting following keys; the escaped form parses correctly. Full Host
|
||
suite green (175 passed).
|
||
|
||
### Host-007 — REQ-HOST-4 rule "GrpcPort ≠ RemotingPort" is not enforced
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Design-document adherence |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.Host/StartupValidator.cs:43-47` |
|
||
|
||
**Description**
|
||
|
||
REQ-HOST-4 requires: "Site nodes must have `GrpcPort` in valid port range (1–65535)
|
||
**and different from `RemotingPort`**." `StartupValidator` validates the GrpcPort
|
||
range but never compares it to `RemotingPort`. A site config that sets both ports to
|
||
the same value passes validation and then fails opaquely at runtime when Kestrel and
|
||
Akka.Remote both try to bind the port. The GrpcPort range check is also skipped
|
||
entirely when the key is absent (`grpcPortStr != null`), relying on the
|
||
`NodeOptions` default of 8083 — acceptable, but the equality rule is the missing
|
||
piece.
|
||
|
||
**Recommendation**
|
||
|
||
Add a check in the `role == "Site"` block: if `GrpcPort` (resolved, including the
|
||
8083 default) equals `RemotingPort`, add an error
|
||
`"ScadaLink:Node:GrpcPort must differ from RemotingPort"`.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit `pending`). Root cause confirmed against
|
||
`StartupValidator.cs`: the `role == "Site"` block validated the GrpcPort range but
|
||
never compared it to `RemotingPort`, so a site config setting both ports equal
|
||
passed validation and then failed opaquely when Kestrel and Akka.Remote both bound
|
||
the port. Fix: added `if (port == grpcPort) errors.Add("ScadaLink:Node:GrpcPort must
|
||
differ from RemotingPort")` in the Site block, using the resolved GrpcPort (including
|
||
the 8083 `NodeOptions` default when the key is absent) against the parsed
|
||
RemotingPort. Regression tests in `StartupValidatorTests`:
|
||
`Site_GrpcPortEqualsRemotingPort_FailsValidation`,
|
||
`Site_DefaultGrpcPortEqualsRemotingPort_FailsValidation` (default-8083 path), and
|
||
`Site_GrpcPortDiffersFromRemotingPort_PassesValidation` — all failed appropriately
|
||
before the fix and pass after. Full Host suite green (175 passed).
|
||
|
||
### Host-008 — `MachineDataDb` is validated and declared but never consumed
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Code organization & conventions |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.Host/StartupValidator.cs:33-34`, `src/ScadaLink.Host/DatabaseOptions.cs:6` |
|
||
|
||
**Description**
|
||
|
||
`StartupValidator` requires a non-empty `ScadaLink:Database:MachineDataDb` connection
|
||
string for Central nodes, and `DatabaseOptions` exposes a `MachineDataDb` property,
|
||
but a repo-wide search shows the value is never read anywhere outside the Host module
|
||
— only `ConfigurationDb` is passed to `AddConfigurationDatabase`
|
||
(`Program.cs:83-85`). The Host therefore fails startup if `MachineDataDb` is missing
|
||
even though nothing uses it. This is either dead configuration that should be removed
|
||
or a missing wiring (a machine-data DbContext that was never registered).
|
||
|
||
**Recommendation**
|
||
|
||
Determine whether a machine-data store is actually required. If yes, wire it into the
|
||
relevant component's DI registration. If no, remove the `MachineDataDb` validation
|
||
rule, the `DatabaseOptions` property, and the key from `appsettings.Central.json`.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit `pending`). Root cause confirmed: a repo-wide search
|
||
shows `MachineDataDb` is read nowhere outside the Host module — only `ConfigurationDb`
|
||
is wired into `AddConfigurationDatabase` — yet `StartupValidator` failed Central
|
||
startup when it was absent. No machine-data store exists or is registered, so this is
|
||
dead configuration. Fix (the "if no" branch of the recommendation): removed the
|
||
`MachineDataDb` validation rule from `StartupValidator`, the `MachineDataDb` property
|
||
from `DatabaseOptions`, the `MachineDataDb` key from `appsettings.Central.json`, and
|
||
the now-pointless `MachineDataDb` reference from that file's `_secrets` note and from
|
||
the `CentralDbTestEnvironment` test helper. Regression test: the prior
|
||
`StartupValidatorTests.Central_MissingMachineDataDb_FailsValidation` was inverted to
|
||
`Central_MissingMachineDataDb_PassesValidation` (absence must now NOT fail startup),
|
||
and `MachineDataDb` was removed from the shared `ValidCentralConfig` fixture so all
|
||
Central tests prove the key is no longer required; it failed under the old code and
|
||
passes after. Full Host suite green (175 passed). Note: `appsettings.Central.json`
|
||
under `docker/central-node-*` is outside this task's edit scope and still carries the
|
||
key harmlessly — it will simply be ignored; a follow-up may remove it for tidiness.
|
||
|
||
### Host-009 — `StartAsync` reports success before role actors are confirmed running
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Akka.NET conventions |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.Host/Actors/AkkaHostedService.cs:127-141` |
|
||
|
||
**Description**
|
||
|
||
`StartAsync` creates actors with `ActorOf` (a fire-and-forget operation — the actor's
|
||
`PreStart` runs asynchronously on its own thread) and then returns
|
||
`Task.CompletedTask`. For site nodes, `grpcServer.SetReady(_actorSystem)` is called
|
||
synchronously at the end of `RegisterSiteActors`, marking the gRPC server ready even
|
||
though `SiteCommunicationActor`, the deployment-manager singleton, and the
|
||
`ClusterClient` may not yet have completed their `PreStart`/initial-contact handshake.
|
||
REQ-HOST-7 requires "Actor system and SiteStreamManager ... initialized before gRPC
|
||
begins accepting connections" — `SiteStreamManager.Initialize` is awaited-equivalent,
|
||
but the broader actor graph is not. The window is small and the gRPC server still
|
||
rejects streams until `SetReady`, so impact is limited, but readiness is being
|
||
asserted optimistically.
|
||
|
||
**Recommendation**
|
||
|
||
If strict ordering matters, gate `SetReady` on confirmation that
|
||
`SiteCommunicationActor` is fully initialized (e.g. an `Ask` round-trip or a
|
||
readiness message), or document explicitly that gRPC readiness only guarantees the
|
||
actor system exists, not that the cluster handshake has completed.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit `pending`) — re-triaged, documentation fix. Root cause
|
||
confirmed: `SetReady` is called optimistically after `ActorOf` registrations whose
|
||
`PreStart`/initial-contact handshakes run asynchronously. Re-triaged the first
|
||
recommended option (gating `SetReady` on a `SiteCommunicationActor` `Ask`
|
||
round-trip): that option is **incorrect** because the meaningful asynchronous step is
|
||
the `ClusterClient` initial-contact handshake with central — gating site gRPC
|
||
readiness on it would prevent a site from coming up and streaming locally while
|
||
central is briefly unreachable, contradicting the design's site-autonomy model.
|
||
Confirmed `SiteStreamGrpcServer` already rejects any stream opened before `SetReady`
|
||
with `StatusCode.Unavailable`, so the window is benign. Fix (the second, correct
|
||
option): the `SetReady` call site now carries an explicit comment documenting the
|
||
narrow readiness contract — by `SetReady` the actor system exists,
|
||
`SiteStreamManager.Initialize` has run, and every role actor has been created with
|
||
ordered synchronous `ActorOf` + registration `Tell`s; what is deliberately NOT
|
||
guaranteed is `PreStart` completion or the central handshake. No regression test is
|
||
meaningful for a documentation-only clarification of an intentional design contract;
|
||
the existing `HostStartupTests.SiteRole_*` cover the surrounding startup path. Full
|
||
Host suite green (175 passed).
|
||
|
||
### Host-010 — No retry/backoff around startup preconditions (DB migration, readiness)
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Error handling & resilience |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.Host/Program.cs:112-125` |
|
||
|
||
**Description**
|
||
|
||
On Central startup the Host opens a DI scope and calls
|
||
`MigrationHelper.ApplyOrValidateMigrationsAsync` directly. If the SQL Server is not
|
||
yet reachable (common in container orchestration where the DB and app start
|
||
together), the call throws, the top-level `catch` logs `Fatal`, and the process
|
||
exits. There is no bounded retry/backoff to tolerate a database that is briefly
|
||
unavailable at boot. The design intent (REQ-HOST-4a, readiness gating, `503` until
|
||
ready) is about *serving traffic*, but the migration step happens before the host
|
||
even runs and has no such tolerance.
|
||
|
||
**Recommendation**
|
||
|
||
Wrap the migration/validation step in a bounded retry with exponential backoff (e.g.
|
||
Polly), or move schema apply behind the readiness gate so the process stays up and
|
||
reports `503` until the database becomes reachable.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit `pending`). Root cause confirmed: Central startup called
|
||
`MigrationHelper.ApplyOrValidateMigrationsAsync` directly with no tolerance for a
|
||
database that is briefly unreachable at boot — a first connection failure threw, the
|
||
top-level `catch` logged `Fatal`, and the process exited. Fix: added a new
|
||
`StartupRetry.ExecuteWithRetryAsync` helper (bounded retry with capped exponential
|
||
backoff — no Polly dependency, keeping the Host's package set minimal) and wrapped
|
||
the migration step in it (`maxAttempts: 8`, `initialDelay: 2s`, doubling, capped at
|
||
30s). Each retry opens a fresh DI scope so a transient connection fault does not
|
||
leave a poisoned `DbContext`; once attempts are exhausted the last exception
|
||
propagates and the existing fatal-log path still fires. Regression tests in new
|
||
`StartupRetryTests`: `ExecuteWithRetry_SucceedsFirstTry_RunsOnce`,
|
||
`ExecuteWithRetry_TransientFailures_RetriesUntilSuccess` (fails twice then succeeds),
|
||
and `ExecuteWithRetry_ExhaustsAttempts_RethrowsLastException` (verifies the original
|
||
exception type/message surfaces after exhaustion) — they fail to compile/pass against
|
||
the pre-fix code (the helper did not exist) and pass after. Full Host suite green
|
||
(175 passed).
|
||
|
||
### Host-011 — `LoggingOptions.MinimumLevel` is dead configuration
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Code organization & conventions |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.Host/LoggingOptions.cs:5`, `src/ScadaLink.Host/Program.cs:42-50` |
|
||
|
||
**Description**
|
||
|
||
`LoggingOptions` exposes a `MinimumLevel` property bound from `ScadaLink:Logging`
|
||
(`SiteServiceRegistration.BindSharedOptions`), and both `appsettings.Central.json`
|
||
and `appsettings.Site.json` set `"Logging": { "MinimumLevel": "Information" }`.
|
||
However Serilog is configured purely via `ReadFrom.Configuration(configuration)`,
|
||
which reads the standard `Serilog` section — not `ScadaLink:Logging`. The
|
||
`LoggingOptions.MinimumLevel` value is never read by any code, so changing it has no
|
||
effect. This is misleading: an operator editing `ScadaLink:Logging:MinimumLevel`
|
||
expecting a log-level change will see nothing happen.
|
||
|
||
**Recommendation**
|
||
|
||
Either consume `LoggingOptions.MinimumLevel` when configuring the Serilog
|
||
`LoggerConfiguration` (e.g. set `MinimumLevel.Is(...)` from it), or remove the option
|
||
class and the `ScadaLink:Logging` sections and rely solely on the `Serilog`
|
||
configuration section. Keep one mechanism, not two.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit `pending`). Root cause confirmed: `LoggingOptions.MinimumLevel`
|
||
was bound from `ScadaLink:Logging` but Serilog was configured purely via
|
||
`ReadFrom.Configuration` (standard `Serilog` section), so editing
|
||
`ScadaLink:Logging:MinimumLevel` had no effect. Re-triaged the two options: the
|
||
"remove the option class" branch was rejected because `Component-Host.md` REQ-HOST-3
|
||
explicitly lists `ScadaLink:Logging` → `LoggingOptions` and that design doc is outside
|
||
this task's edit scope — removing the class would create a fresh code-vs-doc drift.
|
||
Fix (the "consume it" branch): added a `LoggerConfigurationFactory.Build` that binds
|
||
`LoggingOptions`, parses `MinimumLevel` into a Serilog `LogEventLevel` (falling back
|
||
to `Information` for null/blank/unrecognised values), and pins it via
|
||
`MinimumLevel.Is(...)` on top of `ReadFrom.Configuration`; `Program.cs` now builds the
|
||
logger through this factory. `ScadaLink:Logging:MinimumLevel` is now live.
|
||
Regression tests in new `LoggerConfigurationTests`:
|
||
`MinimumLevel_Warning_SuppressesInformationLogs`,
|
||
`MinimumLevel_Debug_AllowsDebugLogs`, and `MinimumLevel_Absent_DefaultsToInformation`
|
||
— they assert the configured level actually filters log events; they pass against the
|
||
new factory (and would fail against the old inline configuration, which ignored the
|
||
key). Full Host suite green (175 passed).
|
||
|
||
### Host-012 — `down-if-alone` hard-coded in HOCON; `ClusterOptions.DownIfAlone` is never read
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Design-document adherence |
|
||
| Status | Open |
|
||
| Location | `src/ScadaLink.Host/Actors/AkkaHostedService.cs:146-148` |
|
||
|
||
**Description**
|
||
|
||
`AkkaHostedService.BuildHocon` emits the split-brain-resolver block with
|
||
`keep-oldest { down-if-alone = on }` as a literal constant. `ClusterOptions`
|
||
(`src/ScadaLink.ClusterInfrastructure/ClusterOptions.cs:74`) exposes a
|
||
`DownIfAlone` property — bound from `ScadaLink:Cluster` via the Options pattern,
|
||
documented as "the design-doc requirement", default `true` — but a repo-wide search
|
||
shows it is referenced **nowhere outside its own declaration**. The Host therefore
|
||
ignores the bound value entirely: setting `ScadaLink:Cluster:DownIfAlone` to `false`
|
||
in `appsettings.json` has no effect, the resolver still runs with `down-if-alone =
|
||
on`. This is the same class of defect as the resolved Host-011
|
||
(`LoggingOptions.MinimumLevel` was dead config) — a configuration option that is
|
||
declared, bound, and documented but never consumed, which silently misleads any
|
||
operator who edits it. A batch-1 re-review of ClusterInfrastructure flagged the same
|
||
hard-coding; it is recorded here because `BuildHocon` is Host-owned code (the
|
||
ClusterInfrastructure project owns only the configuration contract, per the
|
||
`ClusterOptions` class comment).
|
||
|
||
**Recommendation**
|
||
|
||
Make `BuildHocon` consume `clusterOptions.DownIfAlone` — emit `down-if-alone =
|
||
{(clusterOptions.DownIfAlone ? "on" : "off")}` (the value is a bool, so no escaping
|
||
is needed). Add a `HoconBuilderTests` case asserting both `true` and `false` produce
|
||
the corresponding `down-if-alone` token. If the flag is genuinely meant to be fixed
|
||
at `on` for ScadaLink's two-node clusters, remove the `DownIfAlone` property and its
|
||
doc comment instead so code and configuration contract agree — but do not leave it
|
||
declared-but-dead.
|
||
|
||
**Resolution**
|
||
|
||
_Unresolved._
|
||
|
||
### Host-013 — `:F0` rounding of cluster timing values silently degrades sub-second configuration
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Correctness & logic bugs |
|
||
| Status | Open |
|
||
| Location | `src/ScadaLink.Host/Actors/AkkaHostedService.cs:135-136,145,151-152` |
|
||
|
||
**Description**
|
||
|
||
`BuildHocon` renders every duration into HOCON via `TotalSeconds:F0` —
|
||
`transportHeartbeatSec:F0`, `transportFailureSec:F0`, `StableAfter.TotalSeconds:F0`,
|
||
`HeartbeatInterval.TotalSeconds:F0`, `FailureDetectionThreshold.TotalSeconds:F0`. The
|
||
`F0` format specifier rounds to whole seconds, so any sub-second configuration is
|
||
silently distorted: a `HeartbeatInterval` of `00:00:00.5` renders as
|
||
`heartbeat-interval = 0s` (a degenerate / invalid value that Akka will reject or
|
||
treat as zero), and `00:00:02.6` becomes `3s`. The shipped defaults are all whole
|
||
seconds so this is latent, but the configuration model accepts arbitrary `TimeSpan`
|
||
values and an operator tuning a heartbeat to e.g. `750ms` would get a `1s` value with
|
||
no warning — or, worse, `0s`. Rounding configured timing values without surfacing
|
||
the change is a correctness hazard for exactly the kind of failure-detection tuning
|
||
these options exist for.
|
||
|
||
**Recommendation**
|
||
|
||
Render durations without precision loss — emit milliseconds (e.g.
|
||
`heartbeat-interval = {ts.TotalMilliseconds:F0}ms`) so sub-second values survive, or
|
||
validate in `StartupValidator` that each cluster timing value is a positive whole
|
||
number of seconds and fail fast otherwise. Either way, do not silently round.
|
||
|
||
**Resolution**
|
||
|
||
_Unresolved._
|
||
|
||
### Host-014 — Serilog sinks are hard-coded in `Program.cs`, not configuration-driven (REQ-HOST-8)
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Design-document adherence |
|
||
| Status | Open |
|
||
| Location | `src/ScadaLink.Host/Program.cs:43-48`, `src/ScadaLink.Host/appsettings.json:1-7` |
|
||
|
||
**Description**
|
||
|
||
REQ-HOST-8 requires the Host to configure Serilog with "Configuration-driven sink
|
||
setup (console and file sinks at minimum)". `LoggerConfigurationFactory.Build` calls
|
||
`.ReadFrom.Configuration(configuration)`, which reads the standard `Serilog`
|
||
configuration section — but neither `appsettings.json` nor either role-specific file
|
||
contains a `Serilog` section (only a Microsoft `Logging` section with a `LogLevel`
|
||
map, which Serilog's `ReadFrom.Configuration` does not consume). The two sinks that
|
||
actually run are appended in `Program.cs` as hard-coded `.WriteTo.Console(...)` and
|
||
`.WriteTo.File("logs/scadalink-.log", rollingInterval: Day)` calls. As a result the
|
||
console output template, the file path, and the rolling interval cannot be changed
|
||
without recompiling — an operator cannot redirect logs, change the file location, or
|
||
add a sink via configuration, contrary to REQ-HOST-8. The `ReadFrom.Configuration`
|
||
call is effectively a no-op because the section it reads does not exist.
|
||
|
||
**Recommendation**
|
||
|
||
Move the console and file sink definitions into a `Serilog` section in
|
||
`appsettings.json` (with `WriteTo` entries and the output template / file path /
|
||
rolling interval as arguments) so `ReadFrom.Configuration` drives them, and drop the
|
||
hard-coded `.WriteTo` calls from `Program.cs`. Alternatively, update REQ-HOST-8 to
|
||
state the sinks are intentionally code-defined — but the design doc currently says
|
||
"configuration-driven", so code and doc must be reconciled.
|
||
|
||
**Resolution**
|
||
|
||
_Unresolved._
|
||
|
||
### Host-015 — `StartupRetry` retries on every exception type, including permanent failures
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Error handling & resilience |
|
||
| Status | Open |
|
||
| Location | `src/ScadaLink.Host/StartupRetry.cs:36-45` |
|
||
|
||
**Description**
|
||
|
||
`StartupRetry.ExecuteWithRetryAsync` catches `Exception` with only the guard
|
||
`when (attempt < maxAttempts)` — it retries on *any* exception type. Its sole caller
|
||
wraps `MigrationHelper.ApplyOrValidateMigrationsAsync` (`Program.cs:124-134`), which
|
||
on a Central node in production *validates* the schema version and throws if the
|
||
deployed schema does not match the expected migration set. A schema-version mismatch
|
||
is a permanent error — retrying it cannot succeed — yet `StartupRetry` will retry it
|
||
8 times with capped exponential backoff (2s, 4s, 8s, 16s, 30s, 30s, 30s ≈ 2 minutes)
|
||
before finally rethrowing, delaying the fatal-exit-and-alert by minutes for a fault
|
||
that is fatal on the first attempt. The retry helper is meant for *transient*
|
||
connection faults (the XML doc says exactly that: "the database may be briefly
|
||
unreachable"), but it cannot distinguish transient from permanent failures.
|
||
|
||
**Recommendation**
|
||
|
||
Restrict retries to transient faults — e.g. accept an `Func<Exception, bool>
|
||
isTransient` predicate and, for the migration call site, treat only
|
||
connection-class exceptions (`SqlException` with a connection/transport error
|
||
number, `TimeoutException`, socket errors) as retryable while letting
|
||
schema-validation / `InvalidOperationException` failures propagate immediately. Add
|
||
a `StartupRetryTests` case asserting a non-transient exception is rethrown after a
|
||
single attempt.
|
||
|
||
**Resolution**
|
||
|
||
_Unresolved._
|