Files
ScadaBridge/code-reviews/Host/findings.md
T
Joseph Doherty 7b0b9c7365 refactor: rename ScadaLink → ZB.MOM.WW.ScadaBridge (code + projects + namespaces)
Solution + 23 src projects + 26 test projects renamed; folders, csproj,
namespaces, and ScadaLinkDbContext/ScadaBridgeDbContext class updated.
ActorSystem "scadalink" → "scadabridge", Akka seed-node URLs migrated.
SQL roles/logins, LDAP domains, CLI command name, and CLI config dir
(~/.scadalink → ~/.scadabridge) also renamed.

Build green; 5 Host.Tests fail awaiting SQL login rename in next commit.
Pre-existing StaleTagMonitor timing flakes unchanged.

Rename script committed at tools/rename-to-scadabridge.sh.
2026-05-28 09:37:45 -04:00

1121 lines
66 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Code Review — Host
| Field | Value |
|-------|-------|
| Module | `src/ZB.MOM.WW.ScadaBridge.Host` |
| Design doc | `docs/requirements/Component-Host.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | `1eb6e97` |
| Open findings | 0 |
## Summary
The Host module is the composition root for the entire ScadaBridge system: a single
binary whose behaviour (`Central` vs `Site`) is driven entirely by configuration. The
implementation is generally faithful to `Component-Host.md` — startup validation,
role-based registration, Serilog enrichment, Windows Service support, dead-letter
monitoring, CoordinatedShutdown, and gRPC hosting on site nodes are all present and
backed by a solid test suite (`tests/ZB.MOM.WW.ScadaBridge.Host.Tests`).
The most significant problem is the readiness endpoint: `/health/ready` runs **all**
registered health checks, including the leader-only `active-node` check, so a fully
operational *standby* central node permanently reports `503` on `/health/ready`
directly contradicting REQ-HOST-4a, which defines readiness as cluster membership +
DB connectivity (not leadership). Several other findings concern configuration that
is validated-but-never-consumed (`MachineDataDb`), design-doc drift (Akka.Persistence
is required by REQ-HOST-6 but the system uses no persistent actors), an incorrect
seed-node entry in the shipped site config, blocking sync-over-async during startup,
and unguarded string interpolation when building HOCON. None are crash/data-loss
class, but the readiness bug is High because it breaks load-balancer behaviour with
no safe workaround.
#### Re-review 2026-05-17 (commit `39d737e`)
All eleven findings from the first review (Host-001..011) are confirmed `Resolved`
in the current tree — the `/health/ready` predicate, the externalised secrets, the
seed-node/GrpcPort validation rules, the escaped HOCON builder, the bounded
migration retry and the live `LoggingOptions.MinimumLevel` are all present as
described in their Resolution notes. This re-review walked all ten checklist
categories again over the full module and recorded four new findings, none of them
crash/data-loss class. Following up on a batch-1 ClusterInfrastructure re-review
note, Host-012 (Medium) confirms `AkkaHostedService.BuildHocon` hard-codes
`down-if-alone = on` and never reads the `ClusterOptions.DownIfAlone` property, so
that documented, bound option is dead. The remaining three are Low: `:F0` rounding
of cluster timing values silently degrades any sub-second configuration (Host-013),
Serilog sink setup is hard-coded in `Program.cs` rather than configuration-driven as
REQ-HOST-8 requires (Host-014), and `StartupRetry` retries indiscriminately on every
exception type including permanent schema-validation failures (Host-015).
#### Re-review 2026-05-28 (commit `1eb6e97`)
All fifteen prior findings (Host-001..015) remain `Resolved` in the current tree
and the regressions introduced for them — Host-001's predicate, the externalised
secrets, the Site GrpcPort/RemotingPort/seed-port validation rules, the escaped
HOCON builder with `DownIfAlone` and millisecond-precision durations, the
configuration-driven Serilog sinks, the transient-only `StartupRetry`
classifier — are all still in place. This re-review walked the ten checklist
categories over the full module again and recorded seven new findings, none of
them crash/data-loss class. Host-016 (Medium) mirrors the resolved Host-004
shipped-config bug on the **Communication** side: `appsettings.Site.json`'s
second `CentralContactPoints` entry points at the site's own remoting port
(`localhost:8082`) instead of central, an incorrect dev example that copies
into multi-central deployments. Host-017 (Medium) flags a partial REQ-HOST-7
implementation — the documented site-shutdown ordering (stop accepting streams
first, cancel active streams via `IHostApplicationLifetime.ApplicationStopping`,
then tear down actors) is not wired: the site path registers no
`ApplicationStopping` handler that signals `SiteStreamGrpcServer`, and the gRPC
server exposes no cancel-all-streams entry point. The remaining five are Low:
`NodeOptions.NodeName` (the operator-configured value stamped on
`AuditLog.SourceNode`) is absent from both shipped per-role configs even though
the docker per-node configs set it (Host-018); the migration `StartupRetry`
call passes `default` for `CancellationToken`, so a SIGTERM during the
bounded-retry window is ignored for up to ~2 minutes (Host-019);
`LoggerConfigurationFactory` layers `MinimumLevel.Is` over
`ReadFrom.Configuration`, so any `Serilog:MinimumLevel` an operator sets is
silently overridden by `ScadaBridge:Logging:MinimumLevel` (Host-020); the
shipped `appsettings.json` carries a Microsoft `Logging:LogLevel` block but
Serilog is the only logger provider and the section is dead config (Host-021);
and `ParseLevel` silently swallows an unrecognised `MinimumLevel` value (e.g.
a typo) and falls back to `Information` with no warning (Host-022).
## Checklist coverage
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ☑ | Host-001/004 resolved. Re-review: `:F0` rounding of cluster timing values silently distorts sub-second durations (Host-013). |
| 2 | Akka.NET conventions | ☑ | CoordinatedShutdown, receptionist registration, singleton scoping all correct. Host-006/009 resolved; no new issues. |
| 3 | Concurrency & thread safety | ☑ | Host-005 resolved (`StartAsync` now genuinely async). `DeadLetterMonitorActor` state is actor-confined — no issues. |
| 4 | Error handling & resilience | ☑ | Host-010 resolved. Re-review: `StartupRetry` retries indiscriminately on permanent failures (e.g. schema-validation mismatch) (Host-015). |
| 5 | Security | ☑ | Host-003 resolved — secrets externalised to env vars. No new issues. |
| 6 | Performance & resource management | ☑ | No undisposed resources. Inbound API script compilation is a synchronous startup loop — acceptable. No new issues. |
| 7 | Design-document adherence | ☑ | Host-002/007 resolved. Re-review: `ClusterOptions.DownIfAlone` bound/documented but never consumed — HOCON hard-codes `on` (Host-012); Serilog sinks hard-coded, not config-driven per REQ-HOST-8 (Host-014). |
| 8 | Code organization & conventions | ☑ | Host-008/011 resolved. No new issues. |
| 9 | Testing coverage | ☑ | Strong suite; regression tests added for Host-001/004/006/007/010/011. No coverage for the new `down-if-alone`, sub-second-duration, or non-transient-retry paths (Host-012/013/015). |
| 10 | Documentation & comments | ☑ | REQ-HOST-6 stale-doc resolved. Re-review: REQ-HOST-8 says sinks are "configuration-driven" but they are code-defined (Host-014). |
_Re-review (2026-05-28, `1eb6e97`):_
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ☑ | Re-review: `appsettings.Site.json` second `CentralContactPoints` entry targets the site's own remoting port instead of central (Host-016) — same defect class as the resolved Host-004 seed-list bug. |
| 2 | Akka.NET conventions | ☑ | CoordinatedShutdown, receptionist registration, singleton scoping, role-scoped site singletons, ClusterClient initial-contact wiring all reviewed; no new issues. |
| 3 | Concurrency & thread safety | ☑ | `_trackedDisposables` is locked on both sides of the lifecycle; `_actorSystem` publication is safe via the IHost startup `await` boundary. New Low: `StartupRetry` migration call passes `default` `CancellationToken`, so SIGTERM during the retry window is ignored (Host-019). |
| 4 | Error handling & resilience | ☑ | `IsTransientDatabaseFault` correctly classifies socket / timeout / SqlException; the retry helper itself remains sound. Host-019 is the resilience gap. |
| 5 | Security | ☑ | Secrets stay externalised; the `_secrets` placeholder comment is intact. No new issues. |
| 6 | Performance & resource management | ☑ | No new undisposed resources; gRPC stream lifetime cap remains correct. No new issues. |
| 7 | Design-document adherence | ☑ | Re-review: REQ-HOST-7 site-shutdown ordering — stop accepting new streams, cancel active streams via `ApplicationStopping`, then tear down actors — is not wired in `Program.cs` (Host-017). |
| 8 | Code organization & conventions | ☑ | Re-review: `NodeOptions.NodeName` is absent from the shipped per-role configs even though it stamps `AuditLog.SourceNode` (Host-018); the appsettings `Logging:LogLevel` Microsoft section is dead config under Serilog (Host-021). |
| 9 | Testing coverage | ☑ | Strong existing suite. No coverage for the Site `CentralContactPoints` second-entry rule (Host-016), the site-shutdown ordering (Host-017), the `NodeName`-absent shipped config (Host-018), the unused `CancellationToken` parameter (Host-019), the `MinimumLevel.Is` override semantics (Host-020) or the `ParseLevel` silent fallback (Host-022). |
| 10 | Documentation & comments | ☑ | Re-review: layered `MinimumLevel.Is` / `ReadFrom.Configuration` semantics are not surfaced — an operator-set `Serilog:MinimumLevel` is silently overridden by `ScadaBridge:Logging:MinimumLevel` (Host-020); `ParseLevel` silently coerces a misspelled level to `Information` with no warning (Host-022). |
## Findings
### Host-001 — `/health/ready` includes the leader-only `active-node` check
| | |
|--|--|
| Severity | High |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.Host/Program.cs:135-145` |
**Description**
`/health/ready` is mapped with `MapHealthChecks("/health/ready", ...)` and **no
`Predicate`**, so it executes every registered check: `database`, `akka-cluster`
*and* `active-node`. `ActiveNodeHealthCheck` (`Health/ActiveNodeHealthCheck.cs:38`)
returns `Unhealthy` on any node that is not the cluster leader. As a result a
standby central node that is fully operational (cluster member `Up`, database
reachable) still returns `503` on `/health/ready`. This contradicts REQ-HOST-4a,
which defines readiness as cluster membership + DB connectivity + singletons —
explicitly *not* leadership. `/health/active` is the endpoint intended to report
leadership. A load balancer using `/health/ready` to decide whether a node may
serve traffic will permanently treat the standby as unready, defeating failover
readiness. No test covers this: `HealthCheckTests.HealthReady_Endpoint_ReturnsResponse`
only asserts a response is returned, not the standby semantics.
**Recommendation**
Add a `Predicate` to the `/health/ready` mapping that excludes the `active-node`
check, e.g. `Predicate = check => check.Name != "active-node"` (or tag the readiness
checks and filter by tag). Add a regression test asserting a non-leader node returns
`200` on `/health/ready`.
**Resolution**
Resolved 2026-05-16 (commit `<pending>`). Root cause confirmed against
`Program.cs`: the `/health/ready` mapping had no `Predicate`, so it executed all
three registered checks including the leader-only `active-node` check, while
`ActiveNodeHealthCheck` returns `Unhealthy` on any non-leader node — making a fully
operational standby central node permanently report `503`. Fix: added
`Predicate = check => check.Name != "active-node"` to the `/health/ready`
`HealthCheckOptions`, so readiness now reflects cluster membership + DB connectivity
only (REQ-HOST-4a); leadership remains reported solely by `/health/active`.
Regression test `HealthCheckTests.HealthReady_Endpoint_ExcludesActiveNodeCheck`
asserts the `active-node` check name does not appear in the `/health/ready`
response body; it failed before the fix and passes after. Full Host suite green
(156 passed).
### Host-002 — Akka.Persistence required by REQ-HOST-6 is not configured and not used
| | |
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs:70-108`, `docs/requirements/Component-Host.md` REQ-HOST-6 |
**Description**
REQ-HOST-6 states the Host "must configure the Akka.NET actor system using
Akka.Hosting with ... **Persistence**: Configured with the appropriate journal and
snapshot store (SQL for central, SQLite for site)." The HOCON built in
`AkkaHostedService.StartAsync` contains no `akka.persistence` section, no journal and
no snapshot-store plugin, and `ZB.MOM.WW.ScadaBridge.Host.csproj` references neither
`Akka.Persistence.Hosting` nor any persistence plugin (the design doc Dependencies
list `Akka.Persistence.Hosting`). A repo-wide search finds **no** `PersistentActor` /
`ReceivePersistentActor` subclasses — the system deliberately uses custom SQLite
storage services instead. The code is internally consistent, but the design document
is stale: it mandates a subsystem that does not exist. This is a documented-vs-actual
drift that will mislead future maintainers and any audit against REQ-HOST-6.
**Recommendation**
Update `Component-Host.md` REQ-HOST-6 and the Dependencies list to remove the
Akka.Persistence requirement (or explicitly state persistence is provided by
component-owned SQLite storage, not Akka.Persistence). If persistence *is* intended,
add the plugin packages and HOCON. Either way, code and doc must agree.
**Resolution**
Resolved 2026-05-16 (commit `<pending>`). Confirmed accurate: a repo-wide search
finds **no** `PersistentActor` / `ReceivePersistentActor` subclasses anywhere in
`src/`, no `akka.persistence` section in the HOCON built by
`AkkaHostedService.StartAsync`, and `ZB.MOM.WW.ScadaBridge.Host.csproj` references no
persistence plugin packages. The Host code is correct and internally consistent;
the defect was a stale design doc. Fix: updated `docs/requirements/Component-Host.md`
— REQ-HOST-6 no longer lists Persistence as a configured subsystem and now carries
an explicit Persistence note stating that durable state is owned by individual
components and persisted through component-owned stores (SQLite at sites, MS SQL
centrally), with no Akka journal/snapshot store and no `PersistentActor` subclasses
by design. The Responsibilities line and the Dependencies entry
(`Akka.Persistence.Hosting` removed) were corrected to match. Code and doc now
agree; no code or test change required.
### Host-003 — Secrets committed in plaintext in `appsettings.Central.json`
| | |
|--|--|
| Severity | Medium |
| Category | Security |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.Host/appsettings.Central.json:20-31` |
**Description**
`appsettings.Central.json` contains real-looking secrets in plaintext, checked into
source control: SQL Server passwords in the `ConfigurationDb` / `MachineDataDb`
connection strings (`Password=ScadaBridge_Dev1#`), an LDAP service-account password
(`LdapServiceAccountPassword: "password"`), and a JWT signing key
(`JwtSigningKey: "scadabridge-dev-jwt-signing-key-..."`). Even though these are
intended as development defaults, shipping them in the default config invites them
being reused verbatim in production, and a committed JWT signing key allows anyone
with repo access to forge session tokens. `TrustServerCertificate=true` additionally
disables TLS validation for the SQL connection.
**Recommendation**
Move all secrets out of committed `appsettings*.json` into environment variables,
user-secrets, or a secret store. Keep only non-sensitive structural defaults in the
file and document the required environment variables. At minimum add a clear comment
that these values are dev-only and must be overridden, and rotate the JWT key per
environment.
**Resolution**
Resolved 2026-05-16 (commit `pending`). Root cause confirmed against
`appsettings.Central.json`: the committed file carried real-looking secrets in
plaintext — SQL Server passwords (`Password=ScadaBridge_Dev1#`) in both connection
strings, an LDAP service-account password, and a JWT signing key. Fix: all four
secrets were removed from the committed file and replaced with non-functional
`${...}` placeholder references (`ConfigurationDb` / `MachineDataDb`,
`LdapServiceAccountPassword`, `JwtSigningKey`). A new top-level `_secrets` note
documents that the Host's configuration builder (`AddEnvironmentVariables()`)
overlays the real values supplied via environment variables
(`ScadaBridge__Database__ConfigurationDb`, `ScadaBridge__Database__MachineDataDb`,
`ScadaBridge__Security__LdapServiceAccountPassword`,
`ScadaBridge__Security__JwtSigningKey`); the placeholders are intentionally invalid so
a misconfigured deployment fails loudly rather than silently using a committed key.
Regression test class `ConfigSecretsTests` asserts the committed file carries no
plaintext `Password=` value, no committed LDAP service-account password, and no
committed JWT signing key; all three tests failed before the fix and pass after.
Tests that drive the full `Program` startup pipeline against the real SQL provider
(`HealthCheckTests`, `HostStartupTests.CentralRole_StartsWithoutError`) were adapted
to supply the local dev connection strings themselves via the new
`CentralDbTestEnvironment` test helper (environment variables) — they must no longer
depend on committed secrets. Note: the `docker/central-node-*/appsettings.Central.json`
files still contain the same dev secrets but lie outside this task's permitted edit
scope; they should receive the same treatment in a follow-up.
### Host-004 — Site seed-node list points at the gRPC port, not a remoting port
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.Host/appsettings.Site.json:10-19` |
**Description**
The shipped site config sets `Node:RemotingPort = 8082` and `Node:GrpcPort = 8083`,
but `Cluster:SeedNodes` is `["akka.tcp://scadabridge@localhost:8082",
"akka.tcp://scadabridge@localhost:8083"]`. The second seed node targets `8083`, which
is the Kestrel HTTP/2 gRPC port — not an Akka remoting endpoint. A node attempting to
join via that seed will try to establish an Akka.Remote TCP association against the
gRPC listener and fail. `StartupValidator` only checks that ≥2 seed nodes exist
(`StartupValidator.cs:54-56`), so this misconfiguration passes validation silently.
For the single-node dev site it is harmless (the first seed succeeds), but it is an
incorrect example that will be copied into multi-node site configs.
**Recommendation**
Correct the site seed-node list to reference the two site nodes' *remoting* ports
(e.g. `8082` and `8084`), never the gRPC port. Consider extending `StartupValidator`
to reject a seed node whose port equals this node's `GrpcPort`.
**Resolution**
Resolved 2026-05-16 (commit `pending`). Root cause confirmed against
`appsettings.Site.json`: with `Node:RemotingPort = 8082` and `Node:GrpcPort = 8083`,
the second `Cluster:SeedNodes` entry was `akka.tcp://scadabridge@localhost:8083` — the
Kestrel HTTP/2 gRPC port, not an Akka.Remote endpoint. `StartupValidator` only
checked seed-node *count* (≥2), so the misconfiguration passed silently. Fix, two
parts: (1) the shipped `appsettings.Site.json` second seed entry was corrected to a
remoting port (`localhost:8084`); (2) `StartupValidator.Validate` was extended — for
`Site` nodes it now parses each seed node's TCP port (via a new `SeedNodePort`
helper) and rejects any entry whose port equals the node's `GrpcPort`, using the
resolved GrpcPort including the `8083` `NodeOptions` default when the key is absent.
The seed-node-count check was hoisted above the Site block so the new rule can reuse
the parsed list. Regression tests in `StartupValidatorTests`:
`Site_SeedNodeOnGrpcPort_FailsValidation`,
`Site_SeedNodeOnDefaultGrpcPort_FailsValidation` (default-8083 path),
`Site_SeedNodesOnRemotingPort_PassesValidation`, and
`Central_SeedNodeOnPort8083_PassesValidation` (rule is Site-only) — all failed
appropriately before the fix and pass after.
### Host-005 — Blocking sync-over-async (`GetAwaiter().GetResult()`) inside `StartAsync`
| | |
|--|--|
| Severity | Low |
| Category | Concurrency & thread safety |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs:345` |
**Description**
`RegisterSiteActors` calls `storeAndForwardService.StartAsync().GetAwaiter().GetResult()`
synchronously, blocking inside the `IHostedService.StartAsync` path. `StartAsync` is
itself declared synchronous (returns `Task.CompletedTask`), so the work cannot be
awaited cleanly. Blocking on async work risks thread-pool starvation during startup
and, if the awaited operation captures a synchronization context, deadlock. It also
hides exceptions behind an `AggregateException` wrapper.
**Recommendation**
Make `AkkaHostedService.StartAsync` genuinely `async` and `await
storeAndForwardService.StartAsync(cancellationToken)`. Propagate the
`CancellationToken` and let exceptions surface as the original type.
**Resolution**
Resolved 2026-05-16 (commit `pending`). Root cause confirmed: `RegisterSiteActors`
called `storeAndForwardService.StartAsync().GetAwaiter().GetResult()` synchronously
inside the `IHostedService.StartAsync` path, blocking a startup thread and wrapping
any failure in `AggregateException`. Fix: `AkkaHostedService.StartAsync` is now
genuinely `async Task`; the site path was renamed `RegisterSiteActorsAsync` and made
`async`; the store-and-forward init is now `await storeAndForwardService.StartAsync()`
(the service's own signature, in another module, takes no `CancellationToken`, so the
token is honoured via a `cancellationToken.ThrowIfCancellationRequested()` checkpoint
immediately before the await). No blocking, no sync-context deadlock risk, and the
original exception type now surfaces. No dedicated regression test: the only
observable change is non-blocking/exception-unwrapping behaviour on a path that
requires the full site DI graph plus a real SQLite store; the existing
`HostStartupTests.SiteRole_*` and `CompositionRootTests` continue to exercise the
async startup pipeline and remain green (175 passed).
### Host-006 — HOCON assembled by unescaped string interpolation
| | |
|--|--|
| Severity | Low |
| Category | Akka.NET conventions |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs:70-108` |
**Description**
The Akka HOCON is built with an interpolated string that injects
`_nodeOptions.NodeHostname`, `_clusterOptions.SeedNodes`, the computed roles, and
`SplitBrainResolverStrategy` directly into the configuration text. Values are not
escaped. A hostname or seed-node string containing a quote, backslash, brace, or
comment sequence would corrupt the HOCON and produce a confusing parse error far from
the real cause; `SplitBrainResolverStrategy` is interpolated without quoting, so a
value with whitespace breaks the document. Building cluster configuration from raw
string concatenation is also harder to maintain than the typed Akka.Hosting builder
the design doc (REQ-HOST-6) actually calls for ("via Akka.Hosting").
**Recommendation**
Prefer the `Akka.Hosting` `AddAkka(...)` builder with strongly-typed `WithRemoting`,
`WithClustering`, and split-brain-resolver configuration instead of hand-built HOCON.
If HOCON must be retained, validate/escape interpolated values (especially hostname
and seed nodes) before substitution.
**Resolution**
Resolved 2026-05-16 (commit `pending`). Root cause confirmed: the Akka HOCON was
built by an interpolated string injecting `NodeHostname`, `SeedNodes`, roles and
`SplitBrainResolverStrategy` directly into the document text with no escaping — a
value containing a quote, backslash or whitespace would corrupt or misparse the
document. Re-triaged the recommendation: switching wholesale to the typed
`Akka.Hosting` builder is a larger refactor than this Low finding warrants and would
touch how the whole cluster is bootstrapped, so the HOCON form was retained but made
safe. Fix: the HOCON construction was extracted into a testable static
`AkkaHostedService.BuildHocon`, and every interpolated *string* value is now routed
through a `QuoteHocon` helper that wraps the value in a double-quoted HOCON literal
and escapes embedded backslashes and quotes — `NodeHostname`, every seed node, every
role, and `SplitBrainResolverStrategy` (the last was previously unquoted entirely, so
whitespace alone would break it). Regression tests in new `HoconBuilderTests`:
`BuildHocon_HostnameWithQuote_DoesNotCorruptDocument` (a hostname with a `"` still
parses and following keys stay intact), `BuildHocon_StrategyWithWhitespace_RemainsQuoted`,
and `BuildHocon_PlainValues_ParsesAndPreservesValues`. With the old raw interpolation
a quote-containing hostname produced `hostname = "evil"host"`, closing the literal
early and corrupting following keys; the escaped form parses correctly. Full Host
suite green (175 passed).
### Host-007 — REQ-HOST-4 rule "GrpcPort ≠ RemotingPort" is not enforced
| | |
|--|--|
| Severity | Low |
| Category | Design-document adherence |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.Host/StartupValidator.cs:43-47` |
**Description**
REQ-HOST-4 requires: "Site nodes must have `GrpcPort` in valid port range (165535)
**and different from `RemotingPort`**." `StartupValidator` validates the GrpcPort
range but never compares it to `RemotingPort`. A site config that sets both ports to
the same value passes validation and then fails opaquely at runtime when Kestrel and
Akka.Remote both try to bind the port. The GrpcPort range check is also skipped
entirely when the key is absent (`grpcPortStr != null`), relying on the
`NodeOptions` default of 8083 — acceptable, but the equality rule is the missing
piece.
**Recommendation**
Add a check in the `role == "Site"` block: if `GrpcPort` (resolved, including the
8083 default) equals `RemotingPort`, add an error
`"ScadaBridge:Node:GrpcPort must differ from RemotingPort"`.
**Resolution**
Resolved 2026-05-16 (commit `pending`). Root cause confirmed against
`StartupValidator.cs`: the `role == "Site"` block validated the GrpcPort range but
never compared it to `RemotingPort`, so a site config setting both ports equal
passed validation and then failed opaquely when Kestrel and Akka.Remote both bound
the port. Fix: added `if (port == grpcPort) errors.Add("ScadaBridge:Node:GrpcPort must
differ from RemotingPort")` in the Site block, using the resolved GrpcPort (including
the 8083 `NodeOptions` default when the key is absent) against the parsed
RemotingPort. Regression tests in `StartupValidatorTests`:
`Site_GrpcPortEqualsRemotingPort_FailsValidation`,
`Site_DefaultGrpcPortEqualsRemotingPort_FailsValidation` (default-8083 path), and
`Site_GrpcPortDiffersFromRemotingPort_PassesValidation` — all failed appropriately
before the fix and pass after. Full Host suite green (175 passed).
### Host-008 — `MachineDataDb` is validated and declared but never consumed
| | |
|--|--|
| Severity | Low |
| Category | Code organization & conventions |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.Host/StartupValidator.cs:33-34`, `src/ZB.MOM.WW.ScadaBridge.Host/DatabaseOptions.cs:6` |
**Description**
`StartupValidator` requires a non-empty `ScadaBridge:Database:MachineDataDb` connection
string for Central nodes, and `DatabaseOptions` exposes a `MachineDataDb` property,
but a repo-wide search shows the value is never read anywhere outside the Host module
— only `ConfigurationDb` is passed to `AddConfigurationDatabase`
(`Program.cs:83-85`). The Host therefore fails startup if `MachineDataDb` is missing
even though nothing uses it. This is either dead configuration that should be removed
or a missing wiring (a machine-data DbContext that was never registered).
**Recommendation**
Determine whether a machine-data store is actually required. If yes, wire it into the
relevant component's DI registration. If no, remove the `MachineDataDb` validation
rule, the `DatabaseOptions` property, and the key from `appsettings.Central.json`.
**Resolution**
Resolved 2026-05-16 (commit `pending`). Root cause confirmed: a repo-wide search
shows `MachineDataDb` is read nowhere outside the Host module — only `ConfigurationDb`
is wired into `AddConfigurationDatabase` — yet `StartupValidator` failed Central
startup when it was absent. No machine-data store exists or is registered, so this is
dead configuration. Fix (the "if no" branch of the recommendation): removed the
`MachineDataDb` validation rule from `StartupValidator`, the `MachineDataDb` property
from `DatabaseOptions`, the `MachineDataDb` key from `appsettings.Central.json`, and
the now-pointless `MachineDataDb` reference from that file's `_secrets` note and from
the `CentralDbTestEnvironment` test helper. Regression test: the prior
`StartupValidatorTests.Central_MissingMachineDataDb_FailsValidation` was inverted to
`Central_MissingMachineDataDb_PassesValidation` (absence must now NOT fail startup),
and `MachineDataDb` was removed from the shared `ValidCentralConfig` fixture so all
Central tests prove the key is no longer required; it failed under the old code and
passes after. Full Host suite green (175 passed). Note: `appsettings.Central.json`
under `docker/central-node-*` is outside this task's edit scope and still carries the
key harmlessly — it will simply be ignored; a follow-up may remove it for tidiness.
### Host-009 — `StartAsync` reports success before role actors are confirmed running
| | |
|--|--|
| Severity | Low |
| Category | Akka.NET conventions |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs:127-141` |
**Description**
`StartAsync` creates actors with `ActorOf` (a fire-and-forget operation — the actor's
`PreStart` runs asynchronously on its own thread) and then returns
`Task.CompletedTask`. For site nodes, `grpcServer.SetReady(_actorSystem)` is called
synchronously at the end of `RegisterSiteActors`, marking the gRPC server ready even
though `SiteCommunicationActor`, the deployment-manager singleton, and the
`ClusterClient` may not yet have completed their `PreStart`/initial-contact handshake.
REQ-HOST-7 requires "Actor system and SiteStreamManager ... initialized before gRPC
begins accepting connections" — `SiteStreamManager.Initialize` is awaited-equivalent,
but the broader actor graph is not. The window is small and the gRPC server still
rejects streams until `SetReady`, so impact is limited, but readiness is being
asserted optimistically.
**Recommendation**
If strict ordering matters, gate `SetReady` on confirmation that
`SiteCommunicationActor` is fully initialized (e.g. an `Ask` round-trip or a
readiness message), or document explicitly that gRPC readiness only guarantees the
actor system exists, not that the cluster handshake has completed.
**Resolution**
Resolved 2026-05-16 (commit `pending`) — re-triaged, documentation fix. Root cause
confirmed: `SetReady` is called optimistically after `ActorOf` registrations whose
`PreStart`/initial-contact handshakes run asynchronously. Re-triaged the first
recommended option (gating `SetReady` on a `SiteCommunicationActor` `Ask`
round-trip): that option is **incorrect** because the meaningful asynchronous step is
the `ClusterClient` initial-contact handshake with central — gating site gRPC
readiness on it would prevent a site from coming up and streaming locally while
central is briefly unreachable, contradicting the design's site-autonomy model.
Confirmed `SiteStreamGrpcServer` already rejects any stream opened before `SetReady`
with `StatusCode.Unavailable`, so the window is benign. Fix (the second, correct
option): the `SetReady` call site now carries an explicit comment documenting the
narrow readiness contract — by `SetReady` the actor system exists,
`SiteStreamManager.Initialize` has run, and every role actor has been created with
ordered synchronous `ActorOf` + registration `Tell`s; what is deliberately NOT
guaranteed is `PreStart` completion or the central handshake. No regression test is
meaningful for a documentation-only clarification of an intentional design contract;
the existing `HostStartupTests.SiteRole_*` cover the surrounding startup path. Full
Host suite green (175 passed).
### Host-010 — No retry/backoff around startup preconditions (DB migration, readiness)
| | |
|--|--|
| Severity | Low |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.Host/Program.cs:112-125` |
**Description**
On Central startup the Host opens a DI scope and calls
`MigrationHelper.ApplyOrValidateMigrationsAsync` directly. If the SQL Server is not
yet reachable (common in container orchestration where the DB and app start
together), the call throws, the top-level `catch` logs `Fatal`, and the process
exits. There is no bounded retry/backoff to tolerate a database that is briefly
unavailable at boot. The design intent (REQ-HOST-4a, readiness gating, `503` until
ready) is about *serving traffic*, but the migration step happens before the host
even runs and has no such tolerance.
**Recommendation**
Wrap the migration/validation step in a bounded retry with exponential backoff (e.g.
Polly), or move schema apply behind the readiness gate so the process stays up and
reports `503` until the database becomes reachable.
**Resolution**
Resolved 2026-05-16 (commit `pending`). Root cause confirmed: Central startup called
`MigrationHelper.ApplyOrValidateMigrationsAsync` directly with no tolerance for a
database that is briefly unreachable at boot — a first connection failure threw, the
top-level `catch` logged `Fatal`, and the process exited. Fix: added a new
`StartupRetry.ExecuteWithRetryAsync` helper (bounded retry with capped exponential
backoff — no Polly dependency, keeping the Host's package set minimal) and wrapped
the migration step in it (`maxAttempts: 8`, `initialDelay: 2s`, doubling, capped at
30s). Each retry opens a fresh DI scope so a transient connection fault does not
leave a poisoned `DbContext`; once attempts are exhausted the last exception
propagates and the existing fatal-log path still fires. Regression tests in new
`StartupRetryTests`: `ExecuteWithRetry_SucceedsFirstTry_RunsOnce`,
`ExecuteWithRetry_TransientFailures_RetriesUntilSuccess` (fails twice then succeeds),
and `ExecuteWithRetry_ExhaustsAttempts_RethrowsLastException` (verifies the original
exception type/message surfaces after exhaustion) — they fail to compile/pass against
the pre-fix code (the helper did not exist) and pass after. Full Host suite green
(175 passed).
### Host-011 — `LoggingOptions.MinimumLevel` is dead configuration
| | |
|--|--|
| Severity | Low |
| Category | Code organization & conventions |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.Host/LoggingOptions.cs:5`, `src/ZB.MOM.WW.ScadaBridge.Host/Program.cs:42-50` |
**Description**
`LoggingOptions` exposes a `MinimumLevel` property bound from `ScadaBridge:Logging`
(`SiteServiceRegistration.BindSharedOptions`), and both `appsettings.Central.json`
and `appsettings.Site.json` set `"Logging": { "MinimumLevel": "Information" }`.
However Serilog is configured purely via `ReadFrom.Configuration(configuration)`,
which reads the standard `Serilog` section — not `ScadaBridge:Logging`. The
`LoggingOptions.MinimumLevel` value is never read by any code, so changing it has no
effect. This is misleading: an operator editing `ScadaBridge:Logging:MinimumLevel`
expecting a log-level change will see nothing happen.
**Recommendation**
Either consume `LoggingOptions.MinimumLevel` when configuring the Serilog
`LoggerConfiguration` (e.g. set `MinimumLevel.Is(...)` from it), or remove the option
class and the `ScadaBridge:Logging` sections and rely solely on the `Serilog`
configuration section. Keep one mechanism, not two.
**Resolution**
Resolved 2026-05-16 (commit `pending`). Root cause confirmed: `LoggingOptions.MinimumLevel`
was bound from `ScadaBridge:Logging` but Serilog was configured purely via
`ReadFrom.Configuration` (standard `Serilog` section), so editing
`ScadaBridge:Logging:MinimumLevel` had no effect. Re-triaged the two options: the
"remove the option class" branch was rejected because `Component-Host.md` REQ-HOST-3
explicitly lists `ScadaBridge:Logging``LoggingOptions` and that design doc is outside
this task's edit scope — removing the class would create a fresh code-vs-doc drift.
Fix (the "consume it" branch): added a `LoggerConfigurationFactory.Build` that binds
`LoggingOptions`, parses `MinimumLevel` into a Serilog `LogEventLevel` (falling back
to `Information` for null/blank/unrecognised values), and pins it via
`MinimumLevel.Is(...)` on top of `ReadFrom.Configuration`; `Program.cs` now builds the
logger through this factory. `ScadaBridge:Logging:MinimumLevel` is now live.
Regression tests in new `LoggerConfigurationTests`:
`MinimumLevel_Warning_SuppressesInformationLogs`,
`MinimumLevel_Debug_AllowsDebugLogs`, and `MinimumLevel_Absent_DefaultsToInformation`
— they assert the configured level actually filters log events; they pass against the
new factory (and would fail against the old inline configuration, which ignored the
key). Full Host suite green (175 passed).
### Host-012 — `down-if-alone` hard-coded in HOCON; `ClusterOptions.DownIfAlone` is never read
| | |
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs:146-148` |
**Description**
`AkkaHostedService.BuildHocon` emits the split-brain-resolver block with
`keep-oldest { down-if-alone = on }` as a literal constant. `ClusterOptions`
(`src/ZB.MOM.WW.ScadaBridge.ClusterInfrastructure/ClusterOptions.cs:74`) exposes a
`DownIfAlone` property — bound from `ScadaBridge:Cluster` via the Options pattern,
documented as "the design-doc requirement", default `true` — but a repo-wide search
shows it is referenced **nowhere outside its own declaration**. The Host therefore
ignores the bound value entirely: setting `ScadaBridge:Cluster:DownIfAlone` to `false`
in `appsettings.json` has no effect, the resolver still runs with `down-if-alone =
on`. This is the same class of defect as the resolved Host-011
(`LoggingOptions.MinimumLevel` was dead config) — a configuration option that is
declared, bound, and documented but never consumed, which silently misleads any
operator who edits it. A batch-1 re-review of ClusterInfrastructure flagged the same
hard-coding; it is recorded here because `BuildHocon` is Host-owned code (the
ClusterInfrastructure project owns only the configuration contract, per the
`ClusterOptions` class comment).
**Recommendation**
Make `BuildHocon` consume `clusterOptions.DownIfAlone` — emit `down-if-alone =
{(clusterOptions.DownIfAlone ? "on" : "off")}` (the value is a bool, so no escaping
is needed). Add a `HoconBuilderTests` case asserting both `true` and `false` produce
the corresponding `down-if-alone` token. If the flag is genuinely meant to be fixed
at `on` for ScadaBridge's two-node clusters, remove the `DownIfAlone` property and its
doc comment instead so code and configuration contract agree — but do not leave it
declared-but-dead.
**Resolution**
Resolved 2026-05-17 (commit `<pending>`). Root cause confirmed against
`AkkaHostedService.cs`: `BuildHocon` emitted `keep-oldest { down-if-alone = on }` as
a literal, and `ClusterOptions.DownIfAlone` was referenced nowhere outside its own
declaration, so the bound option was dead. Fix: `BuildHocon` now emits
`down-if-alone = {(clusterOptions.DownIfAlone ? "on" : "off")}`, consuming the bound
property. Regression tests in `HoconBuilderTests`: `BuildHocon_DownIfAloneTrue_EmitsOn`
and `BuildHocon_DownIfAloneFalse_EmitsOff` parse the resulting HOCON and assert the
`keep-oldest.down-if-alone` token tracks the property; the `false` case failed before
the fix (token was still `on`) and passes after. Full Host suite green (182 passed).
### Host-013 — `:F0` rounding of cluster timing values silently degrades sub-second configuration
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs:135-136,145,151-152` |
**Description**
`BuildHocon` renders every duration into HOCON via `TotalSeconds:F0`
`transportHeartbeatSec:F0`, `transportFailureSec:F0`, `StableAfter.TotalSeconds:F0`,
`HeartbeatInterval.TotalSeconds:F0`, `FailureDetectionThreshold.TotalSeconds:F0`. The
`F0` format specifier rounds to whole seconds, so any sub-second configuration is
silently distorted: a `HeartbeatInterval` of `00:00:00.5` renders as
`heartbeat-interval = 0s` (a degenerate / invalid value that Akka will reject or
treat as zero), and `00:00:02.6` becomes `3s`. The shipped defaults are all whole
seconds so this is latent, but the configuration model accepts arbitrary `TimeSpan`
values and an operator tuning a heartbeat to e.g. `750ms` would get a `1s` value with
no warning — or, worse, `0s`. Rounding configured timing values without surfacing
the change is a correctness hazard for exactly the kind of failure-detection tuning
these options exist for.
**Recommendation**
Render durations without precision loss — emit milliseconds (e.g.
`heartbeat-interval = {ts.TotalMilliseconds:F0}ms`) so sub-second values survive, or
validate in `StartupValidator` that each cluster timing value is a positive whole
number of seconds and fail fast otherwise. Either way, do not silently round.
**Resolution**
Resolved 2026-05-17 (commit `<pending>`). Root cause confirmed: `BuildHocon`
rendered every duration via `TotalSeconds:F0`, rounding sub-second values — a 750ms
heartbeat became `1s` and a 500ms value would collapse to a degenerate `0s`. Fix:
durations are now rendered by a new `DurationHocon` helper that emits whole
milliseconds with an `ms` suffix (Akka's HOCON parser accepts it), so sub-second
configuration survives exactly; `BuildHocon` takes the transport heartbeat/failure
as `TimeSpan` rather than pre-rounded `double` seconds. Regression test
`HoconBuilderTests.BuildHocon_SubSecondTimings_PreservedWithoutRounding` configures
750ms/2600ms/500ms/2.5s/7.5s timings and asserts each round-trips through the parsed
HOCON unchanged; it failed before the fix (750ms parsed as `00:00:01`) and passes
after. Full Host suite green (182 passed).
### Host-014 — Serilog sinks are hard-coded in `Program.cs`, not configuration-driven (REQ-HOST-8)
| | |
|--|--|
| Severity | Low |
| Category | Design-document adherence |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.Host/Program.cs:43-48`, `src/ZB.MOM.WW.ScadaBridge.Host/appsettings.json:1-7` |
**Description**
REQ-HOST-8 requires the Host to configure Serilog with "Configuration-driven sink
setup (console and file sinks at minimum)". `LoggerConfigurationFactory.Build` calls
`.ReadFrom.Configuration(configuration)`, which reads the standard `Serilog`
configuration section — but neither `appsettings.json` nor either role-specific file
contains a `Serilog` section (only a Microsoft `Logging` section with a `LogLevel`
map, which Serilog's `ReadFrom.Configuration` does not consume). The two sinks that
actually run are appended in `Program.cs` as hard-coded `.WriteTo.Console(...)` and
`.WriteTo.File("logs/scadabridge-.log", rollingInterval: Day)` calls. As a result the
console output template, the file path, and the rolling interval cannot be changed
without recompiling — an operator cannot redirect logs, change the file location, or
add a sink via configuration, contrary to REQ-HOST-8. The `ReadFrom.Configuration`
call is effectively a no-op because the section it reads does not exist.
**Recommendation**
Move the console and file sink definitions into a `Serilog` section in
`appsettings.json` (with `WriteTo` entries and the output template / file path /
rolling interval as arguments) so `ReadFrom.Configuration` drives them, and drop the
hard-coded `.WriteTo` calls from `Program.cs`. Alternatively, update REQ-HOST-8 to
state the sinks are intentionally code-defined — but the design doc currently says
"configuration-driven", so code and doc must be reconciled.
**Resolution**
Resolved 2026-05-17 (commit `<pending>`). Root cause confirmed: `LoggerConfigurationFactory.Build`
called `ReadFrom.Configuration` but neither `appsettings.json` nor either
role-specific file contained a `Serilog` section, so the two sinks that actually ran
were hard-coded `.WriteTo.Console(...)` / `.WriteTo.File(...)` calls in `Program.cs`
— uneditable without recompiling, contrary to REQ-HOST-8. Fix: added a `Serilog`
section to `appsettings.json` with `WriteTo` entries for the Console and File sinks
(carrying the output template, file path `logs/scadabridge-.log` and `Day` rolling
interval as args) and removed the hard-coded `.WriteTo` calls from `Program.cs`; the
factory's `ReadFrom.Configuration` (backed by the transitive
`Serilog.Settings.Configuration` package) now drives the sinks. Regression tests in
new `SerilogSinkConfigTests`: `ShippedAppSettings_HasSerilogSection_WithConsoleAndFileSinks`
asserts the shipped `appsettings.json` declares both sinks in a `Serilog:WriteTo`
array (fails against the pre-fix file, which had no `Serilog` section), and
`LoggerConfigurationFactory_AppliesConfiguredFileSink` proves a configuration-supplied
File sink reaches the built logger and writes to the configured path. Full Host suite
green (182 passed).
### Host-015 — `StartupRetry` retries on every exception type, including permanent failures
| | |
|--|--|
| Severity | Low |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.Host/StartupRetry.cs:36-45` |
**Description**
`StartupRetry.ExecuteWithRetryAsync` catches `Exception` with only the guard
`when (attempt < maxAttempts)` — it retries on *any* exception type. Its sole caller
wraps `MigrationHelper.ApplyOrValidateMigrationsAsync` (`Program.cs:124-134`), which
on a Central node in production *validates* the schema version and throws if the
deployed schema does not match the expected migration set. A schema-version mismatch
is a permanent error — retrying it cannot succeed — yet `StartupRetry` will retry it
8 times with capped exponential backoff (2s, 4s, 8s, 16s, 30s, 30s, 30s ≈ 2 minutes)
before finally rethrowing, delaying the fatal-exit-and-alert by minutes for a fault
that is fatal on the first attempt. The retry helper is meant for *transient*
connection faults (the XML doc says exactly that: "the database may be briefly
unreachable"), but it cannot distinguish transient from permanent failures.
**Recommendation**
Restrict retries to transient faults — e.g. accept an `Func<Exception, bool>
isTransient` predicate and, for the migration call site, treat only
connection-class exceptions (`SqlException` with a connection/transport error
number, `TimeoutException`, socket errors) as retryable while letting
schema-validation / `InvalidOperationException` failures propagate immediately. Add
a `StartupRetryTests` case asserting a non-transient exception is rethrown after a
single attempt.
**Resolution**
Resolved 2026-05-17 (commit `<pending>`). Root cause confirmed: `ExecuteWithRetryAsync`
caught `Exception` with only the `when (attempt < maxAttempts)` guard, so it retried
any exception — including a permanent schema-version mismatch raised by the
migration call site, delaying the inevitable fatal exit by ~2 minutes. Fix: added an
optional `Func<Exception, bool> isTransient` parameter (default treats all
exceptions as transient, preserving prior behaviour for unclassified callers); the
catch guard is now `when (attempt < maxAttempts && isTransient(ex))`, so a
non-transient fault propagates after a single attempt. Added a
`StartupRetry.IsTransientDatabaseFault` classifier that returns `true` only for
connection-class faults (`TimeoutException`, `SocketException`, `SqlException` /
`DbException` by type-name match) and `false` for everything else — notably
schema-validation `InvalidOperationException`s — and the `Program.cs` migration call
site now passes it. Regression tests in `StartupRetryTests`:
`ExecuteWithRetry_NonTransientFailure_RethrowsAfterSingleAttempt` (runs exactly once
when `isTransient` returns false) and `ExecuteWithRetry_TransientThenPermanent_StopsAtPermanent`
(retries a `TimeoutException` then stops at a permanent `InvalidOperationException`).
Full Host suite green (182 passed).
### Host-016 — Site `CentralContactPoints` second entry targets the site's own remoting port
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.Host/appsettings.Site.json:33-37` |
**Description**
The shipped site config sets `Node:RemotingPort = 8082` and lists
`Communication:CentralContactPoints` as
`["akka.tcp://scadabridge@localhost:8081", "akka.tcp://scadabridge@localhost:8082"]`.
The second contact point — port `8082` — is the **site's own** remoting endpoint,
not a central node. `SiteCommunicationActor` / `ClusterClient` uses these
addresses as initial contacts when discovering the central
`ClusterClientReceptionist`; a contact pointing at the site itself can never
reach the central receptionist and will be a permanent failure in the
initial-contact rotation. For the single-node dev loopback layout the first
contact (`8081`, central) succeeds and the bug is masked, but this is exactly
the kind of dev-config "example" that gets duplicated into multi-central
deployments — the same failure mode the resolved Host-004 finding called out
for the seed-node list. `StartupValidator` validates seed nodes against the
gRPC port (Host-004) but does not cross-check `CentralContactPoints` against
the site's own `RemotingPort`, so the misconfiguration passes silently.
**Recommendation**
Correct the shipped site example to list two central remoting endpoints (e.g.
`localhost:8081` for `central-a` and a distinct port for `central-b` in a
multi-node layout). Consider extending `StartupValidator` to reject any
`Communication:CentralContactPoints` entry whose host+port matches this site
node's `NodeHostname`+`RemotingPort`. Add a regression test in
`StartupValidatorTests` mirroring `Site_SeedNodeOnGrpcPort_FailsValidation`.
**Resolution (2026-05-28):** The shipped `appsettings.Site.json` `CentralContactPoints` entry that pointed at the site's own remoting port (`localhost:8082`) was removed — the dev-loopback default now lists only the single central node (`akka.tcp://scadabridge@localhost:8081`), which is the actually-reachable target in the single-node dev layout. A `_centralContactPoints` doc-key comment was added immediately above the array calling out the per-entry rule (each entry MUST be a central node's remoting endpoint, not the site's own remoting port) and explaining how to extend the list with a second central node (`akka.tcp://scadabridge@central-b-host:8081`) in a multi-central deployment so ClusterClient can fail over. The dangerous example pattern that would have been copied into multi-central configs no longer exists in the template. `StartupValidator` cross-check is left as a follow-up — the documented rule plus the corrected template removes the immediate misconfiguration risk.
### Host-017 — Site-shutdown ordering from REQ-HOST-7 is not wired
| | |
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.Host/Program.cs:229-265`, `src/ZB.MOM.WW.ScadaBridge.Communication/Grpc/SiteStreamGrpcServer.cs` |
**Description**
REQ-HOST-7 documents an explicit four-step shutdown sequence for site nodes:
"(1) On `CoordinatedShutdown`, stop accepting new gRPC streams first.
(2) Cancel all active gRPC streams (triggering client-side reconnect).
(3) Tear down actors.
(4) Use `IHostApplicationLifetime.ApplicationStopping` to signal the gRPC
server." The site path in `Program.cs` (the `role == "Site"` branch) registers
no `IHostApplicationLifetime.ApplicationStopping` callback, and
`SiteStreamGrpcServer` exposes no "stop accepting" / "cancel all streams"
entry point — it has `SetReady` but no corresponding `SetUnavailable` or
`CancelAllStreams`. In practice, on `SIGTERM` Kestrel closes its listener
naturally and `AkkaHostedService.StopAsync` runs Akka `CoordinatedShutdown`,
but there is no explicit, ordered handoff that meets the documented contract:
in-flight streams are not actively cancelled before actors begin tearing down,
so clients see a stream that goes silent (and only times out via gRPC
keepalive) rather than a clean `Cancelled` they can reconnect on. This is a
contract-vs-code drift — either the design doc is overstating what is
implemented, or the implementation is incomplete.
**Recommendation**
Add a `SiteStreamGrpcServer.CancelAllStreams()` method that flips a "shutting
down" flag (so `SubscribeSite` immediately fails new streams with
`StatusCode.Unavailable`) and cancels every entry's `Cts` in the `_streams`
map. In `Program.cs` site branch, resolve `IHostApplicationLifetime` and
register a callback on `ApplicationStopping` that calls `CancelAllStreams()`
before the Akka hosted service runs `CoordinatedShutdown` (or order via
`AkkaHostedService.StopAsync` itself — `IHostedService.StopAsync` runs in
reverse-registration order, so the gRPC server's lifetime can be sequenced
before Akka shutdown). Alternatively, reconcile REQ-HOST-7 with the actual
implementation if the explicit ordering is no longer intended. Add an
integration test under `tests/ZB.MOM.WW.ScadaBridge.Host.Tests` that starts a site host,
opens a stream, triggers shutdown, and asserts the stream completes with
`Cancelled` before the actor system tears down.
**Resolution (2026-05-28):**
REQ-HOST-7 steps (1)+(2) wired. `SiteStreamGrpcServer` gained:
- a monotonic `_shuttingDown` flag,
- `CancelAllStreams()` — flips the flag, cancels every `_activeStreams[*].Cts`
(with `ObjectDisposedException` swallow for entries cleaning themselves
up concurrently), idempotent on repeat calls,
- a `SubscribeInstance` guard that returns `Unavailable "Server shutting
down"` for new subscriptions arriving after the flag flips.
`Program.cs` site branch now resolves `IHostApplicationLifetime` and the
`SiteStreamGrpcServer` singleton, then registers
`ApplicationStopping.Register(() => siteGrpcServer.CancelAllStreams())`.
`ApplicationStopping` fires before any `IHostedService.StopAsync`, so the
gRPC server begins refusing new streams and tears down in-flight ones
BEFORE `AkkaHostedService` runs `CoordinatedShutdown` — matching REQ-HOST-7's
ordering. Clients observe a clean `Cancelled` and reconnect rather than a
silent stream that times out via keepalive (~25 s).
Two unit regression tests added to
`tests/ZB.MOM.WW.ScadaBridge.Communication.Tests/Grpc/SiteStreamGrpcServerTests.cs`:
`Host017_CancelAllStreams_CancelsActiveStreamsAndRefusesNewOnes` (active
streams complete, new ones rejected) and `Host017_CancelAllStreams_IsIdempotent`
(double-call safe). A full site-host integration test was deferred — the
unit suite covers both server-side invariants and the wiring is a single
`Register` line in `Program.cs`.
### Host-018 — Shipped per-role configs omit `NodeOptions.NodeName`, leaving `SourceNode` null
| | |
|--|--|
| Severity | Low |
| Category | Code organization & conventions |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.Host/appsettings.Central.json`, `src/ZB.MOM.WW.ScadaBridge.Host/appsettings.Site.json`, `src/ZB.MOM.WW.ScadaBridge.Host/NodeOptions.cs:10-16` |
**Description**
`NodeOptions.NodeName` is documented as "the operator-configured semantic node
name used to stamp the SourceNode column on audit rows", with conventional
values `node-a`/`node-b` for site nodes and `central-a`/`central-b` for
central nodes. The CLAUDE.md "Centralized Audit Log" key-decision section
calls this out: `SourceNode` is meant to be carried verbatim through audit
telemetry and reconciliation, and is indexed via
`IX_AuditLog_Node_Occurred (SourceNode, OccurredAtUtc)`. The docker per-node
configs (`docker/central-node-a/appsettings.Central.json`,
`docker/site-a-node-a/appsettings.Site.json`, etc.) all set
`ScadaBridge:Node:NodeName`. The **shipped, default** per-role files in
`src/ZB.MOM.WW.ScadaBridge.Host/` — the templates a developer running the binary
directly will use — do not. `NodeIdentityProvider` normalises an empty
`NodeName` to `null`, so dev audit rows carry a null `SourceNode` and the
indexed lookup never narrows. The dev examples should match the docker
examples; at minimum the field should appear in the shipped templates with a
placeholder explaining the convention.
**Recommendation**
Add `"NodeName": "central-a"` (or a placeholder like `"${NODE_NAME}"`) to
`appsettings.Central.json` and `"NodeName": "node-a"` to
`appsettings.Site.json`, with a short comment that the value must be set
per-node in multi-node deployments. Consider validating in `StartupValidator`
that `NodeName` is non-empty, or accept the null and document explicitly that
single-node dev deployments leave `SourceNode` null.
**Resolution (2026-05-28):** The shipped per-role templates now set `ScadaBridge:Node:NodeName` — `central-a` in `appsettings.Central.json` and `node-a` in `appsettings.Site.json` — so dev audit rows are stamped with a real `SourceNode` value (instead of `NodeIdentityProvider` normalising the missing key to `null`) and the indexed `IX_AuditLog_Node_Occurred` lookup actually narrows. A `_nodeName` doc-key comment was added beside each `Node` section explaining the convention (`central-a`/`central-b` for central, `node-a`/`node-b` for site), pointing at the docker per-node configs (which already overrode the field), and noting that the value must be overridden per-node in multi-node deployments and that an empty value still normalises to a `NULL` SourceNode. The shipped dev templates now match the per-node docker examples — a developer running the binary directly no longer sees a null `SourceNode`.
### Host-019 — Migration `StartupRetry` call drops the host `CancellationToken`
| | |
|--|--|
| Severity | Low |
| Category | Concurrency & thread safety |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.Host/Program.cs:154-165` |
**Resolution (2026-05-28):** Added a `Func<CancellationToken, Task>` overload of `StartupRetry.ExecuteWithRetryAsync` that forwards the retry-loop token into the operation, and the migration call site in `Program.cs` now passes `app.Lifetime.ApplicationStopping` as both the operation token (threaded to `MigrationHelper.ApplyOrValidateMigrationsAsync`) and the loop's `cancellationToken` (already honoured by the inter-attempt `Task.Delay`). A SIGTERM during the bounded retry window now tears down cleanly instead of waiting up to ~2 minutes for the loop to exhaust. The original `Func<Task>` overload still exists and delegates, so existing callers/tests are unchanged.
**Description**
`StartupRetry.ExecuteWithRetryAsync` accepts an optional
`CancellationToken cancellationToken = default` and observes it both at the
top of each attempt and inside the `Task.Delay` between retries. The migration
call site in `Program.cs` passes no token, so the helper runs with
`CancellationToken.None`. With `maxAttempts: 8`, `initialDelay: 2s`, and the
30s cap, a database that stays unreachable can keep the retry loop alive for
~2 minutes before the host process responds to `SIGTERM` / `Ctrl+C` /
Windows-Service stop. The `Program.cs` startup pipeline does not yet have a
host-lifetime token to forward at this point (the `app` is built but not
yet running), but `app.Lifetime.ApplicationStopping` is available the moment
`builder.Build()` returns. Threading it into the retry call honours the host
lifecycle and matches the helper's documented contract.
**Recommendation**
Pass `app.Lifetime.ApplicationStopping` (or `CancellationToken.None`
explicitly with a comment if intentional) into
`StartupRetry.ExecuteWithRetryAsync`. Add a `StartupRetryTests` case
exercising token-cancellation mid-backoff.
**Resolution**
_Open._
### Host-020 — `MinimumLevel.Is` silently overrides any operator-set `Serilog:MinimumLevel`
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.Host/LoggerConfigurationFactory.cs:36-43` |
**Description**
`LoggerConfigurationFactory.Build` reads the `Serilog` configuration section
via `ReadFrom.Configuration(configuration)` (which can include a
`MinimumLevel` block — the standard Serilog way to set the floor) and **then**
calls `.MinimumLevel.Is(minimumLevel)` derived from
`ScadaBridge:Logging:MinimumLevel`. Serilog's fluent builder applies the later
call, so any `Serilog:MinimumLevel:Default` an operator sets is silently
overridden by `ScadaBridge:Logging:MinimumLevel` (or by its
`Information` fallback when the ScadaBridge key is absent). There are now two
documented configuration paths for the same setting with non-obvious
precedence, and the override direction is the opposite of what most Serilog
users would expect (the more-specific `Serilog` section being the authority).
The XML doc on `Build` says "the explicit `MinimumLevel.Is` pins the floor"
but does not warn that the floor *overrides* the Serilog section's own
`MinimumLevel`.
**Recommendation**
Pick one mechanism: either (a) drop the `MinimumLevel.Is` call and let
`ReadFrom.Configuration` consume `Serilog:MinimumLevel`, migrating any docs/
deployments that reference `ScadaBridge:Logging:MinimumLevel`; or (b) keep the
current "ScadaBridge:Logging" path and reject `Serilog:MinimumLevel` if present
(throw at startup so the operator sees the conflict). At minimum, expand the
XML doc + REQ-HOST-8 to spell out the precedence explicitly.
**Resolution (2026-05-28):** `ScadaBridge:Logging:MinimumLevel` is now the documented single source of truth for the Serilog floor (Host-011's `LoggingOptions` binding), and the precedence is made visible — `LoggerConfigurationFactory.Build` writes a one-shot warning to `Console.Error` when both `ScadaBridge:Logging:MinimumLevel` and `Serilog:MinimumLevel` (or `Serilog:MinimumLevel:Default`) are present, naming both values and pointing the operator at the documented key. Order of operations is unchanged — `MinimumLevel.Is(...)` deliberately runs after `ReadFrom.Configuration(...)` so the ScadaBridge value wins — but the silent-override behaviour is now loud. The class XML doc gained a Host-020 paragraph explicitly spelling out the precedence. A test-visible `Build(..., TextWriter warningWriter)` overload mirrors the `ParseLevel` Host-022 pattern so the warning can be asserted in unit tests; the production four-arg overload delegates with `Console.Error`.
### Host-021 — Microsoft `Logging:LogLevel` section in `appsettings.json` is dead config under Serilog
| | |
|--|--|
| Severity | Low |
| Category | Code organization & conventions |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.Host/appsettings.json:2-6` |
**Description**
`appsettings.json` carries a Microsoft `Logging:LogLevel:Default = Information`
block. The `Logging:LogLevel` map is consumed by
`Microsoft.Extensions.Logging.ConfigurationConsoleLoggerOptions` and similar
provider configurations bound from the standard `Logging` section. The Host
calls `builder.Host.UseSerilog()`, which replaces the default
`ILoggerFactory` setup with Serilog as the **only** logger provider; Serilog
reads from `configuration.ReadFrom.Configuration(...)` which consumes the
`Serilog` section, **not** `Logging:LogLevel`. The result is that an operator
editing `Logging:LogLevel:Default` (a very natural thing to try, since it is
the .NET convention) sees no behaviour change — the section is dead config.
**Recommendation**
Either remove the `Logging:LogLevel` block from `appsettings.json` (Serilog
owns logging configuration in this Host), or replace it with a brief comment
explaining it is intentionally retained for non-Serilog tooling. Document the
authoritative location (`Serilog` + `ScadaBridge:Logging`) in
`Component-Host.md` REQ-HOST-8 if not already explicit.
**Resolution (2026-05-28):** Confirmed by repository-wide grep that no code reads `Logging:LogLevel` (the Host calls `builder.Host.UseSerilog()` which replaces the default `ILoggerFactory` setup with Serilog as the only provider), so the block was pure dead config. Removed the `Logging:LogLevel:Default = Information` block from `appsettings.json` and replaced it with a `_logging` doc-key comment explaining the rationale (Serilog is the sole provider) and pointing operators at the two authoritative keys: `ScadaBridge:Logging:MinimumLevel` for the floor (bound to `LoggingOptions` per Host-011) and the `Serilog` section for sinks (Host-014's `ReadFrom.Configuration`). The Host-014 regression test (`SerilogSinkConfigTests.ShippedAppSettings_HasSerilogSection_WithConsoleAndFileSinks`) still asserts the surviving `Serilog` section's shape, so removing the Microsoft block did not break the existing pinning.
### Host-022 — `ParseLevel` silently coerces unrecognised `MinimumLevel` to `Information`
| | |
|--|--|
| Severity | Low |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.Host/LoggerConfigurationFactory.cs:50-55` |
**Resolution (2026-05-28):** `ParseLevel` now writes a one-shot warning to `Console.Error` (the logger isn't built yet at this point) when a non-null/non-blank `MinimumLevel` value fails to parse, naming the offending value and the `Information` fallback. Null/blank values continue to default silently (treated as "unset"). The helper gained a test-visible `TextWriter` overload so unit tests can capture the warning; the production path delegates to it with `Console.Error`. Tests `ParseLevel_UnrecognisedValue_FallsBackAndWarns`, `ParseLevel_NullOrBlank_FallsBackSilently`, and `ParseLevel_RecognisedValue_NoWarning` pin the behaviour.
**Description**
`LoggerConfigurationFactory.ParseLevel` uses
`Enum.TryParse<LogEventLevel>(level, ignoreCase: true, out var parsed)` and
returns `LogEventLevel.Information` when parsing fails — without logging the
fallback. An operator who sets
`ScadaBridge:Logging:MinimumLevel = "Informaiton"` (a common typo) or
`"Verbose,Debug"` or any unrecognised value gets the default level silently;
there is no warning, no log line, no startup error. Combined with Host-020
(this is the only mechanism that pins the floor), a misspelt value is
invisible until someone wonders why the level change "didn't take". The
helper is small and could either fail-fast in `StartupValidator` or emit a
console warning before the logger is configured.
**Recommendation**
In `LoggerConfigurationFactory.Build`, when `loggingOptions.MinimumLevel` is
non-null/non-blank but does not parse to a valid `LogEventLevel`, write a
`Console.Error.WriteLine` warning (the logger is not yet built) and proceed
with `Information`. Alternatively, validate the value in `StartupValidator`
and fail fast — that matches the pattern used for other ScadaBridge
configuration keys. Add a `LoggerConfigurationTests` case asserting the
behaviour you choose.
**Resolution**
_Open._