7b0b9c7365
Solution + 23 src projects + 26 test projects renamed; folders, csproj, namespaces, and ScadaLinkDbContext/ScadaBridgeDbContext class updated. ActorSystem "scadalink" → "scadabridge", Akka seed-node URLs migrated. SQL roles/logins, LDAP domains, CLI command name, and CLI config dir (~/.scadalink → ~/.scadabridge) also renamed. Build green; 5 Host.Tests fail awaiting SQL login rename in next commit. Pre-existing StaleTagMonitor timing flakes unchanged. Rename script committed at tools/rename-to-scadabridge.sh.
1121 lines
66 KiB
Markdown
1121 lines
66 KiB
Markdown
# Code Review — Host
|
||
|
||
| Field | Value |
|
||
|-------|-------|
|
||
| Module | `src/ZB.MOM.WW.ScadaBridge.Host` |
|
||
| Design doc | `docs/requirements/Component-Host.md` |
|
||
| Status | Reviewed |
|
||
| Last reviewed | 2026-05-28 |
|
||
| Reviewer | claude-agent |
|
||
| Commit reviewed | `1eb6e97` |
|
||
| Open findings | 0 |
|
||
|
||
## Summary
|
||
|
||
The Host module is the composition root for the entire ScadaBridge system: a single
|
||
binary whose behaviour (`Central` vs `Site`) is driven entirely by configuration. The
|
||
implementation is generally faithful to `Component-Host.md` — startup validation,
|
||
role-based registration, Serilog enrichment, Windows Service support, dead-letter
|
||
monitoring, CoordinatedShutdown, and gRPC hosting on site nodes are all present and
|
||
backed by a solid test suite (`tests/ZB.MOM.WW.ScadaBridge.Host.Tests`).
|
||
|
||
The most significant problem is the readiness endpoint: `/health/ready` runs **all**
|
||
registered health checks, including the leader-only `active-node` check, so a fully
|
||
operational *standby* central node permanently reports `503` on `/health/ready` —
|
||
directly contradicting REQ-HOST-4a, which defines readiness as cluster membership +
|
||
DB connectivity (not leadership). Several other findings concern configuration that
|
||
is validated-but-never-consumed (`MachineDataDb`), design-doc drift (Akka.Persistence
|
||
is required by REQ-HOST-6 but the system uses no persistent actors), an incorrect
|
||
seed-node entry in the shipped site config, blocking sync-over-async during startup,
|
||
and unguarded string interpolation when building HOCON. None are crash/data-loss
|
||
class, but the readiness bug is High because it breaks load-balancer behaviour with
|
||
no safe workaround.
|
||
|
||
#### Re-review 2026-05-17 (commit `39d737e`)
|
||
|
||
All eleven findings from the first review (Host-001..011) are confirmed `Resolved`
|
||
in the current tree — the `/health/ready` predicate, the externalised secrets, the
|
||
seed-node/GrpcPort validation rules, the escaped HOCON builder, the bounded
|
||
migration retry and the live `LoggingOptions.MinimumLevel` are all present as
|
||
described in their Resolution notes. This re-review walked all ten checklist
|
||
categories again over the full module and recorded four new findings, none of them
|
||
crash/data-loss class. Following up on a batch-1 ClusterInfrastructure re-review
|
||
note, Host-012 (Medium) confirms `AkkaHostedService.BuildHocon` hard-codes
|
||
`down-if-alone = on` and never reads the `ClusterOptions.DownIfAlone` property, so
|
||
that documented, bound option is dead. The remaining three are Low: `:F0` rounding
|
||
of cluster timing values silently degrades any sub-second configuration (Host-013),
|
||
Serilog sink setup is hard-coded in `Program.cs` rather than configuration-driven as
|
||
REQ-HOST-8 requires (Host-014), and `StartupRetry` retries indiscriminately on every
|
||
exception type including permanent schema-validation failures (Host-015).
|
||
|
||
#### Re-review 2026-05-28 (commit `1eb6e97`)
|
||
|
||
All fifteen prior findings (Host-001..015) remain `Resolved` in the current tree
|
||
and the regressions introduced for them — Host-001's predicate, the externalised
|
||
secrets, the Site GrpcPort/RemotingPort/seed-port validation rules, the escaped
|
||
HOCON builder with `DownIfAlone` and millisecond-precision durations, the
|
||
configuration-driven Serilog sinks, the transient-only `StartupRetry`
|
||
classifier — are all still in place. This re-review walked the ten checklist
|
||
categories over the full module again and recorded seven new findings, none of
|
||
them crash/data-loss class. Host-016 (Medium) mirrors the resolved Host-004
|
||
shipped-config bug on the **Communication** side: `appsettings.Site.json`'s
|
||
second `CentralContactPoints` entry points at the site's own remoting port
|
||
(`localhost:8082`) instead of central, an incorrect dev example that copies
|
||
into multi-central deployments. Host-017 (Medium) flags a partial REQ-HOST-7
|
||
implementation — the documented site-shutdown ordering (stop accepting streams
|
||
first, cancel active streams via `IHostApplicationLifetime.ApplicationStopping`,
|
||
then tear down actors) is not wired: the site path registers no
|
||
`ApplicationStopping` handler that signals `SiteStreamGrpcServer`, and the gRPC
|
||
server exposes no cancel-all-streams entry point. The remaining five are Low:
|
||
`NodeOptions.NodeName` (the operator-configured value stamped on
|
||
`AuditLog.SourceNode`) is absent from both shipped per-role configs even though
|
||
the docker per-node configs set it (Host-018); the migration `StartupRetry`
|
||
call passes `default` for `CancellationToken`, so a SIGTERM during the
|
||
bounded-retry window is ignored for up to ~2 minutes (Host-019);
|
||
`LoggerConfigurationFactory` layers `MinimumLevel.Is` over
|
||
`ReadFrom.Configuration`, so any `Serilog:MinimumLevel` an operator sets is
|
||
silently overridden by `ScadaBridge:Logging:MinimumLevel` (Host-020); the
|
||
shipped `appsettings.json` carries a Microsoft `Logging:LogLevel` block but
|
||
Serilog is the only logger provider and the section is dead config (Host-021);
|
||
and `ParseLevel` silently swallows an unrecognised `MinimumLevel` value (e.g.
|
||
a typo) and falls back to `Information` with no warning (Host-022).
|
||
|
||
## Checklist coverage
|
||
|
||
| # | Category | Examined | Notes |
|
||
|---|----------|----------|-------|
|
||
| 1 | Correctness & logic bugs | ☑ | Host-001/004 resolved. Re-review: `:F0` rounding of cluster timing values silently distorts sub-second durations (Host-013). |
|
||
| 2 | Akka.NET conventions | ☑ | CoordinatedShutdown, receptionist registration, singleton scoping all correct. Host-006/009 resolved; no new issues. |
|
||
| 3 | Concurrency & thread safety | ☑ | Host-005 resolved (`StartAsync` now genuinely async). `DeadLetterMonitorActor` state is actor-confined — no issues. |
|
||
| 4 | Error handling & resilience | ☑ | Host-010 resolved. Re-review: `StartupRetry` retries indiscriminately on permanent failures (e.g. schema-validation mismatch) (Host-015). |
|
||
| 5 | Security | ☑ | Host-003 resolved — secrets externalised to env vars. No new issues. |
|
||
| 6 | Performance & resource management | ☑ | No undisposed resources. Inbound API script compilation is a synchronous startup loop — acceptable. No new issues. |
|
||
| 7 | Design-document adherence | ☑ | Host-002/007 resolved. Re-review: `ClusterOptions.DownIfAlone` bound/documented but never consumed — HOCON hard-codes `on` (Host-012); Serilog sinks hard-coded, not config-driven per REQ-HOST-8 (Host-014). |
|
||
| 8 | Code organization & conventions | ☑ | Host-008/011 resolved. No new issues. |
|
||
| 9 | Testing coverage | ☑ | Strong suite; regression tests added for Host-001/004/006/007/010/011. No coverage for the new `down-if-alone`, sub-second-duration, or non-transient-retry paths (Host-012/013/015). |
|
||
| 10 | Documentation & comments | ☑ | REQ-HOST-6 stale-doc resolved. Re-review: REQ-HOST-8 says sinks are "configuration-driven" but they are code-defined (Host-014). |
|
||
|
||
_Re-review (2026-05-28, `1eb6e97`):_
|
||
|
||
| # | Category | Examined | Notes |
|
||
|---|----------|----------|-------|
|
||
| 1 | Correctness & logic bugs | ☑ | Re-review: `appsettings.Site.json` second `CentralContactPoints` entry targets the site's own remoting port instead of central (Host-016) — same defect class as the resolved Host-004 seed-list bug. |
|
||
| 2 | Akka.NET conventions | ☑ | CoordinatedShutdown, receptionist registration, singleton scoping, role-scoped site singletons, ClusterClient initial-contact wiring all reviewed; no new issues. |
|
||
| 3 | Concurrency & thread safety | ☑ | `_trackedDisposables` is locked on both sides of the lifecycle; `_actorSystem` publication is safe via the IHost startup `await` boundary. New Low: `StartupRetry` migration call passes `default` `CancellationToken`, so SIGTERM during the retry window is ignored (Host-019). |
|
||
| 4 | Error handling & resilience | ☑ | `IsTransientDatabaseFault` correctly classifies socket / timeout / SqlException; the retry helper itself remains sound. Host-019 is the resilience gap. |
|
||
| 5 | Security | ☑ | Secrets stay externalised; the `_secrets` placeholder comment is intact. No new issues. |
|
||
| 6 | Performance & resource management | ☑ | No new undisposed resources; gRPC stream lifetime cap remains correct. No new issues. |
|
||
| 7 | Design-document adherence | ☑ | Re-review: REQ-HOST-7 site-shutdown ordering — stop accepting new streams, cancel active streams via `ApplicationStopping`, then tear down actors — is not wired in `Program.cs` (Host-017). |
|
||
| 8 | Code organization & conventions | ☑ | Re-review: `NodeOptions.NodeName` is absent from the shipped per-role configs even though it stamps `AuditLog.SourceNode` (Host-018); the appsettings `Logging:LogLevel` Microsoft section is dead config under Serilog (Host-021). |
|
||
| 9 | Testing coverage | ☑ | Strong existing suite. No coverage for the Site `CentralContactPoints` second-entry rule (Host-016), the site-shutdown ordering (Host-017), the `NodeName`-absent shipped config (Host-018), the unused `CancellationToken` parameter (Host-019), the `MinimumLevel.Is` override semantics (Host-020) or the `ParseLevel` silent fallback (Host-022). |
|
||
| 10 | Documentation & comments | ☑ | Re-review: layered `MinimumLevel.Is` / `ReadFrom.Configuration` semantics are not surfaced — an operator-set `Serilog:MinimumLevel` is silently overridden by `ScadaBridge:Logging:MinimumLevel` (Host-020); `ParseLevel` silently coerces a misspelled level to `Information` with no warning (Host-022). |
|
||
|
||
## Findings
|
||
|
||
### Host-001 — `/health/ready` includes the leader-only `active-node` check
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | High |
|
||
| Category | Correctness & logic bugs |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.Host/Program.cs:135-145` |
|
||
|
||
**Description**
|
||
|
||
`/health/ready` is mapped with `MapHealthChecks("/health/ready", ...)` and **no
|
||
`Predicate`**, so it executes every registered check: `database`, `akka-cluster`
|
||
*and* `active-node`. `ActiveNodeHealthCheck` (`Health/ActiveNodeHealthCheck.cs:38`)
|
||
returns `Unhealthy` on any node that is not the cluster leader. As a result a
|
||
standby central node that is fully operational (cluster member `Up`, database
|
||
reachable) still returns `503` on `/health/ready`. This contradicts REQ-HOST-4a,
|
||
which defines readiness as cluster membership + DB connectivity + singletons —
|
||
explicitly *not* leadership. `/health/active` is the endpoint intended to report
|
||
leadership. A load balancer using `/health/ready` to decide whether a node may
|
||
serve traffic will permanently treat the standby as unready, defeating failover
|
||
readiness. No test covers this: `HealthCheckTests.HealthReady_Endpoint_ReturnsResponse`
|
||
only asserts a response is returned, not the standby semantics.
|
||
|
||
**Recommendation**
|
||
|
||
Add a `Predicate` to the `/health/ready` mapping that excludes the `active-node`
|
||
check, e.g. `Predicate = check => check.Name != "active-node"` (or tag the readiness
|
||
checks and filter by tag). Add a regression test asserting a non-leader node returns
|
||
`200` on `/health/ready`.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit `<pending>`). Root cause confirmed against
|
||
`Program.cs`: the `/health/ready` mapping had no `Predicate`, so it executed all
|
||
three registered checks including the leader-only `active-node` check, while
|
||
`ActiveNodeHealthCheck` returns `Unhealthy` on any non-leader node — making a fully
|
||
operational standby central node permanently report `503`. Fix: added
|
||
`Predicate = check => check.Name != "active-node"` to the `/health/ready`
|
||
`HealthCheckOptions`, so readiness now reflects cluster membership + DB connectivity
|
||
only (REQ-HOST-4a); leadership remains reported solely by `/health/active`.
|
||
Regression test `HealthCheckTests.HealthReady_Endpoint_ExcludesActiveNodeCheck`
|
||
asserts the `active-node` check name does not appear in the `/health/ready`
|
||
response body; it failed before the fix and passes after. Full Host suite green
|
||
(156 passed).
|
||
|
||
### Host-002 — Akka.Persistence required by REQ-HOST-6 is not configured and not used
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Design-document adherence |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs:70-108`, `docs/requirements/Component-Host.md` REQ-HOST-6 |
|
||
|
||
**Description**
|
||
|
||
REQ-HOST-6 states the Host "must configure the Akka.NET actor system using
|
||
Akka.Hosting with ... **Persistence**: Configured with the appropriate journal and
|
||
snapshot store (SQL for central, SQLite for site)." The HOCON built in
|
||
`AkkaHostedService.StartAsync` contains no `akka.persistence` section, no journal and
|
||
no snapshot-store plugin, and `ZB.MOM.WW.ScadaBridge.Host.csproj` references neither
|
||
`Akka.Persistence.Hosting` nor any persistence plugin (the design doc Dependencies
|
||
list `Akka.Persistence.Hosting`). A repo-wide search finds **no** `PersistentActor` /
|
||
`ReceivePersistentActor` subclasses — the system deliberately uses custom SQLite
|
||
storage services instead. The code is internally consistent, but the design document
|
||
is stale: it mandates a subsystem that does not exist. This is a documented-vs-actual
|
||
drift that will mislead future maintainers and any audit against REQ-HOST-6.
|
||
|
||
**Recommendation**
|
||
|
||
Update `Component-Host.md` REQ-HOST-6 and the Dependencies list to remove the
|
||
Akka.Persistence requirement (or explicitly state persistence is provided by
|
||
component-owned SQLite storage, not Akka.Persistence). If persistence *is* intended,
|
||
add the plugin packages and HOCON. Either way, code and doc must agree.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit `<pending>`). Confirmed accurate: a repo-wide search
|
||
finds **no** `PersistentActor` / `ReceivePersistentActor` subclasses anywhere in
|
||
`src/`, no `akka.persistence` section in the HOCON built by
|
||
`AkkaHostedService.StartAsync`, and `ZB.MOM.WW.ScadaBridge.Host.csproj` references no
|
||
persistence plugin packages. The Host code is correct and internally consistent;
|
||
the defect was a stale design doc. Fix: updated `docs/requirements/Component-Host.md`
|
||
— REQ-HOST-6 no longer lists Persistence as a configured subsystem and now carries
|
||
an explicit Persistence note stating that durable state is owned by individual
|
||
components and persisted through component-owned stores (SQLite at sites, MS SQL
|
||
centrally), with no Akka journal/snapshot store and no `PersistentActor` subclasses
|
||
by design. The Responsibilities line and the Dependencies entry
|
||
(`Akka.Persistence.Hosting` removed) were corrected to match. Code and doc now
|
||
agree; no code or test change required.
|
||
|
||
### Host-003 — Secrets committed in plaintext in `appsettings.Central.json`
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Security |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.Host/appsettings.Central.json:20-31` |
|
||
|
||
**Description**
|
||
|
||
`appsettings.Central.json` contains real-looking secrets in plaintext, checked into
|
||
source control: SQL Server passwords in the `ConfigurationDb` / `MachineDataDb`
|
||
connection strings (`Password=ScadaBridge_Dev1#`), an LDAP service-account password
|
||
(`LdapServiceAccountPassword: "password"`), and a JWT signing key
|
||
(`JwtSigningKey: "scadabridge-dev-jwt-signing-key-..."`). Even though these are
|
||
intended as development defaults, shipping them in the default config invites them
|
||
being reused verbatim in production, and a committed JWT signing key allows anyone
|
||
with repo access to forge session tokens. `TrustServerCertificate=true` additionally
|
||
disables TLS validation for the SQL connection.
|
||
|
||
**Recommendation**
|
||
|
||
Move all secrets out of committed `appsettings*.json` into environment variables,
|
||
user-secrets, or a secret store. Keep only non-sensitive structural defaults in the
|
||
file and document the required environment variables. At minimum add a clear comment
|
||
that these values are dev-only and must be overridden, and rotate the JWT key per
|
||
environment.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit `pending`). Root cause confirmed against
|
||
`appsettings.Central.json`: the committed file carried real-looking secrets in
|
||
plaintext — SQL Server passwords (`Password=ScadaBridge_Dev1#`) in both connection
|
||
strings, an LDAP service-account password, and a JWT signing key. Fix: all four
|
||
secrets were removed from the committed file and replaced with non-functional
|
||
`${...}` placeholder references (`ConfigurationDb` / `MachineDataDb`,
|
||
`LdapServiceAccountPassword`, `JwtSigningKey`). A new top-level `_secrets` note
|
||
documents that the Host's configuration builder (`AddEnvironmentVariables()`)
|
||
overlays the real values supplied via environment variables
|
||
(`ScadaBridge__Database__ConfigurationDb`, `ScadaBridge__Database__MachineDataDb`,
|
||
`ScadaBridge__Security__LdapServiceAccountPassword`,
|
||
`ScadaBridge__Security__JwtSigningKey`); the placeholders are intentionally invalid so
|
||
a misconfigured deployment fails loudly rather than silently using a committed key.
|
||
Regression test class `ConfigSecretsTests` asserts the committed file carries no
|
||
plaintext `Password=` value, no committed LDAP service-account password, and no
|
||
committed JWT signing key; all three tests failed before the fix and pass after.
|
||
Tests that drive the full `Program` startup pipeline against the real SQL provider
|
||
(`HealthCheckTests`, `HostStartupTests.CentralRole_StartsWithoutError`) were adapted
|
||
to supply the local dev connection strings themselves via the new
|
||
`CentralDbTestEnvironment` test helper (environment variables) — they must no longer
|
||
depend on committed secrets. Note: the `docker/central-node-*/appsettings.Central.json`
|
||
files still contain the same dev secrets but lie outside this task's permitted edit
|
||
scope; they should receive the same treatment in a follow-up.
|
||
|
||
### Host-004 — Site seed-node list points at the gRPC port, not a remoting port
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Correctness & logic bugs |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.Host/appsettings.Site.json:10-19` |
|
||
|
||
**Description**
|
||
|
||
The shipped site config sets `Node:RemotingPort = 8082` and `Node:GrpcPort = 8083`,
|
||
but `Cluster:SeedNodes` is `["akka.tcp://scadabridge@localhost:8082",
|
||
"akka.tcp://scadabridge@localhost:8083"]`. The second seed node targets `8083`, which
|
||
is the Kestrel HTTP/2 gRPC port — not an Akka remoting endpoint. A node attempting to
|
||
join via that seed will try to establish an Akka.Remote TCP association against the
|
||
gRPC listener and fail. `StartupValidator` only checks that ≥2 seed nodes exist
|
||
(`StartupValidator.cs:54-56`), so this misconfiguration passes validation silently.
|
||
For the single-node dev site it is harmless (the first seed succeeds), but it is an
|
||
incorrect example that will be copied into multi-node site configs.
|
||
|
||
**Recommendation**
|
||
|
||
Correct the site seed-node list to reference the two site nodes' *remoting* ports
|
||
(e.g. `8082` and `8084`), never the gRPC port. Consider extending `StartupValidator`
|
||
to reject a seed node whose port equals this node's `GrpcPort`.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit `pending`). Root cause confirmed against
|
||
`appsettings.Site.json`: with `Node:RemotingPort = 8082` and `Node:GrpcPort = 8083`,
|
||
the second `Cluster:SeedNodes` entry was `akka.tcp://scadabridge@localhost:8083` — the
|
||
Kestrel HTTP/2 gRPC port, not an Akka.Remote endpoint. `StartupValidator` only
|
||
checked seed-node *count* (≥2), so the misconfiguration passed silently. Fix, two
|
||
parts: (1) the shipped `appsettings.Site.json` second seed entry was corrected to a
|
||
remoting port (`localhost:8084`); (2) `StartupValidator.Validate` was extended — for
|
||
`Site` nodes it now parses each seed node's TCP port (via a new `SeedNodePort`
|
||
helper) and rejects any entry whose port equals the node's `GrpcPort`, using the
|
||
resolved GrpcPort including the `8083` `NodeOptions` default when the key is absent.
|
||
The seed-node-count check was hoisted above the Site block so the new rule can reuse
|
||
the parsed list. Regression tests in `StartupValidatorTests`:
|
||
`Site_SeedNodeOnGrpcPort_FailsValidation`,
|
||
`Site_SeedNodeOnDefaultGrpcPort_FailsValidation` (default-8083 path),
|
||
`Site_SeedNodesOnRemotingPort_PassesValidation`, and
|
||
`Central_SeedNodeOnPort8083_PassesValidation` (rule is Site-only) — all failed
|
||
appropriately before the fix and pass after.
|
||
|
||
### Host-005 — Blocking sync-over-async (`GetAwaiter().GetResult()`) inside `StartAsync`
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Concurrency & thread safety |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs:345` |
|
||
|
||
**Description**
|
||
|
||
`RegisterSiteActors` calls `storeAndForwardService.StartAsync().GetAwaiter().GetResult()`
|
||
synchronously, blocking inside the `IHostedService.StartAsync` path. `StartAsync` is
|
||
itself declared synchronous (returns `Task.CompletedTask`), so the work cannot be
|
||
awaited cleanly. Blocking on async work risks thread-pool starvation during startup
|
||
and, if the awaited operation captures a synchronization context, deadlock. It also
|
||
hides exceptions behind an `AggregateException` wrapper.
|
||
|
||
**Recommendation**
|
||
|
||
Make `AkkaHostedService.StartAsync` genuinely `async` and `await
|
||
storeAndForwardService.StartAsync(cancellationToken)`. Propagate the
|
||
`CancellationToken` and let exceptions surface as the original type.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit `pending`). Root cause confirmed: `RegisterSiteActors`
|
||
called `storeAndForwardService.StartAsync().GetAwaiter().GetResult()` synchronously
|
||
inside the `IHostedService.StartAsync` path, blocking a startup thread and wrapping
|
||
any failure in `AggregateException`. Fix: `AkkaHostedService.StartAsync` is now
|
||
genuinely `async Task`; the site path was renamed `RegisterSiteActorsAsync` and made
|
||
`async`; the store-and-forward init is now `await storeAndForwardService.StartAsync()`
|
||
(the service's own signature, in another module, takes no `CancellationToken`, so the
|
||
token is honoured via a `cancellationToken.ThrowIfCancellationRequested()` checkpoint
|
||
immediately before the await). No blocking, no sync-context deadlock risk, and the
|
||
original exception type now surfaces. No dedicated regression test: the only
|
||
observable change is non-blocking/exception-unwrapping behaviour on a path that
|
||
requires the full site DI graph plus a real SQLite store; the existing
|
||
`HostStartupTests.SiteRole_*` and `CompositionRootTests` continue to exercise the
|
||
async startup pipeline and remain green (175 passed).
|
||
|
||
### Host-006 — HOCON assembled by unescaped string interpolation
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Akka.NET conventions |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs:70-108` |
|
||
|
||
**Description**
|
||
|
||
The Akka HOCON is built with an interpolated string that injects
|
||
`_nodeOptions.NodeHostname`, `_clusterOptions.SeedNodes`, the computed roles, and
|
||
`SplitBrainResolverStrategy` directly into the configuration text. Values are not
|
||
escaped. A hostname or seed-node string containing a quote, backslash, brace, or
|
||
comment sequence would corrupt the HOCON and produce a confusing parse error far from
|
||
the real cause; `SplitBrainResolverStrategy` is interpolated without quoting, so a
|
||
value with whitespace breaks the document. Building cluster configuration from raw
|
||
string concatenation is also harder to maintain than the typed Akka.Hosting builder
|
||
the design doc (REQ-HOST-6) actually calls for ("via Akka.Hosting").
|
||
|
||
**Recommendation**
|
||
|
||
Prefer the `Akka.Hosting` `AddAkka(...)` builder with strongly-typed `WithRemoting`,
|
||
`WithClustering`, and split-brain-resolver configuration instead of hand-built HOCON.
|
||
If HOCON must be retained, validate/escape interpolated values (especially hostname
|
||
and seed nodes) before substitution.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit `pending`). Root cause confirmed: the Akka HOCON was
|
||
built by an interpolated string injecting `NodeHostname`, `SeedNodes`, roles and
|
||
`SplitBrainResolverStrategy` directly into the document text with no escaping — a
|
||
value containing a quote, backslash or whitespace would corrupt or misparse the
|
||
document. Re-triaged the recommendation: switching wholesale to the typed
|
||
`Akka.Hosting` builder is a larger refactor than this Low finding warrants and would
|
||
touch how the whole cluster is bootstrapped, so the HOCON form was retained but made
|
||
safe. Fix: the HOCON construction was extracted into a testable static
|
||
`AkkaHostedService.BuildHocon`, and every interpolated *string* value is now routed
|
||
through a `QuoteHocon` helper that wraps the value in a double-quoted HOCON literal
|
||
and escapes embedded backslashes and quotes — `NodeHostname`, every seed node, every
|
||
role, and `SplitBrainResolverStrategy` (the last was previously unquoted entirely, so
|
||
whitespace alone would break it). Regression tests in new `HoconBuilderTests`:
|
||
`BuildHocon_HostnameWithQuote_DoesNotCorruptDocument` (a hostname with a `"` still
|
||
parses and following keys stay intact), `BuildHocon_StrategyWithWhitespace_RemainsQuoted`,
|
||
and `BuildHocon_PlainValues_ParsesAndPreservesValues`. With the old raw interpolation
|
||
a quote-containing hostname produced `hostname = "evil"host"`, closing the literal
|
||
early and corrupting following keys; the escaped form parses correctly. Full Host
|
||
suite green (175 passed).
|
||
|
||
### Host-007 — REQ-HOST-4 rule "GrpcPort ≠ RemotingPort" is not enforced
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Design-document adherence |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.Host/StartupValidator.cs:43-47` |
|
||
|
||
**Description**
|
||
|
||
REQ-HOST-4 requires: "Site nodes must have `GrpcPort` in valid port range (1–65535)
|
||
**and different from `RemotingPort`**." `StartupValidator` validates the GrpcPort
|
||
range but never compares it to `RemotingPort`. A site config that sets both ports to
|
||
the same value passes validation and then fails opaquely at runtime when Kestrel and
|
||
Akka.Remote both try to bind the port. The GrpcPort range check is also skipped
|
||
entirely when the key is absent (`grpcPortStr != null`), relying on the
|
||
`NodeOptions` default of 8083 — acceptable, but the equality rule is the missing
|
||
piece.
|
||
|
||
**Recommendation**
|
||
|
||
Add a check in the `role == "Site"` block: if `GrpcPort` (resolved, including the
|
||
8083 default) equals `RemotingPort`, add an error
|
||
`"ScadaBridge:Node:GrpcPort must differ from RemotingPort"`.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit `pending`). Root cause confirmed against
|
||
`StartupValidator.cs`: the `role == "Site"` block validated the GrpcPort range but
|
||
never compared it to `RemotingPort`, so a site config setting both ports equal
|
||
passed validation and then failed opaquely when Kestrel and Akka.Remote both bound
|
||
the port. Fix: added `if (port == grpcPort) errors.Add("ScadaBridge:Node:GrpcPort must
|
||
differ from RemotingPort")` in the Site block, using the resolved GrpcPort (including
|
||
the 8083 `NodeOptions` default when the key is absent) against the parsed
|
||
RemotingPort. Regression tests in `StartupValidatorTests`:
|
||
`Site_GrpcPortEqualsRemotingPort_FailsValidation`,
|
||
`Site_DefaultGrpcPortEqualsRemotingPort_FailsValidation` (default-8083 path), and
|
||
`Site_GrpcPortDiffersFromRemotingPort_PassesValidation` — all failed appropriately
|
||
before the fix and pass after. Full Host suite green (175 passed).
|
||
|
||
### Host-008 — `MachineDataDb` is validated and declared but never consumed
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Code organization & conventions |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.Host/StartupValidator.cs:33-34`, `src/ZB.MOM.WW.ScadaBridge.Host/DatabaseOptions.cs:6` |
|
||
|
||
**Description**
|
||
|
||
`StartupValidator` requires a non-empty `ScadaBridge:Database:MachineDataDb` connection
|
||
string for Central nodes, and `DatabaseOptions` exposes a `MachineDataDb` property,
|
||
but a repo-wide search shows the value is never read anywhere outside the Host module
|
||
— only `ConfigurationDb` is passed to `AddConfigurationDatabase`
|
||
(`Program.cs:83-85`). The Host therefore fails startup if `MachineDataDb` is missing
|
||
even though nothing uses it. This is either dead configuration that should be removed
|
||
or a missing wiring (a machine-data DbContext that was never registered).
|
||
|
||
**Recommendation**
|
||
|
||
Determine whether a machine-data store is actually required. If yes, wire it into the
|
||
relevant component's DI registration. If no, remove the `MachineDataDb` validation
|
||
rule, the `DatabaseOptions` property, and the key from `appsettings.Central.json`.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit `pending`). Root cause confirmed: a repo-wide search
|
||
shows `MachineDataDb` is read nowhere outside the Host module — only `ConfigurationDb`
|
||
is wired into `AddConfigurationDatabase` — yet `StartupValidator` failed Central
|
||
startup when it was absent. No machine-data store exists or is registered, so this is
|
||
dead configuration. Fix (the "if no" branch of the recommendation): removed the
|
||
`MachineDataDb` validation rule from `StartupValidator`, the `MachineDataDb` property
|
||
from `DatabaseOptions`, the `MachineDataDb` key from `appsettings.Central.json`, and
|
||
the now-pointless `MachineDataDb` reference from that file's `_secrets` note and from
|
||
the `CentralDbTestEnvironment` test helper. Regression test: the prior
|
||
`StartupValidatorTests.Central_MissingMachineDataDb_FailsValidation` was inverted to
|
||
`Central_MissingMachineDataDb_PassesValidation` (absence must now NOT fail startup),
|
||
and `MachineDataDb` was removed from the shared `ValidCentralConfig` fixture so all
|
||
Central tests prove the key is no longer required; it failed under the old code and
|
||
passes after. Full Host suite green (175 passed). Note: `appsettings.Central.json`
|
||
under `docker/central-node-*` is outside this task's edit scope and still carries the
|
||
key harmlessly — it will simply be ignored; a follow-up may remove it for tidiness.
|
||
|
||
### Host-009 — `StartAsync` reports success before role actors are confirmed running
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Akka.NET conventions |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs:127-141` |
|
||
|
||
**Description**
|
||
|
||
`StartAsync` creates actors with `ActorOf` (a fire-and-forget operation — the actor's
|
||
`PreStart` runs asynchronously on its own thread) and then returns
|
||
`Task.CompletedTask`. For site nodes, `grpcServer.SetReady(_actorSystem)` is called
|
||
synchronously at the end of `RegisterSiteActors`, marking the gRPC server ready even
|
||
though `SiteCommunicationActor`, the deployment-manager singleton, and the
|
||
`ClusterClient` may not yet have completed their `PreStart`/initial-contact handshake.
|
||
REQ-HOST-7 requires "Actor system and SiteStreamManager ... initialized before gRPC
|
||
begins accepting connections" — `SiteStreamManager.Initialize` is awaited-equivalent,
|
||
but the broader actor graph is not. The window is small and the gRPC server still
|
||
rejects streams until `SetReady`, so impact is limited, but readiness is being
|
||
asserted optimistically.
|
||
|
||
**Recommendation**
|
||
|
||
If strict ordering matters, gate `SetReady` on confirmation that
|
||
`SiteCommunicationActor` is fully initialized (e.g. an `Ask` round-trip or a
|
||
readiness message), or document explicitly that gRPC readiness only guarantees the
|
||
actor system exists, not that the cluster handshake has completed.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit `pending`) — re-triaged, documentation fix. Root cause
|
||
confirmed: `SetReady` is called optimistically after `ActorOf` registrations whose
|
||
`PreStart`/initial-contact handshakes run asynchronously. Re-triaged the first
|
||
recommended option (gating `SetReady` on a `SiteCommunicationActor` `Ask`
|
||
round-trip): that option is **incorrect** because the meaningful asynchronous step is
|
||
the `ClusterClient` initial-contact handshake with central — gating site gRPC
|
||
readiness on it would prevent a site from coming up and streaming locally while
|
||
central is briefly unreachable, contradicting the design's site-autonomy model.
|
||
Confirmed `SiteStreamGrpcServer` already rejects any stream opened before `SetReady`
|
||
with `StatusCode.Unavailable`, so the window is benign. Fix (the second, correct
|
||
option): the `SetReady` call site now carries an explicit comment documenting the
|
||
narrow readiness contract — by `SetReady` the actor system exists,
|
||
`SiteStreamManager.Initialize` has run, and every role actor has been created with
|
||
ordered synchronous `ActorOf` + registration `Tell`s; what is deliberately NOT
|
||
guaranteed is `PreStart` completion or the central handshake. No regression test is
|
||
meaningful for a documentation-only clarification of an intentional design contract;
|
||
the existing `HostStartupTests.SiteRole_*` cover the surrounding startup path. Full
|
||
Host suite green (175 passed).
|
||
|
||
### Host-010 — No retry/backoff around startup preconditions (DB migration, readiness)
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Error handling & resilience |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.Host/Program.cs:112-125` |
|
||
|
||
**Description**
|
||
|
||
On Central startup the Host opens a DI scope and calls
|
||
`MigrationHelper.ApplyOrValidateMigrationsAsync` directly. If the SQL Server is not
|
||
yet reachable (common in container orchestration where the DB and app start
|
||
together), the call throws, the top-level `catch` logs `Fatal`, and the process
|
||
exits. There is no bounded retry/backoff to tolerate a database that is briefly
|
||
unavailable at boot. The design intent (REQ-HOST-4a, readiness gating, `503` until
|
||
ready) is about *serving traffic*, but the migration step happens before the host
|
||
even runs and has no such tolerance.
|
||
|
||
**Recommendation**
|
||
|
||
Wrap the migration/validation step in a bounded retry with exponential backoff (e.g.
|
||
Polly), or move schema apply behind the readiness gate so the process stays up and
|
||
reports `503` until the database becomes reachable.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit `pending`). Root cause confirmed: Central startup called
|
||
`MigrationHelper.ApplyOrValidateMigrationsAsync` directly with no tolerance for a
|
||
database that is briefly unreachable at boot — a first connection failure threw, the
|
||
top-level `catch` logged `Fatal`, and the process exited. Fix: added a new
|
||
`StartupRetry.ExecuteWithRetryAsync` helper (bounded retry with capped exponential
|
||
backoff — no Polly dependency, keeping the Host's package set minimal) and wrapped
|
||
the migration step in it (`maxAttempts: 8`, `initialDelay: 2s`, doubling, capped at
|
||
30s). Each retry opens a fresh DI scope so a transient connection fault does not
|
||
leave a poisoned `DbContext`; once attempts are exhausted the last exception
|
||
propagates and the existing fatal-log path still fires. Regression tests in new
|
||
`StartupRetryTests`: `ExecuteWithRetry_SucceedsFirstTry_RunsOnce`,
|
||
`ExecuteWithRetry_TransientFailures_RetriesUntilSuccess` (fails twice then succeeds),
|
||
and `ExecuteWithRetry_ExhaustsAttempts_RethrowsLastException` (verifies the original
|
||
exception type/message surfaces after exhaustion) — they fail to compile/pass against
|
||
the pre-fix code (the helper did not exist) and pass after. Full Host suite green
|
||
(175 passed).
|
||
|
||
### Host-011 — `LoggingOptions.MinimumLevel` is dead configuration
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Code organization & conventions |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.Host/LoggingOptions.cs:5`, `src/ZB.MOM.WW.ScadaBridge.Host/Program.cs:42-50` |
|
||
|
||
**Description**
|
||
|
||
`LoggingOptions` exposes a `MinimumLevel` property bound from `ScadaBridge:Logging`
|
||
(`SiteServiceRegistration.BindSharedOptions`), and both `appsettings.Central.json`
|
||
and `appsettings.Site.json` set `"Logging": { "MinimumLevel": "Information" }`.
|
||
However Serilog is configured purely via `ReadFrom.Configuration(configuration)`,
|
||
which reads the standard `Serilog` section — not `ScadaBridge:Logging`. The
|
||
`LoggingOptions.MinimumLevel` value is never read by any code, so changing it has no
|
||
effect. This is misleading: an operator editing `ScadaBridge:Logging:MinimumLevel`
|
||
expecting a log-level change will see nothing happen.
|
||
|
||
**Recommendation**
|
||
|
||
Either consume `LoggingOptions.MinimumLevel` when configuring the Serilog
|
||
`LoggerConfiguration` (e.g. set `MinimumLevel.Is(...)` from it), or remove the option
|
||
class and the `ScadaBridge:Logging` sections and rely solely on the `Serilog`
|
||
configuration section. Keep one mechanism, not two.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit `pending`). Root cause confirmed: `LoggingOptions.MinimumLevel`
|
||
was bound from `ScadaBridge:Logging` but Serilog was configured purely via
|
||
`ReadFrom.Configuration` (standard `Serilog` section), so editing
|
||
`ScadaBridge:Logging:MinimumLevel` had no effect. Re-triaged the two options: the
|
||
"remove the option class" branch was rejected because `Component-Host.md` REQ-HOST-3
|
||
explicitly lists `ScadaBridge:Logging` → `LoggingOptions` and that design doc is outside
|
||
this task's edit scope — removing the class would create a fresh code-vs-doc drift.
|
||
Fix (the "consume it" branch): added a `LoggerConfigurationFactory.Build` that binds
|
||
`LoggingOptions`, parses `MinimumLevel` into a Serilog `LogEventLevel` (falling back
|
||
to `Information` for null/blank/unrecognised values), and pins it via
|
||
`MinimumLevel.Is(...)` on top of `ReadFrom.Configuration`; `Program.cs` now builds the
|
||
logger through this factory. `ScadaBridge:Logging:MinimumLevel` is now live.
|
||
Regression tests in new `LoggerConfigurationTests`:
|
||
`MinimumLevel_Warning_SuppressesInformationLogs`,
|
||
`MinimumLevel_Debug_AllowsDebugLogs`, and `MinimumLevel_Absent_DefaultsToInformation`
|
||
— they assert the configured level actually filters log events; they pass against the
|
||
new factory (and would fail against the old inline configuration, which ignored the
|
||
key). Full Host suite green (175 passed).
|
||
|
||
### Host-012 — `down-if-alone` hard-coded in HOCON; `ClusterOptions.DownIfAlone` is never read
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Design-document adherence |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs:146-148` |
|
||
|
||
**Description**
|
||
|
||
`AkkaHostedService.BuildHocon` emits the split-brain-resolver block with
|
||
`keep-oldest { down-if-alone = on }` as a literal constant. `ClusterOptions`
|
||
(`src/ZB.MOM.WW.ScadaBridge.ClusterInfrastructure/ClusterOptions.cs:74`) exposes a
|
||
`DownIfAlone` property — bound from `ScadaBridge:Cluster` via the Options pattern,
|
||
documented as "the design-doc requirement", default `true` — but a repo-wide search
|
||
shows it is referenced **nowhere outside its own declaration**. The Host therefore
|
||
ignores the bound value entirely: setting `ScadaBridge:Cluster:DownIfAlone` to `false`
|
||
in `appsettings.json` has no effect, the resolver still runs with `down-if-alone =
|
||
on`. This is the same class of defect as the resolved Host-011
|
||
(`LoggingOptions.MinimumLevel` was dead config) — a configuration option that is
|
||
declared, bound, and documented but never consumed, which silently misleads any
|
||
operator who edits it. A batch-1 re-review of ClusterInfrastructure flagged the same
|
||
hard-coding; it is recorded here because `BuildHocon` is Host-owned code (the
|
||
ClusterInfrastructure project owns only the configuration contract, per the
|
||
`ClusterOptions` class comment).
|
||
|
||
**Recommendation**
|
||
|
||
Make `BuildHocon` consume `clusterOptions.DownIfAlone` — emit `down-if-alone =
|
||
{(clusterOptions.DownIfAlone ? "on" : "off")}` (the value is a bool, so no escaping
|
||
is needed). Add a `HoconBuilderTests` case asserting both `true` and `false` produce
|
||
the corresponding `down-if-alone` token. If the flag is genuinely meant to be fixed
|
||
at `on` for ScadaBridge's two-node clusters, remove the `DownIfAlone` property and its
|
||
doc comment instead so code and configuration contract agree — but do not leave it
|
||
declared-but-dead.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-17 (commit `<pending>`). Root cause confirmed against
|
||
`AkkaHostedService.cs`: `BuildHocon` emitted `keep-oldest { down-if-alone = on }` as
|
||
a literal, and `ClusterOptions.DownIfAlone` was referenced nowhere outside its own
|
||
declaration, so the bound option was dead. Fix: `BuildHocon` now emits
|
||
`down-if-alone = {(clusterOptions.DownIfAlone ? "on" : "off")}`, consuming the bound
|
||
property. Regression tests in `HoconBuilderTests`: `BuildHocon_DownIfAloneTrue_EmitsOn`
|
||
and `BuildHocon_DownIfAloneFalse_EmitsOff` parse the resulting HOCON and assert the
|
||
`keep-oldest.down-if-alone` token tracks the property; the `false` case failed before
|
||
the fix (token was still `on`) and passes after. Full Host suite green (182 passed).
|
||
|
||
### Host-013 — `:F0` rounding of cluster timing values silently degrades sub-second configuration
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Correctness & logic bugs |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs:135-136,145,151-152` |
|
||
|
||
**Description**
|
||
|
||
`BuildHocon` renders every duration into HOCON via `TotalSeconds:F0` —
|
||
`transportHeartbeatSec:F0`, `transportFailureSec:F0`, `StableAfter.TotalSeconds:F0`,
|
||
`HeartbeatInterval.TotalSeconds:F0`, `FailureDetectionThreshold.TotalSeconds:F0`. The
|
||
`F0` format specifier rounds to whole seconds, so any sub-second configuration is
|
||
silently distorted: a `HeartbeatInterval` of `00:00:00.5` renders as
|
||
`heartbeat-interval = 0s` (a degenerate / invalid value that Akka will reject or
|
||
treat as zero), and `00:00:02.6` becomes `3s`. The shipped defaults are all whole
|
||
seconds so this is latent, but the configuration model accepts arbitrary `TimeSpan`
|
||
values and an operator tuning a heartbeat to e.g. `750ms` would get a `1s` value with
|
||
no warning — or, worse, `0s`. Rounding configured timing values without surfacing
|
||
the change is a correctness hazard for exactly the kind of failure-detection tuning
|
||
these options exist for.
|
||
|
||
**Recommendation**
|
||
|
||
Render durations without precision loss — emit milliseconds (e.g.
|
||
`heartbeat-interval = {ts.TotalMilliseconds:F0}ms`) so sub-second values survive, or
|
||
validate in `StartupValidator` that each cluster timing value is a positive whole
|
||
number of seconds and fail fast otherwise. Either way, do not silently round.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-17 (commit `<pending>`). Root cause confirmed: `BuildHocon`
|
||
rendered every duration via `TotalSeconds:F0`, rounding sub-second values — a 750ms
|
||
heartbeat became `1s` and a 500ms value would collapse to a degenerate `0s`. Fix:
|
||
durations are now rendered by a new `DurationHocon` helper that emits whole
|
||
milliseconds with an `ms` suffix (Akka's HOCON parser accepts it), so sub-second
|
||
configuration survives exactly; `BuildHocon` takes the transport heartbeat/failure
|
||
as `TimeSpan` rather than pre-rounded `double` seconds. Regression test
|
||
`HoconBuilderTests.BuildHocon_SubSecondTimings_PreservedWithoutRounding` configures
|
||
750ms/2600ms/500ms/2.5s/7.5s timings and asserts each round-trips through the parsed
|
||
HOCON unchanged; it failed before the fix (750ms parsed as `00:00:01`) and passes
|
||
after. Full Host suite green (182 passed).
|
||
|
||
### Host-014 — Serilog sinks are hard-coded in `Program.cs`, not configuration-driven (REQ-HOST-8)
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Design-document adherence |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.Host/Program.cs:43-48`, `src/ZB.MOM.WW.ScadaBridge.Host/appsettings.json:1-7` |
|
||
|
||
**Description**
|
||
|
||
REQ-HOST-8 requires the Host to configure Serilog with "Configuration-driven sink
|
||
setup (console and file sinks at minimum)". `LoggerConfigurationFactory.Build` calls
|
||
`.ReadFrom.Configuration(configuration)`, which reads the standard `Serilog`
|
||
configuration section — but neither `appsettings.json` nor either role-specific file
|
||
contains a `Serilog` section (only a Microsoft `Logging` section with a `LogLevel`
|
||
map, which Serilog's `ReadFrom.Configuration` does not consume). The two sinks that
|
||
actually run are appended in `Program.cs` as hard-coded `.WriteTo.Console(...)` and
|
||
`.WriteTo.File("logs/scadabridge-.log", rollingInterval: Day)` calls. As a result the
|
||
console output template, the file path, and the rolling interval cannot be changed
|
||
without recompiling — an operator cannot redirect logs, change the file location, or
|
||
add a sink via configuration, contrary to REQ-HOST-8. The `ReadFrom.Configuration`
|
||
call is effectively a no-op because the section it reads does not exist.
|
||
|
||
**Recommendation**
|
||
|
||
Move the console and file sink definitions into a `Serilog` section in
|
||
`appsettings.json` (with `WriteTo` entries and the output template / file path /
|
||
rolling interval as arguments) so `ReadFrom.Configuration` drives them, and drop the
|
||
hard-coded `.WriteTo` calls from `Program.cs`. Alternatively, update REQ-HOST-8 to
|
||
state the sinks are intentionally code-defined — but the design doc currently says
|
||
"configuration-driven", so code and doc must be reconciled.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-17 (commit `<pending>`). Root cause confirmed: `LoggerConfigurationFactory.Build`
|
||
called `ReadFrom.Configuration` but neither `appsettings.json` nor either
|
||
role-specific file contained a `Serilog` section, so the two sinks that actually ran
|
||
were hard-coded `.WriteTo.Console(...)` / `.WriteTo.File(...)` calls in `Program.cs`
|
||
— uneditable without recompiling, contrary to REQ-HOST-8. Fix: added a `Serilog`
|
||
section to `appsettings.json` with `WriteTo` entries for the Console and File sinks
|
||
(carrying the output template, file path `logs/scadabridge-.log` and `Day` rolling
|
||
interval as args) and removed the hard-coded `.WriteTo` calls from `Program.cs`; the
|
||
factory's `ReadFrom.Configuration` (backed by the transitive
|
||
`Serilog.Settings.Configuration` package) now drives the sinks. Regression tests in
|
||
new `SerilogSinkConfigTests`: `ShippedAppSettings_HasSerilogSection_WithConsoleAndFileSinks`
|
||
asserts the shipped `appsettings.json` declares both sinks in a `Serilog:WriteTo`
|
||
array (fails against the pre-fix file, which had no `Serilog` section), and
|
||
`LoggerConfigurationFactory_AppliesConfiguredFileSink` proves a configuration-supplied
|
||
File sink reaches the built logger and writes to the configured path. Full Host suite
|
||
green (182 passed).
|
||
|
||
### Host-015 — `StartupRetry` retries on every exception type, including permanent failures
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Error handling & resilience |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.Host/StartupRetry.cs:36-45` |
|
||
|
||
**Description**
|
||
|
||
`StartupRetry.ExecuteWithRetryAsync` catches `Exception` with only the guard
|
||
`when (attempt < maxAttempts)` — it retries on *any* exception type. Its sole caller
|
||
wraps `MigrationHelper.ApplyOrValidateMigrationsAsync` (`Program.cs:124-134`), which
|
||
on a Central node in production *validates* the schema version and throws if the
|
||
deployed schema does not match the expected migration set. A schema-version mismatch
|
||
is a permanent error — retrying it cannot succeed — yet `StartupRetry` will retry it
|
||
8 times with capped exponential backoff (2s, 4s, 8s, 16s, 30s, 30s, 30s ≈ 2 minutes)
|
||
before finally rethrowing, delaying the fatal-exit-and-alert by minutes for a fault
|
||
that is fatal on the first attempt. The retry helper is meant for *transient*
|
||
connection faults (the XML doc says exactly that: "the database may be briefly
|
||
unreachable"), but it cannot distinguish transient from permanent failures.
|
||
|
||
**Recommendation**
|
||
|
||
Restrict retries to transient faults — e.g. accept an `Func<Exception, bool>
|
||
isTransient` predicate and, for the migration call site, treat only
|
||
connection-class exceptions (`SqlException` with a connection/transport error
|
||
number, `TimeoutException`, socket errors) as retryable while letting
|
||
schema-validation / `InvalidOperationException` failures propagate immediately. Add
|
||
a `StartupRetryTests` case asserting a non-transient exception is rethrown after a
|
||
single attempt.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-17 (commit `<pending>`). Root cause confirmed: `ExecuteWithRetryAsync`
|
||
caught `Exception` with only the `when (attempt < maxAttempts)` guard, so it retried
|
||
any exception — including a permanent schema-version mismatch raised by the
|
||
migration call site, delaying the inevitable fatal exit by ~2 minutes. Fix: added an
|
||
optional `Func<Exception, bool> isTransient` parameter (default treats all
|
||
exceptions as transient, preserving prior behaviour for unclassified callers); the
|
||
catch guard is now `when (attempt < maxAttempts && isTransient(ex))`, so a
|
||
non-transient fault propagates after a single attempt. Added a
|
||
`StartupRetry.IsTransientDatabaseFault` classifier that returns `true` only for
|
||
connection-class faults (`TimeoutException`, `SocketException`, `SqlException` /
|
||
`DbException` by type-name match) and `false` for everything else — notably
|
||
schema-validation `InvalidOperationException`s — and the `Program.cs` migration call
|
||
site now passes it. Regression tests in `StartupRetryTests`:
|
||
`ExecuteWithRetry_NonTransientFailure_RethrowsAfterSingleAttempt` (runs exactly once
|
||
when `isTransient` returns false) and `ExecuteWithRetry_TransientThenPermanent_StopsAtPermanent`
|
||
(retries a `TimeoutException` then stops at a permanent `InvalidOperationException`).
|
||
Full Host suite green (182 passed).
|
||
|
||
### Host-016 — Site `CentralContactPoints` second entry targets the site's own remoting port
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Correctness & logic bugs |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.Host/appsettings.Site.json:33-37` |
|
||
|
||
**Description**
|
||
|
||
The shipped site config sets `Node:RemotingPort = 8082` and lists
|
||
`Communication:CentralContactPoints` as
|
||
`["akka.tcp://scadabridge@localhost:8081", "akka.tcp://scadabridge@localhost:8082"]`.
|
||
The second contact point — port `8082` — is the **site's own** remoting endpoint,
|
||
not a central node. `SiteCommunicationActor` / `ClusterClient` uses these
|
||
addresses as initial contacts when discovering the central
|
||
`ClusterClientReceptionist`; a contact pointing at the site itself can never
|
||
reach the central receptionist and will be a permanent failure in the
|
||
initial-contact rotation. For the single-node dev loopback layout the first
|
||
contact (`8081`, central) succeeds and the bug is masked, but this is exactly
|
||
the kind of dev-config "example" that gets duplicated into multi-central
|
||
deployments — the same failure mode the resolved Host-004 finding called out
|
||
for the seed-node list. `StartupValidator` validates seed nodes against the
|
||
gRPC port (Host-004) but does not cross-check `CentralContactPoints` against
|
||
the site's own `RemotingPort`, so the misconfiguration passes silently.
|
||
|
||
**Recommendation**
|
||
|
||
Correct the shipped site example to list two central remoting endpoints (e.g.
|
||
`localhost:8081` for `central-a` and a distinct port for `central-b` in a
|
||
multi-node layout). Consider extending `StartupValidator` to reject any
|
||
`Communication:CentralContactPoints` entry whose host+port matches this site
|
||
node's `NodeHostname`+`RemotingPort`. Add a regression test in
|
||
`StartupValidatorTests` mirroring `Site_SeedNodeOnGrpcPort_FailsValidation`.
|
||
|
||
**Resolution (2026-05-28):** The shipped `appsettings.Site.json` `CentralContactPoints` entry that pointed at the site's own remoting port (`localhost:8082`) was removed — the dev-loopback default now lists only the single central node (`akka.tcp://scadabridge@localhost:8081`), which is the actually-reachable target in the single-node dev layout. A `_centralContactPoints` doc-key comment was added immediately above the array calling out the per-entry rule (each entry MUST be a central node's remoting endpoint, not the site's own remoting port) and explaining how to extend the list with a second central node (`akka.tcp://scadabridge@central-b-host:8081`) in a multi-central deployment so ClusterClient can fail over. The dangerous example pattern that would have been copied into multi-central configs no longer exists in the template. `StartupValidator` cross-check is left as a follow-up — the documented rule plus the corrected template removes the immediate misconfiguration risk.
|
||
|
||
### Host-017 — Site-shutdown ordering from REQ-HOST-7 is not wired
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Design-document adherence |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.Host/Program.cs:229-265`, `src/ZB.MOM.WW.ScadaBridge.Communication/Grpc/SiteStreamGrpcServer.cs` |
|
||
|
||
**Description**
|
||
|
||
REQ-HOST-7 documents an explicit four-step shutdown sequence for site nodes:
|
||
"(1) On `CoordinatedShutdown`, stop accepting new gRPC streams first.
|
||
(2) Cancel all active gRPC streams (triggering client-side reconnect).
|
||
(3) Tear down actors.
|
||
(4) Use `IHostApplicationLifetime.ApplicationStopping` to signal the gRPC
|
||
server." The site path in `Program.cs` (the `role == "Site"` branch) registers
|
||
no `IHostApplicationLifetime.ApplicationStopping` callback, and
|
||
`SiteStreamGrpcServer` exposes no "stop accepting" / "cancel all streams"
|
||
entry point — it has `SetReady` but no corresponding `SetUnavailable` or
|
||
`CancelAllStreams`. In practice, on `SIGTERM` Kestrel closes its listener
|
||
naturally and `AkkaHostedService.StopAsync` runs Akka `CoordinatedShutdown`,
|
||
but there is no explicit, ordered handoff that meets the documented contract:
|
||
in-flight streams are not actively cancelled before actors begin tearing down,
|
||
so clients see a stream that goes silent (and only times out via gRPC
|
||
keepalive) rather than a clean `Cancelled` they can reconnect on. This is a
|
||
contract-vs-code drift — either the design doc is overstating what is
|
||
implemented, or the implementation is incomplete.
|
||
|
||
**Recommendation**
|
||
|
||
Add a `SiteStreamGrpcServer.CancelAllStreams()` method that flips a "shutting
|
||
down" flag (so `SubscribeSite` immediately fails new streams with
|
||
`StatusCode.Unavailable`) and cancels every entry's `Cts` in the `_streams`
|
||
map. In `Program.cs` site branch, resolve `IHostApplicationLifetime` and
|
||
register a callback on `ApplicationStopping` that calls `CancelAllStreams()`
|
||
before the Akka hosted service runs `CoordinatedShutdown` (or order via
|
||
`AkkaHostedService.StopAsync` itself — `IHostedService.StopAsync` runs in
|
||
reverse-registration order, so the gRPC server's lifetime can be sequenced
|
||
before Akka shutdown). Alternatively, reconcile REQ-HOST-7 with the actual
|
||
implementation if the explicit ordering is no longer intended. Add an
|
||
integration test under `tests/ZB.MOM.WW.ScadaBridge.Host.Tests` that starts a site host,
|
||
opens a stream, triggers shutdown, and asserts the stream completes with
|
||
`Cancelled` before the actor system tears down.
|
||
|
||
**Resolution (2026-05-28):**
|
||
|
||
REQ-HOST-7 steps (1)+(2) wired. `SiteStreamGrpcServer` gained:
|
||
- a monotonic `_shuttingDown` flag,
|
||
- `CancelAllStreams()` — flips the flag, cancels every `_activeStreams[*].Cts`
|
||
(with `ObjectDisposedException` swallow for entries cleaning themselves
|
||
up concurrently), idempotent on repeat calls,
|
||
- a `SubscribeInstance` guard that returns `Unavailable "Server shutting
|
||
down"` for new subscriptions arriving after the flag flips.
|
||
|
||
`Program.cs` site branch now resolves `IHostApplicationLifetime` and the
|
||
`SiteStreamGrpcServer` singleton, then registers
|
||
`ApplicationStopping.Register(() => siteGrpcServer.CancelAllStreams())`.
|
||
`ApplicationStopping` fires before any `IHostedService.StopAsync`, so the
|
||
gRPC server begins refusing new streams and tears down in-flight ones
|
||
BEFORE `AkkaHostedService` runs `CoordinatedShutdown` — matching REQ-HOST-7's
|
||
ordering. Clients observe a clean `Cancelled` and reconnect rather than a
|
||
silent stream that times out via keepalive (~25 s).
|
||
|
||
Two unit regression tests added to
|
||
`tests/ZB.MOM.WW.ScadaBridge.Communication.Tests/Grpc/SiteStreamGrpcServerTests.cs`:
|
||
`Host017_CancelAllStreams_CancelsActiveStreamsAndRefusesNewOnes` (active
|
||
streams complete, new ones rejected) and `Host017_CancelAllStreams_IsIdempotent`
|
||
(double-call safe). A full site-host integration test was deferred — the
|
||
unit suite covers both server-side invariants and the wiring is a single
|
||
`Register` line in `Program.cs`.
|
||
|
||
### Host-018 — Shipped per-role configs omit `NodeOptions.NodeName`, leaving `SourceNode` null
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Code organization & conventions |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.Host/appsettings.Central.json`, `src/ZB.MOM.WW.ScadaBridge.Host/appsettings.Site.json`, `src/ZB.MOM.WW.ScadaBridge.Host/NodeOptions.cs:10-16` |
|
||
|
||
**Description**
|
||
|
||
`NodeOptions.NodeName` is documented as "the operator-configured semantic node
|
||
name used to stamp the SourceNode column on audit rows", with conventional
|
||
values `node-a`/`node-b` for site nodes and `central-a`/`central-b` for
|
||
central nodes. The CLAUDE.md "Centralized Audit Log" key-decision section
|
||
calls this out: `SourceNode` is meant to be carried verbatim through audit
|
||
telemetry and reconciliation, and is indexed via
|
||
`IX_AuditLog_Node_Occurred (SourceNode, OccurredAtUtc)`. The docker per-node
|
||
configs (`docker/central-node-a/appsettings.Central.json`,
|
||
`docker/site-a-node-a/appsettings.Site.json`, etc.) all set
|
||
`ScadaBridge:Node:NodeName`. The **shipped, default** per-role files in
|
||
`src/ZB.MOM.WW.ScadaBridge.Host/` — the templates a developer running the binary
|
||
directly will use — do not. `NodeIdentityProvider` normalises an empty
|
||
`NodeName` to `null`, so dev audit rows carry a null `SourceNode` and the
|
||
indexed lookup never narrows. The dev examples should match the docker
|
||
examples; at minimum the field should appear in the shipped templates with a
|
||
placeholder explaining the convention.
|
||
|
||
**Recommendation**
|
||
|
||
Add `"NodeName": "central-a"` (or a placeholder like `"${NODE_NAME}"`) to
|
||
`appsettings.Central.json` and `"NodeName": "node-a"` to
|
||
`appsettings.Site.json`, with a short comment that the value must be set
|
||
per-node in multi-node deployments. Consider validating in `StartupValidator`
|
||
that `NodeName` is non-empty, or accept the null and document explicitly that
|
||
single-node dev deployments leave `SourceNode` null.
|
||
|
||
**Resolution (2026-05-28):** The shipped per-role templates now set `ScadaBridge:Node:NodeName` — `central-a` in `appsettings.Central.json` and `node-a` in `appsettings.Site.json` — so dev audit rows are stamped with a real `SourceNode` value (instead of `NodeIdentityProvider` normalising the missing key to `null`) and the indexed `IX_AuditLog_Node_Occurred` lookup actually narrows. A `_nodeName` doc-key comment was added beside each `Node` section explaining the convention (`central-a`/`central-b` for central, `node-a`/`node-b` for site), pointing at the docker per-node configs (which already overrode the field), and noting that the value must be overridden per-node in multi-node deployments and that an empty value still normalises to a `NULL` SourceNode. The shipped dev templates now match the per-node docker examples — a developer running the binary directly no longer sees a null `SourceNode`.
|
||
|
||
### Host-019 — Migration `StartupRetry` call drops the host `CancellationToken`
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Concurrency & thread safety |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.Host/Program.cs:154-165` |
|
||
|
||
**Resolution (2026-05-28):** Added a `Func<CancellationToken, Task>` overload of `StartupRetry.ExecuteWithRetryAsync` that forwards the retry-loop token into the operation, and the migration call site in `Program.cs` now passes `app.Lifetime.ApplicationStopping` as both the operation token (threaded to `MigrationHelper.ApplyOrValidateMigrationsAsync`) and the loop's `cancellationToken` (already honoured by the inter-attempt `Task.Delay`). A SIGTERM during the bounded retry window now tears down cleanly instead of waiting up to ~2 minutes for the loop to exhaust. The original `Func<Task>` overload still exists and delegates, so existing callers/tests are unchanged.
|
||
|
||
**Description**
|
||
|
||
`StartupRetry.ExecuteWithRetryAsync` accepts an optional
|
||
`CancellationToken cancellationToken = default` and observes it both at the
|
||
top of each attempt and inside the `Task.Delay` between retries. The migration
|
||
call site in `Program.cs` passes no token, so the helper runs with
|
||
`CancellationToken.None`. With `maxAttempts: 8`, `initialDelay: 2s`, and the
|
||
30s cap, a database that stays unreachable can keep the retry loop alive for
|
||
~2 minutes before the host process responds to `SIGTERM` / `Ctrl+C` /
|
||
Windows-Service stop. The `Program.cs` startup pipeline does not yet have a
|
||
host-lifetime token to forward at this point (the `app` is built but not
|
||
yet running), but `app.Lifetime.ApplicationStopping` is available the moment
|
||
`builder.Build()` returns. Threading it into the retry call honours the host
|
||
lifecycle and matches the helper's documented contract.
|
||
|
||
**Recommendation**
|
||
|
||
Pass `app.Lifetime.ApplicationStopping` (or `CancellationToken.None`
|
||
explicitly with a comment if intentional) into
|
||
`StartupRetry.ExecuteWithRetryAsync`. Add a `StartupRetryTests` case
|
||
exercising token-cancellation mid-backoff.
|
||
|
||
**Resolution**
|
||
|
||
_Open._
|
||
|
||
### Host-020 — `MinimumLevel.Is` silently overrides any operator-set `Serilog:MinimumLevel`
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Documentation & comments |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.Host/LoggerConfigurationFactory.cs:36-43` |
|
||
|
||
**Description**
|
||
|
||
`LoggerConfigurationFactory.Build` reads the `Serilog` configuration section
|
||
via `ReadFrom.Configuration(configuration)` (which can include a
|
||
`MinimumLevel` block — the standard Serilog way to set the floor) and **then**
|
||
calls `.MinimumLevel.Is(minimumLevel)` derived from
|
||
`ScadaBridge:Logging:MinimumLevel`. Serilog's fluent builder applies the later
|
||
call, so any `Serilog:MinimumLevel:Default` an operator sets is silently
|
||
overridden by `ScadaBridge:Logging:MinimumLevel` (or by its
|
||
`Information` fallback when the ScadaBridge key is absent). There are now two
|
||
documented configuration paths for the same setting with non-obvious
|
||
precedence, and the override direction is the opposite of what most Serilog
|
||
users would expect (the more-specific `Serilog` section being the authority).
|
||
The XML doc on `Build` says "the explicit `MinimumLevel.Is` pins the floor"
|
||
but does not warn that the floor *overrides* the Serilog section's own
|
||
`MinimumLevel`.
|
||
|
||
**Recommendation**
|
||
|
||
Pick one mechanism: either (a) drop the `MinimumLevel.Is` call and let
|
||
`ReadFrom.Configuration` consume `Serilog:MinimumLevel`, migrating any docs/
|
||
deployments that reference `ScadaBridge:Logging:MinimumLevel`; or (b) keep the
|
||
current "ScadaBridge:Logging" path and reject `Serilog:MinimumLevel` if present
|
||
(throw at startup so the operator sees the conflict). At minimum, expand the
|
||
XML doc + REQ-HOST-8 to spell out the precedence explicitly.
|
||
|
||
**Resolution (2026-05-28):** `ScadaBridge:Logging:MinimumLevel` is now the documented single source of truth for the Serilog floor (Host-011's `LoggingOptions` binding), and the precedence is made visible — `LoggerConfigurationFactory.Build` writes a one-shot warning to `Console.Error` when both `ScadaBridge:Logging:MinimumLevel` and `Serilog:MinimumLevel` (or `Serilog:MinimumLevel:Default`) are present, naming both values and pointing the operator at the documented key. Order of operations is unchanged — `MinimumLevel.Is(...)` deliberately runs after `ReadFrom.Configuration(...)` so the ScadaBridge value wins — but the silent-override behaviour is now loud. The class XML doc gained a Host-020 paragraph explicitly spelling out the precedence. A test-visible `Build(..., TextWriter warningWriter)` overload mirrors the `ParseLevel` Host-022 pattern so the warning can be asserted in unit tests; the production four-arg overload delegates with `Console.Error`.
|
||
|
||
### Host-021 — Microsoft `Logging:LogLevel` section in `appsettings.json` is dead config under Serilog
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Code organization & conventions |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.Host/appsettings.json:2-6` |
|
||
|
||
**Description**
|
||
|
||
`appsettings.json` carries a Microsoft `Logging:LogLevel:Default = Information`
|
||
block. The `Logging:LogLevel` map is consumed by
|
||
`Microsoft.Extensions.Logging.ConfigurationConsoleLoggerOptions` and similar
|
||
provider configurations bound from the standard `Logging` section. The Host
|
||
calls `builder.Host.UseSerilog()`, which replaces the default
|
||
`ILoggerFactory` setup with Serilog as the **only** logger provider; Serilog
|
||
reads from `configuration.ReadFrom.Configuration(...)` which consumes the
|
||
`Serilog` section, **not** `Logging:LogLevel`. The result is that an operator
|
||
editing `Logging:LogLevel:Default` (a very natural thing to try, since it is
|
||
the .NET convention) sees no behaviour change — the section is dead config.
|
||
|
||
**Recommendation**
|
||
|
||
Either remove the `Logging:LogLevel` block from `appsettings.json` (Serilog
|
||
owns logging configuration in this Host), or replace it with a brief comment
|
||
explaining it is intentionally retained for non-Serilog tooling. Document the
|
||
authoritative location (`Serilog` + `ScadaBridge:Logging`) in
|
||
`Component-Host.md` REQ-HOST-8 if not already explicit.
|
||
|
||
**Resolution (2026-05-28):** Confirmed by repository-wide grep that no code reads `Logging:LogLevel` (the Host calls `builder.Host.UseSerilog()` which replaces the default `ILoggerFactory` setup with Serilog as the only provider), so the block was pure dead config. Removed the `Logging:LogLevel:Default = Information` block from `appsettings.json` and replaced it with a `_logging` doc-key comment explaining the rationale (Serilog is the sole provider) and pointing operators at the two authoritative keys: `ScadaBridge:Logging:MinimumLevel` for the floor (bound to `LoggingOptions` per Host-011) and the `Serilog` section for sinks (Host-014's `ReadFrom.Configuration`). The Host-014 regression test (`SerilogSinkConfigTests.ShippedAppSettings_HasSerilogSection_WithConsoleAndFileSinks`) still asserts the surviving `Serilog` section's shape, so removing the Microsoft block did not break the existing pinning.
|
||
|
||
### Host-022 — `ParseLevel` silently coerces unrecognised `MinimumLevel` to `Information`
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Error handling & resilience |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.Host/LoggerConfigurationFactory.cs:50-55` |
|
||
|
||
**Resolution (2026-05-28):** `ParseLevel` now writes a one-shot warning to `Console.Error` (the logger isn't built yet at this point) when a non-null/non-blank `MinimumLevel` value fails to parse, naming the offending value and the `Information` fallback. Null/blank values continue to default silently (treated as "unset"). The helper gained a test-visible `TextWriter` overload so unit tests can capture the warning; the production path delegates to it with `Console.Error`. Tests `ParseLevel_UnrecognisedValue_FallsBackAndWarns`, `ParseLevel_NullOrBlank_FallsBackSilently`, and `ParseLevel_RecognisedValue_NoWarning` pin the behaviour.
|
||
|
||
**Description**
|
||
|
||
`LoggerConfigurationFactory.ParseLevel` uses
|
||
`Enum.TryParse<LogEventLevel>(level, ignoreCase: true, out var parsed)` and
|
||
returns `LogEventLevel.Information` when parsing fails — without logging the
|
||
fallback. An operator who sets
|
||
`ScadaBridge:Logging:MinimumLevel = "Informaiton"` (a common typo) or
|
||
`"Verbose,Debug"` or any unrecognised value gets the default level silently;
|
||
there is no warning, no log line, no startup error. Combined with Host-020
|
||
(this is the only mechanism that pins the floor), a misspelt value is
|
||
invisible until someone wonders why the level change "didn't take". The
|
||
helper is small and could either fail-fast in `StartupValidator` or emit a
|
||
console warning before the logger is configured.
|
||
|
||
**Recommendation**
|
||
|
||
In `LoggerConfigurationFactory.Build`, when `loggingOptions.MinimumLevel` is
|
||
non-null/non-blank but does not parse to a valid `LogEventLevel`, write a
|
||
`Console.Error.WriteLine` warning (the logger is not yet built) and proceed
|
||
with `Information`. Alternatively, validate the value in `StartupValidator`
|
||
and fail fast — that matches the pattern used for other ScadaBridge
|
||
configuration keys. Add a `LoggerConfigurationTests` case asserting the
|
||
behaviour you choose.
|
||
|
||
**Resolution**
|
||
|
||
_Open._
|