Files
Joseph Doherty 7b0b9c7365 refactor: rename ScadaLink → ZB.MOM.WW.ScadaBridge (code + projects + namespaces)
Solution + 23 src projects + 26 test projects renamed; folders, csproj,
namespaces, and ScadaLinkDbContext/ScadaBridgeDbContext class updated.
ActorSystem "scadalink" → "scadabridge", Akka seed-node URLs migrated.
SQL roles/logins, LDAP domains, CLI command name, and CLI config dir
(~/.scadalink → ~/.scadabridge) also renamed.

Build green; 5 Host.Tests fail awaiting SQL login rename in next commit.
Pre-existing StaleTagMonitor timing flakes unchanged.

Rename script committed at tools/rename-to-scadabridge.sh.
2026-05-28 09:37:45 -04:00

66 KiB
Raw Permalink Blame History

Code Review — Host

Field Value
Module src/ZB.MOM.WW.ScadaBridge.Host
Design doc docs/requirements/Component-Host.md
Status Reviewed
Last reviewed 2026-05-28
Reviewer claude-agent
Commit reviewed 1eb6e97
Open findings 0

Summary

The Host module is the composition root for the entire ScadaBridge system: a single binary whose behaviour (Central vs Site) is driven entirely by configuration. The implementation is generally faithful to Component-Host.md — startup validation, role-based registration, Serilog enrichment, Windows Service support, dead-letter monitoring, CoordinatedShutdown, and gRPC hosting on site nodes are all present and backed by a solid test suite (tests/ZB.MOM.WW.ScadaBridge.Host.Tests).

The most significant problem is the readiness endpoint: /health/ready runs all registered health checks, including the leader-only active-node check, so a fully operational standby central node permanently reports 503 on /health/ready — directly contradicting REQ-HOST-4a, which defines readiness as cluster membership + DB connectivity (not leadership). Several other findings concern configuration that is validated-but-never-consumed (MachineDataDb), design-doc drift (Akka.Persistence is required by REQ-HOST-6 but the system uses no persistent actors), an incorrect seed-node entry in the shipped site config, blocking sync-over-async during startup, and unguarded string interpolation when building HOCON. None are crash/data-loss class, but the readiness bug is High because it breaks load-balancer behaviour with no safe workaround.

Re-review 2026-05-17 (commit 39d737e)

All eleven findings from the first review (Host-001..011) are confirmed Resolved in the current tree — the /health/ready predicate, the externalised secrets, the seed-node/GrpcPort validation rules, the escaped HOCON builder, the bounded migration retry and the live LoggingOptions.MinimumLevel are all present as described in their Resolution notes. This re-review walked all ten checklist categories again over the full module and recorded four new findings, none of them crash/data-loss class. Following up on a batch-1 ClusterInfrastructure re-review note, Host-012 (Medium) confirms AkkaHostedService.BuildHocon hard-codes down-if-alone = on and never reads the ClusterOptions.DownIfAlone property, so that documented, bound option is dead. The remaining three are Low: :F0 rounding of cluster timing values silently degrades any sub-second configuration (Host-013), Serilog sink setup is hard-coded in Program.cs rather than configuration-driven as REQ-HOST-8 requires (Host-014), and StartupRetry retries indiscriminately on every exception type including permanent schema-validation failures (Host-015).

Re-review 2026-05-28 (commit 1eb6e97)

All fifteen prior findings (Host-001..015) remain Resolved in the current tree and the regressions introduced for them — Host-001's predicate, the externalised secrets, the Site GrpcPort/RemotingPort/seed-port validation rules, the escaped HOCON builder with DownIfAlone and millisecond-precision durations, the configuration-driven Serilog sinks, the transient-only StartupRetry classifier — are all still in place. This re-review walked the ten checklist categories over the full module again and recorded seven new findings, none of them crash/data-loss class. Host-016 (Medium) mirrors the resolved Host-004 shipped-config bug on the Communication side: appsettings.Site.json's second CentralContactPoints entry points at the site's own remoting port (localhost:8082) instead of central, an incorrect dev example that copies into multi-central deployments. Host-017 (Medium) flags a partial REQ-HOST-7 implementation — the documented site-shutdown ordering (stop accepting streams first, cancel active streams via IHostApplicationLifetime.ApplicationStopping, then tear down actors) is not wired: the site path registers no ApplicationStopping handler that signals SiteStreamGrpcServer, and the gRPC server exposes no cancel-all-streams entry point. The remaining five are Low: NodeOptions.NodeName (the operator-configured value stamped on AuditLog.SourceNode) is absent from both shipped per-role configs even though the docker per-node configs set it (Host-018); the migration StartupRetry call passes default for CancellationToken, so a SIGTERM during the bounded-retry window is ignored for up to ~2 minutes (Host-019); LoggerConfigurationFactory layers MinimumLevel.Is over ReadFrom.Configuration, so any Serilog:MinimumLevel an operator sets is silently overridden by ScadaBridge:Logging:MinimumLevel (Host-020); the shipped appsettings.json carries a Microsoft Logging:LogLevel block but Serilog is the only logger provider and the section is dead config (Host-021); and ParseLevel silently swallows an unrecognised MinimumLevel value (e.g. a typo) and falls back to Information with no warning (Host-022).

Checklist coverage

# Category Examined Notes
1 Correctness & logic bugs Host-001/004 resolved. Re-review: :F0 rounding of cluster timing values silently distorts sub-second durations (Host-013).
2 Akka.NET conventions CoordinatedShutdown, receptionist registration, singleton scoping all correct. Host-006/009 resolved; no new issues.
3 Concurrency & thread safety Host-005 resolved (StartAsync now genuinely async). DeadLetterMonitorActor state is actor-confined — no issues.
4 Error handling & resilience Host-010 resolved. Re-review: StartupRetry retries indiscriminately on permanent failures (e.g. schema-validation mismatch) (Host-015).
5 Security Host-003 resolved — secrets externalised to env vars. No new issues.
6 Performance & resource management No undisposed resources. Inbound API script compilation is a synchronous startup loop — acceptable. No new issues.
7 Design-document adherence Host-002/007 resolved. Re-review: ClusterOptions.DownIfAlone bound/documented but never consumed — HOCON hard-codes on (Host-012); Serilog sinks hard-coded, not config-driven per REQ-HOST-8 (Host-014).
8 Code organization & conventions Host-008/011 resolved. No new issues.
9 Testing coverage Strong suite; regression tests added for Host-001/004/006/007/010/011. No coverage for the new down-if-alone, sub-second-duration, or non-transient-retry paths (Host-012/013/015).
10 Documentation & comments REQ-HOST-6 stale-doc resolved. Re-review: REQ-HOST-8 says sinks are "configuration-driven" but they are code-defined (Host-014).

Re-review (2026-05-28, 1eb6e97):

# Category Examined Notes
1 Correctness & logic bugs Re-review: appsettings.Site.json second CentralContactPoints entry targets the site's own remoting port instead of central (Host-016) — same defect class as the resolved Host-004 seed-list bug.
2 Akka.NET conventions CoordinatedShutdown, receptionist registration, singleton scoping, role-scoped site singletons, ClusterClient initial-contact wiring all reviewed; no new issues.
3 Concurrency & thread safety _trackedDisposables is locked on both sides of the lifecycle; _actorSystem publication is safe via the IHost startup await boundary. New Low: StartupRetry migration call passes default CancellationToken, so SIGTERM during the retry window is ignored (Host-019).
4 Error handling & resilience IsTransientDatabaseFault correctly classifies socket / timeout / SqlException; the retry helper itself remains sound. Host-019 is the resilience gap.
5 Security Secrets stay externalised; the _secrets placeholder comment is intact. No new issues.
6 Performance & resource management No new undisposed resources; gRPC stream lifetime cap remains correct. No new issues.
7 Design-document adherence Re-review: REQ-HOST-7 site-shutdown ordering — stop accepting new streams, cancel active streams via ApplicationStopping, then tear down actors — is not wired in Program.cs (Host-017).
8 Code organization & conventions Re-review: NodeOptions.NodeName is absent from the shipped per-role configs even though it stamps AuditLog.SourceNode (Host-018); the appsettings Logging:LogLevel Microsoft section is dead config under Serilog (Host-021).
9 Testing coverage Strong existing suite. No coverage for the Site CentralContactPoints second-entry rule (Host-016), the site-shutdown ordering (Host-017), the NodeName-absent shipped config (Host-018), the unused CancellationToken parameter (Host-019), the MinimumLevel.Is override semantics (Host-020) or the ParseLevel silent fallback (Host-022).
10 Documentation & comments Re-review: layered MinimumLevel.Is / ReadFrom.Configuration semantics are not surfaced — an operator-set Serilog:MinimumLevel is silently overridden by ScadaBridge:Logging:MinimumLevel (Host-020); ParseLevel silently coerces a misspelled level to Information with no warning (Host-022).

Findings

Host-001 — /health/ready includes the leader-only active-node check

Severity High
Category Correctness & logic bugs
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.Host/Program.cs:135-145

Description

/health/ready is mapped with MapHealthChecks("/health/ready", ...) and no Predicate, so it executes every registered check: database, akka-cluster and active-node. ActiveNodeHealthCheck (Health/ActiveNodeHealthCheck.cs:38) returns Unhealthy on any node that is not the cluster leader. As a result a standby central node that is fully operational (cluster member Up, database reachable) still returns 503 on /health/ready. This contradicts REQ-HOST-4a, which defines readiness as cluster membership + DB connectivity + singletons — explicitly not leadership. /health/active is the endpoint intended to report leadership. A load balancer using /health/ready to decide whether a node may serve traffic will permanently treat the standby as unready, defeating failover readiness. No test covers this: HealthCheckTests.HealthReady_Endpoint_ReturnsResponse only asserts a response is returned, not the standby semantics.

Recommendation

Add a Predicate to the /health/ready mapping that excludes the active-node check, e.g. Predicate = check => check.Name != "active-node" (or tag the readiness checks and filter by tag). Add a regression test asserting a non-leader node returns 200 on /health/ready.

Resolution

Resolved 2026-05-16 (commit <pending>). Root cause confirmed against Program.cs: the /health/ready mapping had no Predicate, so it executed all three registered checks including the leader-only active-node check, while ActiveNodeHealthCheck returns Unhealthy on any non-leader node — making a fully operational standby central node permanently report 503. Fix: added Predicate = check => check.Name != "active-node" to the /health/ready HealthCheckOptions, so readiness now reflects cluster membership + DB connectivity only (REQ-HOST-4a); leadership remains reported solely by /health/active. Regression test HealthCheckTests.HealthReady_Endpoint_ExcludesActiveNodeCheck asserts the active-node check name does not appear in the /health/ready response body; it failed before the fix and passes after. Full Host suite green (156 passed).

Host-002 — Akka.Persistence required by REQ-HOST-6 is not configured and not used

Severity Medium
Category Design-document adherence
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs:70-108, docs/requirements/Component-Host.md REQ-HOST-6

Description

REQ-HOST-6 states the Host "must configure the Akka.NET actor system using Akka.Hosting with ... Persistence: Configured with the appropriate journal and snapshot store (SQL for central, SQLite for site)." The HOCON built in AkkaHostedService.StartAsync contains no akka.persistence section, no journal and no snapshot-store plugin, and ZB.MOM.WW.ScadaBridge.Host.csproj references neither Akka.Persistence.Hosting nor any persistence plugin (the design doc Dependencies list Akka.Persistence.Hosting). A repo-wide search finds no PersistentActor / ReceivePersistentActor subclasses — the system deliberately uses custom SQLite storage services instead. The code is internally consistent, but the design document is stale: it mandates a subsystem that does not exist. This is a documented-vs-actual drift that will mislead future maintainers and any audit against REQ-HOST-6.

Recommendation

Update Component-Host.md REQ-HOST-6 and the Dependencies list to remove the Akka.Persistence requirement (or explicitly state persistence is provided by component-owned SQLite storage, not Akka.Persistence). If persistence is intended, add the plugin packages and HOCON. Either way, code and doc must agree.

Resolution

Resolved 2026-05-16 (commit <pending>). Confirmed accurate: a repo-wide search finds no PersistentActor / ReceivePersistentActor subclasses anywhere in src/, no akka.persistence section in the HOCON built by AkkaHostedService.StartAsync, and ZB.MOM.WW.ScadaBridge.Host.csproj references no persistence plugin packages. The Host code is correct and internally consistent; the defect was a stale design doc. Fix: updated docs/requirements/Component-Host.md — REQ-HOST-6 no longer lists Persistence as a configured subsystem and now carries an explicit Persistence note stating that durable state is owned by individual components and persisted through component-owned stores (SQLite at sites, MS SQL centrally), with no Akka journal/snapshot store and no PersistentActor subclasses by design. The Responsibilities line and the Dependencies entry (Akka.Persistence.Hosting removed) were corrected to match. Code and doc now agree; no code or test change required.

Host-003 — Secrets committed in plaintext in appsettings.Central.json

Severity Medium
Category Security
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.Host/appsettings.Central.json:20-31

Description

appsettings.Central.json contains real-looking secrets in plaintext, checked into source control: SQL Server passwords in the ConfigurationDb / MachineDataDb connection strings (Password=ScadaBridge_Dev1#), an LDAP service-account password (LdapServiceAccountPassword: "password"), and a JWT signing key (JwtSigningKey: "scadabridge-dev-jwt-signing-key-..."). Even though these are intended as development defaults, shipping them in the default config invites them being reused verbatim in production, and a committed JWT signing key allows anyone with repo access to forge session tokens. TrustServerCertificate=true additionally disables TLS validation for the SQL connection.

Recommendation

Move all secrets out of committed appsettings*.json into environment variables, user-secrets, or a secret store. Keep only non-sensitive structural defaults in the file and document the required environment variables. At minimum add a clear comment that these values are dev-only and must be overridden, and rotate the JWT key per environment.

Resolution

Resolved 2026-05-16 (commit pending). Root cause confirmed against appsettings.Central.json: the committed file carried real-looking secrets in plaintext — SQL Server passwords (Password=ScadaBridge_Dev1#) in both connection strings, an LDAP service-account password, and a JWT signing key. Fix: all four secrets were removed from the committed file and replaced with non-functional ${...} placeholder references (ConfigurationDb / MachineDataDb, LdapServiceAccountPassword, JwtSigningKey). A new top-level _secrets note documents that the Host's configuration builder (AddEnvironmentVariables()) overlays the real values supplied via environment variables (ScadaBridge__Database__ConfigurationDb, ScadaBridge__Database__MachineDataDb, ScadaBridge__Security__LdapServiceAccountPassword, ScadaBridge__Security__JwtSigningKey); the placeholders are intentionally invalid so a misconfigured deployment fails loudly rather than silently using a committed key. Regression test class ConfigSecretsTests asserts the committed file carries no plaintext Password= value, no committed LDAP service-account password, and no committed JWT signing key; all three tests failed before the fix and pass after. Tests that drive the full Program startup pipeline against the real SQL provider (HealthCheckTests, HostStartupTests.CentralRole_StartsWithoutError) were adapted to supply the local dev connection strings themselves via the new CentralDbTestEnvironment test helper (environment variables) — they must no longer depend on committed secrets. Note: the docker/central-node-*/appsettings.Central.json files still contain the same dev secrets but lie outside this task's permitted edit scope; they should receive the same treatment in a follow-up.

Host-004 — Site seed-node list points at the gRPC port, not a remoting port

Severity Medium
Category Correctness & logic bugs
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.Host/appsettings.Site.json:10-19

Description

The shipped site config sets Node:RemotingPort = 8082 and Node:GrpcPort = 8083, but Cluster:SeedNodes is ["akka.tcp://scadabridge@localhost:8082", "akka.tcp://scadabridge@localhost:8083"]. The second seed node targets 8083, which is the Kestrel HTTP/2 gRPC port — not an Akka remoting endpoint. A node attempting to join via that seed will try to establish an Akka.Remote TCP association against the gRPC listener and fail. StartupValidator only checks that ≥2 seed nodes exist (StartupValidator.cs:54-56), so this misconfiguration passes validation silently. For the single-node dev site it is harmless (the first seed succeeds), but it is an incorrect example that will be copied into multi-node site configs.

Recommendation

Correct the site seed-node list to reference the two site nodes' remoting ports (e.g. 8082 and 8084), never the gRPC port. Consider extending StartupValidator to reject a seed node whose port equals this node's GrpcPort.

Resolution

Resolved 2026-05-16 (commit pending). Root cause confirmed against appsettings.Site.json: with Node:RemotingPort = 8082 and Node:GrpcPort = 8083, the second Cluster:SeedNodes entry was akka.tcp://scadabridge@localhost:8083 — the Kestrel HTTP/2 gRPC port, not an Akka.Remote endpoint. StartupValidator only checked seed-node count (≥2), so the misconfiguration passed silently. Fix, two parts: (1) the shipped appsettings.Site.json second seed entry was corrected to a remoting port (localhost:8084); (2) StartupValidator.Validate was extended — for Site nodes it now parses each seed node's TCP port (via a new SeedNodePort helper) and rejects any entry whose port equals the node's GrpcPort, using the resolved GrpcPort including the 8083 NodeOptions default when the key is absent. The seed-node-count check was hoisted above the Site block so the new rule can reuse the parsed list. Regression tests in StartupValidatorTests: Site_SeedNodeOnGrpcPort_FailsValidation, Site_SeedNodeOnDefaultGrpcPort_FailsValidation (default-8083 path), Site_SeedNodesOnRemotingPort_PassesValidation, and Central_SeedNodeOnPort8083_PassesValidation (rule is Site-only) — all failed appropriately before the fix and pass after.

Host-005 — Blocking sync-over-async (GetAwaiter().GetResult()) inside StartAsync

Severity Low
Category Concurrency & thread safety
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs:345

Description

RegisterSiteActors calls storeAndForwardService.StartAsync().GetAwaiter().GetResult() synchronously, blocking inside the IHostedService.StartAsync path. StartAsync is itself declared synchronous (returns Task.CompletedTask), so the work cannot be awaited cleanly. Blocking on async work risks thread-pool starvation during startup and, if the awaited operation captures a synchronization context, deadlock. It also hides exceptions behind an AggregateException wrapper.

Recommendation

Make AkkaHostedService.StartAsync genuinely async and await storeAndForwardService.StartAsync(cancellationToken). Propagate the CancellationToken and let exceptions surface as the original type.

Resolution

Resolved 2026-05-16 (commit pending). Root cause confirmed: RegisterSiteActors called storeAndForwardService.StartAsync().GetAwaiter().GetResult() synchronously inside the IHostedService.StartAsync path, blocking a startup thread and wrapping any failure in AggregateException. Fix: AkkaHostedService.StartAsync is now genuinely async Task; the site path was renamed RegisterSiteActorsAsync and made async; the store-and-forward init is now await storeAndForwardService.StartAsync() (the service's own signature, in another module, takes no CancellationToken, so the token is honoured via a cancellationToken.ThrowIfCancellationRequested() checkpoint immediately before the await). No blocking, no sync-context deadlock risk, and the original exception type now surfaces. No dedicated regression test: the only observable change is non-blocking/exception-unwrapping behaviour on a path that requires the full site DI graph plus a real SQLite store; the existing HostStartupTests.SiteRole_* and CompositionRootTests continue to exercise the async startup pipeline and remain green (175 passed).

Host-006 — HOCON assembled by unescaped string interpolation

Severity Low
Category Akka.NET conventions
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs:70-108

Description

The Akka HOCON is built with an interpolated string that injects _nodeOptions.NodeHostname, _clusterOptions.SeedNodes, the computed roles, and SplitBrainResolverStrategy directly into the configuration text. Values are not escaped. A hostname or seed-node string containing a quote, backslash, brace, or comment sequence would corrupt the HOCON and produce a confusing parse error far from the real cause; SplitBrainResolverStrategy is interpolated without quoting, so a value with whitespace breaks the document. Building cluster configuration from raw string concatenation is also harder to maintain than the typed Akka.Hosting builder the design doc (REQ-HOST-6) actually calls for ("via Akka.Hosting").

Recommendation

Prefer the Akka.Hosting AddAkka(...) builder with strongly-typed WithRemoting, WithClustering, and split-brain-resolver configuration instead of hand-built HOCON. If HOCON must be retained, validate/escape interpolated values (especially hostname and seed nodes) before substitution.

Resolution

Resolved 2026-05-16 (commit pending). Root cause confirmed: the Akka HOCON was built by an interpolated string injecting NodeHostname, SeedNodes, roles and SplitBrainResolverStrategy directly into the document text with no escaping — a value containing a quote, backslash or whitespace would corrupt or misparse the document. Re-triaged the recommendation: switching wholesale to the typed Akka.Hosting builder is a larger refactor than this Low finding warrants and would touch how the whole cluster is bootstrapped, so the HOCON form was retained but made safe. Fix: the HOCON construction was extracted into a testable static AkkaHostedService.BuildHocon, and every interpolated string value is now routed through a QuoteHocon helper that wraps the value in a double-quoted HOCON literal and escapes embedded backslashes and quotes — NodeHostname, every seed node, every role, and SplitBrainResolverStrategy (the last was previously unquoted entirely, so whitespace alone would break it). Regression tests in new HoconBuilderTests: BuildHocon_HostnameWithQuote_DoesNotCorruptDocument (a hostname with a " still parses and following keys stay intact), BuildHocon_StrategyWithWhitespace_RemainsQuoted, and BuildHocon_PlainValues_ParsesAndPreservesValues. With the old raw interpolation a quote-containing hostname produced hostname = "evil"host", closing the literal early and corrupting following keys; the escaped form parses correctly. Full Host suite green (175 passed).

Host-007 — REQ-HOST-4 rule "GrpcPort ≠ RemotingPort" is not enforced

Severity Low
Category Design-document adherence
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.Host/StartupValidator.cs:43-47

Description

REQ-HOST-4 requires: "Site nodes must have GrpcPort in valid port range (165535) and different from RemotingPort." StartupValidator validates the GrpcPort range but never compares it to RemotingPort. A site config that sets both ports to the same value passes validation and then fails opaquely at runtime when Kestrel and Akka.Remote both try to bind the port. The GrpcPort range check is also skipped entirely when the key is absent (grpcPortStr != null), relying on the NodeOptions default of 8083 — acceptable, but the equality rule is the missing piece.

Recommendation

Add a check in the role == "Site" block: if GrpcPort (resolved, including the 8083 default) equals RemotingPort, add an error "ScadaBridge:Node:GrpcPort must differ from RemotingPort".

Resolution

Resolved 2026-05-16 (commit pending). Root cause confirmed against StartupValidator.cs: the role == "Site" block validated the GrpcPort range but never compared it to RemotingPort, so a site config setting both ports equal passed validation and then failed opaquely when Kestrel and Akka.Remote both bound the port. Fix: added if (port == grpcPort) errors.Add("ScadaBridge:Node:GrpcPort must differ from RemotingPort") in the Site block, using the resolved GrpcPort (including the 8083 NodeOptions default when the key is absent) against the parsed RemotingPort. Regression tests in StartupValidatorTests: Site_GrpcPortEqualsRemotingPort_FailsValidation, Site_DefaultGrpcPortEqualsRemotingPort_FailsValidation (default-8083 path), and Site_GrpcPortDiffersFromRemotingPort_PassesValidation — all failed appropriately before the fix and pass after. Full Host suite green (175 passed).

Host-008 — MachineDataDb is validated and declared but never consumed

Severity Low
Category Code organization & conventions
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.Host/StartupValidator.cs:33-34, src/ZB.MOM.WW.ScadaBridge.Host/DatabaseOptions.cs:6

Description

StartupValidator requires a non-empty ScadaBridge:Database:MachineDataDb connection string for Central nodes, and DatabaseOptions exposes a MachineDataDb property, but a repo-wide search shows the value is never read anywhere outside the Host module — only ConfigurationDb is passed to AddConfigurationDatabase (Program.cs:83-85). The Host therefore fails startup if MachineDataDb is missing even though nothing uses it. This is either dead configuration that should be removed or a missing wiring (a machine-data DbContext that was never registered).

Recommendation

Determine whether a machine-data store is actually required. If yes, wire it into the relevant component's DI registration. If no, remove the MachineDataDb validation rule, the DatabaseOptions property, and the key from appsettings.Central.json.

Resolution

Resolved 2026-05-16 (commit pending). Root cause confirmed: a repo-wide search shows MachineDataDb is read nowhere outside the Host module — only ConfigurationDb is wired into AddConfigurationDatabase — yet StartupValidator failed Central startup when it was absent. No machine-data store exists or is registered, so this is dead configuration. Fix (the "if no" branch of the recommendation): removed the MachineDataDb validation rule from StartupValidator, the MachineDataDb property from DatabaseOptions, the MachineDataDb key from appsettings.Central.json, and the now-pointless MachineDataDb reference from that file's _secrets note and from the CentralDbTestEnvironment test helper. Regression test: the prior StartupValidatorTests.Central_MissingMachineDataDb_FailsValidation was inverted to Central_MissingMachineDataDb_PassesValidation (absence must now NOT fail startup), and MachineDataDb was removed from the shared ValidCentralConfig fixture so all Central tests prove the key is no longer required; it failed under the old code and passes after. Full Host suite green (175 passed). Note: appsettings.Central.json under docker/central-node-* is outside this task's edit scope and still carries the key harmlessly — it will simply be ignored; a follow-up may remove it for tidiness.

Host-009 — StartAsync reports success before role actors are confirmed running

Severity Low
Category Akka.NET conventions
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs:127-141

Description

StartAsync creates actors with ActorOf (a fire-and-forget operation — the actor's PreStart runs asynchronously on its own thread) and then returns Task.CompletedTask. For site nodes, grpcServer.SetReady(_actorSystem) is called synchronously at the end of RegisterSiteActors, marking the gRPC server ready even though SiteCommunicationActor, the deployment-manager singleton, and the ClusterClient may not yet have completed their PreStart/initial-contact handshake. REQ-HOST-7 requires "Actor system and SiteStreamManager ... initialized before gRPC begins accepting connections" — SiteStreamManager.Initialize is awaited-equivalent, but the broader actor graph is not. The window is small and the gRPC server still rejects streams until SetReady, so impact is limited, but readiness is being asserted optimistically.

Recommendation

If strict ordering matters, gate SetReady on confirmation that SiteCommunicationActor is fully initialized (e.g. an Ask round-trip or a readiness message), or document explicitly that gRPC readiness only guarantees the actor system exists, not that the cluster handshake has completed.

Resolution

Resolved 2026-05-16 (commit pending) — re-triaged, documentation fix. Root cause confirmed: SetReady is called optimistically after ActorOf registrations whose PreStart/initial-contact handshakes run asynchronously. Re-triaged the first recommended option (gating SetReady on a SiteCommunicationActor Ask round-trip): that option is incorrect because the meaningful asynchronous step is the ClusterClient initial-contact handshake with central — gating site gRPC readiness on it would prevent a site from coming up and streaming locally while central is briefly unreachable, contradicting the design's site-autonomy model. Confirmed SiteStreamGrpcServer already rejects any stream opened before SetReady with StatusCode.Unavailable, so the window is benign. Fix (the second, correct option): the SetReady call site now carries an explicit comment documenting the narrow readiness contract — by SetReady the actor system exists, SiteStreamManager.Initialize has run, and every role actor has been created with ordered synchronous ActorOf + registration Tells; what is deliberately NOT guaranteed is PreStart completion or the central handshake. No regression test is meaningful for a documentation-only clarification of an intentional design contract; the existing HostStartupTests.SiteRole_* cover the surrounding startup path. Full Host suite green (175 passed).

Host-010 — No retry/backoff around startup preconditions (DB migration, readiness)

Severity Low
Category Error handling & resilience
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.Host/Program.cs:112-125

Description

On Central startup the Host opens a DI scope and calls MigrationHelper.ApplyOrValidateMigrationsAsync directly. If the SQL Server is not yet reachable (common in container orchestration where the DB and app start together), the call throws, the top-level catch logs Fatal, and the process exits. There is no bounded retry/backoff to tolerate a database that is briefly unavailable at boot. The design intent (REQ-HOST-4a, readiness gating, 503 until ready) is about serving traffic, but the migration step happens before the host even runs and has no such tolerance.

Recommendation

Wrap the migration/validation step in a bounded retry with exponential backoff (e.g. Polly), or move schema apply behind the readiness gate so the process stays up and reports 503 until the database becomes reachable.

Resolution

Resolved 2026-05-16 (commit pending). Root cause confirmed: Central startup called MigrationHelper.ApplyOrValidateMigrationsAsync directly with no tolerance for a database that is briefly unreachable at boot — a first connection failure threw, the top-level catch logged Fatal, and the process exited. Fix: added a new StartupRetry.ExecuteWithRetryAsync helper (bounded retry with capped exponential backoff — no Polly dependency, keeping the Host's package set minimal) and wrapped the migration step in it (maxAttempts: 8, initialDelay: 2s, doubling, capped at 30s). Each retry opens a fresh DI scope so a transient connection fault does not leave a poisoned DbContext; once attempts are exhausted the last exception propagates and the existing fatal-log path still fires. Regression tests in new StartupRetryTests: ExecuteWithRetry_SucceedsFirstTry_RunsOnce, ExecuteWithRetry_TransientFailures_RetriesUntilSuccess (fails twice then succeeds), and ExecuteWithRetry_ExhaustsAttempts_RethrowsLastException (verifies the original exception type/message surfaces after exhaustion) — they fail to compile/pass against the pre-fix code (the helper did not exist) and pass after. Full Host suite green (175 passed).

Host-011 — LoggingOptions.MinimumLevel is dead configuration

Severity Low
Category Code organization & conventions
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.Host/LoggingOptions.cs:5, src/ZB.MOM.WW.ScadaBridge.Host/Program.cs:42-50

Description

LoggingOptions exposes a MinimumLevel property bound from ScadaBridge:Logging (SiteServiceRegistration.BindSharedOptions), and both appsettings.Central.json and appsettings.Site.json set "Logging": { "MinimumLevel": "Information" }. However Serilog is configured purely via ReadFrom.Configuration(configuration), which reads the standard Serilog section — not ScadaBridge:Logging. The LoggingOptions.MinimumLevel value is never read by any code, so changing it has no effect. This is misleading: an operator editing ScadaBridge:Logging:MinimumLevel expecting a log-level change will see nothing happen.

Recommendation

Either consume LoggingOptions.MinimumLevel when configuring the Serilog LoggerConfiguration (e.g. set MinimumLevel.Is(...) from it), or remove the option class and the ScadaBridge:Logging sections and rely solely on the Serilog configuration section. Keep one mechanism, not two.

Resolution

Resolved 2026-05-16 (commit pending). Root cause confirmed: LoggingOptions.MinimumLevel was bound from ScadaBridge:Logging but Serilog was configured purely via ReadFrom.Configuration (standard Serilog section), so editing ScadaBridge:Logging:MinimumLevel had no effect. Re-triaged the two options: the "remove the option class" branch was rejected because Component-Host.md REQ-HOST-3 explicitly lists ScadaBridge:LoggingLoggingOptions and that design doc is outside this task's edit scope — removing the class would create a fresh code-vs-doc drift. Fix (the "consume it" branch): added a LoggerConfigurationFactory.Build that binds LoggingOptions, parses MinimumLevel into a Serilog LogEventLevel (falling back to Information for null/blank/unrecognised values), and pins it via MinimumLevel.Is(...) on top of ReadFrom.Configuration; Program.cs now builds the logger through this factory. ScadaBridge:Logging:MinimumLevel is now live. Regression tests in new LoggerConfigurationTests: MinimumLevel_Warning_SuppressesInformationLogs, MinimumLevel_Debug_AllowsDebugLogs, and MinimumLevel_Absent_DefaultsToInformation — they assert the configured level actually filters log events; they pass against the new factory (and would fail against the old inline configuration, which ignored the key). Full Host suite green (175 passed).

Host-012 — down-if-alone hard-coded in HOCON; ClusterOptions.DownIfAlone is never read

Severity Medium
Category Design-document adherence
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs:146-148

Description

AkkaHostedService.BuildHocon emits the split-brain-resolver block with keep-oldest { down-if-alone = on } as a literal constant. ClusterOptions (src/ZB.MOM.WW.ScadaBridge.ClusterInfrastructure/ClusterOptions.cs:74) exposes a DownIfAlone property — bound from ScadaBridge:Cluster via the Options pattern, documented as "the design-doc requirement", default true — but a repo-wide search shows it is referenced nowhere outside its own declaration. The Host therefore ignores the bound value entirely: setting ScadaBridge:Cluster:DownIfAlone to false in appsettings.json has no effect, the resolver still runs with down-if-alone = on. This is the same class of defect as the resolved Host-011 (LoggingOptions.MinimumLevel was dead config) — a configuration option that is declared, bound, and documented but never consumed, which silently misleads any operator who edits it. A batch-1 re-review of ClusterInfrastructure flagged the same hard-coding; it is recorded here because BuildHocon is Host-owned code (the ClusterInfrastructure project owns only the configuration contract, per the ClusterOptions class comment).

Recommendation

Make BuildHocon consume clusterOptions.DownIfAlone — emit down-if-alone = {(clusterOptions.DownIfAlone ? "on" : "off")} (the value is a bool, so no escaping is needed). Add a HoconBuilderTests case asserting both true and false produce the corresponding down-if-alone token. If the flag is genuinely meant to be fixed at on for ScadaBridge's two-node clusters, remove the DownIfAlone property and its doc comment instead so code and configuration contract agree — but do not leave it declared-but-dead.

Resolution

Resolved 2026-05-17 (commit <pending>). Root cause confirmed against AkkaHostedService.cs: BuildHocon emitted keep-oldest { down-if-alone = on } as a literal, and ClusterOptions.DownIfAlone was referenced nowhere outside its own declaration, so the bound option was dead. Fix: BuildHocon now emits down-if-alone = {(clusterOptions.DownIfAlone ? "on" : "off")}, consuming the bound property. Regression tests in HoconBuilderTests: BuildHocon_DownIfAloneTrue_EmitsOn and BuildHocon_DownIfAloneFalse_EmitsOff parse the resulting HOCON and assert the keep-oldest.down-if-alone token tracks the property; the false case failed before the fix (token was still on) and passes after. Full Host suite green (182 passed).

Host-013 — :F0 rounding of cluster timing values silently degrades sub-second configuration

Severity Low
Category Correctness & logic bugs
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs:135-136,145,151-152

Description

BuildHocon renders every duration into HOCON via TotalSeconds:F0transportHeartbeatSec:F0, transportFailureSec:F0, StableAfter.TotalSeconds:F0, HeartbeatInterval.TotalSeconds:F0, FailureDetectionThreshold.TotalSeconds:F0. The F0 format specifier rounds to whole seconds, so any sub-second configuration is silently distorted: a HeartbeatInterval of 00:00:00.5 renders as heartbeat-interval = 0s (a degenerate / invalid value that Akka will reject or treat as zero), and 00:00:02.6 becomes 3s. The shipped defaults are all whole seconds so this is latent, but the configuration model accepts arbitrary TimeSpan values and an operator tuning a heartbeat to e.g. 750ms would get a 1s value with no warning — or, worse, 0s. Rounding configured timing values without surfacing the change is a correctness hazard for exactly the kind of failure-detection tuning these options exist for.

Recommendation

Render durations without precision loss — emit milliseconds (e.g. heartbeat-interval = {ts.TotalMilliseconds:F0}ms) so sub-second values survive, or validate in StartupValidator that each cluster timing value is a positive whole number of seconds and fail fast otherwise. Either way, do not silently round.

Resolution

Resolved 2026-05-17 (commit <pending>). Root cause confirmed: BuildHocon rendered every duration via TotalSeconds:F0, rounding sub-second values — a 750ms heartbeat became 1s and a 500ms value would collapse to a degenerate 0s. Fix: durations are now rendered by a new DurationHocon helper that emits whole milliseconds with an ms suffix (Akka's HOCON parser accepts it), so sub-second configuration survives exactly; BuildHocon takes the transport heartbeat/failure as TimeSpan rather than pre-rounded double seconds. Regression test HoconBuilderTests.BuildHocon_SubSecondTimings_PreservedWithoutRounding configures 750ms/2600ms/500ms/2.5s/7.5s timings and asserts each round-trips through the parsed HOCON unchanged; it failed before the fix (750ms parsed as 00:00:01) and passes after. Full Host suite green (182 passed).

Host-014 — Serilog sinks are hard-coded in Program.cs, not configuration-driven (REQ-HOST-8)

Severity Low
Category Design-document adherence
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.Host/Program.cs:43-48, src/ZB.MOM.WW.ScadaBridge.Host/appsettings.json:1-7

Description

REQ-HOST-8 requires the Host to configure Serilog with "Configuration-driven sink setup (console and file sinks at minimum)". LoggerConfigurationFactory.Build calls .ReadFrom.Configuration(configuration), which reads the standard Serilog configuration section — but neither appsettings.json nor either role-specific file contains a Serilog section (only a Microsoft Logging section with a LogLevel map, which Serilog's ReadFrom.Configuration does not consume). The two sinks that actually run are appended in Program.cs as hard-coded .WriteTo.Console(...) and .WriteTo.File("logs/scadabridge-.log", rollingInterval: Day) calls. As a result the console output template, the file path, and the rolling interval cannot be changed without recompiling — an operator cannot redirect logs, change the file location, or add a sink via configuration, contrary to REQ-HOST-8. The ReadFrom.Configuration call is effectively a no-op because the section it reads does not exist.

Recommendation

Move the console and file sink definitions into a Serilog section in appsettings.json (with WriteTo entries and the output template / file path / rolling interval as arguments) so ReadFrom.Configuration drives them, and drop the hard-coded .WriteTo calls from Program.cs. Alternatively, update REQ-HOST-8 to state the sinks are intentionally code-defined — but the design doc currently says "configuration-driven", so code and doc must be reconciled.

Resolution

Resolved 2026-05-17 (commit <pending>). Root cause confirmed: LoggerConfigurationFactory.Build called ReadFrom.Configuration but neither appsettings.json nor either role-specific file contained a Serilog section, so the two sinks that actually ran were hard-coded .WriteTo.Console(...) / .WriteTo.File(...) calls in Program.cs — uneditable without recompiling, contrary to REQ-HOST-8. Fix: added a Serilog section to appsettings.json with WriteTo entries for the Console and File sinks (carrying the output template, file path logs/scadabridge-.log and Day rolling interval as args) and removed the hard-coded .WriteTo calls from Program.cs; the factory's ReadFrom.Configuration (backed by the transitive Serilog.Settings.Configuration package) now drives the sinks. Regression tests in new SerilogSinkConfigTests: ShippedAppSettings_HasSerilogSection_WithConsoleAndFileSinks asserts the shipped appsettings.json declares both sinks in a Serilog:WriteTo array (fails against the pre-fix file, which had no Serilog section), and LoggerConfigurationFactory_AppliesConfiguredFileSink proves a configuration-supplied File sink reaches the built logger and writes to the configured path. Full Host suite green (182 passed).

Host-015 — StartupRetry retries on every exception type, including permanent failures

Severity Low
Category Error handling & resilience
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.Host/StartupRetry.cs:36-45

Description

StartupRetry.ExecuteWithRetryAsync catches Exception with only the guard when (attempt < maxAttempts) — it retries on any exception type. Its sole caller wraps MigrationHelper.ApplyOrValidateMigrationsAsync (Program.cs:124-134), which on a Central node in production validates the schema version and throws if the deployed schema does not match the expected migration set. A schema-version mismatch is a permanent error — retrying it cannot succeed — yet StartupRetry will retry it 8 times with capped exponential backoff (2s, 4s, 8s, 16s, 30s, 30s, 30s ≈ 2 minutes) before finally rethrowing, delaying the fatal-exit-and-alert by minutes for a fault that is fatal on the first attempt. The retry helper is meant for transient connection faults (the XML doc says exactly that: "the database may be briefly unreachable"), but it cannot distinguish transient from permanent failures.

Recommendation

Restrict retries to transient faults — e.g. accept an Func<Exception, bool> isTransient predicate and, for the migration call site, treat only connection-class exceptions (SqlException with a connection/transport error number, TimeoutException, socket errors) as retryable while letting schema-validation / InvalidOperationException failures propagate immediately. Add a StartupRetryTests case asserting a non-transient exception is rethrown after a single attempt.

Resolution

Resolved 2026-05-17 (commit <pending>). Root cause confirmed: ExecuteWithRetryAsync caught Exception with only the when (attempt < maxAttempts) guard, so it retried any exception — including a permanent schema-version mismatch raised by the migration call site, delaying the inevitable fatal exit by ~2 minutes. Fix: added an optional Func<Exception, bool> isTransient parameter (default treats all exceptions as transient, preserving prior behaviour for unclassified callers); the catch guard is now when (attempt < maxAttempts && isTransient(ex)), so a non-transient fault propagates after a single attempt. Added a StartupRetry.IsTransientDatabaseFault classifier that returns true only for connection-class faults (TimeoutException, SocketException, SqlException / DbException by type-name match) and false for everything else — notably schema-validation InvalidOperationExceptions — and the Program.cs migration call site now passes it. Regression tests in StartupRetryTests: ExecuteWithRetry_NonTransientFailure_RethrowsAfterSingleAttempt (runs exactly once when isTransient returns false) and ExecuteWithRetry_TransientThenPermanent_StopsAtPermanent (retries a TimeoutException then stops at a permanent InvalidOperationException). Full Host suite green (182 passed).

Host-016 — Site CentralContactPoints second entry targets the site's own remoting port

Severity Medium
Category Correctness & logic bugs
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.Host/appsettings.Site.json:33-37

Description

The shipped site config sets Node:RemotingPort = 8082 and lists Communication:CentralContactPoints as ["akka.tcp://scadabridge@localhost:8081", "akka.tcp://scadabridge@localhost:8082"]. The second contact point — port 8082 — is the site's own remoting endpoint, not a central node. SiteCommunicationActor / ClusterClient uses these addresses as initial contacts when discovering the central ClusterClientReceptionist; a contact pointing at the site itself can never reach the central receptionist and will be a permanent failure in the initial-contact rotation. For the single-node dev loopback layout the first contact (8081, central) succeeds and the bug is masked, but this is exactly the kind of dev-config "example" that gets duplicated into multi-central deployments — the same failure mode the resolved Host-004 finding called out for the seed-node list. StartupValidator validates seed nodes against the gRPC port (Host-004) but does not cross-check CentralContactPoints against the site's own RemotingPort, so the misconfiguration passes silently.

Recommendation

Correct the shipped site example to list two central remoting endpoints (e.g. localhost:8081 for central-a and a distinct port for central-b in a multi-node layout). Consider extending StartupValidator to reject any Communication:CentralContactPoints entry whose host+port matches this site node's NodeHostname+RemotingPort. Add a regression test in StartupValidatorTests mirroring Site_SeedNodeOnGrpcPort_FailsValidation.

Resolution (2026-05-28): The shipped appsettings.Site.json CentralContactPoints entry that pointed at the site's own remoting port (localhost:8082) was removed — the dev-loopback default now lists only the single central node (akka.tcp://scadabridge@localhost:8081), which is the actually-reachable target in the single-node dev layout. A _centralContactPoints doc-key comment was added immediately above the array calling out the per-entry rule (each entry MUST be a central node's remoting endpoint, not the site's own remoting port) and explaining how to extend the list with a second central node (akka.tcp://scadabridge@central-b-host:8081) in a multi-central deployment so ClusterClient can fail over. The dangerous example pattern that would have been copied into multi-central configs no longer exists in the template. StartupValidator cross-check is left as a follow-up — the documented rule plus the corrected template removes the immediate misconfiguration risk.

Host-017 — Site-shutdown ordering from REQ-HOST-7 is not wired

Severity Medium
Category Design-document adherence
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.Host/Program.cs:229-265, src/ZB.MOM.WW.ScadaBridge.Communication/Grpc/SiteStreamGrpcServer.cs

Description

REQ-HOST-7 documents an explicit four-step shutdown sequence for site nodes: "(1) On CoordinatedShutdown, stop accepting new gRPC streams first. (2) Cancel all active gRPC streams (triggering client-side reconnect). (3) Tear down actors. (4) Use IHostApplicationLifetime.ApplicationStopping to signal the gRPC server." The site path in Program.cs (the role == "Site" branch) registers no IHostApplicationLifetime.ApplicationStopping callback, and SiteStreamGrpcServer exposes no "stop accepting" / "cancel all streams" entry point — it has SetReady but no corresponding SetUnavailable or CancelAllStreams. In practice, on SIGTERM Kestrel closes its listener naturally and AkkaHostedService.StopAsync runs Akka CoordinatedShutdown, but there is no explicit, ordered handoff that meets the documented contract: in-flight streams are not actively cancelled before actors begin tearing down, so clients see a stream that goes silent (and only times out via gRPC keepalive) rather than a clean Cancelled they can reconnect on. This is a contract-vs-code drift — either the design doc is overstating what is implemented, or the implementation is incomplete.

Recommendation

Add a SiteStreamGrpcServer.CancelAllStreams() method that flips a "shutting down" flag (so SubscribeSite immediately fails new streams with StatusCode.Unavailable) and cancels every entry's Cts in the _streams map. In Program.cs site branch, resolve IHostApplicationLifetime and register a callback on ApplicationStopping that calls CancelAllStreams() before the Akka hosted service runs CoordinatedShutdown (or order via AkkaHostedService.StopAsync itself — IHostedService.StopAsync runs in reverse-registration order, so the gRPC server's lifetime can be sequenced before Akka shutdown). Alternatively, reconcile REQ-HOST-7 with the actual implementation if the explicit ordering is no longer intended. Add an integration test under tests/ZB.MOM.WW.ScadaBridge.Host.Tests that starts a site host, opens a stream, triggers shutdown, and asserts the stream completes with Cancelled before the actor system tears down.

Resolution (2026-05-28):

REQ-HOST-7 steps (1)+(2) wired. SiteStreamGrpcServer gained:

  • a monotonic _shuttingDown flag,
  • CancelAllStreams() — flips the flag, cancels every _activeStreams[*].Cts (with ObjectDisposedException swallow for entries cleaning themselves up concurrently), idempotent on repeat calls,
  • a SubscribeInstance guard that returns Unavailable "Server shutting down" for new subscriptions arriving after the flag flips.

Program.cs site branch now resolves IHostApplicationLifetime and the SiteStreamGrpcServer singleton, then registers ApplicationStopping.Register(() => siteGrpcServer.CancelAllStreams()). ApplicationStopping fires before any IHostedService.StopAsync, so the gRPC server begins refusing new streams and tears down in-flight ones BEFORE AkkaHostedService runs CoordinatedShutdown — matching REQ-HOST-7's ordering. Clients observe a clean Cancelled and reconnect rather than a silent stream that times out via keepalive (~25 s).

Two unit regression tests added to tests/ZB.MOM.WW.ScadaBridge.Communication.Tests/Grpc/SiteStreamGrpcServerTests.cs: Host017_CancelAllStreams_CancelsActiveStreamsAndRefusesNewOnes (active streams complete, new ones rejected) and Host017_CancelAllStreams_IsIdempotent (double-call safe). A full site-host integration test was deferred — the unit suite covers both server-side invariants and the wiring is a single Register line in Program.cs.

Host-018 — Shipped per-role configs omit NodeOptions.NodeName, leaving SourceNode null

Severity Low
Category Code organization & conventions
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.Host/appsettings.Central.json, src/ZB.MOM.WW.ScadaBridge.Host/appsettings.Site.json, src/ZB.MOM.WW.ScadaBridge.Host/NodeOptions.cs:10-16

Description

NodeOptions.NodeName is documented as "the operator-configured semantic node name used to stamp the SourceNode column on audit rows", with conventional values node-a/node-b for site nodes and central-a/central-b for central nodes. The CLAUDE.md "Centralized Audit Log" key-decision section calls this out: SourceNode is meant to be carried verbatim through audit telemetry and reconciliation, and is indexed via IX_AuditLog_Node_Occurred (SourceNode, OccurredAtUtc). The docker per-node configs (docker/central-node-a/appsettings.Central.json, docker/site-a-node-a/appsettings.Site.json, etc.) all set ScadaBridge:Node:NodeName. The shipped, default per-role files in src/ZB.MOM.WW.ScadaBridge.Host/ — the templates a developer running the binary directly will use — do not. NodeIdentityProvider normalises an empty NodeName to null, so dev audit rows carry a null SourceNode and the indexed lookup never narrows. The dev examples should match the docker examples; at minimum the field should appear in the shipped templates with a placeholder explaining the convention.

Recommendation

Add "NodeName": "central-a" (or a placeholder like "${NODE_NAME}") to appsettings.Central.json and "NodeName": "node-a" to appsettings.Site.json, with a short comment that the value must be set per-node in multi-node deployments. Consider validating in StartupValidator that NodeName is non-empty, or accept the null and document explicitly that single-node dev deployments leave SourceNode null.

Resolution (2026-05-28): The shipped per-role templates now set ScadaBridge:Node:NodeNamecentral-a in appsettings.Central.json and node-a in appsettings.Site.json — so dev audit rows are stamped with a real SourceNode value (instead of NodeIdentityProvider normalising the missing key to null) and the indexed IX_AuditLog_Node_Occurred lookup actually narrows. A _nodeName doc-key comment was added beside each Node section explaining the convention (central-a/central-b for central, node-a/node-b for site), pointing at the docker per-node configs (which already overrode the field), and noting that the value must be overridden per-node in multi-node deployments and that an empty value still normalises to a NULL SourceNode. The shipped dev templates now match the per-node docker examples — a developer running the binary directly no longer sees a null SourceNode.

Host-019 — Migration StartupRetry call drops the host CancellationToken

Severity Low
Category Concurrency & thread safety
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.Host/Program.cs:154-165

Resolution (2026-05-28): Added a Func<CancellationToken, Task> overload of StartupRetry.ExecuteWithRetryAsync that forwards the retry-loop token into the operation, and the migration call site in Program.cs now passes app.Lifetime.ApplicationStopping as both the operation token (threaded to MigrationHelper.ApplyOrValidateMigrationsAsync) and the loop's cancellationToken (already honoured by the inter-attempt Task.Delay). A SIGTERM during the bounded retry window now tears down cleanly instead of waiting up to ~2 minutes for the loop to exhaust. The original Func<Task> overload still exists and delegates, so existing callers/tests are unchanged.

Description

StartupRetry.ExecuteWithRetryAsync accepts an optional CancellationToken cancellationToken = default and observes it both at the top of each attempt and inside the Task.Delay between retries. The migration call site in Program.cs passes no token, so the helper runs with CancellationToken.None. With maxAttempts: 8, initialDelay: 2s, and the 30s cap, a database that stays unreachable can keep the retry loop alive for ~2 minutes before the host process responds to SIGTERM / Ctrl+C / Windows-Service stop. The Program.cs startup pipeline does not yet have a host-lifetime token to forward at this point (the app is built but not yet running), but app.Lifetime.ApplicationStopping is available the moment builder.Build() returns. Threading it into the retry call honours the host lifecycle and matches the helper's documented contract.

Recommendation

Pass app.Lifetime.ApplicationStopping (or CancellationToken.None explicitly with a comment if intentional) into StartupRetry.ExecuteWithRetryAsync. Add a StartupRetryTests case exercising token-cancellation mid-backoff.

Resolution

Open.

Host-020 — MinimumLevel.Is silently overrides any operator-set Serilog:MinimumLevel

Severity Low
Category Documentation & comments
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.Host/LoggerConfigurationFactory.cs:36-43

Description

LoggerConfigurationFactory.Build reads the Serilog configuration section via ReadFrom.Configuration(configuration) (which can include a MinimumLevel block — the standard Serilog way to set the floor) and then calls .MinimumLevel.Is(minimumLevel) derived from ScadaBridge:Logging:MinimumLevel. Serilog's fluent builder applies the later call, so any Serilog:MinimumLevel:Default an operator sets is silently overridden by ScadaBridge:Logging:MinimumLevel (or by its Information fallback when the ScadaBridge key is absent). There are now two documented configuration paths for the same setting with non-obvious precedence, and the override direction is the opposite of what most Serilog users would expect (the more-specific Serilog section being the authority). The XML doc on Build says "the explicit MinimumLevel.Is pins the floor" but does not warn that the floor overrides the Serilog section's own MinimumLevel.

Recommendation

Pick one mechanism: either (a) drop the MinimumLevel.Is call and let ReadFrom.Configuration consume Serilog:MinimumLevel, migrating any docs/ deployments that reference ScadaBridge:Logging:MinimumLevel; or (b) keep the current "ScadaBridge:Logging" path and reject Serilog:MinimumLevel if present (throw at startup so the operator sees the conflict). At minimum, expand the XML doc + REQ-HOST-8 to spell out the precedence explicitly.

Resolution (2026-05-28): ScadaBridge:Logging:MinimumLevel is now the documented single source of truth for the Serilog floor (Host-011's LoggingOptions binding), and the precedence is made visible — LoggerConfigurationFactory.Build writes a one-shot warning to Console.Error when both ScadaBridge:Logging:MinimumLevel and Serilog:MinimumLevel (or Serilog:MinimumLevel:Default) are present, naming both values and pointing the operator at the documented key. Order of operations is unchanged — MinimumLevel.Is(...) deliberately runs after ReadFrom.Configuration(...) so the ScadaBridge value wins — but the silent-override behaviour is now loud. The class XML doc gained a Host-020 paragraph explicitly spelling out the precedence. A test-visible Build(..., TextWriter warningWriter) overload mirrors the ParseLevel Host-022 pattern so the warning can be asserted in unit tests; the production four-arg overload delegates with Console.Error.

Host-021 — Microsoft Logging:LogLevel section in appsettings.json is dead config under Serilog

Severity Low
Category Code organization & conventions
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.Host/appsettings.json:2-6

Description

appsettings.json carries a Microsoft Logging:LogLevel:Default = Information block. The Logging:LogLevel map is consumed by Microsoft.Extensions.Logging.ConfigurationConsoleLoggerOptions and similar provider configurations bound from the standard Logging section. The Host calls builder.Host.UseSerilog(), which replaces the default ILoggerFactory setup with Serilog as the only logger provider; Serilog reads from configuration.ReadFrom.Configuration(...) which consumes the Serilog section, not Logging:LogLevel. The result is that an operator editing Logging:LogLevel:Default (a very natural thing to try, since it is the .NET convention) sees no behaviour change — the section is dead config.

Recommendation

Either remove the Logging:LogLevel block from appsettings.json (Serilog owns logging configuration in this Host), or replace it with a brief comment explaining it is intentionally retained for non-Serilog tooling. Document the authoritative location (Serilog + ScadaBridge:Logging) in Component-Host.md REQ-HOST-8 if not already explicit.

Resolution (2026-05-28): Confirmed by repository-wide grep that no code reads Logging:LogLevel (the Host calls builder.Host.UseSerilog() which replaces the default ILoggerFactory setup with Serilog as the only provider), so the block was pure dead config. Removed the Logging:LogLevel:Default = Information block from appsettings.json and replaced it with a _logging doc-key comment explaining the rationale (Serilog is the sole provider) and pointing operators at the two authoritative keys: ScadaBridge:Logging:MinimumLevel for the floor (bound to LoggingOptions per Host-011) and the Serilog section for sinks (Host-014's ReadFrom.Configuration). The Host-014 regression test (SerilogSinkConfigTests.ShippedAppSettings_HasSerilogSection_WithConsoleAndFileSinks) still asserts the surviving Serilog section's shape, so removing the Microsoft block did not break the existing pinning.

Host-022 — ParseLevel silently coerces unrecognised MinimumLevel to Information

Severity Low
Category Error handling & resilience
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.Host/LoggerConfigurationFactory.cs:50-55

Resolution (2026-05-28): ParseLevel now writes a one-shot warning to Console.Error (the logger isn't built yet at this point) when a non-null/non-blank MinimumLevel value fails to parse, naming the offending value and the Information fallback. Null/blank values continue to default silently (treated as "unset"). The helper gained a test-visible TextWriter overload so unit tests can capture the warning; the production path delegates to it with Console.Error. Tests ParseLevel_UnrecognisedValue_FallsBackAndWarns, ParseLevel_NullOrBlank_FallsBackSilently, and ParseLevel_RecognisedValue_NoWarning pin the behaviour.

Description

LoggerConfigurationFactory.ParseLevel uses Enum.TryParse<LogEventLevel>(level, ignoreCase: true, out var parsed) and returns LogEventLevel.Information when parsing fails — without logging the fallback. An operator who sets ScadaBridge:Logging:MinimumLevel = "Informaiton" (a common typo) or "Verbose,Debug" or any unrecognised value gets the default level silently; there is no warning, no log line, no startup error. Combined with Host-020 (this is the only mechanism that pins the floor), a misspelt value is invisible until someone wonders why the level change "didn't take". The helper is small and could either fail-fast in StartupValidator or emit a console warning before the logger is configured.

Recommendation

In LoggerConfigurationFactory.Build, when loggingOptions.MinimumLevel is non-null/non-blank but does not parse to a valid LogEventLevel, write a Console.Error.WriteLine warning (the logger is not yet built) and proceed with Information. Alternatively, validate the value in StartupValidator and fail fast — that matches the pattern used for other ScadaBridge configuration keys. Add a LoggerConfigurationTests case asserting the behaviour you choose.

Resolution

Open.