Solution + 23 src projects + 26 test projects renamed; folders, csproj, namespaces, and ScadaLinkDbContext/ScadaBridgeDbContext class updated. ActorSystem "scadalink" → "scadabridge", Akka seed-node URLs migrated. SQL roles/logins, LDAP domains, CLI command name, and CLI config dir (~/.scadalink → ~/.scadabridge) also renamed. Build green; 5 Host.Tests fail awaiting SQL login rename in next commit. Pre-existing StaleTagMonitor timing flakes unchanged. Rename script committed at tools/rename-to-scadabridge.sh.
66 KiB
Code Review — Host
| Field | Value |
|---|---|
| Module | src/ZB.MOM.WW.ScadaBridge.Host |
| Design doc | docs/requirements/Component-Host.md |
| Status | Reviewed |
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | 1eb6e97 |
| Open findings | 0 |
Summary
The Host module is the composition root for the entire ScadaBridge system: a single
binary whose behaviour (Central vs Site) is driven entirely by configuration. The
implementation is generally faithful to Component-Host.md — startup validation,
role-based registration, Serilog enrichment, Windows Service support, dead-letter
monitoring, CoordinatedShutdown, and gRPC hosting on site nodes are all present and
backed by a solid test suite (tests/ZB.MOM.WW.ScadaBridge.Host.Tests).
The most significant problem is the readiness endpoint: /health/ready runs all
registered health checks, including the leader-only active-node check, so a fully
operational standby central node permanently reports 503 on /health/ready —
directly contradicting REQ-HOST-4a, which defines readiness as cluster membership +
DB connectivity (not leadership). Several other findings concern configuration that
is validated-but-never-consumed (MachineDataDb), design-doc drift (Akka.Persistence
is required by REQ-HOST-6 but the system uses no persistent actors), an incorrect
seed-node entry in the shipped site config, blocking sync-over-async during startup,
and unguarded string interpolation when building HOCON. None are crash/data-loss
class, but the readiness bug is High because it breaks load-balancer behaviour with
no safe workaround.
Re-review 2026-05-17 (commit 39d737e)
All eleven findings from the first review (Host-001..011) are confirmed Resolved
in the current tree — the /health/ready predicate, the externalised secrets, the
seed-node/GrpcPort validation rules, the escaped HOCON builder, the bounded
migration retry and the live LoggingOptions.MinimumLevel are all present as
described in their Resolution notes. This re-review walked all ten checklist
categories again over the full module and recorded four new findings, none of them
crash/data-loss class. Following up on a batch-1 ClusterInfrastructure re-review
note, Host-012 (Medium) confirms AkkaHostedService.BuildHocon hard-codes
down-if-alone = on and never reads the ClusterOptions.DownIfAlone property, so
that documented, bound option is dead. The remaining three are Low: :F0 rounding
of cluster timing values silently degrades any sub-second configuration (Host-013),
Serilog sink setup is hard-coded in Program.cs rather than configuration-driven as
REQ-HOST-8 requires (Host-014), and StartupRetry retries indiscriminately on every
exception type including permanent schema-validation failures (Host-015).
Re-review 2026-05-28 (commit 1eb6e97)
All fifteen prior findings (Host-001..015) remain Resolved in the current tree
and the regressions introduced for them — Host-001's predicate, the externalised
secrets, the Site GrpcPort/RemotingPort/seed-port validation rules, the escaped
HOCON builder with DownIfAlone and millisecond-precision durations, the
configuration-driven Serilog sinks, the transient-only StartupRetry
classifier — are all still in place. This re-review walked the ten checklist
categories over the full module again and recorded seven new findings, none of
them crash/data-loss class. Host-016 (Medium) mirrors the resolved Host-004
shipped-config bug on the Communication side: appsettings.Site.json's
second CentralContactPoints entry points at the site's own remoting port
(localhost:8082) instead of central, an incorrect dev example that copies
into multi-central deployments. Host-017 (Medium) flags a partial REQ-HOST-7
implementation — the documented site-shutdown ordering (stop accepting streams
first, cancel active streams via IHostApplicationLifetime.ApplicationStopping,
then tear down actors) is not wired: the site path registers no
ApplicationStopping handler that signals SiteStreamGrpcServer, and the gRPC
server exposes no cancel-all-streams entry point. The remaining five are Low:
NodeOptions.NodeName (the operator-configured value stamped on
AuditLog.SourceNode) is absent from both shipped per-role configs even though
the docker per-node configs set it (Host-018); the migration StartupRetry
call passes default for CancellationToken, so a SIGTERM during the
bounded-retry window is ignored for up to ~2 minutes (Host-019);
LoggerConfigurationFactory layers MinimumLevel.Is over
ReadFrom.Configuration, so any Serilog:MinimumLevel an operator sets is
silently overridden by ScadaBridge:Logging:MinimumLevel (Host-020); the
shipped appsettings.json carries a Microsoft Logging:LogLevel block but
Serilog is the only logger provider and the section is dead config (Host-021);
and ParseLevel silently swallows an unrecognised MinimumLevel value (e.g.
a typo) and falls back to Information with no warning (Host-022).
Checklist coverage
| # | Category | Examined | Notes |
|---|---|---|---|
| 1 | Correctness & logic bugs | ☑ | Host-001/004 resolved. Re-review: :F0 rounding of cluster timing values silently distorts sub-second durations (Host-013). |
| 2 | Akka.NET conventions | ☑ | CoordinatedShutdown, receptionist registration, singleton scoping all correct. Host-006/009 resolved; no new issues. |
| 3 | Concurrency & thread safety | ☑ | Host-005 resolved (StartAsync now genuinely async). DeadLetterMonitorActor state is actor-confined — no issues. |
| 4 | Error handling & resilience | ☑ | Host-010 resolved. Re-review: StartupRetry retries indiscriminately on permanent failures (e.g. schema-validation mismatch) (Host-015). |
| 5 | Security | ☑ | Host-003 resolved — secrets externalised to env vars. No new issues. |
| 6 | Performance & resource management | ☑ | No undisposed resources. Inbound API script compilation is a synchronous startup loop — acceptable. No new issues. |
| 7 | Design-document adherence | ☑ | Host-002/007 resolved. Re-review: ClusterOptions.DownIfAlone bound/documented but never consumed — HOCON hard-codes on (Host-012); Serilog sinks hard-coded, not config-driven per REQ-HOST-8 (Host-014). |
| 8 | Code organization & conventions | ☑ | Host-008/011 resolved. No new issues. |
| 9 | Testing coverage | ☑ | Strong suite; regression tests added for Host-001/004/006/007/010/011. No coverage for the new down-if-alone, sub-second-duration, or non-transient-retry paths (Host-012/013/015). |
| 10 | Documentation & comments | ☑ | REQ-HOST-6 stale-doc resolved. Re-review: REQ-HOST-8 says sinks are "configuration-driven" but they are code-defined (Host-014). |
Re-review (2026-05-28, 1eb6e97):
| # | Category | Examined | Notes |
|---|---|---|---|
| 1 | Correctness & logic bugs | ☑ | Re-review: appsettings.Site.json second CentralContactPoints entry targets the site's own remoting port instead of central (Host-016) — same defect class as the resolved Host-004 seed-list bug. |
| 2 | Akka.NET conventions | ☑ | CoordinatedShutdown, receptionist registration, singleton scoping, role-scoped site singletons, ClusterClient initial-contact wiring all reviewed; no new issues. |
| 3 | Concurrency & thread safety | ☑ | _trackedDisposables is locked on both sides of the lifecycle; _actorSystem publication is safe via the IHost startup await boundary. New Low: StartupRetry migration call passes default CancellationToken, so SIGTERM during the retry window is ignored (Host-019). |
| 4 | Error handling & resilience | ☑ | IsTransientDatabaseFault correctly classifies socket / timeout / SqlException; the retry helper itself remains sound. Host-019 is the resilience gap. |
| 5 | Security | ☑ | Secrets stay externalised; the _secrets placeholder comment is intact. No new issues. |
| 6 | Performance & resource management | ☑ | No new undisposed resources; gRPC stream lifetime cap remains correct. No new issues. |
| 7 | Design-document adherence | ☑ | Re-review: REQ-HOST-7 site-shutdown ordering — stop accepting new streams, cancel active streams via ApplicationStopping, then tear down actors — is not wired in Program.cs (Host-017). |
| 8 | Code organization & conventions | ☑ | Re-review: NodeOptions.NodeName is absent from the shipped per-role configs even though it stamps AuditLog.SourceNode (Host-018); the appsettings Logging:LogLevel Microsoft section is dead config under Serilog (Host-021). |
| 9 | Testing coverage | ☑ | Strong existing suite. No coverage for the Site CentralContactPoints second-entry rule (Host-016), the site-shutdown ordering (Host-017), the NodeName-absent shipped config (Host-018), the unused CancellationToken parameter (Host-019), the MinimumLevel.Is override semantics (Host-020) or the ParseLevel silent fallback (Host-022). |
| 10 | Documentation & comments | ☑ | Re-review: layered MinimumLevel.Is / ReadFrom.Configuration semantics are not surfaced — an operator-set Serilog:MinimumLevel is silently overridden by ScadaBridge:Logging:MinimumLevel (Host-020); ParseLevel silently coerces a misspelled level to Information with no warning (Host-022). |
Findings
Host-001 — /health/ready includes the leader-only active-node check
| Severity | High |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.Host/Program.cs:135-145 |
Description
/health/ready is mapped with MapHealthChecks("/health/ready", ...) and no
Predicate, so it executes every registered check: database, akka-cluster
and active-node. ActiveNodeHealthCheck (Health/ActiveNodeHealthCheck.cs:38)
returns Unhealthy on any node that is not the cluster leader. As a result a
standby central node that is fully operational (cluster member Up, database
reachable) still returns 503 on /health/ready. This contradicts REQ-HOST-4a,
which defines readiness as cluster membership + DB connectivity + singletons —
explicitly not leadership. /health/active is the endpoint intended to report
leadership. A load balancer using /health/ready to decide whether a node may
serve traffic will permanently treat the standby as unready, defeating failover
readiness. No test covers this: HealthCheckTests.HealthReady_Endpoint_ReturnsResponse
only asserts a response is returned, not the standby semantics.
Recommendation
Add a Predicate to the /health/ready mapping that excludes the active-node
check, e.g. Predicate = check => check.Name != "active-node" (or tag the readiness
checks and filter by tag). Add a regression test asserting a non-leader node returns
200 on /health/ready.
Resolution
Resolved 2026-05-16 (commit <pending>). Root cause confirmed against
Program.cs: the /health/ready mapping had no Predicate, so it executed all
three registered checks including the leader-only active-node check, while
ActiveNodeHealthCheck returns Unhealthy on any non-leader node — making a fully
operational standby central node permanently report 503. Fix: added
Predicate = check => check.Name != "active-node" to the /health/ready
HealthCheckOptions, so readiness now reflects cluster membership + DB connectivity
only (REQ-HOST-4a); leadership remains reported solely by /health/active.
Regression test HealthCheckTests.HealthReady_Endpoint_ExcludesActiveNodeCheck
asserts the active-node check name does not appear in the /health/ready
response body; it failed before the fix and passes after. Full Host suite green
(156 passed).
Host-002 — Akka.Persistence required by REQ-HOST-6 is not configured and not used
| Severity | Medium |
| Category | Design-document adherence |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs:70-108, docs/requirements/Component-Host.md REQ-HOST-6 |
Description
REQ-HOST-6 states the Host "must configure the Akka.NET actor system using
Akka.Hosting with ... Persistence: Configured with the appropriate journal and
snapshot store (SQL for central, SQLite for site)." The HOCON built in
AkkaHostedService.StartAsync contains no akka.persistence section, no journal and
no snapshot-store plugin, and ZB.MOM.WW.ScadaBridge.Host.csproj references neither
Akka.Persistence.Hosting nor any persistence plugin (the design doc Dependencies
list Akka.Persistence.Hosting). A repo-wide search finds no PersistentActor /
ReceivePersistentActor subclasses — the system deliberately uses custom SQLite
storage services instead. The code is internally consistent, but the design document
is stale: it mandates a subsystem that does not exist. This is a documented-vs-actual
drift that will mislead future maintainers and any audit against REQ-HOST-6.
Recommendation
Update Component-Host.md REQ-HOST-6 and the Dependencies list to remove the
Akka.Persistence requirement (or explicitly state persistence is provided by
component-owned SQLite storage, not Akka.Persistence). If persistence is intended,
add the plugin packages and HOCON. Either way, code and doc must agree.
Resolution
Resolved 2026-05-16 (commit <pending>). Confirmed accurate: a repo-wide search
finds no PersistentActor / ReceivePersistentActor subclasses anywhere in
src/, no akka.persistence section in the HOCON built by
AkkaHostedService.StartAsync, and ZB.MOM.WW.ScadaBridge.Host.csproj references no
persistence plugin packages. The Host code is correct and internally consistent;
the defect was a stale design doc. Fix: updated docs/requirements/Component-Host.md
— REQ-HOST-6 no longer lists Persistence as a configured subsystem and now carries
an explicit Persistence note stating that durable state is owned by individual
components and persisted through component-owned stores (SQLite at sites, MS SQL
centrally), with no Akka journal/snapshot store and no PersistentActor subclasses
by design. The Responsibilities line and the Dependencies entry
(Akka.Persistence.Hosting removed) were corrected to match. Code and doc now
agree; no code or test change required.
Host-003 — Secrets committed in plaintext in appsettings.Central.json
| Severity | Medium |
| Category | Security |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.Host/appsettings.Central.json:20-31 |
Description
appsettings.Central.json contains real-looking secrets in plaintext, checked into
source control: SQL Server passwords in the ConfigurationDb / MachineDataDb
connection strings (Password=ScadaBridge_Dev1#), an LDAP service-account password
(LdapServiceAccountPassword: "password"), and a JWT signing key
(JwtSigningKey: "scadabridge-dev-jwt-signing-key-..."). Even though these are
intended as development defaults, shipping them in the default config invites them
being reused verbatim in production, and a committed JWT signing key allows anyone
with repo access to forge session tokens. TrustServerCertificate=true additionally
disables TLS validation for the SQL connection.
Recommendation
Move all secrets out of committed appsettings*.json into environment variables,
user-secrets, or a secret store. Keep only non-sensitive structural defaults in the
file and document the required environment variables. At minimum add a clear comment
that these values are dev-only and must be overridden, and rotate the JWT key per
environment.
Resolution
Resolved 2026-05-16 (commit pending). Root cause confirmed against
appsettings.Central.json: the committed file carried real-looking secrets in
plaintext — SQL Server passwords (Password=ScadaBridge_Dev1#) in both connection
strings, an LDAP service-account password, and a JWT signing key. Fix: all four
secrets were removed from the committed file and replaced with non-functional
${...} placeholder references (ConfigurationDb / MachineDataDb,
LdapServiceAccountPassword, JwtSigningKey). A new top-level _secrets note
documents that the Host's configuration builder (AddEnvironmentVariables())
overlays the real values supplied via environment variables
(ScadaBridge__Database__ConfigurationDb, ScadaBridge__Database__MachineDataDb,
ScadaBridge__Security__LdapServiceAccountPassword,
ScadaBridge__Security__JwtSigningKey); the placeholders are intentionally invalid so
a misconfigured deployment fails loudly rather than silently using a committed key.
Regression test class ConfigSecretsTests asserts the committed file carries no
plaintext Password= value, no committed LDAP service-account password, and no
committed JWT signing key; all three tests failed before the fix and pass after.
Tests that drive the full Program startup pipeline against the real SQL provider
(HealthCheckTests, HostStartupTests.CentralRole_StartsWithoutError) were adapted
to supply the local dev connection strings themselves via the new
CentralDbTestEnvironment test helper (environment variables) — they must no longer
depend on committed secrets. Note: the docker/central-node-*/appsettings.Central.json
files still contain the same dev secrets but lie outside this task's permitted edit
scope; they should receive the same treatment in a follow-up.
Host-004 — Site seed-node list points at the gRPC port, not a remoting port
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.Host/appsettings.Site.json:10-19 |
Description
The shipped site config sets Node:RemotingPort = 8082 and Node:GrpcPort = 8083,
but Cluster:SeedNodes is ["akka.tcp://scadabridge@localhost:8082", "akka.tcp://scadabridge@localhost:8083"]. The second seed node targets 8083, which
is the Kestrel HTTP/2 gRPC port — not an Akka remoting endpoint. A node attempting to
join via that seed will try to establish an Akka.Remote TCP association against the
gRPC listener and fail. StartupValidator only checks that ≥2 seed nodes exist
(StartupValidator.cs:54-56), so this misconfiguration passes validation silently.
For the single-node dev site it is harmless (the first seed succeeds), but it is an
incorrect example that will be copied into multi-node site configs.
Recommendation
Correct the site seed-node list to reference the two site nodes' remoting ports
(e.g. 8082 and 8084), never the gRPC port. Consider extending StartupValidator
to reject a seed node whose port equals this node's GrpcPort.
Resolution
Resolved 2026-05-16 (commit pending). Root cause confirmed against
appsettings.Site.json: with Node:RemotingPort = 8082 and Node:GrpcPort = 8083,
the second Cluster:SeedNodes entry was akka.tcp://scadabridge@localhost:8083 — the
Kestrel HTTP/2 gRPC port, not an Akka.Remote endpoint. StartupValidator only
checked seed-node count (≥2), so the misconfiguration passed silently. Fix, two
parts: (1) the shipped appsettings.Site.json second seed entry was corrected to a
remoting port (localhost:8084); (2) StartupValidator.Validate was extended — for
Site nodes it now parses each seed node's TCP port (via a new SeedNodePort
helper) and rejects any entry whose port equals the node's GrpcPort, using the
resolved GrpcPort including the 8083 NodeOptions default when the key is absent.
The seed-node-count check was hoisted above the Site block so the new rule can reuse
the parsed list. Regression tests in StartupValidatorTests:
Site_SeedNodeOnGrpcPort_FailsValidation,
Site_SeedNodeOnDefaultGrpcPort_FailsValidation (default-8083 path),
Site_SeedNodesOnRemotingPort_PassesValidation, and
Central_SeedNodeOnPort8083_PassesValidation (rule is Site-only) — all failed
appropriately before the fix and pass after.
Host-005 — Blocking sync-over-async (GetAwaiter().GetResult()) inside StartAsync
| Severity | Low |
| Category | Concurrency & thread safety |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs:345 |
Description
RegisterSiteActors calls storeAndForwardService.StartAsync().GetAwaiter().GetResult()
synchronously, blocking inside the IHostedService.StartAsync path. StartAsync is
itself declared synchronous (returns Task.CompletedTask), so the work cannot be
awaited cleanly. Blocking on async work risks thread-pool starvation during startup
and, if the awaited operation captures a synchronization context, deadlock. It also
hides exceptions behind an AggregateException wrapper.
Recommendation
Make AkkaHostedService.StartAsync genuinely async and await storeAndForwardService.StartAsync(cancellationToken). Propagate the
CancellationToken and let exceptions surface as the original type.
Resolution
Resolved 2026-05-16 (commit pending). Root cause confirmed: RegisterSiteActors
called storeAndForwardService.StartAsync().GetAwaiter().GetResult() synchronously
inside the IHostedService.StartAsync path, blocking a startup thread and wrapping
any failure in AggregateException. Fix: AkkaHostedService.StartAsync is now
genuinely async Task; the site path was renamed RegisterSiteActorsAsync and made
async; the store-and-forward init is now await storeAndForwardService.StartAsync()
(the service's own signature, in another module, takes no CancellationToken, so the
token is honoured via a cancellationToken.ThrowIfCancellationRequested() checkpoint
immediately before the await). No blocking, no sync-context deadlock risk, and the
original exception type now surfaces. No dedicated regression test: the only
observable change is non-blocking/exception-unwrapping behaviour on a path that
requires the full site DI graph plus a real SQLite store; the existing
HostStartupTests.SiteRole_* and CompositionRootTests continue to exercise the
async startup pipeline and remain green (175 passed).
Host-006 — HOCON assembled by unescaped string interpolation
| Severity | Low |
| Category | Akka.NET conventions |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs:70-108 |
Description
The Akka HOCON is built with an interpolated string that injects
_nodeOptions.NodeHostname, _clusterOptions.SeedNodes, the computed roles, and
SplitBrainResolverStrategy directly into the configuration text. Values are not
escaped. A hostname or seed-node string containing a quote, backslash, brace, or
comment sequence would corrupt the HOCON and produce a confusing parse error far from
the real cause; SplitBrainResolverStrategy is interpolated without quoting, so a
value with whitespace breaks the document. Building cluster configuration from raw
string concatenation is also harder to maintain than the typed Akka.Hosting builder
the design doc (REQ-HOST-6) actually calls for ("via Akka.Hosting").
Recommendation
Prefer the Akka.Hosting AddAkka(...) builder with strongly-typed WithRemoting,
WithClustering, and split-brain-resolver configuration instead of hand-built HOCON.
If HOCON must be retained, validate/escape interpolated values (especially hostname
and seed nodes) before substitution.
Resolution
Resolved 2026-05-16 (commit pending). Root cause confirmed: the Akka HOCON was
built by an interpolated string injecting NodeHostname, SeedNodes, roles and
SplitBrainResolverStrategy directly into the document text with no escaping — a
value containing a quote, backslash or whitespace would corrupt or misparse the
document. Re-triaged the recommendation: switching wholesale to the typed
Akka.Hosting builder is a larger refactor than this Low finding warrants and would
touch how the whole cluster is bootstrapped, so the HOCON form was retained but made
safe. Fix: the HOCON construction was extracted into a testable static
AkkaHostedService.BuildHocon, and every interpolated string value is now routed
through a QuoteHocon helper that wraps the value in a double-quoted HOCON literal
and escapes embedded backslashes and quotes — NodeHostname, every seed node, every
role, and SplitBrainResolverStrategy (the last was previously unquoted entirely, so
whitespace alone would break it). Regression tests in new HoconBuilderTests:
BuildHocon_HostnameWithQuote_DoesNotCorruptDocument (a hostname with a " still
parses and following keys stay intact), BuildHocon_StrategyWithWhitespace_RemainsQuoted,
and BuildHocon_PlainValues_ParsesAndPreservesValues. With the old raw interpolation
a quote-containing hostname produced hostname = "evil"host", closing the literal
early and corrupting following keys; the escaped form parses correctly. Full Host
suite green (175 passed).
Host-007 — REQ-HOST-4 rule "GrpcPort ≠ RemotingPort" is not enforced
| Severity | Low |
| Category | Design-document adherence |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.Host/StartupValidator.cs:43-47 |
Description
REQ-HOST-4 requires: "Site nodes must have GrpcPort in valid port range (1–65535)
and different from RemotingPort." StartupValidator validates the GrpcPort
range but never compares it to RemotingPort. A site config that sets both ports to
the same value passes validation and then fails opaquely at runtime when Kestrel and
Akka.Remote both try to bind the port. The GrpcPort range check is also skipped
entirely when the key is absent (grpcPortStr != null), relying on the
NodeOptions default of 8083 — acceptable, but the equality rule is the missing
piece.
Recommendation
Add a check in the role == "Site" block: if GrpcPort (resolved, including the
8083 default) equals RemotingPort, add an error
"ScadaBridge:Node:GrpcPort must differ from RemotingPort".
Resolution
Resolved 2026-05-16 (commit pending). Root cause confirmed against
StartupValidator.cs: the role == "Site" block validated the GrpcPort range but
never compared it to RemotingPort, so a site config setting both ports equal
passed validation and then failed opaquely when Kestrel and Akka.Remote both bound
the port. Fix: added if (port == grpcPort) errors.Add("ScadaBridge:Node:GrpcPort must differ from RemotingPort") in the Site block, using the resolved GrpcPort (including
the 8083 NodeOptions default when the key is absent) against the parsed
RemotingPort. Regression tests in StartupValidatorTests:
Site_GrpcPortEqualsRemotingPort_FailsValidation,
Site_DefaultGrpcPortEqualsRemotingPort_FailsValidation (default-8083 path), and
Site_GrpcPortDiffersFromRemotingPort_PassesValidation — all failed appropriately
before the fix and pass after. Full Host suite green (175 passed).
Host-008 — MachineDataDb is validated and declared but never consumed
| Severity | Low |
| Category | Code organization & conventions |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.Host/StartupValidator.cs:33-34, src/ZB.MOM.WW.ScadaBridge.Host/DatabaseOptions.cs:6 |
Description
StartupValidator requires a non-empty ScadaBridge:Database:MachineDataDb connection
string for Central nodes, and DatabaseOptions exposes a MachineDataDb property,
but a repo-wide search shows the value is never read anywhere outside the Host module
— only ConfigurationDb is passed to AddConfigurationDatabase
(Program.cs:83-85). The Host therefore fails startup if MachineDataDb is missing
even though nothing uses it. This is either dead configuration that should be removed
or a missing wiring (a machine-data DbContext that was never registered).
Recommendation
Determine whether a machine-data store is actually required. If yes, wire it into the
relevant component's DI registration. If no, remove the MachineDataDb validation
rule, the DatabaseOptions property, and the key from appsettings.Central.json.
Resolution
Resolved 2026-05-16 (commit pending). Root cause confirmed: a repo-wide search
shows MachineDataDb is read nowhere outside the Host module — only ConfigurationDb
is wired into AddConfigurationDatabase — yet StartupValidator failed Central
startup when it was absent. No machine-data store exists or is registered, so this is
dead configuration. Fix (the "if no" branch of the recommendation): removed the
MachineDataDb validation rule from StartupValidator, the MachineDataDb property
from DatabaseOptions, the MachineDataDb key from appsettings.Central.json, and
the now-pointless MachineDataDb reference from that file's _secrets note and from
the CentralDbTestEnvironment test helper. Regression test: the prior
StartupValidatorTests.Central_MissingMachineDataDb_FailsValidation was inverted to
Central_MissingMachineDataDb_PassesValidation (absence must now NOT fail startup),
and MachineDataDb was removed from the shared ValidCentralConfig fixture so all
Central tests prove the key is no longer required; it failed under the old code and
passes after. Full Host suite green (175 passed). Note: appsettings.Central.json
under docker/central-node-* is outside this task's edit scope and still carries the
key harmlessly — it will simply be ignored; a follow-up may remove it for tidiness.
Host-009 — StartAsync reports success before role actors are confirmed running
| Severity | Low |
| Category | Akka.NET conventions |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs:127-141 |
Description
StartAsync creates actors with ActorOf (a fire-and-forget operation — the actor's
PreStart runs asynchronously on its own thread) and then returns
Task.CompletedTask. For site nodes, grpcServer.SetReady(_actorSystem) is called
synchronously at the end of RegisterSiteActors, marking the gRPC server ready even
though SiteCommunicationActor, the deployment-manager singleton, and the
ClusterClient may not yet have completed their PreStart/initial-contact handshake.
REQ-HOST-7 requires "Actor system and SiteStreamManager ... initialized before gRPC
begins accepting connections" — SiteStreamManager.Initialize is awaited-equivalent,
but the broader actor graph is not. The window is small and the gRPC server still
rejects streams until SetReady, so impact is limited, but readiness is being
asserted optimistically.
Recommendation
If strict ordering matters, gate SetReady on confirmation that
SiteCommunicationActor is fully initialized (e.g. an Ask round-trip or a
readiness message), or document explicitly that gRPC readiness only guarantees the
actor system exists, not that the cluster handshake has completed.
Resolution
Resolved 2026-05-16 (commit pending) — re-triaged, documentation fix. Root cause
confirmed: SetReady is called optimistically after ActorOf registrations whose
PreStart/initial-contact handshakes run asynchronously. Re-triaged the first
recommended option (gating SetReady on a SiteCommunicationActor Ask
round-trip): that option is incorrect because the meaningful asynchronous step is
the ClusterClient initial-contact handshake with central — gating site gRPC
readiness on it would prevent a site from coming up and streaming locally while
central is briefly unreachable, contradicting the design's site-autonomy model.
Confirmed SiteStreamGrpcServer already rejects any stream opened before SetReady
with StatusCode.Unavailable, so the window is benign. Fix (the second, correct
option): the SetReady call site now carries an explicit comment documenting the
narrow readiness contract — by SetReady the actor system exists,
SiteStreamManager.Initialize has run, and every role actor has been created with
ordered synchronous ActorOf + registration Tells; what is deliberately NOT
guaranteed is PreStart completion or the central handshake. No regression test is
meaningful for a documentation-only clarification of an intentional design contract;
the existing HostStartupTests.SiteRole_* cover the surrounding startup path. Full
Host suite green (175 passed).
Host-010 — No retry/backoff around startup preconditions (DB migration, readiness)
| Severity | Low |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.Host/Program.cs:112-125 |
Description
On Central startup the Host opens a DI scope and calls
MigrationHelper.ApplyOrValidateMigrationsAsync directly. If the SQL Server is not
yet reachable (common in container orchestration where the DB and app start
together), the call throws, the top-level catch logs Fatal, and the process
exits. There is no bounded retry/backoff to tolerate a database that is briefly
unavailable at boot. The design intent (REQ-HOST-4a, readiness gating, 503 until
ready) is about serving traffic, but the migration step happens before the host
even runs and has no such tolerance.
Recommendation
Wrap the migration/validation step in a bounded retry with exponential backoff (e.g.
Polly), or move schema apply behind the readiness gate so the process stays up and
reports 503 until the database becomes reachable.
Resolution
Resolved 2026-05-16 (commit pending). Root cause confirmed: Central startup called
MigrationHelper.ApplyOrValidateMigrationsAsync directly with no tolerance for a
database that is briefly unreachable at boot — a first connection failure threw, the
top-level catch logged Fatal, and the process exited. Fix: added a new
StartupRetry.ExecuteWithRetryAsync helper (bounded retry with capped exponential
backoff — no Polly dependency, keeping the Host's package set minimal) and wrapped
the migration step in it (maxAttempts: 8, initialDelay: 2s, doubling, capped at
30s). Each retry opens a fresh DI scope so a transient connection fault does not
leave a poisoned DbContext; once attempts are exhausted the last exception
propagates and the existing fatal-log path still fires. Regression tests in new
StartupRetryTests: ExecuteWithRetry_SucceedsFirstTry_RunsOnce,
ExecuteWithRetry_TransientFailures_RetriesUntilSuccess (fails twice then succeeds),
and ExecuteWithRetry_ExhaustsAttempts_RethrowsLastException (verifies the original
exception type/message surfaces after exhaustion) — they fail to compile/pass against
the pre-fix code (the helper did not exist) and pass after. Full Host suite green
(175 passed).
Host-011 — LoggingOptions.MinimumLevel is dead configuration
| Severity | Low |
| Category | Code organization & conventions |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.Host/LoggingOptions.cs:5, src/ZB.MOM.WW.ScadaBridge.Host/Program.cs:42-50 |
Description
LoggingOptions exposes a MinimumLevel property bound from ScadaBridge:Logging
(SiteServiceRegistration.BindSharedOptions), and both appsettings.Central.json
and appsettings.Site.json set "Logging": { "MinimumLevel": "Information" }.
However Serilog is configured purely via ReadFrom.Configuration(configuration),
which reads the standard Serilog section — not ScadaBridge:Logging. The
LoggingOptions.MinimumLevel value is never read by any code, so changing it has no
effect. This is misleading: an operator editing ScadaBridge:Logging:MinimumLevel
expecting a log-level change will see nothing happen.
Recommendation
Either consume LoggingOptions.MinimumLevel when configuring the Serilog
LoggerConfiguration (e.g. set MinimumLevel.Is(...) from it), or remove the option
class and the ScadaBridge:Logging sections and rely solely on the Serilog
configuration section. Keep one mechanism, not two.
Resolution
Resolved 2026-05-16 (commit pending). Root cause confirmed: LoggingOptions.MinimumLevel
was bound from ScadaBridge:Logging but Serilog was configured purely via
ReadFrom.Configuration (standard Serilog section), so editing
ScadaBridge:Logging:MinimumLevel had no effect. Re-triaged the two options: the
"remove the option class" branch was rejected because Component-Host.md REQ-HOST-3
explicitly lists ScadaBridge:Logging → LoggingOptions and that design doc is outside
this task's edit scope — removing the class would create a fresh code-vs-doc drift.
Fix (the "consume it" branch): added a LoggerConfigurationFactory.Build that binds
LoggingOptions, parses MinimumLevel into a Serilog LogEventLevel (falling back
to Information for null/blank/unrecognised values), and pins it via
MinimumLevel.Is(...) on top of ReadFrom.Configuration; Program.cs now builds the
logger through this factory. ScadaBridge:Logging:MinimumLevel is now live.
Regression tests in new LoggerConfigurationTests:
MinimumLevel_Warning_SuppressesInformationLogs,
MinimumLevel_Debug_AllowsDebugLogs, and MinimumLevel_Absent_DefaultsToInformation
— they assert the configured level actually filters log events; they pass against the
new factory (and would fail against the old inline configuration, which ignored the
key). Full Host suite green (175 passed).
Host-012 — down-if-alone hard-coded in HOCON; ClusterOptions.DownIfAlone is never read
| Severity | Medium |
| Category | Design-document adherence |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs:146-148 |
Description
AkkaHostedService.BuildHocon emits the split-brain-resolver block with
keep-oldest { down-if-alone = on } as a literal constant. ClusterOptions
(src/ZB.MOM.WW.ScadaBridge.ClusterInfrastructure/ClusterOptions.cs:74) exposes a
DownIfAlone property — bound from ScadaBridge:Cluster via the Options pattern,
documented as "the design-doc requirement", default true — but a repo-wide search
shows it is referenced nowhere outside its own declaration. The Host therefore
ignores the bound value entirely: setting ScadaBridge:Cluster:DownIfAlone to false
in appsettings.json has no effect, the resolver still runs with down-if-alone = on. This is the same class of defect as the resolved Host-011
(LoggingOptions.MinimumLevel was dead config) — a configuration option that is
declared, bound, and documented but never consumed, which silently misleads any
operator who edits it. A batch-1 re-review of ClusterInfrastructure flagged the same
hard-coding; it is recorded here because BuildHocon is Host-owned code (the
ClusterInfrastructure project owns only the configuration contract, per the
ClusterOptions class comment).
Recommendation
Make BuildHocon consume clusterOptions.DownIfAlone — emit down-if-alone = {(clusterOptions.DownIfAlone ? "on" : "off")} (the value is a bool, so no escaping
is needed). Add a HoconBuilderTests case asserting both true and false produce
the corresponding down-if-alone token. If the flag is genuinely meant to be fixed
at on for ScadaBridge's two-node clusters, remove the DownIfAlone property and its
doc comment instead so code and configuration contract agree — but do not leave it
declared-but-dead.
Resolution
Resolved 2026-05-17 (commit <pending>). Root cause confirmed against
AkkaHostedService.cs: BuildHocon emitted keep-oldest { down-if-alone = on } as
a literal, and ClusterOptions.DownIfAlone was referenced nowhere outside its own
declaration, so the bound option was dead. Fix: BuildHocon now emits
down-if-alone = {(clusterOptions.DownIfAlone ? "on" : "off")}, consuming the bound
property. Regression tests in HoconBuilderTests: BuildHocon_DownIfAloneTrue_EmitsOn
and BuildHocon_DownIfAloneFalse_EmitsOff parse the resulting HOCON and assert the
keep-oldest.down-if-alone token tracks the property; the false case failed before
the fix (token was still on) and passes after. Full Host suite green (182 passed).
Host-013 — :F0 rounding of cluster timing values silently degrades sub-second configuration
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs:135-136,145,151-152 |
Description
BuildHocon renders every duration into HOCON via TotalSeconds:F0 —
transportHeartbeatSec:F0, transportFailureSec:F0, StableAfter.TotalSeconds:F0,
HeartbeatInterval.TotalSeconds:F0, FailureDetectionThreshold.TotalSeconds:F0. The
F0 format specifier rounds to whole seconds, so any sub-second configuration is
silently distorted: a HeartbeatInterval of 00:00:00.5 renders as
heartbeat-interval = 0s (a degenerate / invalid value that Akka will reject or
treat as zero), and 00:00:02.6 becomes 3s. The shipped defaults are all whole
seconds so this is latent, but the configuration model accepts arbitrary TimeSpan
values and an operator tuning a heartbeat to e.g. 750ms would get a 1s value with
no warning — or, worse, 0s. Rounding configured timing values without surfacing
the change is a correctness hazard for exactly the kind of failure-detection tuning
these options exist for.
Recommendation
Render durations without precision loss — emit milliseconds (e.g.
heartbeat-interval = {ts.TotalMilliseconds:F0}ms) so sub-second values survive, or
validate in StartupValidator that each cluster timing value is a positive whole
number of seconds and fail fast otherwise. Either way, do not silently round.
Resolution
Resolved 2026-05-17 (commit <pending>). Root cause confirmed: BuildHocon
rendered every duration via TotalSeconds:F0, rounding sub-second values — a 750ms
heartbeat became 1s and a 500ms value would collapse to a degenerate 0s. Fix:
durations are now rendered by a new DurationHocon helper that emits whole
milliseconds with an ms suffix (Akka's HOCON parser accepts it), so sub-second
configuration survives exactly; BuildHocon takes the transport heartbeat/failure
as TimeSpan rather than pre-rounded double seconds. Regression test
HoconBuilderTests.BuildHocon_SubSecondTimings_PreservedWithoutRounding configures
750ms/2600ms/500ms/2.5s/7.5s timings and asserts each round-trips through the parsed
HOCON unchanged; it failed before the fix (750ms parsed as 00:00:01) and passes
after. Full Host suite green (182 passed).
Host-014 — Serilog sinks are hard-coded in Program.cs, not configuration-driven (REQ-HOST-8)
| Severity | Low |
| Category | Design-document adherence |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.Host/Program.cs:43-48, src/ZB.MOM.WW.ScadaBridge.Host/appsettings.json:1-7 |
Description
REQ-HOST-8 requires the Host to configure Serilog with "Configuration-driven sink
setup (console and file sinks at minimum)". LoggerConfigurationFactory.Build calls
.ReadFrom.Configuration(configuration), which reads the standard Serilog
configuration section — but neither appsettings.json nor either role-specific file
contains a Serilog section (only a Microsoft Logging section with a LogLevel
map, which Serilog's ReadFrom.Configuration does not consume). The two sinks that
actually run are appended in Program.cs as hard-coded .WriteTo.Console(...) and
.WriteTo.File("logs/scadabridge-.log", rollingInterval: Day) calls. As a result the
console output template, the file path, and the rolling interval cannot be changed
without recompiling — an operator cannot redirect logs, change the file location, or
add a sink via configuration, contrary to REQ-HOST-8. The ReadFrom.Configuration
call is effectively a no-op because the section it reads does not exist.
Recommendation
Move the console and file sink definitions into a Serilog section in
appsettings.json (with WriteTo entries and the output template / file path /
rolling interval as arguments) so ReadFrom.Configuration drives them, and drop the
hard-coded .WriteTo calls from Program.cs. Alternatively, update REQ-HOST-8 to
state the sinks are intentionally code-defined — but the design doc currently says
"configuration-driven", so code and doc must be reconciled.
Resolution
Resolved 2026-05-17 (commit <pending>). Root cause confirmed: LoggerConfigurationFactory.Build
called ReadFrom.Configuration but neither appsettings.json nor either
role-specific file contained a Serilog section, so the two sinks that actually ran
were hard-coded .WriteTo.Console(...) / .WriteTo.File(...) calls in Program.cs
— uneditable without recompiling, contrary to REQ-HOST-8. Fix: added a Serilog
section to appsettings.json with WriteTo entries for the Console and File sinks
(carrying the output template, file path logs/scadabridge-.log and Day rolling
interval as args) and removed the hard-coded .WriteTo calls from Program.cs; the
factory's ReadFrom.Configuration (backed by the transitive
Serilog.Settings.Configuration package) now drives the sinks. Regression tests in
new SerilogSinkConfigTests: ShippedAppSettings_HasSerilogSection_WithConsoleAndFileSinks
asserts the shipped appsettings.json declares both sinks in a Serilog:WriteTo
array (fails against the pre-fix file, which had no Serilog section), and
LoggerConfigurationFactory_AppliesConfiguredFileSink proves a configuration-supplied
File sink reaches the built logger and writes to the configured path. Full Host suite
green (182 passed).
Host-015 — StartupRetry retries on every exception type, including permanent failures
| Severity | Low |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.Host/StartupRetry.cs:36-45 |
Description
StartupRetry.ExecuteWithRetryAsync catches Exception with only the guard
when (attempt < maxAttempts) — it retries on any exception type. Its sole caller
wraps MigrationHelper.ApplyOrValidateMigrationsAsync (Program.cs:124-134), which
on a Central node in production validates the schema version and throws if the
deployed schema does not match the expected migration set. A schema-version mismatch
is a permanent error — retrying it cannot succeed — yet StartupRetry will retry it
8 times with capped exponential backoff (2s, 4s, 8s, 16s, 30s, 30s, 30s ≈ 2 minutes)
before finally rethrowing, delaying the fatal-exit-and-alert by minutes for a fault
that is fatal on the first attempt. The retry helper is meant for transient
connection faults (the XML doc says exactly that: "the database may be briefly
unreachable"), but it cannot distinguish transient from permanent failures.
Recommendation
Restrict retries to transient faults — e.g. accept an Func<Exception, bool> isTransient predicate and, for the migration call site, treat only
connection-class exceptions (SqlException with a connection/transport error
number, TimeoutException, socket errors) as retryable while letting
schema-validation / InvalidOperationException failures propagate immediately. Add
a StartupRetryTests case asserting a non-transient exception is rethrown after a
single attempt.
Resolution
Resolved 2026-05-17 (commit <pending>). Root cause confirmed: ExecuteWithRetryAsync
caught Exception with only the when (attempt < maxAttempts) guard, so it retried
any exception — including a permanent schema-version mismatch raised by the
migration call site, delaying the inevitable fatal exit by ~2 minutes. Fix: added an
optional Func<Exception, bool> isTransient parameter (default treats all
exceptions as transient, preserving prior behaviour for unclassified callers); the
catch guard is now when (attempt < maxAttempts && isTransient(ex)), so a
non-transient fault propagates after a single attempt. Added a
StartupRetry.IsTransientDatabaseFault classifier that returns true only for
connection-class faults (TimeoutException, SocketException, SqlException /
DbException by type-name match) and false for everything else — notably
schema-validation InvalidOperationExceptions — and the Program.cs migration call
site now passes it. Regression tests in StartupRetryTests:
ExecuteWithRetry_NonTransientFailure_RethrowsAfterSingleAttempt (runs exactly once
when isTransient returns false) and ExecuteWithRetry_TransientThenPermanent_StopsAtPermanent
(retries a TimeoutException then stops at a permanent InvalidOperationException).
Full Host suite green (182 passed).
Host-016 — Site CentralContactPoints second entry targets the site's own remoting port
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.Host/appsettings.Site.json:33-37 |
Description
The shipped site config sets Node:RemotingPort = 8082 and lists
Communication:CentralContactPoints as
["akka.tcp://scadabridge@localhost:8081", "akka.tcp://scadabridge@localhost:8082"].
The second contact point — port 8082 — is the site's own remoting endpoint,
not a central node. SiteCommunicationActor / ClusterClient uses these
addresses as initial contacts when discovering the central
ClusterClientReceptionist; a contact pointing at the site itself can never
reach the central receptionist and will be a permanent failure in the
initial-contact rotation. For the single-node dev loopback layout the first
contact (8081, central) succeeds and the bug is masked, but this is exactly
the kind of dev-config "example" that gets duplicated into multi-central
deployments — the same failure mode the resolved Host-004 finding called out
for the seed-node list. StartupValidator validates seed nodes against the
gRPC port (Host-004) but does not cross-check CentralContactPoints against
the site's own RemotingPort, so the misconfiguration passes silently.
Recommendation
Correct the shipped site example to list two central remoting endpoints (e.g.
localhost:8081 for central-a and a distinct port for central-b in a
multi-node layout). Consider extending StartupValidator to reject any
Communication:CentralContactPoints entry whose host+port matches this site
node's NodeHostname+RemotingPort. Add a regression test in
StartupValidatorTests mirroring Site_SeedNodeOnGrpcPort_FailsValidation.
Resolution (2026-05-28): The shipped appsettings.Site.json CentralContactPoints entry that pointed at the site's own remoting port (localhost:8082) was removed — the dev-loopback default now lists only the single central node (akka.tcp://scadabridge@localhost:8081), which is the actually-reachable target in the single-node dev layout. A _centralContactPoints doc-key comment was added immediately above the array calling out the per-entry rule (each entry MUST be a central node's remoting endpoint, not the site's own remoting port) and explaining how to extend the list with a second central node (akka.tcp://scadabridge@central-b-host:8081) in a multi-central deployment so ClusterClient can fail over. The dangerous example pattern that would have been copied into multi-central configs no longer exists in the template. StartupValidator cross-check is left as a follow-up — the documented rule plus the corrected template removes the immediate misconfiguration risk.
Host-017 — Site-shutdown ordering from REQ-HOST-7 is not wired
| Severity | Medium |
| Category | Design-document adherence |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.Host/Program.cs:229-265, src/ZB.MOM.WW.ScadaBridge.Communication/Grpc/SiteStreamGrpcServer.cs |
Description
REQ-HOST-7 documents an explicit four-step shutdown sequence for site nodes:
"(1) On CoordinatedShutdown, stop accepting new gRPC streams first.
(2) Cancel all active gRPC streams (triggering client-side reconnect).
(3) Tear down actors.
(4) Use IHostApplicationLifetime.ApplicationStopping to signal the gRPC
server." The site path in Program.cs (the role == "Site" branch) registers
no IHostApplicationLifetime.ApplicationStopping callback, and
SiteStreamGrpcServer exposes no "stop accepting" / "cancel all streams"
entry point — it has SetReady but no corresponding SetUnavailable or
CancelAllStreams. In practice, on SIGTERM Kestrel closes its listener
naturally and AkkaHostedService.StopAsync runs Akka CoordinatedShutdown,
but there is no explicit, ordered handoff that meets the documented contract:
in-flight streams are not actively cancelled before actors begin tearing down,
so clients see a stream that goes silent (and only times out via gRPC
keepalive) rather than a clean Cancelled they can reconnect on. This is a
contract-vs-code drift — either the design doc is overstating what is
implemented, or the implementation is incomplete.
Recommendation
Add a SiteStreamGrpcServer.CancelAllStreams() method that flips a "shutting
down" flag (so SubscribeSite immediately fails new streams with
StatusCode.Unavailable) and cancels every entry's Cts in the _streams
map. In Program.cs site branch, resolve IHostApplicationLifetime and
register a callback on ApplicationStopping that calls CancelAllStreams()
before the Akka hosted service runs CoordinatedShutdown (or order via
AkkaHostedService.StopAsync itself — IHostedService.StopAsync runs in
reverse-registration order, so the gRPC server's lifetime can be sequenced
before Akka shutdown). Alternatively, reconcile REQ-HOST-7 with the actual
implementation if the explicit ordering is no longer intended. Add an
integration test under tests/ZB.MOM.WW.ScadaBridge.Host.Tests that starts a site host,
opens a stream, triggers shutdown, and asserts the stream completes with
Cancelled before the actor system tears down.
Resolution (2026-05-28):
REQ-HOST-7 steps (1)+(2) wired. SiteStreamGrpcServer gained:
- a monotonic
_shuttingDownflag, CancelAllStreams()— flips the flag, cancels every_activeStreams[*].Cts(withObjectDisposedExceptionswallow for entries cleaning themselves up concurrently), idempotent on repeat calls,- a
SubscribeInstanceguard that returnsUnavailable "Server shutting down"for new subscriptions arriving after the flag flips.
Program.cs site branch now resolves IHostApplicationLifetime and the
SiteStreamGrpcServer singleton, then registers
ApplicationStopping.Register(() => siteGrpcServer.CancelAllStreams()).
ApplicationStopping fires before any IHostedService.StopAsync, so the
gRPC server begins refusing new streams and tears down in-flight ones
BEFORE AkkaHostedService runs CoordinatedShutdown — matching REQ-HOST-7's
ordering. Clients observe a clean Cancelled and reconnect rather than a
silent stream that times out via keepalive (~25 s).
Two unit regression tests added to
tests/ZB.MOM.WW.ScadaBridge.Communication.Tests/Grpc/SiteStreamGrpcServerTests.cs:
Host017_CancelAllStreams_CancelsActiveStreamsAndRefusesNewOnes (active
streams complete, new ones rejected) and Host017_CancelAllStreams_IsIdempotent
(double-call safe). A full site-host integration test was deferred — the
unit suite covers both server-side invariants and the wiring is a single
Register line in Program.cs.
Host-018 — Shipped per-role configs omit NodeOptions.NodeName, leaving SourceNode null
| Severity | Low |
| Category | Code organization & conventions |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.Host/appsettings.Central.json, src/ZB.MOM.WW.ScadaBridge.Host/appsettings.Site.json, src/ZB.MOM.WW.ScadaBridge.Host/NodeOptions.cs:10-16 |
Description
NodeOptions.NodeName is documented as "the operator-configured semantic node
name used to stamp the SourceNode column on audit rows", with conventional
values node-a/node-b for site nodes and central-a/central-b for
central nodes. The CLAUDE.md "Centralized Audit Log" key-decision section
calls this out: SourceNode is meant to be carried verbatim through audit
telemetry and reconciliation, and is indexed via
IX_AuditLog_Node_Occurred (SourceNode, OccurredAtUtc). The docker per-node
configs (docker/central-node-a/appsettings.Central.json,
docker/site-a-node-a/appsettings.Site.json, etc.) all set
ScadaBridge:Node:NodeName. The shipped, default per-role files in
src/ZB.MOM.WW.ScadaBridge.Host/ — the templates a developer running the binary
directly will use — do not. NodeIdentityProvider normalises an empty
NodeName to null, so dev audit rows carry a null SourceNode and the
indexed lookup never narrows. The dev examples should match the docker
examples; at minimum the field should appear in the shipped templates with a
placeholder explaining the convention.
Recommendation
Add "NodeName": "central-a" (or a placeholder like "${NODE_NAME}") to
appsettings.Central.json and "NodeName": "node-a" to
appsettings.Site.json, with a short comment that the value must be set
per-node in multi-node deployments. Consider validating in StartupValidator
that NodeName is non-empty, or accept the null and document explicitly that
single-node dev deployments leave SourceNode null.
Resolution (2026-05-28): The shipped per-role templates now set ScadaBridge:Node:NodeName — central-a in appsettings.Central.json and node-a in appsettings.Site.json — so dev audit rows are stamped with a real SourceNode value (instead of NodeIdentityProvider normalising the missing key to null) and the indexed IX_AuditLog_Node_Occurred lookup actually narrows. A _nodeName doc-key comment was added beside each Node section explaining the convention (central-a/central-b for central, node-a/node-b for site), pointing at the docker per-node configs (which already overrode the field), and noting that the value must be overridden per-node in multi-node deployments and that an empty value still normalises to a NULL SourceNode. The shipped dev templates now match the per-node docker examples — a developer running the binary directly no longer sees a null SourceNode.
Host-019 — Migration StartupRetry call drops the host CancellationToken
| Severity | Low |
| Category | Concurrency & thread safety |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.Host/Program.cs:154-165 |
Resolution (2026-05-28): Added a Func<CancellationToken, Task> overload of StartupRetry.ExecuteWithRetryAsync that forwards the retry-loop token into the operation, and the migration call site in Program.cs now passes app.Lifetime.ApplicationStopping as both the operation token (threaded to MigrationHelper.ApplyOrValidateMigrationsAsync) and the loop's cancellationToken (already honoured by the inter-attempt Task.Delay). A SIGTERM during the bounded retry window now tears down cleanly instead of waiting up to ~2 minutes for the loop to exhaust. The original Func<Task> overload still exists and delegates, so existing callers/tests are unchanged.
Description
StartupRetry.ExecuteWithRetryAsync accepts an optional
CancellationToken cancellationToken = default and observes it both at the
top of each attempt and inside the Task.Delay between retries. The migration
call site in Program.cs passes no token, so the helper runs with
CancellationToken.None. With maxAttempts: 8, initialDelay: 2s, and the
30s cap, a database that stays unreachable can keep the retry loop alive for
~2 minutes before the host process responds to SIGTERM / Ctrl+C /
Windows-Service stop. The Program.cs startup pipeline does not yet have a
host-lifetime token to forward at this point (the app is built but not
yet running), but app.Lifetime.ApplicationStopping is available the moment
builder.Build() returns. Threading it into the retry call honours the host
lifecycle and matches the helper's documented contract.
Recommendation
Pass app.Lifetime.ApplicationStopping (or CancellationToken.None
explicitly with a comment if intentional) into
StartupRetry.ExecuteWithRetryAsync. Add a StartupRetryTests case
exercising token-cancellation mid-backoff.
Resolution
Open.
Host-020 — MinimumLevel.Is silently overrides any operator-set Serilog:MinimumLevel
| Severity | Low |
| Category | Documentation & comments |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.Host/LoggerConfigurationFactory.cs:36-43 |
Description
LoggerConfigurationFactory.Build reads the Serilog configuration section
via ReadFrom.Configuration(configuration) (which can include a
MinimumLevel block — the standard Serilog way to set the floor) and then
calls .MinimumLevel.Is(minimumLevel) derived from
ScadaBridge:Logging:MinimumLevel. Serilog's fluent builder applies the later
call, so any Serilog:MinimumLevel:Default an operator sets is silently
overridden by ScadaBridge:Logging:MinimumLevel (or by its
Information fallback when the ScadaBridge key is absent). There are now two
documented configuration paths for the same setting with non-obvious
precedence, and the override direction is the opposite of what most Serilog
users would expect (the more-specific Serilog section being the authority).
The XML doc on Build says "the explicit MinimumLevel.Is pins the floor"
but does not warn that the floor overrides the Serilog section's own
MinimumLevel.
Recommendation
Pick one mechanism: either (a) drop the MinimumLevel.Is call and let
ReadFrom.Configuration consume Serilog:MinimumLevel, migrating any docs/
deployments that reference ScadaBridge:Logging:MinimumLevel; or (b) keep the
current "ScadaBridge:Logging" path and reject Serilog:MinimumLevel if present
(throw at startup so the operator sees the conflict). At minimum, expand the
XML doc + REQ-HOST-8 to spell out the precedence explicitly.
Resolution (2026-05-28): ScadaBridge:Logging:MinimumLevel is now the documented single source of truth for the Serilog floor (Host-011's LoggingOptions binding), and the precedence is made visible — LoggerConfigurationFactory.Build writes a one-shot warning to Console.Error when both ScadaBridge:Logging:MinimumLevel and Serilog:MinimumLevel (or Serilog:MinimumLevel:Default) are present, naming both values and pointing the operator at the documented key. Order of operations is unchanged — MinimumLevel.Is(...) deliberately runs after ReadFrom.Configuration(...) so the ScadaBridge value wins — but the silent-override behaviour is now loud. The class XML doc gained a Host-020 paragraph explicitly spelling out the precedence. A test-visible Build(..., TextWriter warningWriter) overload mirrors the ParseLevel Host-022 pattern so the warning can be asserted in unit tests; the production four-arg overload delegates with Console.Error.
Host-021 — Microsoft Logging:LogLevel section in appsettings.json is dead config under Serilog
| Severity | Low |
| Category | Code organization & conventions |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.Host/appsettings.json:2-6 |
Description
appsettings.json carries a Microsoft Logging:LogLevel:Default = Information
block. The Logging:LogLevel map is consumed by
Microsoft.Extensions.Logging.ConfigurationConsoleLoggerOptions and similar
provider configurations bound from the standard Logging section. The Host
calls builder.Host.UseSerilog(), which replaces the default
ILoggerFactory setup with Serilog as the only logger provider; Serilog
reads from configuration.ReadFrom.Configuration(...) which consumes the
Serilog section, not Logging:LogLevel. The result is that an operator
editing Logging:LogLevel:Default (a very natural thing to try, since it is
the .NET convention) sees no behaviour change — the section is dead config.
Recommendation
Either remove the Logging:LogLevel block from appsettings.json (Serilog
owns logging configuration in this Host), or replace it with a brief comment
explaining it is intentionally retained for non-Serilog tooling. Document the
authoritative location (Serilog + ScadaBridge:Logging) in
Component-Host.md REQ-HOST-8 if not already explicit.
Resolution (2026-05-28): Confirmed by repository-wide grep that no code reads Logging:LogLevel (the Host calls builder.Host.UseSerilog() which replaces the default ILoggerFactory setup with Serilog as the only provider), so the block was pure dead config. Removed the Logging:LogLevel:Default = Information block from appsettings.json and replaced it with a _logging doc-key comment explaining the rationale (Serilog is the sole provider) and pointing operators at the two authoritative keys: ScadaBridge:Logging:MinimumLevel for the floor (bound to LoggingOptions per Host-011) and the Serilog section for sinks (Host-014's ReadFrom.Configuration). The Host-014 regression test (SerilogSinkConfigTests.ShippedAppSettings_HasSerilogSection_WithConsoleAndFileSinks) still asserts the surviving Serilog section's shape, so removing the Microsoft block did not break the existing pinning.
Host-022 — ParseLevel silently coerces unrecognised MinimumLevel to Information
| Severity | Low |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.Host/LoggerConfigurationFactory.cs:50-55 |
Resolution (2026-05-28): ParseLevel now writes a one-shot warning to Console.Error (the logger isn't built yet at this point) when a non-null/non-blank MinimumLevel value fails to parse, naming the offending value and the Information fallback. Null/blank values continue to default silently (treated as "unset"). The helper gained a test-visible TextWriter overload so unit tests can capture the warning; the production path delegates to it with Console.Error. Tests ParseLevel_UnrecognisedValue_FallsBackAndWarns, ParseLevel_NullOrBlank_FallsBackSilently, and ParseLevel_RecognisedValue_NoWarning pin the behaviour.
Description
LoggerConfigurationFactory.ParseLevel uses
Enum.TryParse<LogEventLevel>(level, ignoreCase: true, out var parsed) and
returns LogEventLevel.Information when parsing fails — without logging the
fallback. An operator who sets
ScadaBridge:Logging:MinimumLevel = "Informaiton" (a common typo) or
"Verbose,Debug" or any unrecognised value gets the default level silently;
there is no warning, no log line, no startup error. Combined with Host-020
(this is the only mechanism that pins the floor), a misspelt value is
invisible until someone wonders why the level change "didn't take". The
helper is small and could either fail-fast in StartupValidator or emit a
console warning before the logger is configured.
Recommendation
In LoggerConfigurationFactory.Build, when loggingOptions.MinimumLevel is
non-null/non-blank but does not parse to a valid LogEventLevel, write a
Console.Error.WriteLine warning (the logger is not yet built) and proceed
with Information. Alternatively, validate the value in StartupValidator
and fail fast — that matches the pattern used for other ScadaBridge
configuration keys. Add a LoggerConfigurationTests case asserting the
behaviour you choose.
Resolution
Open.