code-review: 2026-05-28 baseline re-review of all 23 modules at 1eb6e97

Re-applies the full 10-category checklist to every src/ project — including
first-time reviews of the four newer components (AuditLog, NotificationOutbox,
SiteCallAudit, Transport) — so the code-reviews/ index reflects today's
codebase rather than the 2026-05-16 baseline. 172 new Open findings (0
Critical, 18 High, 62 Medium, 92 Low); 481 findings total across 23 modules.

regen-readme.py now derives each module's Last reviewed + Commit from its
findings.md header instead of hard-coding 2026-05-16 / 9c60592, so future
single-module re-reviews show their own date in the Module Status table.
This commit is contained in:
Joseph Doherty
2026-05-28 02:55:47 -04:00
parent 1eb6e972b0
commit f93b7b99bb
25 changed files with 8793 additions and 115 deletions
+325 -3
View File
@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.Host` |
| Design doc | `docs/requirements/Component-Host.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-17 |
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | `39d737e` |
| Open findings | 0 |
| Commit reviewed | `1eb6e97` |
| Open findings | 7 |
## Summary
@@ -48,6 +48,38 @@ Serilog sink setup is hard-coded in `Program.cs` rather than configuration-drive
REQ-HOST-8 requires (Host-014), and `StartupRetry` retries indiscriminately on every
exception type including permanent schema-validation failures (Host-015).
#### Re-review 2026-05-28 (commit `1eb6e97`)
All fifteen prior findings (Host-001..015) remain `Resolved` in the current tree
and the regressions introduced for them — Host-001's predicate, the externalised
secrets, the Site GrpcPort/RemotingPort/seed-port validation rules, the escaped
HOCON builder with `DownIfAlone` and millisecond-precision durations, the
configuration-driven Serilog sinks, the transient-only `StartupRetry`
classifier — are all still in place. This re-review walked the ten checklist
categories over the full module again and recorded seven new findings, none of
them crash/data-loss class. Host-016 (Medium) mirrors the resolved Host-004
shipped-config bug on the **Communication** side: `appsettings.Site.json`'s
second `CentralContactPoints` entry points at the site's own remoting port
(`localhost:8082`) instead of central, an incorrect dev example that copies
into multi-central deployments. Host-017 (Medium) flags a partial REQ-HOST-7
implementation — the documented site-shutdown ordering (stop accepting streams
first, cancel active streams via `IHostApplicationLifetime.ApplicationStopping`,
then tear down actors) is not wired: the site path registers no
`ApplicationStopping` handler that signals `SiteStreamGrpcServer`, and the gRPC
server exposes no cancel-all-streams entry point. The remaining five are Low:
`NodeOptions.NodeName` (the operator-configured value stamped on
`AuditLog.SourceNode`) is absent from both shipped per-role configs even though
the docker per-node configs set it (Host-018); the migration `StartupRetry`
call passes `default` for `CancellationToken`, so a SIGTERM during the
bounded-retry window is ignored for up to ~2 minutes (Host-019);
`LoggerConfigurationFactory` layers `MinimumLevel.Is` over
`ReadFrom.Configuration`, so any `Serilog:MinimumLevel` an operator sets is
silently overridden by `ScadaLink:Logging:MinimumLevel` (Host-020); the
shipped `appsettings.json` carries a Microsoft `Logging:LogLevel` block but
Serilog is the only logger provider and the section is dead config (Host-021);
and `ParseLevel` silently swallows an unrecognised `MinimumLevel` value (e.g.
a typo) and falls back to `Information` with no warning (Host-022).
## Checklist coverage
| # | Category | Examined | Notes |
@@ -63,6 +95,21 @@ exception type including permanent schema-validation failures (Host-015).
| 9 | Testing coverage | ☑ | Strong suite; regression tests added for Host-001/004/006/007/010/011. No coverage for the new `down-if-alone`, sub-second-duration, or non-transient-retry paths (Host-012/013/015). |
| 10 | Documentation & comments | ☑ | REQ-HOST-6 stale-doc resolved. Re-review: REQ-HOST-8 says sinks are "configuration-driven" but they are code-defined (Host-014). |
_Re-review (2026-05-28, `1eb6e97`):_
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ☑ | Re-review: `appsettings.Site.json` second `CentralContactPoints` entry targets the site's own remoting port instead of central (Host-016) — same defect class as the resolved Host-004 seed-list bug. |
| 2 | Akka.NET conventions | ☑ | CoordinatedShutdown, receptionist registration, singleton scoping, role-scoped site singletons, ClusterClient initial-contact wiring all reviewed; no new issues. |
| 3 | Concurrency & thread safety | ☑ | `_trackedDisposables` is locked on both sides of the lifecycle; `_actorSystem` publication is safe via the IHost startup `await` boundary. New Low: `StartupRetry` migration call passes `default` `CancellationToken`, so SIGTERM during the retry window is ignored (Host-019). |
| 4 | Error handling & resilience | ☑ | `IsTransientDatabaseFault` correctly classifies socket / timeout / SqlException; the retry helper itself remains sound. Host-019 is the resilience gap. |
| 5 | Security | ☑ | Secrets stay externalised; the `_secrets` placeholder comment is intact. No new issues. |
| 6 | Performance & resource management | ☑ | No new undisposed resources; gRPC stream lifetime cap remains correct. No new issues. |
| 7 | Design-document adherence | ☑ | Re-review: REQ-HOST-7 site-shutdown ordering — stop accepting new streams, cancel active streams via `ApplicationStopping`, then tear down actors — is not wired in `Program.cs` (Host-017). |
| 8 | Code organization & conventions | ☑ | Re-review: `NodeOptions.NodeName` is absent from the shipped per-role configs even though it stamps `AuditLog.SourceNode` (Host-018); the appsettings `Logging:LogLevel` Microsoft section is dead config under Serilog (Host-021). |
| 9 | Testing coverage | ☑ | Strong existing suite. No coverage for the Site `CentralContactPoints` second-entry rule (Host-016), the site-shutdown ordering (Host-017), the `NodeName`-absent shipped config (Host-018), the unused `CancellationToken` parameter (Host-019), the `MinimumLevel.Is` override semantics (Host-020) or the `ParseLevel` silent fallback (Host-022). |
| 10 | Documentation & comments | ☑ | Re-review: layered `MinimumLevel.Is` / `ReadFrom.Configuration` semantics are not surfaced — an operator-set `Serilog:MinimumLevel` is silently overridden by `ScadaLink:Logging:MinimumLevel` (Host-020); `ParseLevel` silently coerces a misspelled level to `Information` with no warning (Host-022). |
## Findings
### Host-001 — `/health/ready` includes the leader-only `active-node` check
@@ -777,3 +824,278 @@ site now passes it. Regression tests in `StartupRetryTests`:
when `isTransient` returns false) and `ExecuteWithRetry_TransientThenPermanent_StopsAtPermanent`
(retries a `TimeoutException` then stops at a permanent `InvalidOperationException`).
Full Host suite green (182 passed).
### Host-016 — Site `CentralContactPoints` second entry targets the site's own remoting port
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.Host/appsettings.Site.json:33-37` |
**Description**
The shipped site config sets `Node:RemotingPort = 8082` and lists
`Communication:CentralContactPoints` as
`["akka.tcp://scadalink@localhost:8081", "akka.tcp://scadalink@localhost:8082"]`.
The second contact point — port `8082` — is the **site's own** remoting endpoint,
not a central node. `SiteCommunicationActor` / `ClusterClient` uses these
addresses as initial contacts when discovering the central
`ClusterClientReceptionist`; a contact pointing at the site itself can never
reach the central receptionist and will be a permanent failure in the
initial-contact rotation. For the single-node dev loopback layout the first
contact (`8081`, central) succeeds and the bug is masked, but this is exactly
the kind of dev-config "example" that gets duplicated into multi-central
deployments — the same failure mode the resolved Host-004 finding called out
for the seed-node list. `StartupValidator` validates seed nodes against the
gRPC port (Host-004) but does not cross-check `CentralContactPoints` against
the site's own `RemotingPort`, so the misconfiguration passes silently.
**Recommendation**
Correct the shipped site example to list two central remoting endpoints (e.g.
`localhost:8081` for `central-a` and a distinct port for `central-b` in a
multi-node layout). Consider extending `StartupValidator` to reject any
`Communication:CentralContactPoints` entry whose host+port matches this site
node's `NodeHostname`+`RemotingPort`. Add a regression test in
`StartupValidatorTests` mirroring `Site_SeedNodeOnGrpcPort_FailsValidation`.
**Resolution**
_Open._
### Host-017 — Site-shutdown ordering from REQ-HOST-7 is not wired
| | |
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.Host/Program.cs:229-265`, `src/ScadaLink.Communication/Grpc/SiteStreamGrpcServer.cs` |
**Description**
REQ-HOST-7 documents an explicit four-step shutdown sequence for site nodes:
"(1) On `CoordinatedShutdown`, stop accepting new gRPC streams first.
(2) Cancel all active gRPC streams (triggering client-side reconnect).
(3) Tear down actors.
(4) Use `IHostApplicationLifetime.ApplicationStopping` to signal the gRPC
server." The site path in `Program.cs` (the `role == "Site"` branch) registers
no `IHostApplicationLifetime.ApplicationStopping` callback, and
`SiteStreamGrpcServer` exposes no "stop accepting" / "cancel all streams"
entry point — it has `SetReady` but no corresponding `SetUnavailable` or
`CancelAllStreams`. In practice, on `SIGTERM` Kestrel closes its listener
naturally and `AkkaHostedService.StopAsync` runs Akka `CoordinatedShutdown`,
but there is no explicit, ordered handoff that meets the documented contract:
in-flight streams are not actively cancelled before actors begin tearing down,
so clients see a stream that goes silent (and only times out via gRPC
keepalive) rather than a clean `Cancelled` they can reconnect on. This is a
contract-vs-code drift — either the design doc is overstating what is
implemented, or the implementation is incomplete.
**Recommendation**
Add a `SiteStreamGrpcServer.CancelAllStreams()` method that flips a "shutting
down" flag (so `SubscribeSite` immediately fails new streams with
`StatusCode.Unavailable`) and cancels every entry's `Cts` in the `_streams`
map. In `Program.cs` site branch, resolve `IHostApplicationLifetime` and
register a callback on `ApplicationStopping` that calls `CancelAllStreams()`
before the Akka hosted service runs `CoordinatedShutdown` (or order via
`AkkaHostedService.StopAsync` itself — `IHostedService.StopAsync` runs in
reverse-registration order, so the gRPC server's lifetime can be sequenced
before Akka shutdown). Alternatively, reconcile REQ-HOST-7 with the actual
implementation if the explicit ordering is no longer intended. Add an
integration test under `tests/ScadaLink.Host.Tests` that starts a site host,
opens a stream, triggers shutdown, and asserts the stream completes with
`Cancelled` before the actor system tears down.
**Resolution**
_Open._
### Host-018 — Shipped per-role configs omit `NodeOptions.NodeName`, leaving `SourceNode` null
| | |
|--|--|
| Severity | Low |
| Category | Code organization & conventions |
| Status | Open |
| Location | `src/ScadaLink.Host/appsettings.Central.json`, `src/ScadaLink.Host/appsettings.Site.json`, `src/ScadaLink.Host/NodeOptions.cs:10-16` |
**Description**
`NodeOptions.NodeName` is documented as "the operator-configured semantic node
name used to stamp the SourceNode column on audit rows", with conventional
values `node-a`/`node-b` for site nodes and `central-a`/`central-b` for
central nodes. The CLAUDE.md "Centralized Audit Log" key-decision section
calls this out: `SourceNode` is meant to be carried verbatim through audit
telemetry and reconciliation, and is indexed via
`IX_AuditLog_Node_Occurred (SourceNode, OccurredAtUtc)`. The docker per-node
configs (`docker/central-node-a/appsettings.Central.json`,
`docker/site-a-node-a/appsettings.Site.json`, etc.) all set
`ScadaLink:Node:NodeName`. The **shipped, default** per-role files in
`src/ScadaLink.Host/` — the templates a developer running the binary
directly will use — do not. `NodeIdentityProvider` normalises an empty
`NodeName` to `null`, so dev audit rows carry a null `SourceNode` and the
indexed lookup never narrows. The dev examples should match the docker
examples; at minimum the field should appear in the shipped templates with a
placeholder explaining the convention.
**Recommendation**
Add `"NodeName": "central-a"` (or a placeholder like `"${NODE_NAME}"`) to
`appsettings.Central.json` and `"NodeName": "node-a"` to
`appsettings.Site.json`, with a short comment that the value must be set
per-node in multi-node deployments. Consider validating in `StartupValidator`
that `NodeName` is non-empty, or accept the null and document explicitly that
single-node dev deployments leave `SourceNode` null.
**Resolution**
_Open._
### Host-019 — Migration `StartupRetry` call drops the host `CancellationToken`
| | |
|--|--|
| Severity | Low |
| Category | Concurrency & thread safety |
| Status | Open |
| Location | `src/ScadaLink.Host/Program.cs:154-165` |
**Description**
`StartupRetry.ExecuteWithRetryAsync` accepts an optional
`CancellationToken cancellationToken = default` and observes it both at the
top of each attempt and inside the `Task.Delay` between retries. The migration
call site in `Program.cs` passes no token, so the helper runs with
`CancellationToken.None`. With `maxAttempts: 8`, `initialDelay: 2s`, and the
30s cap, a database that stays unreachable can keep the retry loop alive for
~2 minutes before the host process responds to `SIGTERM` / `Ctrl+C` /
Windows-Service stop. The `Program.cs` startup pipeline does not yet have a
host-lifetime token to forward at this point (the `app` is built but not
yet running), but `app.Lifetime.ApplicationStopping` is available the moment
`builder.Build()` returns. Threading it into the retry call honours the host
lifecycle and matches the helper's documented contract.
**Recommendation**
Pass `app.Lifetime.ApplicationStopping` (or `CancellationToken.None`
explicitly with a comment if intentional) into
`StartupRetry.ExecuteWithRetryAsync`. Add a `StartupRetryTests` case
exercising token-cancellation mid-backoff.
**Resolution**
_Open._
### Host-020 — `MinimumLevel.Is` silently overrides any operator-set `Serilog:MinimumLevel`
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Open |
| Location | `src/ScadaLink.Host/LoggerConfigurationFactory.cs:36-43` |
**Description**
`LoggerConfigurationFactory.Build` reads the `Serilog` configuration section
via `ReadFrom.Configuration(configuration)` (which can include a
`MinimumLevel` block — the standard Serilog way to set the floor) and **then**
calls `.MinimumLevel.Is(minimumLevel)` derived from
`ScadaLink:Logging:MinimumLevel`. Serilog's fluent builder applies the later
call, so any `Serilog:MinimumLevel:Default` an operator sets is silently
overridden by `ScadaLink:Logging:MinimumLevel` (or by its
`Information` fallback when the ScadaLink key is absent). There are now two
documented configuration paths for the same setting with non-obvious
precedence, and the override direction is the opposite of what most Serilog
users would expect (the more-specific `Serilog` section being the authority).
The XML doc on `Build` says "the explicit `MinimumLevel.Is` pins the floor"
but does not warn that the floor *overrides* the Serilog section's own
`MinimumLevel`.
**Recommendation**
Pick one mechanism: either (a) drop the `MinimumLevel.Is` call and let
`ReadFrom.Configuration` consume `Serilog:MinimumLevel`, migrating any docs/
deployments that reference `ScadaLink:Logging:MinimumLevel`; or (b) keep the
current "ScadaLink:Logging" path and reject `Serilog:MinimumLevel` if present
(throw at startup so the operator sees the conflict). At minimum, expand the
XML doc + REQ-HOST-8 to spell out the precedence explicitly.
**Resolution**
_Open._
### Host-021 — Microsoft `Logging:LogLevel` section in `appsettings.json` is dead config under Serilog
| | |
|--|--|
| Severity | Low |
| Category | Code organization & conventions |
| Status | Open |
| Location | `src/ScadaLink.Host/appsettings.json:2-6` |
**Description**
`appsettings.json` carries a Microsoft `Logging:LogLevel:Default = Information`
block. The `Logging:LogLevel` map is consumed by
`Microsoft.Extensions.Logging.ConfigurationConsoleLoggerOptions` and similar
provider configurations bound from the standard `Logging` section. The Host
calls `builder.Host.UseSerilog()`, which replaces the default
`ILoggerFactory` setup with Serilog as the **only** logger provider; Serilog
reads from `configuration.ReadFrom.Configuration(...)` which consumes the
`Serilog` section, **not** `Logging:LogLevel`. The result is that an operator
editing `Logging:LogLevel:Default` (a very natural thing to try, since it is
the .NET convention) sees no behaviour change — the section is dead config.
**Recommendation**
Either remove the `Logging:LogLevel` block from `appsettings.json` (Serilog
owns logging configuration in this Host), or replace it with a brief comment
explaining it is intentionally retained for non-Serilog tooling. Document the
authoritative location (`Serilog` + `ScadaLink:Logging`) in
`Component-Host.md` REQ-HOST-8 if not already explicit.
**Resolution**
_Open._
### Host-022 — `ParseLevel` silently coerces unrecognised `MinimumLevel` to `Information`
| | |
|--|--|
| Severity | Low |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.Host/LoggerConfigurationFactory.cs:50-55` |
**Description**
`LoggerConfigurationFactory.ParseLevel` uses
`Enum.TryParse<LogEventLevel>(level, ignoreCase: true, out var parsed)` and
returns `LogEventLevel.Information` when parsing fails — without logging the
fallback. An operator who sets
`ScadaLink:Logging:MinimumLevel = "Informaiton"` (a common typo) or
`"Verbose,Debug"` or any unrecognised value gets the default level silently;
there is no warning, no log line, no startup error. Combined with Host-020
(this is the only mechanism that pins the floor), a misspelt value is
invisible until someone wonders why the level change "didn't take". The
helper is small and could either fail-fast in `StartupValidator` or emit a
console warning before the logger is configured.
**Recommendation**
In `LoggerConfigurationFactory.Build`, when `loggingOptions.MinimumLevel` is
non-null/non-blank but does not parse to a valid `LogEventLevel`, write a
`Console.Error.WriteLine` warning (the logger is not yet built) and proceed
with `Information`. Alternatively, validate the value in `StartupValidator`
and fail fast — that matches the pattern used for other ScadaLink
configuration keys. Add a `LoggerConfigurationTests` case asserting the
behaviour you choose.
**Resolution**
_Open._