fix(host): resolve Host-012..015 — consume DownIfAlone in HOCON, sub-second timing precision, config-driven Serilog sinks, transient-only startup retry

This commit is contained in:
Joseph Doherty
2026-05-17 03:18:33 -04:00
parent eae4077414
commit aca65e85bb
9 changed files with 395 additions and 33 deletions

View File

@@ -8,7 +8,7 @@
| Last reviewed | 2026-05-17 |
| Reviewer | claude-agent |
| Commit reviewed | `39d737e` |
| Open findings | 4 |
| Open findings | 0 |
## Summary
@@ -585,7 +585,7 @@ key). Full Host suite green (175 passed).
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.Host/Actors/AkkaHostedService.cs:146-148` |
**Description**
@@ -618,7 +618,15 @@ declared-but-dead.
**Resolution**
_Unresolved._
Resolved 2026-05-17 (commit `<pending>`). Root cause confirmed against
`AkkaHostedService.cs`: `BuildHocon` emitted `keep-oldest { down-if-alone = on }` as
a literal, and `ClusterOptions.DownIfAlone` was referenced nowhere outside its own
declaration, so the bound option was dead. Fix: `BuildHocon` now emits
`down-if-alone = {(clusterOptions.DownIfAlone ? "on" : "off")}`, consuming the bound
property. Regression tests in `HoconBuilderTests`: `BuildHocon_DownIfAloneTrue_EmitsOn`
and `BuildHocon_DownIfAloneFalse_EmitsOff` parse the resulting HOCON and assert the
`keep-oldest.down-if-alone` token tracks the property; the `false` case failed before
the fix (token was still `on`) and passes after. Full Host suite green (182 passed).
### Host-013 — `:F0` rounding of cluster timing values silently degrades sub-second configuration
@@ -626,7 +634,7 @@ _Unresolved._
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.Host/Actors/AkkaHostedService.cs:135-136,145,151-152` |
**Description**
@@ -653,7 +661,17 @@ number of seconds and fail fast otherwise. Either way, do not silently round.
**Resolution**
_Unresolved._
Resolved 2026-05-17 (commit `<pending>`). Root cause confirmed: `BuildHocon`
rendered every duration via `TotalSeconds:F0`, rounding sub-second values — a 750ms
heartbeat became `1s` and a 500ms value would collapse to a degenerate `0s`. Fix:
durations are now rendered by a new `DurationHocon` helper that emits whole
milliseconds with an `ms` suffix (Akka's HOCON parser accepts it), so sub-second
configuration survives exactly; `BuildHocon` takes the transport heartbeat/failure
as `TimeSpan` rather than pre-rounded `double` seconds. Regression test
`HoconBuilderTests.BuildHocon_SubSecondTimings_PreservedWithoutRounding` configures
750ms/2600ms/500ms/2.5s/7.5s timings and asserts each round-trips through the parsed
HOCON unchanged; it failed before the fix (750ms parsed as `00:00:01`) and passes
after. Full Host suite green (182 passed).
### Host-014 — Serilog sinks are hard-coded in `Program.cs`, not configuration-driven (REQ-HOST-8)
@@ -661,7 +679,7 @@ _Unresolved._
|--|--|
| Severity | Low |
| Category | Design-document adherence |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.Host/Program.cs:43-48`, `src/ScadaLink.Host/appsettings.json:1-7` |
**Description**
@@ -690,7 +708,22 @@ state the sinks are intentionally code-defined — but the design doc currently
**Resolution**
_Unresolved._
Resolved 2026-05-17 (commit `<pending>`). Root cause confirmed: `LoggerConfigurationFactory.Build`
called `ReadFrom.Configuration` but neither `appsettings.json` nor either
role-specific file contained a `Serilog` section, so the two sinks that actually ran
were hard-coded `.WriteTo.Console(...)` / `.WriteTo.File(...)` calls in `Program.cs`
— uneditable without recompiling, contrary to REQ-HOST-8. Fix: added a `Serilog`
section to `appsettings.json` with `WriteTo` entries for the Console and File sinks
(carrying the output template, file path `logs/scadalink-.log` and `Day` rolling
interval as args) and removed the hard-coded `.WriteTo` calls from `Program.cs`; the
factory's `ReadFrom.Configuration` (backed by the transitive
`Serilog.Settings.Configuration` package) now drives the sinks. Regression tests in
new `SerilogSinkConfigTests`: `ShippedAppSettings_HasSerilogSection_WithConsoleAndFileSinks`
asserts the shipped `appsettings.json` declares both sinks in a `Serilog:WriteTo`
array (fails against the pre-fix file, which had no `Serilog` section), and
`LoggerConfigurationFactory_AppliesConfiguredFileSink` proves a configuration-supplied
File sink reaches the built logger and writes to the configured path. Full Host suite
green (182 passed).
### Host-015 — `StartupRetry` retries on every exception type, including permanent failures
@@ -698,7 +731,7 @@ _Unresolved._
|--|--|
| Severity | Low |
| Category | Error handling & resilience |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.Host/StartupRetry.cs:36-45` |
**Description**
@@ -727,4 +760,20 @@ single attempt.
**Resolution**
_Unresolved._
Resolved 2026-05-17 (commit `<pending>`). Root cause confirmed: `ExecuteWithRetryAsync`
caught `Exception` with only the `when (attempt < maxAttempts)` guard, so it retried
any exception — including a permanent schema-version mismatch raised by the
migration call site, delaying the inevitable fatal exit by ~2 minutes. Fix: added an
optional `Func<Exception, bool> isTransient` parameter (default treats all
exceptions as transient, preserving prior behaviour for unclassified callers); the
catch guard is now `when (attempt < maxAttempts && isTransient(ex))`, so a
non-transient fault propagates after a single attempt. Added a
`StartupRetry.IsTransientDatabaseFault` classifier that returns `true` only for
connection-class faults (`TimeoutException`, `SocketException`, `SqlException` /
`DbException` by type-name match) and `false` for everything else — notably
schema-validation `InvalidOperationException`s — and the `Program.cs` migration call
site now passes it. Regression tests in `StartupRetryTests`:
`ExecuteWithRetry_NonTransientFailure_RethrowsAfterSingleAttempt` (runs exactly once
when `isTransient` returns false) and `ExecuteWithRetry_TransientThenPermanent_StopsAtPermanent`
(retries a `TimeoutException` then stops at a permanent `InvalidOperationException`).
Full Host suite green (182 passed).