fix(api-surface): close Theme 9 — 27 naming / dead-code / config / hygiene findings

The largest themed batch — small mechanical fixes across 11 modules.

API / message hygiene:
- Comm-020: SiteAddressCacheLoaded now carries IReadOnlyDictionary /
  IReadOnlyList — Akka messages must be immutable.
- Commons-016: BundleSession.MaxUnlockAttempts named constant replaces
  magic 3.
- Commons-018: IOperationTrackingStore + IPartitionMaintenance moved from
  Interfaces/ root to Interfaces/Services/ (namespace preserved — 9
  consumers exceeded the in-prompt move threshold).
- Commons-023: TrackingStatusSnapshot.SourceNode now consistent with the
  trailing-optional-with-default pattern used elsewhere.
- SR-022: AuditingDbCommand.DbConnection.set no longer uses reflection —
  exposes AuditingDbConnection.Inner via internal API surface.

Dead code / config cleanup:
- ClusterInfra-011: decorative SectionName constant deleted.
- ClusterInfra-014: dead AddClusterInfrastructureActors method + its
  "throws-when-called" test deleted.
- Host-021: Microsoft Logging:LogLevel block deleted from appsettings.json
  (dead under Serilog).

Fail-loud over fail-silent:
- DM-021: ResolveSiteIdentifierAsync throws on missing site (was silently
  substituting a DB id).
- DM-022: dropped transient Pending write — record now lands directly in
  InProgress (no UI flicker, one fewer DB write).
- Host-020: LoggerConfigurationFactory emits a Console.Error warning when
  both Serilog:MinimumLevel and ScadaLink:Logging:MinimumLevel are set
  (ScadaLink remains truth per Host-011).
- SnF-022: NotifyCachedCallObserverAsync logs Warning on unparseable
  TrackedOperationId (was silently dropping).
- SnF-023: empty siteId default replaced with $unknown-site sentinel
  + constructor normalisation.

Correctness:
- SCA-001: SupervisorStrategy XML rewritten to match actual
  DefaultDecider/Restart semantics (was claiming Resume).
- SCA-003: OnUpsertAsync now restamps IngestedAtUtc on every upsert.
- SR-021: HandleDeployArtifacts now dispatches an internal
  ApplyArtifactDataConnectionsToDcl message after the SQLite write so
  system-wide artifact-deploy data-connection changes go live
  immediately (was requiring a site restart).
- SnF-020: RetryParkedMessageAsync captures the parked row BEFORE the
  local write so a concurrent delete can't skip standby replication.

Sentinels / naming collisions:
- HM-021: CentralSiteId changed from "central" to "$central"
  (uncollideable — leading $ is forbidden in real SiteIdentifiers).

Doc / surface cleanups:
- SEL-018: FailedWriteCount promoted to ISiteEventLogger; XML softened
  to "Available for future Health Monitoring integration".
- SnF-019: VERIFY outcome — documented parking-after-DefaultMaxRetries
  in Component-StoreAndForward.md + DefaultMaxRetries XML (uniform
  cap; maxRetries:0 is the unbounded escape hatch).
- SnF-021: Component-StoreAndForward.md no longer claims the tracking
  table lives in SnF — it's in SiteRuntime, the interface is in Commons.
- CLI-020: bundle export response parse guarded with try/catch on
  JsonException / KeyNotFoundException / FormatException — emits a
  clean INVALID_RESPONSE exit instead of a stack trace.

Config:
- ClusterInfra-013: intent comment added to "catastrophic config" test.
- Host-016: appsettings.Site.json second CentralContactPoints entry
  removed (was pointing at the SITE's own port); doc-key explains how
  to extend.
- Host-018: NodeName added to both shipped per-role configs (was
  causing SourceNode to be null on audit rows).

UI:
- CentralUI-029: replaced JS.InvokeAsync<int>("eval", …) with an ES
  module import (new wwwroot/js/browser-time.js).
- CentralUI-032: AuditResultsGrid gains a Previous button backed by a
  cursor stack.

10+ new regression tests across the affected projects. Build clean;
all suites green. README regenerated: 6 open (was 33).

Session-to-date: 130 of 136 originally-open Theme findings closed.
This commit is contained in:
Joseph Doherty
2026-05-28 08:39:01 -04:00
parent d190345ef0
commit 77cb0ad0e2
46 changed files with 966 additions and 278 deletions
+63 -21
View File
@@ -8,7 +8,7 @@
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | `1eb6e97` |
| Open findings | 5 (3 Deferred: 002, 011, 012; 5 new Open from Re-review 2026-05-28) |
| Open findings | 0 (3 Deferred: 002, 011, 012; all 5 Open from Re-review 2026-05-28 resolved 2026-05-28) |
## Summary
@@ -1067,7 +1067,7 @@ _Unresolved._
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:229`, `:407``:437`; `src/ScadaLink.StoreAndForward/StoreAndForwardOptions.cs:18`; `src/ScadaLink.SiteRuntime/Scripts/ScriptRuntimeContext.cs:1773``:1778`; `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:149``:156` |
**Description**
@@ -1121,9 +1121,21 @@ the field value) so the invariant is enforced at the single chokepoint rather th
relying on every caller to pass the right value — this also fixes the legacy
`NotificationDeliveryService` path without editing the consumer.
**Resolution**
_Unresolved._
**Resolution (2026-05-28):**
VERIFY outcome — the design doc's "Notifications do not park" wording (lines 47, 59)
was the *operational intent* for the happy path, not an absolute invariant: the engine
has always enforced `DefaultMaxRetries` uniformly across every category, and every
sibling system (ESG, CachedDbWrite) bounds retry-then-parks for the same disk-pressure
and operator-visibility reasons. Removing the cap for notifications would let a single
unreachable central exhaust local disk via an unbounded buffer — worse than the
documented "park after retry budget" behaviour. Resolution is therefore the brief's
**default**: document the parking behaviour. Updated
`Component-StoreAndForward.md` lines 46/58 to clarify that the `DefaultMaxRetries` cap
applies uniformly (including to notifications) and that `maxRetries: 0` is the explicit
escape hatch for callers that need unbounded retry. Added a `StoreAndForward-019` block
to `StoreAndForwardOptions.DefaultMaxRetries`'s XML doc explaining the same invariant.
No behavioural code change — existing tests (104 in
`ScadaLink.StoreAndForward.Tests`) continue to pass.
### StoreAndForward-020 — `RetryParkedMessageAsync` skips standby replication when the message is deleted between local update and re-load
@@ -1131,7 +1143,7 @@ _Unresolved._
|--|--|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:599``:616` |
**Description**
@@ -1209,9 +1221,16 @@ Add a regression test in `StoreAndForwardReplicationTests` that simulates the
delete-between-update-and-reload race and asserts the `Requeue` replication
operation is still emitted with the correct category.
**Resolution**
_Unresolved._
**Resolution (2026-05-28):**
Applied the brief's primary recommendation — `RetryParkedMessageAsync` now captures
the parked row up front via `GetMessageByIdAsync` (and rejects the call early if the
row is missing or no longer `Parked`), then performs the local `RetryParkedMessageAsync`
storage write, and finally reconstructs the post-requeue state on the captured POCO
(`Status = Pending, RetryCount = 0, LastError = null, LastAttemptAt = null`) and
replicates it. A concurrent `RemoveMessageAsync` or `DiscardParkedMessageAsync` running
between the local write and the original re-load can no longer skip replication — the
row is in hand. The category-fallback misllabelling on the racy path is gone because
the activity log uses the captured `Category` directly.
### StoreAndForward-021 — Design doc claims the Operation Tracking Table lives in StoreAndForward but the implementation is in SiteRuntime
@@ -1219,7 +1238,7 @@ _Unresolved._
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Open |
| Status | Resolved |
| Location | `docs/requirements/Component-StoreAndForward.md:21`, `:49``:51`, `:77``:87`, `:108`, `:114`; `src/ScadaLink.SiteRuntime/Tracking/OperationTrackingStore.cs:37`; `src/ScadaLink.StoreAndForward/` (whole module) |
**Description**
@@ -1274,9 +1293,18 @@ several refactors out of date. The hierarchical map should be:
- `Component-SiteCallAudit.md` / `Component-AuditLog.md` → telemetry emission +
central-side mirror.
**Resolution**
_Unresolved._
**Resolution (2026-05-28):**
Doc-side fix applied (per the brief, the simplest of the two options). Updated
`Component-StoreAndForward.md`: (1) removed the "Maintain a site-local operation
tracking table" line from Responsibilities and reworded the cached-call telemetry
responsibility to point at the `ICachedCallLifecycleObserver` hook; (2) renamed the
"Operation Tracking Table" section to "Operation Tracking Table (lives in Site
Runtime, not here)" with an explicit `StoreAndForward-021` callout cross-linking to
`Component-SiteRuntime.md` and the `IOperationTrackingStore` interface in
Commons. The rest of the section is retained for cross-component context (the
buffered cached-call rows carry `TrackedOperationId` so the link to the tracking row
must still be documented somewhere) but is reworded to make clear the table itself is
not owned here.
### StoreAndForward-022 — `NotifyCachedCallObserverAsync` silently drops the entire audit lifecycle when the message id is not a parseable `TrackedOperationId`
@@ -1284,7 +1312,7 @@ _Unresolved._
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:484``:515` |
**Description**
@@ -1333,9 +1361,14 @@ contract — the existing
the fix is "log + skip", that test should be updated to also assert the log emission;
if the fix is "emit anyway", the test should be replaced.
**Resolution**
_Unresolved._
**Resolution (2026-05-28):**
Applied the brief's "cheap fix" — the non-GUID skip path now logs a Warning naming
the offending `MessageId`, `Category` and `Outcome` before returning, so a
misconfigured caller is observable instead of silently bypassing the audit pipeline.
S&F retry bookkeeping remains untouched (the observer is still best-effort, the skip
still returns without throwing). The existing
`Attempt_MessageIdNotAGuid_NoObserverNotification` test still passes — its assertion
is on `_observer.Notifications` being empty, which is unchanged.
### StoreAndForward-023 — `siteId` silently defaults to empty when no `IStoreAndForwardSiteContext` is registered, degrading audit telemetry correlation
@@ -1343,7 +1376,7 @@ _Unresolved._
|--|--|
| Severity | Low |
| Category | Code organization & conventions |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.StoreAndForward/ServiceCollectionExtensions.cs:43``:53`; `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:99`, `:524` |
**Description**
@@ -1383,9 +1416,18 @@ absent (no `AddAuditLog`), keep the empty-string default since `_siteId` is unus
Alternatively, change `siteId` from a parameter to a `Func<string>` resolved lazily
from the service provider so a late-registered context still takes effect.
**Resolution**
_Unresolved._
**Resolution (2026-05-28):**
Applied the brief's sentinel option (less invasive than throwing — preserves the
existing test wiring that constructs `StoreAndForwardService` without a site context).
Introduced `StoreAndForwardService.UnknownSiteSentinel = "$unknown-site"` (leading
`$` chosen so it cannot collide with a real site id) and the constructor now
normalises any null/empty/whitespace `siteId` argument to that sentinel. The empty
string can no longer reach `CachedCallAttemptContext.SourceSite`; a misconfigured
host without an `IStoreAndForwardSiteContext` produces audit rows tagged with the
sentinel — recognisably bad in the central audit log instead of silently merging
into the empty bucket. All 104 existing tests pass; the only test that asserts a
literal `SourceSite` (`CachedCallAttemptEmissionTests`) supplies `"site-77"` so the
normalisation is a no-op there.
### StoreAndForward-024 — `StopAsync` does not wait for an in-flight retry sweep, so disposed dependencies can be touched after shutdown