docs(code-reviews): re-review batch 2 at 39d737e — ConfigurationDatabase, DataConnectionLayer, DeploymentManager, ExternalSystemGateway, HealthMonitoring

17 new findings: ConfigurationDatabase-012..014, DataConnectionLayer-014..017, DeploymentManager-015..017, ExternalSystemGateway-015..017, HealthMonitoring-013..016.
2026-05-17 00:45:10 -04:00
parent e49846603e
commit 89636e2bbf
6 changed files with 895 additions and 64 deletions
--- a/code-reviews/ConfigurationDatabase/findings.md
+++ b/code-reviews/ConfigurationDatabase/findings.md
@@ -5,10 +5,10 @@
 | Module | `src/ScadaLink.ConfigurationDatabase` |
 | Design doc | `docs/requirements/Component-ConfigurationDatabase.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-16 |
+| Last reviewed | 2026-05-17 |
 | Reviewer | claude-agent |
-| Commit reviewed | `9c60592` |
+| Commit reviewed | `39d737e` |
-| Open findings | 0 |
+| Open findings | 3 |
 ## Summary
@@ -37,6 +37,28 @@ repositories (`TemplateEngineRepository`, `DeploymentManagerRepository`,
 `ExternalSystemRepository`, `InboundApiRepository`, `NotificationRepository`,
 `SiteRepository`, `InstanceLocator`) have little or no direct coverage.
 #### Re-review 2026-05-17 (commit `39d737e`)
 Re-reviewed the module at commit `39d737e`. All eleven findings from the initial
 review (`9c60592`) remain `Resolved` — the secret-column encryption
 (`EncryptedStringConverter` + `ApplySecretColumnEncryption`), the fail-fast no-arg
 DI overload, the `AsSplitQuery` conversions, the audit cycle-safe serializer, and the
 added repository test coverage all verified present and consistent with their
 resolutions. Three new findings were recorded. The most material is that the
 encryption work done for CD-004 left one bearer credential out of scope:
 `ApiKey.KeyValue` — the inbound-API authentication secret — is still persisted in
 plaintext (`ConfigurationDatabase-012`); it cannot use the same non-deterministic
 Data Protection converter because authentication looks the key up *by value*, so it
 needs a hash-based scheme instead. The second is a resilience gap in the encryption
 plumbing itself: `ApplySecretColumnEncryption` silently substitutes a throwaway
 `EphemeralDataProtectionProvider` whenever no provider is supplied, so any context
 constructed via the single-argument constructor on a write path would encrypt
 secrets with a key discarded at process exit, yielding permanently undecryptable
 ciphertext with no error (`ConfigurationDatabase-013`). The third is a minor
 inconsistency — a redundant cast on one of the three `HasConversion` calls
 (`ConfigurationDatabase-014`). The module is otherwise healthy and the prior fixes
 hold up well.
 ## Checklist coverage
 | # | Category | Examined | Notes |
@@ -44,11 +66,11 @@ repositories (`TemplateEngineRepository`, `DeploymentManagerRepository`,
 | 1 | Correctness & logic bugs | ✓ | `GetTemplateWithChildrenAsync` discards loaded children (CD-001); `GetApprovedKeysForMethodAsync` CSV parsing is brittle (CD-008). |
 | 2 | Akka.NET conventions | ✓ | No actors in this module; data-access layer only. No issues found. |
 | 3 | Concurrency & thread safety | ✓ | DbContext correctly scoped; optimistic concurrency on `DeploymentRecord` correct. Repositories hold no shared mutable state. No issues found. |
-| 4 | Error handling & resilience | ✓ | `WaitForDatabaseReadyAsync` is sound. No-arg DI overload fails late and silently (CD-003); audit JSON serialization failure handling (CD-007). |
+| 4 | Error handling & resilience | ✓ | `WaitForDatabaseReadyAsync` is sound. No-arg DI overload fails late and silently (CD-003, resolved); audit JSON serialization failure handling (CD-007, resolved). Re-review: ephemeral Data Protection fallback can silently produce undecryptable ciphertext (CD-013). |
-| 5 | Security | ✓ | Hardcoded `sa` credential literal (CD-002); SMTP/DB-connection/auth secrets stored unencrypted (CD-004). |
+| 5 | Security | ✓ | Hardcoded `sa` credential literal (CD-002, resolved); SMTP/DB-connection/auth secrets unencrypted (CD-004, resolved). Re-review: `ApiKey.KeyValue` bearer credential still stored in plaintext (CD-012). |
-| 6 | Performance & resource management | ✓ | `GetAllTemplatesAsync` / `GetTemplateTreeAsync` eager-load multiple collections without `AsSplitQuery` (CD-009). No N+1 in audited paths. |
+| 6 | Performance & resource management | ✓ | `GetAllTemplatesAsync` / `GetTemplateTreeAsync` eager-load multiple collections without `AsSplitQuery` (CD-009, resolved). No N+1 in audited paths. Re-review: no new issues. |
-| 7 | Design-document adherence | ✓ | Audit `Id` type mismatch vs design doc (CD-005); seed data uses `HasData` consistent with design. |
+| 7 | Design-document adherence | ✓ | Audit `Id` type mismatch vs design doc (CD-005, resolved); seed data uses `HasData` consistent with design. Re-review: no new issues. |
-| 8 | Code organization & conventions | ✓ | Mostly clean. `Grpc*` address columns unbounded (CD-006); inconsistent null-guard on injected context (CD-011). |
+| 8 | Code organization & conventions | ✓ | Mostly clean. `Grpc*` address columns unbounded (CD-006, resolved); inconsistent null-guard on injected context (CD-011, resolved). Re-review: redundant/inconsistent cast on one `HasConversion` call (CD-014). |
 | 9 | Testing coverage | ✓ | Several repositories and `InstanceLocator` lack direct tests (CD-010). |
 | 10 | Documentation & comments | ✓ | `DeploymentManagerRepository` "WP-24 stub" XML comment is stale; noted in module context but not raised as a standalone finding. No issues found beyond items above. |
@@ -570,3 +592,128 @@ every data-access type behaves uniformly and a hand-constructed instance fails w
 informative exception at construction rather than a later `NullReferenceException`.
 Regression: `Constructor_NullContext_Throws` tests were added for all four affected types
 (`InboundApiRepositoryTests.cs`, `RepositoryCoverageTests.cs`).
 ### ConfigurationDatabase-012 — Inbound-API `ApiKey.KeyValue` bearer credential stored in plaintext
 | | |
 |--|--|
 | Severity | Medium |
 | Category | Security |
 | Status | Open |
 | Location | `src/ScadaLink.ConfigurationDatabase/Configurations/InboundApiConfiguration.cs:17-19` |
 **Description**
 `ApiKey.KeyValue` is the bearer credential presented in the `X-API-Key` header to
 authenticate Inbound API requests (HighLevelReqs §7.2–7.3). It is mapped as an
 ordinary `nvarchar(500)` column with a unique index and persisted verbatim. Anyone
 with read access to the configuration database — or to any `AuditLogEntry.AfterStateJson`
 into which an `ApiKey` entity is serialized — obtains live API credentials in cleartext.
 `ConfigurationDatabase-004` introduced encryption-at-rest for the other secret-bearing
 columns (SMTP credentials, external-system auth config, database connection strings)
 but explicitly scoped `ApiKey.KeyValue` out. The omission is genuine: the
 `EncryptedStringConverter` built for CD-004 is backed by ASP.NET Data Protection, which
 is **non-deterministic** — the same plaintext encrypts to different ciphertext each
 time — so it cannot be applied here, because `GetApprovedKeysForMethodAsync` and the
 authentication path resolve a key by its value (`GetApiKeyByValueAsync` does
 `FirstOrDefaultAsync(k => k.KeyValue == keyValue)`). A non-deterministic converter would
 break that equality lookup. The result is that the one credential most exposed to
 external callers is the one credential left unprotected.
 **Recommendation**
 Store a salted cryptographic hash of the key value instead of the plaintext (or
 ciphertext): hash on create, and authenticate by hashing the presented key and
 comparing. This keeps the equality lookup working (the hash is deterministic) while
 ensuring the database never holds a usable credential. The plaintext key would then
 only ever be shown once, at creation time, to the Admin who created it. This requires
 a coordinated change with the Inbound API / Security components and the `ApiKey`
 entity in Commons; record the chosen scheme in
 `docs/requirements/Component-ConfigurationDatabase.md` and the Inbound API design doc.
 At minimum, ensure `ApiKey` entities are never passed to `IAuditService` without the
 key value redacted.
 **Resolution**
 _Unresolved._
 ### ConfigurationDatabase-013 — Secret-column encryption silently falls back to an ephemeral (throwaway) key
 | | |
 |--|--|
 | Severity | Medium |
 | Category | Error handling & resilience |
 | Status | Open |
 | Location | `src/ScadaLink.ConfigurationDatabase/ScadaLinkDbContext.cs:107-124` |
 **Description**
 `ApplySecretColumnEncryption` resolves the Data Protection provider as
 `_dataProtectionProvider ?? new EphemeralDataProtectionProvider()`. The `??` fallback
 is reached whenever the context is constructed via the single-argument
 `ScadaLinkDbContext(DbContextOptions)` constructor — i.e. whenever no provider was
 injected. An `EphemeralDataProtectionProvider` generates a key ring that lives only in
 process memory and is discarded at process exit.
 For design-time `dotnet ef` tooling this is harmless (the XML remark correctly notes
 it only emits schema). The risk is on a *runtime write path*. The runtime currently
 gets the provider-bearing context only because `AddConfigurationDatabase` adds an
 `AddScoped` factory registration that overrides EF's activator-based registration.
 That override is the single thing standing between correct behaviour and silent data
 corruption: any future change that resolves a `ScadaLinkDbContext` through a path the
 override does not cover — an `AddPooledDbContextFactory`/`IDbContextFactory<ScadaLinkDbContext>`
 registration, a second `AddDbContext` call, a hand-constructed context in server code —
 would construct the context with the single-arg constructor, encrypt secret columns
 with a throwaway key, and persist ciphertext that becomes **permanently undecryptable
 the moment the process restarts**. There is no exception, no warning; the failure only
 surfaces later as `CryptographicException` on read (mis-attributed by
 `EncryptedStringConverter` to "the stored value was not written by this system").
 **Recommendation**
 Do not silently substitute an ephemeral provider for write-capable contexts. Either:
 (a) require the provider unconditionally and have design-time tooling pass an explicit
 ephemeral provider so the intent is visible at the call site; or (b) keep the
 single-arg constructor but mark contexts built without a real provider as
 schema-only — e.g. record a flag and have the encrypting converter throw a clear
 `InvalidOperationException` ("secret columns cannot be written without a configured
 Data Protection key ring") on the first `Protect`, instead of producing throwaway
 ciphertext. Also harden the DI wiring so a `ScadaLinkDbContext` cannot be resolved
 through the EF-activator registration at all (e.g. register only the factory, or use
 `AddDbContextFactory` with the explicit constructor).
 **Resolution**
 _Unresolved._
 ### ConfigurationDatabase-014 — Redundant, inconsistent cast on one `HasConversion` call
 | | |
 |--|--|
 | Severity | Low |
 | Category | Code organization & conventions |
 | Status | Open |
 | Location | `src/ScadaLink.ConfigurationDatabase/ScadaLinkDbContext.cs:121-123` |
 **Description**
 `ApplySecretColumnEncryption` calls `HasConversion(converter)` three times. The first
 two (`SmtpConfiguration.Credentials`, `ExternalSystemDefinition.AuthConfiguration`)
 pass `converter` directly; the third (`DatabaseConnectionDefinition.ConnectionString`)
 casts it to `(Microsoft.EntityFrameworkCore.Storage.ValueConversion.ValueConverter)`.
 `EncryptedStringConverter` already derives from `ValueConverter<string?, string?>`
 (itself a `ValueConverter`), and the first two call sites compile fine without the
 cast, so the cast is redundant. The inconsistency makes the code look as though the
 third call needs special handling when it does not, and the fully-qualified type name
 inline adds noise.
 **Recommendation**
 Remove the cast so all three calls read identically as `HasConversion(converter)`.
 If a `ValueConverter`-typed reference is genuinely wanted, give it a local variable of
 that type once and use it for all three.
 **Resolution**
 _Unresolved._
--- a/code-reviews/DataConnectionLayer/findings.md
+++ b/code-reviews/DataConnectionLayer/findings.md
@@ -5,10 +5,10 @@
 | Module | `src/ScadaLink.DataConnectionLayer` |
 | Design doc | `docs/requirements/Component-DataConnectionLayer.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-16 |
+| Last reviewed | 2026-05-17 |
 | Reviewer | claude-agent |
-| Commit reviewed | `9c60592` |
+| Commit reviewed | `39d737e` |
-| Open findings | 0 |
+| Open findings | 4 |
 ## Summary
@@ -30,20 +30,40 @@ the design doc's failover state machine and the implemented unstable-disconnect
 heuristic. Test coverage is adequate for the happy paths and failover but absent for
 tag-resolution retry, disconnect/re-subscribe, and concurrency around `HandleSubscribe`.
 #### Re-review 2026-05-17 (commit `39d737e`)
 All 13 findings from the 2026-05-16 review remain `Resolved` and the fixes were
 verified in place against the current source (`PipeTo(Self)` subscribe pattern,
 `Resume` supervision, `ConcurrentDictionary` callback maps, atomic disconnect guards,
 bounded write timeout, etc.). The re-review walked all 10 checklist categories again
 and found **4 new findings**: one **High** — the DCL-012 security warning is never
 seen in production because `RealOpcUaClientFactory.Create()` constructs
 `RealOpcUaClient` with no logger, so the warning sinks into `NullLogger`; one
 **Medium** — initial-connect failures in the `Connecting` state never count toward
 failover, so a connection whose primary endpoint is unreachable at startup retries the
 primary forever and never tries the configured backup; one **Medium** —
 `HandleSubscribeCompleted` always replies `SubscribeTagsResponse(success: true)` even
 when a connection-level subscribe failure is driving the actor into `Reconnecting`,
 telling the Instance Actor the subscribe succeeded when it did not; and one **Low** —
 `WriteBatchAsync` does not catch the `InvalidOperationException` from `EnsureConnected`,
 so a mid-batch disconnect aborts the whole write batch (the same class of defect
 DCL-007 fixed for `ReadBatchAsync`). New findings are numbered from
 `DataConnectionLayer-014`.
 ## Checklist coverage
 | # | Category | Examined | Notes |
 |---|----------|----------|-------|
-| 1 | Correctness & logic bugs | x | `_resolvedTags` double-counting and stale counters after failover; `ReadBatchAsync` aborts mid-batch. |
+| 1 | Correctness & logic bugs | x | 2026-05-16 findings resolved. Re-review: finding 016 — `SubscribeTagsResponse` reports success on a connection-level subscribe failure. |
-| 2 | Akka.NET conventions | x | `Task.Run` mutating actor state (critical); `Restart` supervision loses state; closures capturing `_subscriptionsByInstance`. |
+| 2 | Akka.NET conventions | x | 2026-05-16 findings resolved (`PipeTo(Self)` subscribe, `Resume` supervision). Re-review: no new issues. |
-| 3 | Concurrency & thread safety | x | Actor state mutated off the actor thread; `RealOpcUaClient` callback dictionary unsynchronized. |
+| 3 | Concurrency & thread safety | x | 2026-05-16 findings resolved (`ConcurrentDictionary`, atomic disconnect guards). Re-review: no new issues. |
-| 4 | Error handling & resilience | x | Subscription failures not surfaced; unbounded write with no timeout; reconnect after subscribe-time failure not handled. |
+| 4 | Error handling & resilience | x | Re-review: finding 015 — initial-connect failures never trigger failover; finding 017 — `WriteBatchAsync` aborts on mid-batch disconnect. |
-| 5 | Security | x | `AutoAcceptUntrustedCerts` defaults to `true`; OPC UA password handling acceptable. See finding 012. |
+| 5 | Security | x | Re-review: finding 014 — the DCL-012 auto-accept-cert warning is never logged in production (`RealOpcUaClient` built without a logger). |
-| 6 | Performance & resource management | x | `HandleUnsubscribe` O(n^2) over instances; initial-read loop serial per tag. |
+| 6 | Performance & resource management | x | 2026-05-16 finding 008 resolved (reverse index). Re-review: no new issues. |
-| 7 | Design-document adherence | x | Failover heuristic (unstable-disconnect count) differs from documented state machine; `WriteTimeout` documented but unused. |
+| 7 | Design-document adherence | x | 2026-05-16 findings 005/009 resolved. Re-review: no new issues (finding 015 logged under resilience). |
 | 8 | Code organization & conventions | x | No issues found — POCOs in Commons, options class owned by component, factory pattern consistent. |
-| 9 | Testing coverage | x | No tests for tag-resolution retry, disconnect/re-subscribe, bad-quality push, or `HandleSubscribe` concurrency. |
+| 9 | Testing coverage | x | DCL001–013 regression tests present. Re-review: gaps remain for finding 014/015/016 scenarios (no test for production logger wiring, startup failover, or subscribe-response-on-failure). |
-| 10 | Documentation & comments | x | XML comment on `RaiseDisconnected` claims thread safety it does not have; design doc round-robin description stale. |
+| 10 | Documentation & comments | x | 2026-05-16 finding 013 resolved. Re-review: no new issues. |
 ## Findings
@@ -661,3 +681,179 @@ fanning 32 barrier-synchronised threads that raise the client's `ConnectionLost`
 simultaneously, and asserts `Disconnected` fires exactly once per round; against a
 non-atomic check-then-set it double-fires (verified by temporarily reverting the
 guard), and it passes against the atomic fix.
 ### DataConnectionLayer-014 — DCL-012 security warning is never logged in production: `RealOpcUaClient` is created without a logger
 | | |
 |--|--|
 | Severity | High |
 | Category | Security |
 | Status | Open |
 | Location | `src/ScadaLink.DataConnectionLayer/Adapters/RealOpcUaClient.cs:325`, `src/ScadaLink.DataConnectionLayer/Adapters/RealOpcUaClient.cs:35-39,79-83` |
 **Description**
 Finding DataConnectionLayer-012 was resolved in part by adding a prominent
 `ILogger` warning in `RealOpcUaClient.ConnectAsync` whenever the auto-accept
 certificate validator is installed (`RealOpcUaClient.cs:79-83`). The
 `ILogger<RealOpcUaClient>` constructor parameter was made optional, defaulting to
 `NullLogger<RealOpcUaClient>.Instance` (`RealOpcUaClient.cs:35-39`).
 However, the only production code path that constructs a `RealOpcUaClient` is
 `RealOpcUaClientFactory.Create()` (`RealOpcUaClient.cs:325`), which calls
 `new RealOpcUaClient(_globalOptions)` and passes **no logger**. The factory itself
 holds only an `OpcUaGlobalOptions` and has no `ILoggerFactory`/`ILogger` available.
 As a result the `_logger` field is always `NullLogger` for every real OPC UA
 connection, and the man-in-the-middle warning the DCL-012 fix added is silently
 discarded. An operator who deploys a connection with `AutoAcceptUntrustedCerts`
 enabled — accepting any server certificate on an industrial control link — gets no
 visible signal anywhere in the logs. The in-scope half of DCL-012's resolution is
 therefore not actually effective in production; only the unit test
 (`DCL012_OpcUaConnectionOptions_AutoAcceptUntrustedCerts_DefaultsToFalse`, which only
 checks the default value) passes.
 **Recommendation**
 Thread a real logger through to `RealOpcUaClient`. `DataConnectionFactory` already
 holds an `ILoggerFactory` and constructs `RealOpcUaClientFactory(globalOptions)` —
 give `RealOpcUaClientFactory` an `ILoggerFactory` (or an `ILogger<RealOpcUaClient>`)
 constructor parameter and pass `_loggerFactory.CreateLogger<RealOpcUaClient>()` into
 each `new RealOpcUaClient(...)`. Add a test that asserts the warning is emitted on a
 real connect with auto-accept enabled (e.g. via a captured `ILogger`), not just that
 the default is `false`.
 **Resolution**
 _Unresolved._
 ### DataConnectionLayer-015 — Initial-connect failures never trigger failover; an unreachable primary at startup never tries the backup
 | | |
 |--|--|
 | Severity | Medium |
 | Category | Error handling & resilience |
 | Status | Open |
 | Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:404-417`, `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:419-493` |
 **Description**
 Failover between the primary and backup endpoints is implemented in two places, both
 reachable only after a connection has already been `Connected` at least once:
 `HandleReconnectResult` (Reconnecting state) counts `_consecutiveFailures` and switches
 endpoint, and `BecomeReconnecting` counts `_consecutiveUnstableDisconnects`.
 `HandleConnectResult` — the handler for the *initial* connection attempt in the
 `Connecting` state (`DataConnectionActor.cs:404-417`) — does neither. On failure it
 only logs and re-arms the reconnect timer with `AttemptConnect`; it never increments
 `_consecutiveFailures`, never consults `_backupConfig`, and never switches endpoint.
 Consequence: if the primary endpoint is unreachable when the connection actor first
 starts — which is the common case after a fresh artifact deployment, a site restart,
 or a primary that is simply down at that moment — the actor retries the *primary*
 endpoint indefinitely at `ReconnectInterval` and **never** attempts the configured
 backup. The design doc's endpoint-redundancy promise ("automatic failover when the
 active endpoint becomes unreachable") is silently not honoured for the
 never-connected-yet case, and an operator sees a connection stuck `Connecting` forever
 despite a healthy backup being configured.
 **Recommendation**
 Make `HandleConnectResult` participate in the failover counter the same way
 `HandleReconnectResult` does: increment `_consecutiveFailures` on failure and, when
 `_backupConfig != null && _consecutiveFailures >= _failoverRetryCount`, perform the
 endpoint switch (dispose adapter, create the other adapter, bump `_adapterGeneration`,
 log the failover event) before re-arming the timer. Alternatively, fold the initial
 connect into the same reconnect path so there is a single failover decision point. Add
 a regression test for "primary down at startup, backup configured → fails over to
 backup".
 **Resolution**
 _Unresolved._
 ### DataConnectionLayer-016 — `HandleSubscribeCompleted` reports `SubscribeTagsResponse` success even on a connection-level subscribe failure
 | | |
 |--|--|
 | Severity | Medium |
 | Category | Correctness & logic bugs |
 | Status | Open |
 | Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:606,666-672`, `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:232-240` |
 **Description**
 `HandleSubscribeCompleted` computes `connectionLevelFailure` (line 606) and returns it
 so the `Connected`-state handler can drive the actor into `Reconnecting`
 (`DataConnectionActor.cs:232-240`). But before returning, it unconditionally replies
 to the caller with `new SubscribeTagsResponse(..., true, null, ...)` (lines 666-667) —
 `Success: true`, `Error: null` — regardless of whether any tag failed at connection
 level.
 So when a subscribe arrives while the adapter is silently down, the Instance Actor is
 told the subscribe **succeeded**, while the connection actor simultaneously transitions
 to `Reconnecting`. The tags were never actually subscribed at the adapter (the catch
 block recorded `Success: false`); they are recovered later by `ReSubscribeAll` only if
 and when reconnection succeeds. The caller has no way to distinguish "subscribed and
 healthy" from "accepted, but the connection is currently down" — a misleading
 success signal on a request that did not do what the response claims.
 (Genuine tag-resolution failures are arguably also reported as overall `true`, but
 that is defensible: those tags are tracked in `_unresolvedTags` and the design models
 unresolved tags as a runtime quality concern, with a `Bad`-quality `TagValueUpdate`
 already pushed. The connection-level case is the clear defect because the actor itself
 treats it as a failure worth a state transition.)
 **Recommendation**
 When `connectionLevelFailure` is true, reply with
 `SubscribeTagsResponse(..., success: false, error: "connection unavailable — will
 re-subscribe on reconnect", ...)` (or an equivalent), so the caller's response matches
 the actor's own assessment. Optionally carry per-tag outcomes in the response so the
 Instance Actor can reflect partial success. Add a test asserting the response is not
 `Success: true` when a connection-level subscribe failure drives `Reconnecting`.
 **Resolution**
 _Unresolved._
 ### DataConnectionLayer-017 — `WriteBatchAsync` aborts the whole batch on a mid-batch disconnect
 | | |
 |--|--|
 | Severity | Low |
 | Category | Error handling & resilience |
 | Status | Open |
 | Location | `src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:229-237`, `src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:218-227` |
 **Description**
 `WriteBatchAsync` loops calling `WriteAsync` per tag (`OpcUaDataConnection.cs:229-237`).
 `WriteAsync` returns a `WriteResult` for OPC-UA-level write rejections (good — a bad
 status does not abort the batch), but it first calls `EnsureConnected()`
 (`OpcUaDataConnection.cs:220`), which throws `InvalidOperationException` when the
 client is disconnected. `WriteBatchAsync` does not catch that exception, so if the
 connection drops partway through a batch the whole `WriteBatchAsync` throws and the
 caller gets no result map — losing the per-tag outcomes for the tags that already
 wrote. This is the same class of defect that DataConnectionLayer-007 fixed for
 `ReadBatchAsync` (which now records a failed `ReadResult` per failing tag and only
 propagates `OperationCanceledException`). `WriteBatchAsync` feeds
 `WriteBatchAndWaitAsync` (line 246), so a disconnect during a flag-and-wait write
 sequence surfaces as an unhandled exception rather than a clean `false`/per-tag result.
 Severity is Low because device writes are real-time control operations with no
 store-and-forward, the batch write paths are not on the primary `HandleWrite` hot path
 (`HandleWrite` calls single-tag `WriteAsync`), and a disconnect mid-batch is itself an
 error condition — but the inconsistent error shape (exception vs. per-tag result) is a
 maintainability and correctness wart.
 **Recommendation**
 Mirror the DCL-007 fix: wrap the per-tag `WriteAsync` call in `WriteBatchAsync` in a
 try/catch that records a failed `WriteResult(false, ex.Message)` for the failing tag
 and continues, while still propagating `OperationCanceledException` to abort a
 cancelled batch as a whole. This gives callers (including `WriteBatchAndWaitAsync`) a
 complete, consistent result map.
 **Resolution**
 _Unresolved._
--- a/code-reviews/DeploymentManager/findings.md
+++ b/code-reviews/DeploymentManager/findings.md
@@ -5,10 +5,10 @@
 | Module | `src/ScadaLink.DeploymentManager` |
 | Design doc | `docs/requirements/Component-DeploymentManager.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-16 |
+| Last reviewed | 2026-05-17 |
 | Reviewer | claude-agent |
-| Commit reviewed | `9c60592` |
+| Commit reviewed | `39d737e` |
-| Open findings | 0 |
+| Open findings | 3 |
 ## Summary
@@ -30,20 +30,43 @@ detail. Configuration is not bound to `appsettings.json`, leaving one option
 entirely dead. Test coverage stops at the communication boundary and never
 exercises a successful deployment or the lifecycle success paths.
 #### Re-review 2026-05-17 (commit `39d737e`)
 Re-reviewed at commit `39d737e` after the batch of fixes for
 DeploymentManager-001..014. All fourteen prior findings remain `Resolved` and
 verified against source — the broadened catch, non-cancellable cleanup writes,
 ref-counted `OperationLockManager`, query-before-redeploy reconciliation,
 structured diff, options binding, and the expanded TestKit-actor test suite are
 all present and correct. The module is in markedly better shape than the
 first review: error paths are now defensively handled and test coverage is
 broad (successful deploy/lifecycle, lock serialization, reconciliation
 matrix, artifact per-site matrix).
 This re-review found **3 new findings**, all clustered on the
 DeploymentManager-006 reconciliation path added since the last review. The
 reconciliation shortcut (`TryReconcileWithSiteAsync`) marks a stale prior
 record `Success` when the site already has the target revision, but it does
 **not** perform the side effects the normal success path does — it never
 updates the instance `State`, never refreshes the `DeployedConfigSnapshot`,
 and never corrects the prior record's own `RevisionHash` (DeploymentManager-015,
 DeploymentManager-016). The `GetDeploymentStatusAsync` XML doc is now stale —
 it still describes the query-before-redeploy behaviour that actually moved into
 `TryReconcileWithSiteAsync` (DeploymentManager-017).
 ## Checklist coverage
 | # | Category | Examined | Notes |
 |---|----------|----------|-------|
-| 1 | Correctness & logic bugs | ✓ | Stuck `InProgress` record on unexpected exception; cancelled-token failure write. |
+| 1 | Correctness & logic bugs | ✓ | Re-review 2026-05-17: reconciliation skips instance-state/snapshot updates (DeploymentManager-015) and keeps a stale `RevisionHash` (DeploymentManager-016). Prior: stuck `InProgress` / cancelled-token write (resolved). |
 | 2 | Akka.NET conventions | ✓ | Module is a plain service layer; it calls `CommunicationService` which wraps Ask. No actors here. No issues. |
-| 3 | Concurrency & thread safety | ✓ | `OperationLockManager` is sound but leaks semaphores; `DeployToAllSitesAsync` correctly builds commands sequentially before parallel send. |
+| 3 | Concurrency & thread safety | ✓ | `OperationLockManager` ref-counts and reclaims semaphores; `DeployToAllSitesAsync` correctly builds commands sequentially before parallel send. No issues at re-review. |
-| 4 | Error handling & resilience | ✓ | Several gaps — see DeploymentManager-001/002/003/004. |
+| 4 | Error handling & resilience | ✓ | Prior gaps DeploymentManager-001/002/003/004 resolved and verified. No new issues. |
-| 5 | Security | ✓ | SMTP credentials are serialized and broadcast to sites — see DeploymentManager-013. No injection vectors; no authz here (enforced upstream). |
+| 5 | Security | ✓ | SMTP credential handling documented as an accepted design decision (DeploymentManager-013). No injection vectors; no authz here (enforced upstream). No new issues. |
-| 6 | Performance & resource management | ✓ | Semaphore leak (DeploymentManager-005); artifact rebuild does N+1 method queries per external system. |
+| 6 | Performance & resource management | ✓ | Semaphore leak resolved (DeploymentManager-005). No new issues. |
-| 7 | Design-document adherence | ✓ | Missing query-before-redeploy (DeploymentManager-006); Diff View not implemented (DeploymentManager-007). |
+| 7 | Design-document adherence | ✓ | Query-before-redeploy and Diff View implemented (DeploymentManager-006/007). Re-review: reconciliation path breaks the deployed-snapshot/instance-state invariants — see DeploymentManager-015. |
-| 8 | Code organization & conventions | ✓ | Options class not bound to configuration — DeploymentManager-008. POCO/repo placement correct. |
+| 8 | Code organization & conventions | ✓ | Options binding resolved (DeploymentManager-008). POCO/repo placement correct. No new issues. |
-| 9 | Testing coverage | ✓ | No successful-deploy test, no lifecycle success test — DeploymentManager-011; dead `CreateCommand` helper — DeploymentManager-014. |
+| 9 | Testing coverage | ✓ | Broad coverage added (success, lifecycle, lock serialization, reconciliation, artifact matrix). Re-review: reconciled-success path's missing side effects (DeploymentManager-015) are untested. |
-| 10 | Documentation & comments | ✓ | Misleading timeout comment — DeploymentManager-009; stale option XML doc — DeploymentManager-012. |
+| 10 | Documentation & comments | ✓ | Prior comment findings resolved. Re-review: `GetDeploymentStatusAsync` XML doc is now stale — DeploymentManager-017. |
 ## Findings
@@ -710,3 +733,126 @@ the communication boundary. New tests:
 `DeployToAllSitesAsync_AllPerSiteCommandsShareTheSummaryDeploymentId` (also
 covers DeploymentManager-010), `DeployToAllSitesAsync_PartialFailure_ReportsPerSiteMatrix`
 (per-site success/failure matrix), `RetryForSiteAsync_SiteSucceeds_ReturnsSuccessAndAudits`.
 ### DeploymentManager-015 — Site-query reconciliation marks a deployment `Success` but skips instance-state and snapshot updates
 | | |
 |--|--|
 | Severity | High |
 | Category | Correctness & logic bugs |
 | Status | Open |
 | Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:631-655` |
 **Description**
 `TryReconcileWithSiteAsync` (the DeploymentManager-006 query-before-redeploy
 path) handles the case where a prior `InProgress`/timeout-`Failed` record exists
 and the site reports it already has the target revision hash. In that case it
 marks the prior `DeploymentRecord` `Success`, audit-logs `DeployReconciled`, and
 returns it — the caller then returns `Result.Success` and **never enters the
 normal deploy body**.
 The normal success path (`DeployInstanceAsync.cs:215-223`) does three things on
 a successful site response: writes the deployment record terminal status, sets
 `instance.State = InstanceState.Enabled` + `UpdateInstanceAsync`, and calls
 `StoreDeployedSnapshotAsync`. The reconciliation shortcut performs only the
 first. Consequently, after a reconciled deployment:
 - The instance `State` is left at whatever it was (e.g. `NotDeployed` for a
  first-time deploy that timed out, or `Disabled`) even though the site is
  actually running the configuration — the central state machine and the site
  diverge, and a subsequent `DisableInstanceAsync`/`EnableInstanceAsync` will be
  rejected or allowed incorrectly by `StateTransitionValidator`.
 - No `DeployedConfigSnapshot` is created or refreshed. A first-time deploy that
  is resolved purely by reconciliation leaves `GetDeploymentComparisonAsync`
  permanently returning `"No deployed snapshot found for this instance."`, and a
  redeploy reconciliation leaves the stored snapshot showing the *old* config
  even though the deployment record claims `Success` for the new revision.
 The design ("Deployed vs. Template-Derived State", WP-4/WP-8) requires the
 deployed snapshot and instance state to reflect the last successful deployment;
 the reconciliation path silently breaks both invariants.
 **Recommendation**
 In the reconciled-success branch of `TryReconcileWithSiteAsync`, perform the
 same post-success side effects as the normal path: set `instance.State =
 InstanceState.Enabled` (+ `UpdateInstanceAsync`) and call
 `StoreDeployedSnapshotAsync` with the target deployment ID / revision hash /
 config JSON. Factor the shared post-success logic into one helper so the normal
 and reconciliation paths cannot drift. Add a regression test asserting that a
 reconciled deployment leaves the instance `Enabled` and a snapshot stored.
 **Resolution**
 _Unresolved._
 ### DeploymentManager-016 — Reconciled prior record keeps its stale `RevisionHash`
 | | |
 |--|--|
 | Severity | Medium |
 | Category | Correctness & logic bugs |
 | Status | Open |
 | Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:639-651` |
 **Description**
 When `TryReconcileWithSiteAsync` reconciles a prior record, it mutates
 `prior.Status`, `prior.ErrorMessage`, and `prior.CompletedAt`, but **not**
 `prior.RevisionHash`. The reconciliation condition only compares the *site's*
 `AppliedRevisionHash` against the *freshly-flattened* `targetRevisionHash` — it
 does not require `prior.RevisionHash` to equal either of them.
 The prior record can legitimately carry a different revision hash than the
 current target: e.g. a deploy timed out at revision `R1`, the template was then
 edited so the current flatten yields `R2`, and meanwhile the site actually
 applied `R2` through some other path (or `R1` and `R2` are equal-by-content but
 the prior record predates a hash recompute). After reconciliation the record's
 `Status` is `Success` but its `RevisionHash` still says `R1`, so staleness
 checks and any UI that reads `DeploymentRecord.RevisionHash` will report the
 instance as deployed at the wrong revision. The audit `DeployReconciled` entry
 records `RevisionHash = targetRevisionHash`, contradicting the persisted record.
 **Recommendation**
 In the reconciled-success branch, also set `prior.RevisionHash =
 targetRevisionHash` so the persisted record, the audit entry, and the site's
 actual applied revision all agree. Alternatively, only reconcile when
 `prior.RevisionHash == targetRevisionHash` and otherwise fall through to a
 normal deploy.
 **Resolution**
 _Unresolved._
 ### DeploymentManager-017 — `GetDeploymentStatusAsync` XML doc describes behaviour it does not implement
 | | |
 |--|--|
 | Severity | Low |
 | Category | Documentation & comments |
 | Status | Open |
 | Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:562-570` |
 **Description**
 The XML summary on `GetDeploymentStatusAsync` reads: *"WP-2: After
 failover/timeout, query site for current deployment state before
 re-deploying."* The method body does no such thing — it is a one-line
 pass-through to `_repository.GetDeploymentByDeploymentIdAsync`, a pure local DB
 read. The query-the-site-before-redeploy behaviour the comment describes was
 implemented separately in `TryReconcileWithSiteAsync` (DeploymentManager-006).
 The stale comment is a leftover of the original design intent and misleads a
 reader into thinking this method contacts the site.
 **Recommendation**
 Reword the summary to describe what the method actually does — "returns the
 current persisted `DeploymentRecord` for the given deployment ID from the
 configuration database" — and, if useful, cross-reference
 `TryReconcileWithSiteAsync` as the place the site-query reconciliation lives.
 **Resolution**
 _Unresolved._
--- a/code-reviews/ExternalSystemGateway/findings.md
+++ b/code-reviews/ExternalSystemGateway/findings.md
@@ -5,10 +5,10 @@
 | Module | `src/ScadaLink.ExternalSystemGateway` |
 | Design doc | `docs/requirements/Component-ExternalSystemGateway.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-16 |
+| Last reviewed | 2026-05-17 |
 | Reviewer | claude-agent |
-| Commit reviewed | `9c60592` |
+| Commit reviewed | `39d737e` |
-| Open findings | 0 |
+| Open findings | 3 |
 ## Summary
@@ -30,20 +30,41 @@ Test coverage is thin — `CachedCall` transient/buffering paths and `DatabaseGa
 are entirely untested. Themes: incomplete wiring against the S&F engine, and design-doc
 requirements (timeout, retry settings) that are declared but not implemented.
 #### Re-review 2026-05-17 (commit `39d737e`)
 All fourteen prior findings remain `Resolved`; the resolutions for findings 001–014
 were spot-checked against the current source and hold. The re-review walked the full
 10-category checklist again and surfaced **three new findings**. The most serious
 (`ExternalSystemGateway-015`, High) is a regression *introduced by* the
 `ExternalSystemGateway-004` resolution: `CachedCall`/`CachedWrite` now pass a
 per-system/per-connection `MaxRetries` of `0` through verbatim, but
 `StoreAndForwardService.RetryMessageAsync` interprets a stored `MaxRetries` of `0` as
 **retry forever**, not "never retry" — so the very `0` the ESG-004 fix claims to
 "honour as never retry" actually produces an unbounded retry loop, and two ESG tests
 assert the broken behaviour. `ExternalSystemGateway-016` (Medium) is that the
 `ExternalSystemGateway-013` resolution used `ConfigureHttpClientDefaults`, which is a
 **process-global** registration — it forces a `SocketsHttpHandler` (capped at the ESG
 option) onto every `HttpClient` in the host, including the Notification Service's
 OAuth2 token client, not just the gateway's per-system clients. `ExternalSystemGateway-017`
 (Low) is a trailing-`?` URL nit when a GET method's parameters are all null. Theme:
 both substantive findings are second-order defects in earlier fixes — the earlier
 resolutions did not verify the downstream contract of the S&F engine they integrate
 with.
 ## Checklist coverage
 | # | Category | Examined | Notes |
 |---|----------|----------|-------|
-| 1 | Correctness & logic bugs | ☑ | URL building edge cases, dropped S&F result, classification gaps — findings 003, 006, 009. |
+| 1 | Correctness & logic bugs | ☑ | Prior: URL edge cases, dropped S&F result, classification — 003, 006, 009. Re-review: `MaxRetries == 0` retry-forever vs never-retry contradiction (015); trailing-`?` URL nit (017). |
-| 2 | Akka.NET conventions | ☑ | No actors in this module; `AddExternalSystemGatewayActors` is a no-op. Blocking-I/O isolation is delegated to Site Runtime. No issues found in this module. |
+| 2 | Akka.NET conventions | ☑ | No actors in this module; `AddExternalSystemGatewayActors` is a no-op. Blocking-I/O isolation is delegated to Site Runtime. No issues found. |
-| 3 | Concurrency & thread safety | ☑ | Services are stateless and DI-scoped; `ExternalCallResult.Response` lazy-parse is not thread-safe but instances are single-use. No findings raised. |
+| 3 | Concurrency & thread safety | ☑ | Services are stateless and DI-scoped; the S&F delivery handlers resolve in a fresh DI scope on the sweep thread. No findings raised. |
-| 4 | Error handling & resilience | ☑ | S&F handler never registered, double-dispatch, timeout not applied, cancellation conflation — findings 001, 002, 003, 008. |
+| 4 | Error handling & resilience | ☑ | Prior: handler registration, double-dispatch, timeout, cancellation — 001, 002, 003, 008. Re-review: the unbounded-retry consequence of finding 015 is also a resilience defect (recorded under category 1). |
-| 5 | Security | ☑ | Auth secrets logged-safe, but error bodies echoed verbatim — finding 007. |
+| 5 | Security | ☑ | Error bodies now truncated (007). No new issues — auth secrets not logged, body capped. |
-| 6 | Performance & resource management | ☑ | `HttpRequestMessage`/`HttpResponseMessage` and failed `SqlConnection` not disposed; full repository scan per call — findings 005, 010, 011. |
+| 6 | Performance & resource management | ☑ | Prior: disposal and repository-scan findings 005, 010, 011 — all resolved and verified. No new issues. |
-| 7 | Design-document adherence | ☑ | Timeout, retry settings, audit logging gaps — findings 002, 004, 012. |
+| 7 | Design-document adherence | ☑ | Prior: timeout, retry settings, logging — 002, 004, 012. Re-review: finding 015 is also a design-adherence gap (S&F retry contract); recorded under category 1. |
-| 8 | Code organization & conventions | ☑ | Options class correctly owned by module; `MaxConcurrentConnectionsPerSystem` unused — finding 013. |
+| 8 | Code organization & conventions | ☑ | Prior: `MaxConcurrentConnectionsPerSystem` wiring — 013. Re-review: that wiring uses process-global `ConfigureHttpClientDefaults`, leaking the ESG cap onto every host `HttpClient` — finding 016. |
-| 9 | Testing coverage | ☑ | CachedCall buffering and DatabaseGateway untested — finding 014. |
+| 9 | Testing coverage | ☑ | Coverage is broad after finding 014. Re-review note: the `ZeroMaxRetries...` tests assert the persisted column, not the sweep outcome, and so lock in the finding-015 defect. |
-| 10 | Documentation & comments | ☑ | XML docs reference WP numbers; permanent-failure logging requirement unverified — folded into finding 012. |
+| 10 | Documentation & comments | ☑ | Inline comments at `ExternalSystemClient.cs:118-119` / `DatabaseGateway.cs:99-101` assert a "never retry" semantic that the code does not deliver — see finding 015. |
 ## Findings
@@ -760,3 +781,144 @@ headers and body so URL/auth/body construction is now verified, not just status
 These are new-coverage tests against already-correct behaviour, so they pass on the
 current source; the `BuildUrl` and `ApplyAuth` paths they exercise are now protected
 against regression.
 ### ExternalSystemGateway-015 — `MaxRetries == 0` is buffered as "retry forever", contradicting the ExternalSystemGateway-004 "never retry" claim
 | | |
 |--|--|
 | Severity | High |
 | Category | Correctness & logic bugs |
 | Status | Open |
 | Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:120-127`, `src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:102-108` |
 **Description**
 The `ExternalSystemGateway-004` fix removed the `MaxRetries > 0 ? ... : null` guard so
 that `CachedCallAsync` and `CachedWriteAsync` now pass the definition's `MaxRetries`
 to `StoreAndForwardService.EnqueueAsync` verbatim. The stated rationale (and the inline
 comments at `ExternalSystemClient.cs:118-119` / `DatabaseGateway.cs:99-101`, plus the
 tests `CachedCall_TransientFailure_ZeroMaxRetriesIsHonouredNotTreatedAsUnset` and
 `CachedWrite_ZeroMaxRetriesIsHonouredNotTreatedAsUnset`) is that a configured
 `MaxRetries` of `0` means **"never retry"**.
 That is the opposite of what the Store-and-Forward engine actually does with the
 value. `EnqueueAsync` stores the passed `maxRetries` directly into
 `StoreAndForwardMessage.MaxRetries` (`StoreAndForwardService.cs:139`), whose own XML
 doc states **"Maximum retry-sweep attempts before parking (0 = no limit)"**
 (`StoreAndForwardMessage.cs:30`). The retry sweep enforces it as
 `if (message.MaxRetries > 0 && message.RetryCount >= message.MaxRetries)`
 (`StoreAndForwardService.cs:285`) — when `MaxRetries == 0` that guard is false on
 every sweep, so the message is **never parked and is retried forever**.
 Consequences for a `CachedCall`/`CachedWrite` against a system or connection
 configured with `MaxRetries = 0`:
 1. A transiently-failing message that the operator intended never to retry is instead
   retried on every sweep indefinitely, accumulating in the buffer and repeatedly
   re-dispatching the request — the exact unbounded-retry / duplicate-delivery hazard
   the idempotency note in the design doc warns about.
 2. The two ESG regression tests cited above assert `max_retries == 0` is *stored* and
   describe that as "honoured" — they verify the persisted column, never the resulting
   sweep behaviour, so they lock in the broken semantics.
 3. Because the SiteRuntime repository still always supplies `MaxRetries == 0` (the
   open companion gap noted in `ExternalSystemGateway-004`), **every** cached call and
   cached write at every site currently buffers as retry-forever. Before the ESG-004
   fix the `> 0` guard sent `null`, so the S&F `DefaultMaxRetries` (a bounded value)
   applied — i.e. the ESG-004 fix turned a bounded retry into an unbounded one for the
   common path.
 **Recommendation**
 Reconcile the ESG and S&F interpretations of `MaxRetries == 0` — they must agree.
 Either: (a) treat the entity's `MaxRetries == 0` as "unset" and pass `null` so the
 bounded S&F default applies (reverting to the pre-ESG-004 behaviour, and accepting
 that "never retry" then needs a different representation such as a nullable field or a
 `MaxRetries == 1` convention); or (b) if `0` genuinely must mean "never retry", add an
 explicit no-retry path — e.g. do not enqueue at all on transient failure when
 `MaxRetries == 0`, or introduce a distinct sentinel — and fix the
 `StoreAndForwardMessage.MaxRetries` doc and `RetryMessageAsync` guard so `0` no longer
 means "no limit". Update the two `ZeroMaxRetries...` tests to assert the *sweep*
 outcome (parked / not retried), not just the stored column value.
 **Resolution**
 _Unresolved._
 ### ExternalSystemGateway-016 — `ConfigureHttpClientDefaults` applies the ESG connection cap to every `HttpClient` in the host process
 | | |
 |--|--|
 | Severity | Medium |
 | Category | Code organization & conventions |
 | Status | Open |
 | Location | `src/ScadaLink.ExternalSystemGateway/ServiceCollectionExtensions.cs:21-29` |
 **Description**
 The `ExternalSystemGateway-013` fix wires `MaxConcurrentConnectionsPerSystem` by
 calling `services.ConfigureHttpClientDefaults(builder => builder.ConfigurePrimaryHttpMessageHandler(...))`.
 The inline comment claims this "applies to the dynamically-named clients created by
 `ExternalSystemClient`" — but `ConfigureHttpClientDefaults` is **process-global**: it
 adds the configuration to *every* `HttpClient`/`IHttpClientFactory` client created
 anywhere in the host, regardless of name.
 The Host registers the External System Gateway alongside other components that also
 use `IHttpClientFactory` — notably `ScadaLink.NotificationService` (`OAuth2TokenService`
 and its `ServiceCollectionExtensions` call `AddHttpClient`). With the ESG registration
 present, the OAuth2 token client (and any future `HttpClient` consumer in the host)
 has its **primary handler replaced** by a `SocketsHttpHandler` whose
 `MaxConnectionsPerServer` is the ESG's `MaxConcurrentConnectionsPerSystem`. That:
 1. Silently caps unrelated clients at a value owned by, and named for, a different
   component — an operator tuning the ESG option unknowingly throttles Microsoft 365
   OAuth2 token acquisition.
 2. Overrides/discards any primary-handler configuration those other components add for
   their own clients (e.g. a custom handler, proxy, or certificate settings).
 This is a leaky, surprising side effect for what the option claims to be a per-ESG
 setting; `ConfigureHttpClientDefaults` should not be used to express a single
 component's policy.
 **Recommendation**
 Scope the handler configuration to the gateway's own clients only. The ESG already
 creates per-system clients with a deterministic name pattern
 (`ExternalSystem_{system.Name}`); register a typed/named `HttpClient` (or a small
 factory abstraction) for that pattern and call `ConfigurePrimaryHttpMessageHandler`
 on that registration instead of on the global defaults. If a name-pattern registration
 is impractical, document the global effect explicitly and rename the option, but the
 preferred fix is to stop using `ConfigureHttpClientDefaults`.
 **Resolution**
 _Unresolved._
 ### ExternalSystemGateway-017 — `BuildUrl` appends a bare trailing `?` when a GET method's parameters are all null
 | | |
 |--|--|
 | Severity | Low |
 | Category | Correctness & logic bugs |
 | Status | Open |
 | Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:324-333` |
 **Description**
 In `BuildUrl`, the GET/DELETE query-string branch is entered when
 `parameters != null && parameters.Count > 0`, but the projection then filters out
 null-valued entries (`parameters.Where(p => p.Value != null)`). When a GET/DELETE
 method is invoked with a non-empty parameter dictionary whose values are *all* null,
 `queryString` is the empty string and the code still executes `url += "?" + queryString`,
 producing a URL ending in a bare `?` (e.g. `https://host/api/resource?`). Most servers
 tolerate a trailing `?`, but it is an unintended artifact, can defeat response caching
 on some endpoints, and makes captured request URLs harder to read in logs.
 **Recommendation**
 Only append `"?" + queryString` when `queryString` is non-empty (compute the joined
 string first and check it), so a method whose effective parameter set is empty
 produces a clean URL identical to the no-parameters case.
 **Resolution**
 _Unresolved._
--- a/code-reviews/HealthMonitoring/findings.md
+++ b/code-reviews/HealthMonitoring/findings.md
@@ -5,10 +5,10 @@
 | Module | `src/ScadaLink.HealthMonitoring` |
 | Design doc | `docs/requirements/Component-HealthMonitoring.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-16 |
+| Last reviewed | 2026-05-17 |
 | Reviewer | claude-agent |
-| Commit reviewed | `9c60592` |
+| Commit reviewed | `39d737e` |
-| Open findings | 0 |
+| Open findings | 4 |
 ## Summary
@@ -32,20 +32,38 @@ heartbeat path, and most collector setters. None of the findings are crash-class
 but the concurrency issues are Medium/High and the missing S&F metric is a real
 design-adherence gap.
 #### Re-review 2026-05-17 (commit `39d737e`)
 All twelve prior findings (HealthMonitoring-001..012) are confirmed `Resolved` —
 `SiteHealthState` is now an immutable `sealed record` mutated only via atomic
 compare-and-swap, the store-and-forward buffer-depth metric is populated, the
 central-site offline grace and the unknown-site heartbeat registration are in
 place, and the test suite has grown to cover the report loop, heartbeat path, and
 collector setters. This re-review found **4 new findings, all Low/Medium, none
 crash-class**. They are residual polish items rather than behaviour regressions:
 an inaccurate offline-check-interval comment (HealthMonitoring-013), unvalidated
 `HealthMonitoringOptions` intervals that crash the hosted service on
 misconfiguration (HealthMonitoring-014), a heartbeat-only registered site left
 with a year-0001 `LastReportReceivedAt` that the UI's staleness display must
 special-case (HealthMonitoring-015), and `CollectReport` reading
 `DateTimeOffset.UtcNow` directly instead of the module's now-standard injected
 `TimeProvider` (HealthMonitoring-016). The module remains small, readable, and
 broadly faithful to the design intent.
 ## Checklist coverage
 | # | Category | Examined | Notes |
 |---|----------|----------|-------|
-| 1 | Correctness & logic bugs | x | `MarkHeartbeat` drops heartbeats for unregistered sites (HealthMonitoring-007); central self-report has no heartbeat grace (HealthMonitoring-005). |
+| 1 | Correctness & logic bugs | x | `MarkHeartbeat` drops heartbeats for unregistered sites (HealthMonitoring-007); central self-report has no heartbeat grace (HealthMonitoring-005). Re-review: heartbeat-registered site left with year-0001 `LastReportReceivedAt` (HealthMonitoring-015). |
 | 2 | Akka.NET conventions | x | Module itself contains no actors (transport abstracted via `IHealthReportTransport`); `AddHealthMonitoringActors` is a dead placeholder (HealthMonitoring-011). Actor-side wiring lives in Communication and is out of scope. |
 | 3 | Concurrency & thread safety | x | Unguarded mutable `SiteHealthState` (HealthMonitoring-002); mutation inside `AddOrUpdate` delegate (HealthMonitoring-003); `GetAllSiteStates` leaks live mutable references (HealthMonitoring-008). Collector counters correctly use `Interlocked`. |
-| 4 | Error handling & resilience | x | `HealthReportSender` silently swallows inner failures with bare `catch {}` (HealthMonitoring-010); top-level loop error handling is sound. |
+| 4 | Error handling & resilience | x | `HealthReportSender` silently swallows inner failures with bare `catch {}` (HealthMonitoring-010, resolved); top-level loop error handling is sound. Re-review: `HealthMonitoringOptions` intervals unvalidated — a zero/negative value crashes the hosted service at `PeriodicTimer` construction (HealthMonitoring-014). |
 | 5 | Security | x | No issues found. Module handles only numeric/string operational metrics, no secrets, no external input parsing, no auth surface. |
 | 6 | Performance & resource management | x | `PeriodicTimer` instances correctly disposed via `using`. Dictionary snapshots per report are acceptable at the documented scale. No issues found. |
 | 7 | Design-document adherence | x | Store-and-forward buffer depth metric unimplemented (HealthMonitoring-001); sequence seeding deviates from doc's "starting at 1" wording (HealthMonitoring-006). |
-| 8 | Code organization & conventions | x | Options class correctly owned by the component; POCO/messages in Commons. Dead placeholder method noted (HealthMonitoring-011). |
+| 8 | Code organization & conventions | x | Options class correctly owned by the component; POCO/messages in Commons. Dead placeholder method noted (HealthMonitoring-011, resolved). Re-review: `SiteHealthCollector.CollectReport` reads `DateTimeOffset.UtcNow` directly instead of the module's now-standard injected `TimeProvider` (HealthMonitoring-016). |
 | 9 | Testing coverage | x | No tests for `CentralHealthReportLoop`, `MarkHeartbeat`, offline-via-heartbeat, replica idempotency, or most collector setters (HealthMonitoring-009). |
-| 10 | Documentation & comments | x | Heartbeat interval is described inconsistently (~2s vs ~5s) across XML docs (HealthMonitoring-004); `LatestReport = null!` misrepresents the contract (HealthMonitoring-012). |
+| 10 | Documentation & comments | x | Heartbeat interval is described inconsistently (~2s vs ~5s) across XML docs (HealthMonitoring-004, resolved); `LatestReport = null!` misrepresents the contract (HealthMonitoring-012, resolved). Re-review: offline-check-interval comment claims "(shorter)" timeout but code only uses `OfflineTimeout` (HealthMonitoring-013). |
 ## Findings
@@ -560,3 +578,148 @@ has not yet sent a report"). A codebase-wide search confirms no `null!` suppress
 remains anywhere in `src/ScadaLink.HealthMonitoring`. This is exactly the change
 HealthMonitoring-002 made when converting `SiteHealthState` to an immutable record, so
 the contract is now honest and no further code change was required.
 ### HealthMonitoring-013 — Offline-check interval comment claims "shorter timeout" but only ever uses `OfflineTimeout`
 | | |
 |--|--|
 | Severity | Low |
 | Category | Documentation & comments |
 | Status | Open |
 | Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:194-196` |
 **Description**
 `ExecuteAsync` derives the `PeriodicTimer` cadence with the comment "Check at half
 the (shorter) offline timeout interval for timely detection", but the code only
 reads `_options.OfflineTimeout`:
 ```csharp
 var checkInterval = TimeSpan.FromMilliseconds(_options.OfflineTimeout.TotalMilliseconds / 2);
 ```
 `CentralOfflineTimeout` (HealthMonitoring-005's fix) is never considered. With the
 default options (`OfflineTimeout` 60s, `CentralOfflineTimeout` 3m) `OfflineTimeout`
 genuinely is the shorter of the two, so the parenthetical happens to hold. But the
 comment states an invariant the code does not enforce: if an operator configures
 `CentralOfflineTimeout` *smaller* than `OfflineTimeout`, the check cadence stays
 tied to `OfflineTimeout`, and central offline detection is delayed by up to a full
 `OfflineTimeout / 2` beyond the intended `CentralOfflineTimeout` window. The comment
 misleads a reader into believing the cadence already adapts to whichever timeout is
 shorter.
 **Recommendation**
 Either compute `checkInterval` from `Math.Min(OfflineTimeout, CentralOfflineTimeout)`
 so the code matches the comment, or drop the "(shorter)" wording and state plainly
 that the cadence is derived from `OfflineTimeout` only (acceptable while the default
 `CentralOfflineTimeout` is the larger value).
 **Resolution**
 _Unresolved._
 ### HealthMonitoring-014 — `HealthMonitoringOptions` intervals are unvalidated; a zero/negative value crashes the hosted service
 | | |
 |--|--|
 | Severity | Low |
 | Category | Error handling & resilience |
 | Status | Open |
 | Location | `src/ScadaLink.HealthMonitoring/HealthMonitoringOptions.cs:3-20`, `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:196`, `src/ScadaLink.HealthMonitoring/HealthReportSender.cs:67`, `src/ScadaLink.HealthMonitoring/CentralHealthReportLoop.cs:63` |
 **Description**
 `HealthMonitoringOptions` is bound from the `ScadaLink:HealthMonitoring` config
 section (`SiteServiceRegistration.BindSharedOptions`) with no validation —
 no `IValidateOptions<HealthMonitoringOptions>`, no `ValidateDataAnnotations`, no
 `ValidateOnStart`. `ReportInterval`, `OfflineTimeout`, and `CentralOfflineTimeout`
 are all fed straight into `new PeriodicTimer(...)` (and `OfflineTimeout` into a
 division for the check interval). `PeriodicTimer`'s constructor throws
 `ArgumentOutOfRangeException` for a zero or negative period. A misconfigured
 `appsettings.json` (e.g. `"ReportInterval": "00:00:00"`, an empty/garbled value
 that binds to `TimeSpan.Zero`, or a negative span) therefore crashes the
 `HealthReportSender` / `CentralHealthReportLoop` / `CentralHealthAggregator`
 hosted service at startup with an opaque exception that does not name the
 offending config key, rather than failing fast with a clear validation message.
 **Recommendation**
 Add an options validator (DataAnnotations `[Range]`-style on the spans, or an
 `IValidateOptions<HealthMonitoringOptions>`) that rejects non-positive
 `ReportInterval`/`OfflineTimeout`/`CentralOfflineTimeout` and ideally requires
 `CentralOfflineTimeout >= OfflineTimeout`, and call `.ValidateOnStart()` so a bad
 configuration fails fast with a message naming the section and key.
 **Resolution**
 _Unresolved._
 ### HealthMonitoring-015 — Heartbeat-registered site is left with a year-0001 `LastReportReceivedAt`
 | | |
 |--|--|
 | Severity | Medium |
 | Category | Correctness & logic bugs |
 | Status | Open |
 | Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:122-130`, `src/ScadaLink.HealthMonitoring/SiteHealthState.cs:27` |
 **Description**
 When `MarkHeartbeat` registers a previously-unknown site (HealthMonitoring-007's
 fix), it sets `LastReportReceivedAt = default` — i.e. `DateTimeOffset.MinValue`
 (`0001-01-01`). The XML doc on `SiteHealthState.LastReportReceivedAt` states the
 field is "Used by the UI to surface report staleness during failover." A
 heartbeat-only site therefore has `LatestReport == null` **and**
 `LastReportReceivedAt == DateTimeOffset.MinValue`. Any UI code that computes
 "last report N ago" as `now - LastReportReceivedAt` without first checking
 `LatestReport != null` will render a nonsensical staleness of roughly two
 thousand years for a site that is, in fact, freshly reachable. The two
 "no report yet" signals (`LatestReport == null`, `LastReportReceivedAt == default`)
 are independent and both must be special-cased; the sentinel value is an easy trap.
 **Recommendation**
 Make `LastReportReceivedAt` nullable (`DateTimeOffset?`) so "no report received
 yet" is an explicit, unmissable state rather than a magic sentinel — consistent
 with how `LatestReport` was already made nullable for the same case — and have UI
 consumers render staleness only when it has a value. Alternatively, document the
 `default` sentinel prominently on the field and audit every UI reader, but the
 nullable option is safer and matches the existing `LatestReport` treatment.
 **Resolution**
 _Unresolved._
 ### HealthMonitoring-016 — `SiteHealthCollector.CollectReport` reads `DateTimeOffset.UtcNow` directly instead of an injected `TimeProvider`
 | | |
 |--|--|
 | Severity | Low |
 | Category | Code organization & conventions |
 | Status | Open |
 | Location | `src/ScadaLink.HealthMonitoring/SiteHealthCollector.cs:151` |
 **Description**
 `CollectReport` stamps each report with `ReportTimestamp: DateTimeOffset.UtcNow`,
 read directly from the system clock. Every other time-dependent class in the
 module — `CentralHealthAggregator`, `HealthReportSender`, `CentralHealthReportLoop`
 — was deliberately refactored (HealthMonitoring-006) to take an injectable
 `TimeProvider` so the behaviour is deterministically testable and the clock
 dependency is explicit. `SiteHealthCollector` is the lone holdout: the report
 timestamp cannot be controlled in a unit test, which is why
 `SiteHealthCollectorTests.CollectReport_IncludesUtcTimestamp` can only assert the
 timestamp falls in a `before`/`after` wall-clock window rather than equalling a
 known instant. This is a minor consistency/testability gap, not a behaviour bug.
 **Recommendation**
 Add an optional `TimeProvider` constructor parameter to `SiteHealthCollector`
 (defaulting to `TimeProvider.System`, mirroring the other classes) and derive
 `ReportTimestamp` from `GetUtcNow()`, so the report timestamp is deterministically
 testable and the module is consistent.
 **Resolution**
 _Unresolved._
--- a/code-reviews/README.md
+++ b/code-reviews/README.md
@@ -40,10 +40,10 @@ module file and counted in **Total**.
 | Severity | Open findings |
 |----------|---------------|
 | Critical | 0 |
-| High | 2 |
+| High | 5 |
-| Medium | 5 |
+| Medium | 12 |
-| Low | 10 |
+| Low | 17 |
-| **Total** | **17** |
+| **Total** | **34** |
 ## Module Status
@@ -54,11 +54,11 @@ module file and counted in **Total**.
 | [ClusterInfrastructure](ClusterInfrastructure/findings.md) | 2026-05-16 | `9c60592` | 0/0/1/1 | 2 | 10 |
 | [Commons](Commons/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/2 | 2 | 14 |
 | [Communication](Communication/findings.md) | 2026-05-16 | `9c60592` | 0/1/1/2 | 4 | 15 |
-| [ConfigurationDatabase](ConfigurationDatabase/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 11 |
+| [ConfigurationDatabase](ConfigurationDatabase/findings.md) | 2026-05-16 | `9c60592` | 0/0/2/1 | 3 | 14 |
-| [DataConnectionLayer](DataConnectionLayer/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 13 |
+| [DataConnectionLayer](DataConnectionLayer/findings.md) | 2026-05-16 | `9c60592` | 0/1/2/1 | 4 | 17 |
-| [DeploymentManager](DeploymentManager/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 14 |
+| [DeploymentManager](DeploymentManager/findings.md) | 2026-05-16 | `9c60592` | 0/1/1/1 | 3 | 17 |
-| [ExternalSystemGateway](ExternalSystemGateway/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 14 |
+| [ExternalSystemGateway](ExternalSystemGateway/findings.md) | 2026-05-16 | `9c60592` | 0/1/1/1 | 3 | 17 |
-| [HealthMonitoring](HealthMonitoring/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 12 |
+| [HealthMonitoring](HealthMonitoring/findings.md) | 2026-05-16 | `9c60592` | 0/0/1/3 | 4 | 16 |
 | [Host](Host/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 11 |
 | [InboundAPI](InboundAPI/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 13 |
 | [ManagementService](ManagementService/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 13 |
@@ -80,14 +80,17 @@ description, location, recommendation — lives in the module's `findings.md`.
 _None open._
-### High (2)
+### High (5)
 | ID | Module | Title |
 |----|--------|-------|
 | CentralUI-020 | [CentralUI](CentralUI/findings.md) | Idle-session redirect never fires: `SessionExpiry` polls a frozen auth-state snapshot |
 | Communication-012 | [Communication](Communication/findings.md) | gRPC client factory ignores the endpoint on a cache hit, breaking NodeA→NodeB stream failover |
 | DataConnectionLayer-014 | [DataConnectionLayer](DataConnectionLayer/findings.md) | DCL-012 security warning is never logged in production: `RealOpcUaClient` is created without a logger |
 | DeploymentManager-015 | [DeploymentManager](DeploymentManager/findings.md) | Site-query reconciliation marks a deployment `Success` but skips instance-state and snapshot updates |
 | ExternalSystemGateway-015 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | `MaxRetries == 0` is buffered as "retry forever", contradicting the ExternalSystemGateway-004 "never retry" claim |
-### Medium (5)
+### Medium (12)
 | ID | Module | Title |
 |----|--------|-------|
@@ -96,8 +99,15 @@ _None open._
 | CentralUI-022 | [CentralUI](CentralUI/findings.md) | `Deployments` push handler fires `InvokeAsync` with no disposal guard |
 | ClusterInfrastructure-009 | [ClusterInfrastructure](ClusterInfrastructure/findings.md) | `DownIfAlone` is an inert configuration knob — never consumed by the HOCON builder |
 | Communication-013 | [Communication](Communication/findings.md) | Site gRPC address changes are never applied; `RemoveSiteAsync` has no production caller |
 | ConfigurationDatabase-012 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | Inbound-API `ApiKey.KeyValue` bearer credential stored in plaintext |
 | ConfigurationDatabase-013 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | Secret-column encryption silently falls back to an ephemeral (throwaway) key |
 | DataConnectionLayer-015 | [DataConnectionLayer](DataConnectionLayer/findings.md) | Initial-connect failures never trigger failover; an unreachable primary at startup never tries the backup |
 | DataConnectionLayer-016 | [DataConnectionLayer](DataConnectionLayer/findings.md) | `HandleSubscribeCompleted` reports `SubscribeTagsResponse` success even on a connection-level subscribe failure |
 | DeploymentManager-016 | [DeploymentManager](DeploymentManager/findings.md) | Reconciled prior record keeps its stale `RevisionHash` |
 | ExternalSystemGateway-016 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | `ConfigureHttpClientDefaults` applies the ESG connection cap to every `HttpClient` in the host process |
 | HealthMonitoring-015 | [HealthMonitoring](HealthMonitoring/findings.md) | Heartbeat-registered site is left with a year-0001 `LastReportReceivedAt` |
-### Low (10)
+### Low (17)
 | ID | Module | Title |
 |----|--------|-------|
@@ -111,3 +121,10 @@ _None open._
 | Commons-014 | [Commons](Commons/findings.md) | `OpcUaEndpointConfigSerializer.Deserialize` can mislabel a corrupt typed row as `Legacy` |
 | Communication-014 | [Communication](Communication/findings.md) | Untrusted gRPC `correlation_id` flows directly into an Akka actor name |
 | Communication-015 | [Communication](Communication/findings.md) | No test exercises the real gRPC client factory across a node flip |
 | ConfigurationDatabase-014 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | Redundant, inconsistent cast on one `HasConversion` call |
 | DataConnectionLayer-017 | [DataConnectionLayer](DataConnectionLayer/findings.md) | `WriteBatchAsync` aborts the whole batch on a mid-batch disconnect |
 | DeploymentManager-017 | [DeploymentManager](DeploymentManager/findings.md) | `GetDeploymentStatusAsync` XML doc describes behaviour it does not implement |
 | ExternalSystemGateway-017 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | `BuildUrl` appends a bare trailing `?` when a GET method's parameters are all null |
 | HealthMonitoring-013 | [HealthMonitoring](HealthMonitoring/findings.md) | Offline-check interval comment claims "shorter timeout" but only ever uses `OfflineTimeout` |
 | HealthMonitoring-014 | [HealthMonitoring](HealthMonitoring/findings.md) | `HealthMonitoringOptions` intervals are unvalidated; a zero/negative value crashes the hosted service |
 | HealthMonitoring-016 | [HealthMonitoring](HealthMonitoring/findings.md) | `SiteHealthCollector.CollectReport` reads `DateTimeOffset.UtcNow` directly instead of an injected `TimeProvider` |