docs(code-reviews): re-review batch 4 at 39d737e — SiteEventLogging, SiteRuntime, StoreAndForward, TemplateEngine

11 new findings: SiteEventLogging-012..014, SiteRuntime-017..019, StoreAndForward-015..017, TemplateEngine-015..016.
docs(code-reviews): re-review batch 3 at 39d737e — Host, InboundAPI, ManagementService, NotificationService, Security
2026-05-17 00:51:58 -04:00 · 2026-05-17 00:48:25 -04:00 · 2026-05-17 00:45:10 -04:00 · 2026-05-17 00:41:21 -04:00
20 changed files with 3204 additions and 165 deletions
@@ -5,10 +5,10 @@
 | Module | `src/ScadaLink.CLI` |
 | Design doc | `docs/requirements/Component-CLI.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-16 |
+| Last reviewed | 2026-05-17 |
 | Reviewer | claude-agent |
-| Commit reviewed | `9c60592` |
-| Open findings | 0 |
+| Commit reviewed | `39d737e` |
+| Open findings | 3 |

 ## Summary

@@ -31,8 +31,26 @@ ID-keyed, flag-based surface. Test coverage exercises `OutputFormatter`, `CliCon
 `CommandHelpers.HandleResponse`, but the HTTP client, the `debug stream` path, the JSON
 argument parsing, and the command-tree wiring are untested.

+#### Re-review 2026-05-17 (commit `39d737e`)
+
+All 13 prior findings are confirmed resolved — the resilience gaps, the dead format
+configuration, the credential-handling weakness, and the test-coverage holes have all
+been closed, and the test suite has grown substantially (`CommandTreeTests`,
+`ManagementHttpClientTests`, `DebugStreamTests`, etc.). The CLI's runtime behaviour is now
+solid. This re-review walked all 14 command groups against the full checklist and found
+three new issues, all rooted in **update-command and design-document drift** rather than
+runtime defects: every `update` command requires the entity's "core" fields (`--name`,
+`--script`) even though the design doc presents them as optional, so a partial update is
+impossible (CLI-014); the design doc's command surface has drifted again in two specific
+places — `template composition delete` and the `data-connection` config flags (CLI-015);
+and `WriteAsTable` derives table columns from only the first array element, silently
+dropping columns for any later element with a different shape (CLI-016). No
+Critical/High issues; the module remains healthy.
+
 ## Checklist coverage

+_Original review (2026-05-16, `9c60592`):_
+
 | # | Category | Examined | Notes |
 |---|----------|----------|-------|
 | 1 | Correctness & logic bugs | ☑ | Format precedence is broken (CLI-001); empty/non-JSON success bodies crash table rendering (CLI-002, CLI-003). |
@@ -46,6 +64,21 @@ argument parsing, and the command-tree wiring are untested.
 | 9 | Testing coverage | ☑ | No tests for `ManagementHttpClient`, `DebugCommands`, command-tree wiring, or JSON argument parsing (CLI-013). |
 | 10 | Documentation & comments | ☑ | `Component-CLI.md` mismatch (CLI-007); the in-repo `README.md` is reasonably accurate. Minor exit-code doc mismatch (CLI-009). |

+_Re-review (2026-05-17, `39d737e`):_
+
+| # | Category | Examined | Notes |
+|---|----------|----------|-------|
+| 1 | Correctness & logic bugs | ☑ | Update commands require "core" fields, blocking partial updates (CLI-014); `WriteAsTable` headers derived from first array element only (CLI-016). |
+| 2 | Akka.NET conventions | ☑ | Not applicable — pure HTTP/SignalR client. No issues. |
+| 3 | Concurrency & thread safety | ☑ | `debug stream` concurrency now resolved via `DebugStreamHelpers` (CLI-011/012). No new issues. |
+| 4 | Error handling & resilience | ☑ | Malformed-URL / malformed-JSON / connect-cancellation paths all hardened (CLI-004/005/010). No new issues. |
+| 5 | Security | ☑ | Env-var credential fallback in place (CLI-006). Basic Auth over HTTP is by design. No new issues. |
+| 6 | Performance & resource management | ☑ | `CancellationTokenSource` now `using`-scoped (CLI-011). No new issues. |
+| 7 | Design-document adherence | ☑ | Two residual command-surface drifts: `template composition delete` and `data-connection --primary-config` (CLI-015). |
+| 8 | Code organization & conventions | ☑ | Consistent and clean; option construction centralised in `CliOptions`. No new issues. |
+| 9 | Testing coverage | ☑ | Substantially expanded (`CommandTreeTests`, `ManagementHttpClientTests`, `DebugStreamTests`). No new gaps. |
+| 10 | Documentation & comments | ☑ | XML docs accurate. `Component-CLI.md` drift folded into CLI-015. |
+
 ## Findings

 ### CLI-001 — `SCADALINK_FORMAT` env var and config-file format are dead; format precedence broken
@@ -537,3 +570,121 @@ Resolved 2026-05-16 (commit pending). Coverage gaps confirmed and closed:
  `InstanceCommands.TryParseBindings`/`TryParseOverrides` and covered by
  `InstanceArgumentParsingTests` under CLI-005.
 The CLI test suite went from 42 to 77 passing tests.
+
+### CLI-014 — `update` commands require "core" fields, making partial updates impossible
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.CLI/Commands/TemplateCommands.cs:77`, `src/ScadaLink.CLI/Commands/SiteCommands.cs:86`, `src/ScadaLink.CLI/Commands/ExternalSystemCommands.cs:40-42`, `src/ScadaLink.CLI/Commands/DataConnectionCommands.cs:39-40`, `src/ScadaLink.CLI/Commands/NotificationCommands.cs:40-41`, `src/ScadaLink.CLI/Commands/ApiMethodCommands.cs:79` |
+
+**Description**
+
+The design doc presents `update` commands with all non-`--id` fields as optional, e.g.
+`template update --id <id> [--name <name>] [--description <desc>] [--parent-id <id>]`
+(`Component-CLI.md:62`) and `api-method update --id <id> [--script <code>] ...`
+(`Component-CLI.md:224`). The implementation contradicts this: the update commands mark
+the entity's "core" fields as `Required = true`, so the user must always re-supply them:
+
+- `template update` — `--name` is `Required = true` (`TemplateCommands.cs:77`).
+- `site update` — `--name` is `Required = true` (`SiteCommands.cs:86`).
+- `external-system update` — `--name`, `--endpoint-url`, `--auth-type` are all
+  `Required = true` (`ExternalSystemCommands.cs:40-42`).
+- `data-connection update` — `--name`, `--protocol` are `Required = true`
+  (`DataConnectionCommands.cs:39-40`).
+- `notification update` — `--name`, `--emails` are `Required = true`
+  (`NotificationCommands.cs:40-41`).
+- `api-method update` — `--script` is `Required = true` (`ApiMethodCommands.cs:79`).
+- The same pattern applies to `template attribute/alarm/script update`
+  (`TemplateCommands.cs:164-165`, `246-248`, `332-334`) and `role-mapping update`
+  (`SecurityCommands.cs:110-111`).
+
+Because the corresponding `Update*Command` records are whole-replace (they carry the full
+field set, not a sparse patch), a user who wants to change only one field — e.g. flip an
+API method's timeout, or rename a template — must look up and re-pass every other field's
+current value. Omitting any required flag is a hard parse error. This makes scripted,
+single-field updates (a core CLI/CI use case) awkward and error-prone, and it does not
+match the documented optional-flag surface.
+
+**Recommendation**
+
+Decide on one model and align doc + code. Either (a) make the update flags genuinely
+optional and have the server/`Update*Command` treat a null field as "leave unchanged"
+(sparse patch), or (b) if whole-replace is intentional, update `Component-CLI.md` to show
+these flags as required (no `[...]`) and document that an update replaces the whole
+entity. Option (a) matches the documented surface and the typical CLI expectation.
+
+**Resolution**
+
+_Unresolved._
+
+### CLI-015 — `Component-CLI.md` command surface has drifted again in two places
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Design-document adherence |
+| Status | Open |
+| Location | `docs/requirements/Component-CLI.md:75`, `docs/requirements/Component-CLI.md:125-126` (vs. `src/ScadaLink.CLI/Commands/TemplateCommands.cs:404-413`, `src/ScadaLink.CLI/Commands/DataConnectionCommands.cs:41`, `:86`) |
+
+**Description**
+
+CLI-007 regenerated the doc's "Command Structure" section, but two specific drifts remain
+or were introduced:
+
+- `Component-CLI.md:75` documents `template composition delete --template-id <id>
+  --instance-name <name>`, but the implementation (`TemplateCommands.cs:404-413`) deletes
+  a composition by its own integer ID via a single `--id` option
+  (`DeleteTemplateCompositionCommand(id)`). The doc's two-flag form does not exist.
+- `data-connection create` and `update` accept a `--primary-config` option (aliased
+  `--configuration`) for the primary configuration JSON (`DataConnectionCommands.cs:86`,
+  `:41`), but `Component-CLI.md:125-126` lists only `--backup-config` and
+  `--failover-retry-count` — the primary-config flag is absent from the doc.
+
+A reader following the doc would use a non-existent `template composition delete` form and
+would not discover the `--primary-config` flag.
+
+**Recommendation**
+
+Correct `Component-CLI.md:75` to `template composition delete --id <id>`, and add
+`[--primary-config <json>]` to the documented `data-connection create`/`update` signatures
+(`Component-CLI.md:125-126`). Also note the `--configuration` alias if aliases are
+documented elsewhere.
+
+**Resolution**
+
+_Unresolved._
+
+### CLI-016 — `WriteAsTable` derives columns from the first array element only
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.CLI/Commands/CommandHelpers.cs:184-200` |
+
+**Description**
+
+When rendering a JSON array as a table, `WriteAsTable` builds the header set from
+`items[0].EnumerateObject()` only (`CommandHelpers.cs:184-186`) and then projects every
+row against that fixed header list (`:188-198`). If a later element of the array has a
+different shape — additional properties, or properties the first element lacks — those
+extra columns are silently dropped from the table and a row missing a header property
+renders an empty cell. The user sees a table that appears complete but has omitted data,
+with no indication that columns were discarded. (The JSON output path is unaffected; this
+only affects `--format table`.) Management API list responses are generally homogeneous,
+so the practical impact is low, but a heterogeneous array — e.g. a diff or a mixed-status
+list — would be rendered incorrectly with no warning.
+
+**Recommendation**
+
+Compute the header set as the union of property names across all array elements (iterate
+all items, collect distinct property names preserving first-seen order) before projecting
+rows, so no element's data is silently dropped.
+
+**Resolution**
+
+_Unresolved._
@@ -5,10 +5,10 @@
 | Module | `src/ScadaLink.CentralUI` |
 | Design doc | `docs/requirements/Component-CentralUI.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-16 |
+| Last reviewed | 2026-05-17 |
 | Reviewer | claude-agent |
-| Commit reviewed | `9c60592` |
-| Open findings | 0 |
+| Commit reviewed | `39d737e` |
+| Open findings | 6 |

 ## Summary

@@ -32,6 +32,24 @@ Testing coverage is thin for a module this large: only the script analyzer,
 TreeView, schema model, and a few data-connection pages have unit tests; most
 pages and the auth bridge are untested.

+#### Re-review 2026-05-17 (commit `39d737e`)
+
+All 19 findings from the 2026-05-16 review are confirmed closed. The resolution
+batch (`a9bd7ee`..`34588ae`) substantially rewrote the auth bridge, the script
+sandbox, several Deployment/Monitoring pages, and the shared component disposal
+paths, so this re-review re-examined the post-fix state across all 10 checklist
+categories. Six new findings (CentralUI-020 .. 025) were recorded. The most
+important is **CentralUI-020**: the two prior fixes interact destructively — the
+CentralUI-004 fix made `CookieAuthenticationStateProvider` return a frozen,
+constructor-time auth-state snapshot, while the CentralUI-005 fix rewrote
+`SessionExpiry.razor` to *poll* that same provider to detect a lapsed session.
+Because the snapshot never changes for the life of the circuit, the idle-timeout
+redirect can never fire, so the documented idle-logout behaviour is silently
+defeated. The remaining new findings are a cross-thread `Dictionary` mutation in
+`DebugView`, an unguarded `InvokeAsync` in the new `Deployments` push handler,
+and three Low-severity items (residual bare `catch`, magic-string claim
+lookups, and the untested `SessionExpiry` polling path).
+
 ## Checklist coverage

 | # | Category | Examined | Notes |
@@ -47,6 +65,14 @@ pages and the auth bridge are untested.
 | 9 | Testing coverage | ☑ | Auth, sandbox-run, DebugView, Health, ParkedMessages, most pages untested. |
 | 10 | Documentation & comments | ☑ | Comments are accurate and helpful; a few stale claims noted. |

+Re-review 2026-05-17 (`39d737e`): all 10 categories re-examined against the
+post-fix source. New findings — category 3 (CentralUI-020 the auth-snapshot vs
+session-poll interaction is also a design-adherence regression; CentralUI-021
+cross-thread `Dictionary`; CentralUI-022 unguarded `InvokeAsync`), category 4
+(CentralUI-023 residual bare `catch`), category 8 (CentralUI-024 magic-string
+claims), category 9 (CentralUI-025 untested `SessionExpiry` poll). Categories
+1, 2, 5, 6, 7, 10 produced no new findings.
+
 ## Findings

 ### CentralUI-001 — Test Run sandbox executes arbitrary C# with no trust-model enforcement
@@ -961,3 +987,232 @@ it was failing on the baseline because `DeploymentService` had gained a
 `DiffService` constructor dependency from a DeploymentManager contract change
 that the test fixture had not been updated for; `DiffService` is now registered
 in the fixture.)
+
+### CentralUI-020 — Idle-session redirect never fires: `SessionExpiry` polls a frozen auth-state snapshot
+
+| | |
+|--|--|
+| Severity | High |
+| Category | Concurrency & thread safety |
+| Status | Open |
+| Location | `src/ScadaLink.CentralUI/Components/Shared/SessionExpiry.razor:39-62`; `src/ScadaLink.CentralUI/Auth/CookieAuthenticationStateProvider.cs:29-43` |
+
+**Description**
+
+The CentralUI-004 fix and the CentralUI-005 fix interact destructively.
+CentralUI-004 made `CookieAuthenticationStateProvider` snapshot the principal
+**once** in its constructor into a cached `Task<AuthenticationState>` and serve
+that exact task for the entire life of the SignalR circuit — it never re-reads
+`HttpContext`, never calls `SetAuthenticationState`, and never raises
+`NotifyAuthenticationStateChanged`. CentralUI-005 then rewrote
+`SessionExpiry.razor` to *poll* `AuthStateProvider.GetAuthenticationStateAsync()`
+once a minute and redirect to `/login` "once the sliding cookie has actually
+lapsed server-side." But `GetAuthenticationStateAsync()` returns the same frozen
+constructor-time snapshot on every call — `auth.User.Identity.IsAuthenticated`
+is permanently `true` for the life of the circuit regardless of whether the
+server-side cookie has expired. The poll loop therefore never observes an
+expired session and the redirect never fires. An idle user whose cookie has
+lapsed server-side keeps an authenticated-looking page open indefinitely; the
+documented "30-minute idle timeout" is silently defeated for any user who
+leaves a circuit open. (The cookie middleware would still reject the *next*
+full HTTP request / new circuit, so this is a stale-UI / missed-logout exposure
+rather than a full auth bypass — but the page continues to render
+authenticated content and a SignalR circuit can stay alive for a long time.)
+This is also a design-document-adherence regression against CLAUDE.md
+"Security & Auth" (30-minute idle timeout) — recorded under Concurrency because
+the root cause is the lifetime/staleness mismatch between the two components.
+
+**Recommendation**
+
+`SessionExpiry` must consult something that actually reflects the live
+server-side session, not the circuit's frozen principal. Options: (a) have
+`SessionExpiry` poll a lightweight authenticated server endpoint (e.g. a
+`/auth/ping` minimal API that returns 401 once the cookie has lapsed) and
+redirect on 401; or (b) give `CookieAuthenticationStateProvider` a refresh path
+that re-validates the cookie and calls `SetAuthenticationState` /
+`NotifyAuthenticationStateChanged` so the polled state can actually change.
+Whichever is chosen, add a test that exercises the redirect path with an
+expired session (see CentralUI-025).
+
+**Resolution**
+
+_Unresolved._
+
+### CentralUI-021 — `DebugView` stream callback mutates `Dictionary` off the render thread
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Concurrency & thread safety |
+| Status | Open |
+| Location | `src/ScadaLink.CentralUI/Components/Pages/Deployment/DebugView.razor:404-419,511-519,275-289` |
+
+**Description**
+
+The `onEvent` callback passed to `DebugStreamService.StartStreamAsync` runs on
+an Akka/gRPC thread (as the design doc and the CentralUI-009 comments state). It
+calls `UpsertWithCap(_attributeValues, …)` / `UpsertWithCap(_alarmStates, …)`
+**directly on that thread** — the mutation is not marshalled through
+`InvokeAsync`; only the subsequent `StateHasChanged` is. Meanwhile the render
+thread evaluates `FilteredAttributeValues` / `FilteredAlarmStates`, which
+enumerate `_attributeValues.Values` / `_alarmStates.Values` and call
+`OrderBy(...).ToList()`. `Dictionary<TKey,TValue>` is not thread-safe: a write
+on the Akka thread concurrent with an enumeration on the render thread can throw
+`InvalidOperationException` ("Collection was modified; enumeration operation may
+not execute") or corrupt the dictionary's internal buckets. The CentralUI-009
+fix added a `_disposed` guard but did not address this data race — the guard
+only prevents touching a *disposed* component, not concurrent access to a live
+one. Under a busy debug stream this will intermittently fault the page.
+
+**Recommendation**
+
+Marshal the dictionary mutation onto the render thread too — move the
+`UpsertWithCap` call inside the `SafeInvokeAsync`/`InvokeAsync` body so all
+access to `_attributeValues`/`_alarmStates` happens on the renderer's
+dispatcher. Alternatively guard both the writes and the `Filtered*` reads with a
+lock, or use a concurrent collection. The cap-trim loop must be inside the same
+critical section as the upsert.
+
+**Resolution**
+
+_Unresolved._
+
+### CentralUI-022 — `Deployments` push handler fires `InvokeAsync` with no disposal guard
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Error handling & resilience |
+| Status | Open |
+| Location | `src/ScadaLink.CentralUI/Components/Pages/Deployment/Deployments.razor:221-229,317-322` |
+
+**Description**
+
+`OnDeploymentStatusChanged` is invoked by `IDeploymentStatusNotifier`, a process
+singleton, on the DeploymentManager service thread. The handler does
+`_ = InvokeAsync(async () => { await LoadDataAsync(); StateHasChanged(); })`,
+discarding the returned task. `Dispose()` unsubscribes the handler, but there is
+a race window: the notifier can read the subscriber list and begin invoking
+`OnDeploymentStatusChanged` *just before* the component is disposed, so
+`InvokeAsync` then runs against a disposed component and throws
+`ObjectDisposedException` on the DeploymentManager thread — an unobserved task
+exception (the task is fire-and-forget). The same hazard was explicitly fixed
+for `DebugView` (CentralUI-009, `SafeInvokeAsync` + `_disposed` flag) and
+`ToastNotification` (CentralUI-010), but the new push-based `Deployments`
+handler introduced by the CentralUI-006 fix did not adopt the same guard.
+Separately, every push event triggers two full repository reloads
+(`GetAllInstancesAsync` + `GetAllDeploymentRecordsAsync`) for every open
+circuit, so a burst of status changes amplifies into N×2 round-trips per tick.
+
+**Recommendation**
+
+Add a `volatile bool _disposed` set first in `Dispose()`, have
+`OnDeploymentStatusChanged` no-op when set, and wrap the `InvokeAsync` dispatch
+to swallow `ObjectDisposedException` (mirror `DebugView.SafeInvokeAsync`).
+Optionally coalesce bursts (debounce) and/or reload only the changed record
+rather than the whole table on each event.
+
+**Resolution**
+
+_Unresolved._
+
+### CentralUI-023 — Residual bare `catch {}` blocks swallow JS interop errors
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Error handling & resilience |
+| Status | Open |
+| Location | `src/ScadaLink.CentralUI/Components/Pages/Monitoring/ParkedMessages.razor:690-698`; `src/ScadaLink.CentralUI/Components/Shared/DiffDialog.razor:107-116,118-130,104` |
+
+**Description**
+
+CentralUI-018 narrowed the bare `catch {}` blocks in `MonacoEditor`,
+`TreeView`, and `Sites.razor`, but the same pattern survives elsewhere.
+`ParkedMessages.CopyAsync` wraps `navigator.clipboard.writeText` in
+`catch { _toast.ShowError("Copy failed."); }` — a real `JSException`
+(clipboard permission denied) and an expected `JSDisconnectedException` are
+treated identically and neither is logged. `DiffDialog.TryLockBodyAsync` /
+`TryUnlockBodyAsync` each have a bare outer `catch` whose handler does another
+JS call wrapped in a second bare `catch { /* swallow */ }`, and
+`OnAfterRenderAsync`'s `_modalRef.FocusAsync()` is wrapped in a bare
+`catch { /* prerender or detached: ignore */ }`. Genuine interop failures in
+these paths are invisible in production logs.
+
+**Recommendation**
+
+Catch `JSDisconnectedException` silently and `JSException` (and
+`InvalidOperationException` for the prerender focus case) with an `ILogger`
+call, consistent with the CentralUI-018 fixes in the same module.
+
+**Resolution**
+
+_Unresolved._
+
+### CentralUI-024 — Claim lookups use magic strings instead of `JwtTokenService` constants
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Code organization & conventions |
+| Status | Open |
+| Location | `src/ScadaLink.CentralUI/Components/Layout/NavMenu.razor:102`; `src/ScadaLink.CentralUI/Components/Pages/Dashboard.razor:14`; `GetCurrentUserAsync` in `Templates.razor`, `TemplateEdit.razor`, `TemplateCreate.razor`, `SharedScripts.razor`, `SharedScriptForm.razor`, `Sites.razor`, `Topology.razor`, `InstanceCreate.razor`, `InstanceConfigure.razor` |
+
+**Description**
+
+`ScadaLink.Security.JwtTokenService` exposes the canonical claim-type constants
+(`UsernameClaimType = "Username"`, `DisplayNameClaimType = "DisplayName"`,
+`RoleClaimType`, `SiteIdClaimType`). `SiteScopeService` correctly uses
+`JwtTokenService.SiteIdClaimType`, but every `GetCurrentUserAsync` helper across
+ten pages does `authState.User.FindFirst("Username")?.Value`, and `NavMenu` /
+`Dashboard` do `context.User.FindFirst("DisplayName")`. The literals happen to
+match the constants today, so there is no live bug — but if the claim type is
+ever renamed in `JwtTokenService` (the single source of truth) every one of
+these call sites silently breaks, falling back to `"unknown"` for the audit
+user and a blank display name. The duplicated `GetCurrentUserAsync` helper is
+also copy-pasted verbatim into ten components.
+
+**Recommendation**
+
+Replace the string literals with `JwtTokenService.UsernameClaimType` /
+`DisplayNameClaimType`. Consider extracting the repeated `GetCurrentUserAsync`
+into a single shared helper (e.g. an extension on `AuthenticationStateProvider`
+or a small scoped service) so the claim lookup lives in exactly one place.
+
+**Resolution**
+
+_Unresolved._
+
+### CentralUI-025 — `SessionExpiry` polling/redirect path has no test coverage
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Testing coverage |
+| Status | Open |
+| Location | `tests/ScadaLink.CentralUI.Tests/Auth/SessionExpiryPolicyTests.cs`; `src/ScadaLink.CentralUI/Components/Shared/SessionExpiry.razor` |
+
+**Description**
+
+`SessionExpiryPolicyTests` covers only `AuthEndpoints.BuildSignInProperties()`
+(the sign-in properties shape). The actual runtime behaviour of
+`SessionExpiry.razor` — that an expired session triggers a redirect to
+`/login`, that an authenticated session does not, and that the component does
+not poll/redirect on the `/login` page itself — is untested. Had a behavioural
+test exercised the redirect with an expired/anonymous auth state against the
+real `CookieAuthenticationStateProvider`, the CentralUI-020 defect (the frozen
+snapshot never reporting an expired session) would have been caught. The
+component is the system's only client-side idle-logout mechanism, so the gap is
+material.
+
+**Recommendation**
+
+Add bUnit tests for `SessionExpiry`: (a) with an unauthenticated auth state the
+component navigates to `/login`; (b) with an authenticated state it does not;
+(c) on the `/login` route it neither polls nor redirects. The provider used in
+the test must be one whose state can actually transition to expired — which
+also forces the CentralUI-020 fix.
+
+**Resolution**
+
+_Unresolved._
@@ -5,10 +5,10 @@
 | Module | `src/ScadaLink.ClusterInfrastructure` |
 | Design doc | `docs/requirements/Component-ClusterInfrastructure.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-16 |
+| Last reviewed | 2026-05-17 |
 | Reviewer | claude-agent |
-| Commit reviewed | `9c60592` |
-| Open findings | 0 |
+| Commit reviewed | `39d737e` |
+| Open findings | 2 |

 ## Summary

@@ -29,20 +29,39 @@ every other component runs on, yet it presently delivers nothing the design requ
 The single options class is clean and its test covers defaults and setters
 adequately for what exists.

+**Re-review 2026-05-17 (commit `39d737e`).** All eight prior findings (CI-001..008)
+were resolved by the batch of work between `9c60592` and `39d737e`: `ClusterOptions`
+gained XML docs, the `SectionName` constant, and the `DownIfAlone` property;
+`ClusterOptionsValidator` was added; `ServiceCollectionExtensions` now registers the
+validator and throws from the dead actor-registration method; and the test project
+grew to 16 cases across three test classes. The module is in good shape — the
+`ClusterOptions` contract, its validator, and the DI registration are all sound,
+well-documented, and well-tested. This re-review examined all three source files and
+all three test files against the full 10-category checklist and found **two new
+issues**, both stemming from work the prior review explicitly deferred to a "Host
+review" that has not happened: the `DownIfAlone` property is exposed and validated as
+part of the configuration contract but is never consumed — `ScadaLink.Host`'s
+`BuildHocon` still hard-codes `down-if-alone = on` (CI-009, Medium) — and the validator
+does not enforce the design doc's requirement that `down-if-alone` be `on` for the
+keep-oldest resolver, so `DownIfAlone = false` is silently accepted (CI-010, Low).
+
 ## Checklist coverage

+Original review (2026-05-16, `9c60592`) below; the re-review notes (2026-05-17,
+`39d737e`) are appended in each row.
+
 | # | Category | Examined | Notes |
 |---|----------|----------|-------|
-| 1 | Correctness & logic bugs | ✓ | No executable logic exists beyond an options POCO; no logic bugs, but `ServiceCollectionExtensions` returns success while doing nothing (CI-002). |
-| 2 | Akka.NET conventions | ✓ | No actors, no `ActorSystem` bootstrap, no supervision, no cluster/singleton wiring exist despite the design doc requiring all of them (CI-001). Nothing to assess against `Tell`/`Ask`, immutability, or `PipeTo`. |
-| 3 | Concurrency & thread safety | ✓ | No shared mutable state, no actors, no async code. No issues found in current code. |
-| 4 | Error handling & resilience | ✓ | Failover, split-brain, dual-node recovery, and graceful-shutdown logic are entirely absent (CI-001). No exception paths to review in current code. |
-| 5 | Security | ✓ | No authn/authz surface in this module. Akka remoting is unconfigured, so transport security cannot be assessed; flagged as part of the missing implementation (CI-001). No secret handling present. |
-| 6 | Performance & resource management | ✓ | No streams, connections, timers, or `IDisposable` resources exist yet. No issues found in current code. |
-| 7 | Design-document adherence | ✓ | Severe drift: the module implements none of its documented responsibilities (CI-001). `ClusterOptions` also omits remoting host/port, cluster role/site identifier, gRPC port, storage paths, and `down-if-alone` (CI-003). |
-| 8 | Code organization & conventions | ✓ | Options class is correctly owned by the component project. Missing config-section-name constant (CI-005) and missing `IValidateOptions`/data-annotation validation (CI-004) versus the Options pattern intent. |
-| 9 | Testing coverage | ✓ | `ClusterOptionsTests` covers defaults and setters. No tests for any cluster behaviour because none exists; the test project references nothing else (CI-006). |
-| 10 | Documentation & comments | ✓ | `ClusterOptions` has no XML doc comments unlike peer options classes (CI-007). The "Phase 0 skeleton" placeholders are undocumented at the module level — no README or tracking note (CI-008). |
+| 1 | Correctness & logic bugs | ✓ | No executable logic exists beyond an options POCO; no logic bugs, but `ServiceCollectionExtensions` returns success while doing nothing (CI-002). **Re-review:** CI-002 resolved. New — `DownIfAlone` is a settable property that controls nothing because the HOCON builder hard-codes the value (CI-009). |
+| 2 | Akka.NET conventions | ✓ | No actors, no `ActorSystem` bootstrap, no supervision, no cluster/singleton wiring exist despite the design doc requiring all of them (CI-001). Nothing to assess against `Tell`/`Ask`, immutability, or `PipeTo`. **Re-review:** confirmed the Akka bootstrap legitimately lives in `ScadaLink.Host` (CI-001 resolution); still nothing actor-related in this module. No issues. |
+| 3 | Concurrency & thread safety | ✓ | No shared mutable state, no actors, no async code. No issues found in current code. **Re-review:** validator and DI extensions are stateless; no issues. |
+| 4 | Error handling & resilience | ✓ | Failover, split-brain, dual-node recovery, and graceful-shutdown logic are entirely absent (CI-001). No exception paths to review in current code. **Re-review:** the validator now fails fast on misconfiguration. New — it does not enforce the design doc's `down-if-alone = on` requirement (CI-010). |
+| 5 | Security | ✓ | No authn/authz surface in this module. Akka remoting is unconfigured, so transport security cannot be assessed; flagged as part of the missing implementation (CI-001). No secret handling present. **Re-review:** still no authn/authz surface, no secret handling. No issues. |
+| 6 | Performance & resource management | ✓ | No streams, connections, timers, or `IDisposable` resources exist yet. No issues found in current code. **Re-review:** no resources held; the validator allocates a small failure list per call only. No issues. |
+| 7 | Design-document adherence | ✓ | Severe drift: the module implements none of its documented responsibilities (CI-001). `ClusterOptions` also omits remoting host/port, cluster role/site identifier, gRPC port, storage paths, and `down-if-alone` (CI-003). **Re-review:** CI-001/CI-003 resolved (ownership split documented; `DownIfAlone` added). New — `DownIfAlone` was added to the contract but never wired into the HOCON (CI-009). |
+| 8 | Code organization & conventions | ✓ | Options class is correctly owned by the component project. Missing config-section-name constant (CI-005) and missing `IValidateOptions`/data-annotation validation (CI-004) versus the Options pattern intent. **Re-review:** CI-004/CI-005 resolved; `SectionName` constant present and options/validator placement correct. No issues. |
+| 9 | Testing coverage | ✓ | `ClusterOptionsTests` covers defaults and setters. No tests for any cluster behaviour because none exists; the test project references nothing else (CI-006). **Re-review:** CI-006 resolved — 16 tests across three classes covering options, validator, and DI registration. No `DownIfAlone`-wiring test exists, but that wiring lives in the Host (CI-009). No new issue here. |
+| 10 | Documentation & comments | ✓ | `ClusterOptions` has no XML doc comments unlike peer options classes (CI-007). The "Phase 0 skeleton" placeholders are undocumented at the module level — no README or tracking note (CI-008). **Re-review:** CI-007/CI-008 resolved — full XML docs on all members; skeleton comments gone. Note: the `DownIfAlone` XML doc calls `true` "the design-doc requirement" yet the value is inert (CI-009) and unenforced (CI-010). |

 ## Findings

@@ -490,3 +509,97 @@ section, and the component table reflects the true placement. This is a
 documentation-only finding, so no runtime regression test is meaningful; verified by
 inspection of `ServiceCollectionExtensions.cs` and
 `docs/requirements/Component-ClusterInfrastructure.md:21-39`.
+
+### ClusterInfrastructure-009 — `DownIfAlone` is an inert configuration knob — never consumed by the HOCON builder
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Design-document adherence |
+| Status | Open |
+| Location | `src/ScadaLink.ClusterInfrastructure/ClusterOptions.cs:74` |
+
+**Description**
+
+The `DownIfAlone` property was added to `ClusterOptions` by CI-003's resolution as
+part of "the split-brain configuration contract". It is public, defaults to `true`,
+carries an XML doc presenting it as "the design-doc requirement", and is exercised by
+`ClusterOptionsTests.DownIfAlone_CanBeSet`. However, nothing in the system reads it.
+The Akka.NET HOCON is generated by `ScadaLink.Host.Actors.AkkaHostedService.BuildHocon`,
+which **hard-codes** the resolver setting:
+
+```
+split-brain-resolver {
+    active-strategy = ...
+    stable-after = ...
+    keep-oldest {
+        down-if-alone = on
+    }
+}
+```
+
+`BuildHocon` receives the full `ClusterOptions` instance and consumes every other
+field (`SeedNodes`, `MinNrOfMembers`, `SplitBrainResolverStrategy`, `StableAfter`,
+`HeartbeatInterval`, `FailureDetectionThreshold`) but ignores `DownIfAlone` entirely.
+The result is a configuration property that an operator can set in `appsettings.json`,
+that passes validation, and that has **zero runtime effect** — setting
+`DownIfAlone: false` does not turn the flag off. CI-003's resolution explicitly
+acknowledged this gap ("wiring it to read `DownIfAlone` is a one-line `ScadaLink.Host`
+change ... noted for the Host's review") but the wiring was never done and no tracked
+finding carried it, so the gap has silently persisted to commit `39d737e`. An inert,
+misleadingly-documented configuration knob is a correctness and design-adherence
+defect: it gives operators a false sense of control over a safety-critical resolver
+setting.
+
+**Recommendation**
+
+Either (a) wire `DownIfAlone` into `BuildHocon` — emit `down-if-alone = {(clusterOptions.DownIfAlone ? "on" : "off")}`
+— so the property does what its XML doc claims (a Host-side change, to be tracked in
+the Host module's review since `BuildHocon` lives there), or (b) if the flag is
+intentionally fixed at `on` and must never be operator-configurable, remove the
+`DownIfAlone` property from `ClusterOptions` and document the hard-coded `on` value as
+a non-negotiable invariant. Do not leave a public, settable, validated property that
+controls nothing.
+
+**Resolution**
+
+_Unresolved._
+
+### ClusterInfrastructure-010 — Validator does not enforce `DownIfAlone = true` despite the design doc requiring it
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Error handling & resilience |
+| Status | Open |
+| Location | `src/ScadaLink.ClusterInfrastructure/ClusterOptionsValidator.cs:21-71` |
+
+**Description**
+
+`Component-ClusterInfrastructure.md` (Split-Brain Resolution) states the keep-oldest
+resolver must be configured with `down-if-alone = on`, and the XML doc on
+`ClusterOptions.DownIfAlone` calls `true` "the design-doc requirement" — the rationale
+being that without it the oldest node can run as an isolated single-node cluster
+during a partition while the younger node forms its own. `ClusterOptionsValidator`
+guards every other safety-critical setting (`MinNrOfMembers == 1`, `keep-oldest`-only
+strategy, positive timings, heartbeat below the failure threshold) but performs no
+check on `DownIfAlone`. A configuration of `DownIfAlone: false` therefore passes
+validation cleanly. This is currently latent because CI-009 shows the property is not
+consumed at all, but the moment CI-009 is fixed by wiring the property into the HOCON
+(option (a)), `DownIfAlone: false` would silently produce the unsafe single-node
+behaviour the design doc explicitly forbids — with no fail-fast guard. The validator
+is the right place to enforce the invariant, consistent with how it already rejects
+quorum split-brain strategies.
+
+**Recommendation**
+
+If CI-009 is resolved by keeping `DownIfAlone` configurable, add a check to
+`ClusterOptionsValidator.Validate` that fails when `DownIfAlone` is `false` (or, if
+some future deployment legitimately needs it off, fails only in combination with the
+`keep-oldest` strategy), with a message explaining the isolated-single-node-cluster
+hazard. If CI-009 is resolved by removing the property, this finding is moot and
+should be closed as resolved alongside it.
+
+**Resolution**
+
+_Unresolved._
@@ -5,10 +5,10 @@
 | Module | `src/ScadaLink.Commons` |
 | Design doc | `docs/requirements/Component-Commons.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-16 |
+| Last reviewed | 2026-05-17 |
 | Reviewer | claude-agent |
-| Commit reviewed | `9c60592` |
-| Open findings | 0 |
+| Commit reviewed | `39d737e` |
+| Open findings | 2 |

 ## Summary

@@ -32,14 +32,28 @@ kind of edge-case logic that warrants them. Entity and message contracts otherwi
 clean and additive-evolution-friendly, with the exception of one `ValueTuple` use in a
 wire command.

+**Re-review 2026-05-17 (commit `39d737e`).** All twelve prior findings (Commons-001
+through Commons-012) are confirmed `Resolved` — the fixes are sound, well-targeted, and
+backed by focused regression tests (`StaleTagMonitorRaceTests`, `DynamicJsonElementTests`,
+`ScriptParametersTests`, `ManagementCommandRegistryTests`, `OpcUaEndpointConfigSerializerTests`,
+`ResultTests`, `ValueFormatterTests`, `ConnectionBindingSerializationTests`,
+`FlatteningAndScriptScopeTests`). The new files introduced since `9c60592`
+(`TemplateAlarm` lock/inherit fields, `IExternalSystemRepository` name-keyed lookups,
+`DeploymentStateQueryRequest`/`Response`, `ParameterDefinition`) follow the established
+POCO / record / additive-evolution conventions and carry round-trip compatibility tests.
+Two new Low-severity findings were recorded this pass: a `DynamicJsonElement` array
+indexer that rejects `long` indices (Commons-013) and an `OpcUaEndpointConfigSerializer`
+legacy-fallback path that can mislabel a corrupt new-shape row as `Legacy` (Commons-014).
+No Critical, High, or Medium issues were found.
+
 ## Checklist coverage

 | # | Category | Examined | Notes |
 |---|----------|----------|-------|
-| 1 | Correctness & logic bugs | ✓ | `DynamicJsonElement.TryConvert` returns success for non-convertible types; `Result<T>` allows null error; legacy-config fallback loses data. |
+| 1 | Correctness & logic bugs | ✓ | `DynamicJsonElement.TryConvert` returns success for non-convertible types; `Result<T>` allows null error; legacy-config fallback loses data. Re-review: `DynamicJsonElement.TryGetIndex` rejects non-`int` indices (Commons-013). |
 | 2 | Akka.NET conventions | ✓ | Commons has no actors (correct). Message contracts are records and immutable. One wire message uses `ValueTuple` (Commons-008). Correlation IDs present on request/response messages. |
 | 3 | Concurrency & thread safety | ✓ | `StaleTagMonitor` has a check-then-act race between the timer callback and `OnValueReceived` (Commons-001). |
-| 4 | Error handling & resilience | ✓ | `ScriptParameters.GetNullable` silently swallows conversion failures (Commons-003); OPC UA legacy deserialize discards malformed input (Commons-005). |
+| 4 | Error handling & resilience | ✓ | `ScriptParameters.GetNullable` silently swallows conversion failures (Commons-003); OPC UA legacy deserialize discards malformed input (Commons-005). Re-review: corrupt typed OPC UA rows can fall through to the legacy path and be mislabelled `Legacy` (Commons-014). |
 | 5 | Security | ✓ | No auth logic here. `SmtpConfiguration.Credentials` / OPC UA passwords are plain-string fields (storage/encryption is a consumer concern) — noted, not a finding. No script-trust violations: Commons defines no forbidden-API surface. |
 | 6 | Performance & resource management | ✓ | `StaleTagMonitor` disposes its `Timer` correctly. `DynamicJsonElement` references a `JsonElement` whose backing document lifetime is not owned (Commons-002). |
 | 7 | Design-document adherence | ✓ | Several behavior-bearing helper/validator/serializer classes push against REQ-COM-6 "no business logic" (Commons-007). Folder layout matches REQ-COM-5b. |
@@ -566,3 +580,75 @@ the parameterless `ToString()`). The XML doc gained a remarks block stating the
 culture-invariant contract and why. Regression tests added in `ValueFormatterTests`
 (`FormatDisplayValue_Double_UsesInvariantCulture_*`, `_DateTime_*`, `_CollectionOfDoubles_*`,
 each pinned under `de-DE`).
+
+### Commons-013 — `DynamicJsonElement.TryGetIndex` rejects non-`int` index values
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.Commons/Types/DynamicJsonElement.cs:40-54` |
+
+**Description**
+
+`TryGetIndex` accepts an index only when `indexes[0] is int index`. `DynamicJsonElement`
+is designed for dynamic access from scripts (`obj.items[0]`). In a `dynamic` expression the
+index operand's runtime type follows the script's variable type — a script that computes
+an index in a loop counter or reads it from another `DynamicJsonElement` (whose numbers
+are unwrapped as `long` by `Wrap`, see `:105`) will pass a `long`, not an `int`. The
+pattern match then fails, `TryGetIndex` returns `false`, and the dynamic binder throws a
+`RuntimeBinderException` for what is a perfectly valid in-range index. Because the wrapper
+itself surfaces JSON numbers as `long`, `obj.items[obj.count - 1]` — count being a wrapped
+JSON number — is the exact failing case. The `int`-only guard also silently rejects
+`byte`/`short` indices that would widen to a valid array position.
+
+**Recommendation**
+
+Accept any integral index by converting through `Convert.ToInt64` (guarded for
+`OverflowException`) or by matching `int`, `long`, `short`, `byte` and normalizing to a
+single integer before the bounds check. Add a regression test indexing with a `long`.
+
+**Resolution**
+
+_Unresolved._
+
+### Commons-014 — `OpcUaEndpointConfigSerializer.Deserialize` can mislabel a corrupt typed row as `Legacy`
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Error handling & resilience |
+| Status | Open |
+| Location | `src/ScadaLink.Commons/Serialization/OpcUaEndpointConfigSerializer.cs:107-131` |
+
+**Description**
+
+`Deserialize` tries the typed path first: it parses the document, checks for an
+`endpointUrl` property, then calls `JsonSerializer.Deserialize<OpcUaEndpointConfig>`.
+The whole block is wrapped in `catch (JsonException) { /* fall through to legacy */ }`.
+If a row *is* the current typed shape (it has `endpointUrl`) but is corrupt in a way that
+makes `JsonSerializer.Deserialize` throw a `JsonException` — e.g. an enum-valued field
+holding an unrecognised string, or a numeric field holding a non-numeric token — the
+exception is swallowed and control falls through to `LoadLegacy`. `LoadLegacy` only
+requires the root to be a JSON object, so it will usually succeed against the same input
+and the result is reported as `OpcUaConfigParseStatus.Legacy`. The Commons-005 fix added
+the `Malformed` status precisely so a caller can tell a recoverable legacy row from
+unparseable data; this path re-introduces a softer version of the same confusion — a
+genuinely broken current-shape row is presented to the user as a benign "please re-save"
+legacy row, and the offending field is silently dropped by `FromFlatDict` (which ignores
+keys it cannot parse) rather than surfaced. The XML doc describes the legacy fallback as
+being for "pre-refactor rows" only and does not mention this branch.
+
+**Recommendation**
+
+Only fall through to `LoadLegacy` when the typed shape is genuinely *not present* — i.e.
+the `endpointUrl` property is absent. When `endpointUrl` *is* present but typed
+deserialization throws, classify the outcome as `Malformed` (or a distinct status) so the
+caller can surface a real error instead of an empty/partial config. Tighten the XML doc
+to describe this branch, and add a regression test for a typed row with an invalid enum
+field.
+
+**Resolution**
+
+_Unresolved._
@@ -5,10 +5,10 @@
 | Module | `src/ScadaLink.Communication` |
 | Design doc | `docs/requirements/Component-Communication.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-16 |
+| Last reviewed | 2026-05-17 |
 | Reviewer | claude-agent |
-| Commit reviewed | `9c60592` |
-| Open findings | 0 |
+| Commit reviewed | `39d737e` |
+| Open findings | 4 |

 ## Summary

@@ -25,20 +25,37 @@ CLAUDE.md "Resume for coordinator actors" decision. Design-doc adherence is othe
 good. Test coverage is broad for happy paths but has gaps around failover, cache
 mutation races, and the snapshot-timeout cleanup path.

+#### Re-review 2026-05-17 (commit `39d737e`)
+
+All prior findings (Communication-001..011) are confirmed `Resolved` in this commit
+and the fixes hold up against the source. The re-review walked all 10 checklist
+categories again and uncovered a previously-missed defect at the centre of the gRPC
+node-failover path: **`SiteStreamGrpcClientFactory.GetOrCreate` caches one client per
+site identifier and silently ignores the `grpcEndpoint` argument on a cache hit**. The
+`DebugStreamBridgeActor` reconnect logic flips `_useNodeA` and passes the *other*
+node's endpoint, but the factory hands back the original NodeA-bound client every
+time — so the documented "try the other site node endpoint" failover never actually
+moves to NodeB (Communication-012). The same caching defect means a site address
+change is never picked up because `RemoveSiteAsync` has no production caller
+(Communication-013). Two Low findings round out the re-review: an untrusted
+gRPC-supplied `correlation_id` flows straight into an Akka actor name
+(Communication-014), and the factory's endpoint-reuse defect is masked by the test
+mock (Communication-015). Four new findings, all Open: one High, one Medium, two Low.
+
 ## Checklist coverage

 | # | Category | Examined | Notes |
 |---|----------|----------|-------|
-| 1 | Correctness & logic bugs | ✓ | Snapshot-timeout orphan, reconnect not calling `CleanupGrpc`, subscription-map races. |
-| 2 | Akka.NET conventions | ✓ | No supervision strategy on coordinators; `Sender` captured in async-launched closure path. |
-| 3 | Concurrency & thread safety | ✓ | `SiteStreamGrpcClient._subscriptions` overwrite/remove race; `_siteClients` field reassignment unused but non-readonly. |
-| 4 | Error handling & resilience | ✓ | gRPC reconnect leaks server-side relay; `LoadSiteAddressesFromDb` swallows DB failures silently. |
-| 5 | Security | ✓ | No findings in module code. DebugStreamHub auth lives outside this module (Central UI). |
-| 6 | Performance & resource management | ✓ | Orphaned subscriptions/CTS leaks; `SiteStreamGrpcClientFactory.Dispose` blocks on async. |
-| 7 | Design-document adherence | ✓ | `GrpcMaxStreamLifetime` / keepalive options defined but never applied; hard-coded values used instead. |
-| 8 | Code organization & conventions | ✓ | Options pattern correct; minor: public records declared in actor files. No structural issues. |
-| 9 | Testing coverage | ✓ | No tests for snapshot-timeout cleanup, address-cache refresh races, or gRPC server reconnect-leak. |
-| 10 | Documentation & comments | ✓ | XML comment on `DebugStreamBridgeActor` says "Persistent actor" — it is not an Akka.Persistence actor. |
+| 1 | Correctness & logic bugs | ✓ | Re-review: factory ignores endpoint on cache hit, defeating NodeA→NodeB stream failover (Communication-012). Prior items resolved. |
+| 2 | Akka.NET conventions | ✓ | Coordinator `Resume` strategies now present and verified. No new issues. |
+| 3 | Concurrency & thread safety | ✓ | Subscription-map register/remove now ownership-checked. `_siteClients` readonly. No new issues. |
+| 4 | Error handling & resilience | ✓ | `Status.Failure` handler added; reconnect unsubscribes prior stream. No new issues. |
+| 5 | Security | ✓ | Re-review: public gRPC `correlation_id` flows unvalidated into an Akka actor name (Communication-014). |
+| 6 | Performance & resource management | ✓ | Synchronous `Dispose` paths fixed; CTS leaks resolved. No new issues. |
+| 7 | Design-document adherence | ✓ | Re-review: site gRPC address-change disposal not wired — `RemoveSiteAsync` is dead code (Communication-013). gRPC options now applied. |
+| 8 | Code organization & conventions | ✓ | Options pattern correct; public records still declared in actor files (acceptable). No structural issues. |
+| 9 | Testing coverage | ✓ | Re-review: prior gaps closed, but the factory mock masks the endpoint-reuse defect — no real node-flip coverage (Communication-015). |
+| 10 | Documentation & comments | ✓ | `DebugStreamBridgeActor` summary corrected. No new issues. |

 ## Findings

@@ -519,3 +536,151 @@ passes after):
  (added with this finding's resolution).
 The full module suite (`dotnet test tests/ScadaLink.Communication.Tests`) is green at
 111 passing tests.
+
+### Communication-012 — gRPC client factory ignores the endpoint on a cache hit, breaking NodeA→NodeB stream failover
+
+| | |
+|--|--|
+| Severity | High |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.Communication/Grpc/SiteStreamGrpcClientFactory.cs:39`, `src/ScadaLink.Communication/Actors/DebugStreamBridgeActor.cs:166` |
+
+**Description**
+
+`SiteStreamGrpcClientFactory.GetOrCreate` is `_clients.GetOrAdd(siteIdentifier, _ =>
+CreateClient(grpcEndpoint))` — it keys the cache by **site identifier only** and the
+`grpcEndpoint` argument is used *exclusively* for the first-ever creation. Every
+subsequent call for that site returns the originally-cached `SiteStreamGrpcClient`,
+which is permanently bound to the `GrpcChannel` of whatever endpoint was passed first.
+
+`DebugStreamBridgeActor` relies on the opposite behaviour. On a gRPC stream error,
+`HandleGrpcError` flips `_useNodeA` and `OpenGrpcStream` recomputes
+`endpoint = _useNodeA ? _grpcNodeAAddress : _grpcNodeBAddress`, then calls
+`_grpcFactory.GetOrCreate(_siteIdentifier, endpoint)` expecting a client connected to
+the *other* node. Because the factory ignores the new endpoint, the bridge actor
+reconnects to the **same failed NodeA endpoint** on every retry. The design doc's
+core debug-stream failover behaviour ("tries the other site node endpoint", "NodeB if
+NodeA failed, or vice versa") is therefore inoperative — when a site node goes down,
+the debug stream cannot move to the surviving node and simply exhausts `MaxRetries`
+against the dead endpoint and terminates. The `_useNodeA` flip, the `previousEndpoint`
+computation in `HandleGrpcError`, and the `CleanupGrpc` endpoint selection are all
+dead logic. (Communication-002's `Unsubscribe`-before-reconnect fix still functions,
+but it unsubscribes and re-subscribes on the *same* client/node rather than the
+intended other node.)
+
+**Recommendation**
+
+Make the per-site client aware of both endpoints, or key the cache by
+`(siteIdentifier, endpoint)`, or have `GetOrCreate` detect an endpoint change and
+dispose+recreate the cached client. Given the design intent ("Falls back to NodeB if
+NodeA connection fails"), the cleanest fix is to give `SiteStreamGrpcClient` (or a
+per-site holder) both NodeA/NodeB addresses and let it switch channels internally,
+removing the endpoint argument from `GetOrCreate` entirely. Add a test that drives a
+real `SiteStreamGrpcClientFactory` through a node flip and asserts the second client
+targets the other endpoint.
+
+**Resolution**
+
+_Unresolved._
+
+### Communication-013 — Site gRPC address changes are never applied; `RemoveSiteAsync` has no production caller
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Design-document adherence |
+| Status | Open |
+| Location | `src/ScadaLink.Communication/Grpc/SiteStreamGrpcClientFactory.cs:58` |
+
+**Description**
+
+The design doc states that `SiteStreamGrpcClientFactory` "Disposes clients on site
+removal or address change." `RemoveSiteAsync` implements the disposal mechanism, but
+a repo-wide search finds **no production caller** — only tests invoke it. Combined
+with the cache-by-site-identifier behaviour (Communication-012), the consequence is
+that once a site's `SiteStreamGrpcClient` is created, a later edit to that site's
+`GrpcNodeAAddress` / `GrpcNodeBAddress` (via the Central UI or CLI) is never reflected
+in the cached client — it keeps using the stale channel for the life of the process.
+`CentralCommunicationActor` already refreshes the *Akka* address cache every 60s and
+recreates ClusterClients on change, but there is no equivalent invalidation path
+wired into the gRPC client factory. A site whose gRPC endpoints are corrected after
+an initial misconfiguration will never have working debug streaming until the central
+node is restarted.
+
+**Recommendation**
+
+Wire a site-removal / address-change signal into `SiteStreamGrpcClientFactory` —
+e.g. have `CentralCommunicationActor` (which already detects address changes in
+`HandleSiteAddressCacheLoaded`) call `RemoveSiteAsync` for sites whose gRPC addresses
+changed or were removed, or fold the gRPC endpoints into the same refresh cycle. If
+the on-the-fly address-change requirement is intentionally dropped, remove
+`RemoveSiteAsync` and correct the design doc.
+
+**Resolution**
+
+_Unresolved._
+
+### Communication-014 — Untrusted gRPC `correlation_id` flows directly into an Akka actor name
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Security |
+| Status | Open |
+| Location | `src/ScadaLink.Communication/Grpc/SiteStreamGrpcServer.cs:124` |
+
+**Description**
+
+`SubscribeInstance` is a public gRPC endpoint hosted on each site node. It creates the
+relay actor with `$"stream-relay-{request.CorrelationId}-{actorSeq}"` as the actor
+name, where `request.CorrelationId` comes straight off the wire. Akka actor names have
+a restricted character set; a `correlation_id` containing `/`, whitespace, or other
+disallowed characters makes `ActorSystem.ActorOf` throw `InvalidActorNameException`.
+That exception is not caught inside `SubscribeInstance`, so it escapes as an unhandled
+RPC fault (and after the `_streamSubscriber.Subscribe` / `_activeStreams` entry has
+already been set up for the duration, though the `finally` does not run because the
+throw is before the `try`). In practice central always supplies a GUID, so impact is
+low, but the server is trusting client-supplied input to be actor-name-safe.
+
+**Recommendation**
+
+Validate `request.CorrelationId` on entry (non-empty, matches an expected GUID/safe
+pattern) and reject with `StatusCode.InvalidArgument` otherwise; or derive the actor
+name solely from the internal `_actorCounter` and keep the correlation ID only as
+actor state / dictionary key.
+
+**Resolution**
+
+_Unresolved._
+
+### Communication-015 — No test exercises the real gRPC client factory across a node flip
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Testing coverage |
+| Status | Open |
+| Location | `tests/ScadaLink.Communication.Tests/Grpc/DebugStreamBridgeActorTests.cs:401`, `tests/ScadaLink.Communication.Tests/Grpc/SiteStreamGrpcClientFactoryTests.cs` |
+
+**Description**
+
+`DebugStreamBridgeActorTests` exercises the reconnect/failover paths through
+`MockSiteStreamGrpcClientFactory`, which returns one fixed mock client regardless of
+the `grpcEndpoint` argument. This is exactly the behaviour the *real*
+`SiteStreamGrpcClientFactory` exhibits incorrectly (Communication-012), so the mock
+masks the defect: `On_GrpcError_Reconnects_To_Other_Node` passes even though the real
+factory never reaches the other node. `SiteStreamGrpcClientFactoryTests` only asserts
+`GetOrCreate` returns the same client for the same site — it never checks what happens
+when the same site is requested with a *different* endpoint.
+
+**Recommendation**
+
+Add a `SiteStreamGrpcClientFactoryTests` case that calls `GetOrCreate(site, endpointA)`
+then `GetOrCreate(site, endpointB)` and asserts the second call targets `endpointB`
+(it should fail today and pass after Communication-012 is fixed). Have the bridge-actor
+test's mock factory track the endpoint per call so node-flip coverage is meaningful.
+
+**Resolution**
+
+_Unresolved._
@@ -5,10 +5,10 @@
 | Module | `src/ScadaLink.ConfigurationDatabase` |
 | Design doc | `docs/requirements/Component-ConfigurationDatabase.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-16 |
+| Last reviewed | 2026-05-17 |
 | Reviewer | claude-agent |
-| Commit reviewed | `9c60592` |
-| Open findings | 0 |
+| Commit reviewed | `39d737e` |
+| Open findings | 3 |

 ## Summary

@@ -37,6 +37,28 @@ repositories (`TemplateEngineRepository`, `DeploymentManagerRepository`,
 `ExternalSystemRepository`, `InboundApiRepository`, `NotificationRepository`,
 `SiteRepository`, `InstanceLocator`) have little or no direct coverage.

+#### Re-review 2026-05-17 (commit `39d737e`)
+
+Re-reviewed the module at commit `39d737e`. All eleven findings from the initial
+review (`9c60592`) remain `Resolved` — the secret-column encryption
+(`EncryptedStringConverter` + `ApplySecretColumnEncryption`), the fail-fast no-arg
+DI overload, the `AsSplitQuery` conversions, the audit cycle-safe serializer, and the
+added repository test coverage all verified present and consistent with their
+resolutions. Three new findings were recorded. The most material is that the
+encryption work done for CD-004 left one bearer credential out of scope:
+`ApiKey.KeyValue` — the inbound-API authentication secret — is still persisted in
+plaintext (`ConfigurationDatabase-012`); it cannot use the same non-deterministic
+Data Protection converter because authentication looks the key up *by value*, so it
+needs a hash-based scheme instead. The second is a resilience gap in the encryption
+plumbing itself: `ApplySecretColumnEncryption` silently substitutes a throwaway
+`EphemeralDataProtectionProvider` whenever no provider is supplied, so any context
+constructed via the single-argument constructor on a write path would encrypt
+secrets with a key discarded at process exit, yielding permanently undecryptable
+ciphertext with no error (`ConfigurationDatabase-013`). The third is a minor
+inconsistency — a redundant cast on one of the three `HasConversion` calls
+(`ConfigurationDatabase-014`). The module is otherwise healthy and the prior fixes
+hold up well.
+
 ## Checklist coverage

 | # | Category | Examined | Notes |
@@ -44,11 +66,11 @@ repositories (`TemplateEngineRepository`, `DeploymentManagerRepository`,
 | 1 | Correctness & logic bugs | ✓ | `GetTemplateWithChildrenAsync` discards loaded children (CD-001); `GetApprovedKeysForMethodAsync` CSV parsing is brittle (CD-008). |
 | 2 | Akka.NET conventions | ✓ | No actors in this module; data-access layer only. No issues found. |
 | 3 | Concurrency & thread safety | ✓ | DbContext correctly scoped; optimistic concurrency on `DeploymentRecord` correct. Repositories hold no shared mutable state. No issues found. |
-| 4 | Error handling & resilience | ✓ | `WaitForDatabaseReadyAsync` is sound. No-arg DI overload fails late and silently (CD-003); audit JSON serialization failure handling (CD-007). |
-| 5 | Security | ✓ | Hardcoded `sa` credential literal (CD-002); SMTP/DB-connection/auth secrets stored unencrypted (CD-004). |
-| 6 | Performance & resource management | ✓ | `GetAllTemplatesAsync` / `GetTemplateTreeAsync` eager-load multiple collections without `AsSplitQuery` (CD-009). No N+1 in audited paths. |
-| 7 | Design-document adherence | ✓ | Audit `Id` type mismatch vs design doc (CD-005); seed data uses `HasData` consistent with design. |
-| 8 | Code organization & conventions | ✓ | Mostly clean. `Grpc*` address columns unbounded (CD-006); inconsistent null-guard on injected context (CD-011). |
+| 4 | Error handling & resilience | ✓ | `WaitForDatabaseReadyAsync` is sound. No-arg DI overload fails late and silently (CD-003, resolved); audit JSON serialization failure handling (CD-007, resolved). Re-review: ephemeral Data Protection fallback can silently produce undecryptable ciphertext (CD-013). |
+| 5 | Security | ✓ | Hardcoded `sa` credential literal (CD-002, resolved); SMTP/DB-connection/auth secrets unencrypted (CD-004, resolved). Re-review: `ApiKey.KeyValue` bearer credential still stored in plaintext (CD-012). |
+| 6 | Performance & resource management | ✓ | `GetAllTemplatesAsync` / `GetTemplateTreeAsync` eager-load multiple collections without `AsSplitQuery` (CD-009, resolved). No N+1 in audited paths. Re-review: no new issues. |
+| 7 | Design-document adherence | ✓ | Audit `Id` type mismatch vs design doc (CD-005, resolved); seed data uses `HasData` consistent with design. Re-review: no new issues. |
+| 8 | Code organization & conventions | ✓ | Mostly clean. `Grpc*` address columns unbounded (CD-006, resolved); inconsistent null-guard on injected context (CD-011, resolved). Re-review: redundant/inconsistent cast on one `HasConversion` call (CD-014). |
 | 9 | Testing coverage | ✓ | Several repositories and `InstanceLocator` lack direct tests (CD-010). |
 | 10 | Documentation & comments | ✓ | `DeploymentManagerRepository` "WP-24 stub" XML comment is stale; noted in module context but not raised as a standalone finding. No issues found beyond items above. |

@@ -570,3 +592,128 @@ every data-access type behaves uniformly and a hand-constructed instance fails w
 informative exception at construction rather than a later `NullReferenceException`.
 Regression: `Constructor_NullContext_Throws` tests were added for all four affected types
 (`InboundApiRepositoryTests.cs`, `RepositoryCoverageTests.cs`).
+
+### ConfigurationDatabase-012 — Inbound-API `ApiKey.KeyValue` bearer credential stored in plaintext
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Security |
+| Status | Open |
+| Location | `src/ScadaLink.ConfigurationDatabase/Configurations/InboundApiConfiguration.cs:17-19` |
+
+**Description**
+
+`ApiKey.KeyValue` is the bearer credential presented in the `X-API-Key` header to
+authenticate Inbound API requests (HighLevelReqs §7.2–7.3). It is mapped as an
+ordinary `nvarchar(500)` column with a unique index and persisted verbatim. Anyone
+with read access to the configuration database — or to any `AuditLogEntry.AfterStateJson`
+into which an `ApiKey` entity is serialized — obtains live API credentials in cleartext.
+
+`ConfigurationDatabase-004` introduced encryption-at-rest for the other secret-bearing
+columns (SMTP credentials, external-system auth config, database connection strings)
+but explicitly scoped `ApiKey.KeyValue` out. The omission is genuine: the
+`EncryptedStringConverter` built for CD-004 is backed by ASP.NET Data Protection, which
+is **non-deterministic** — the same plaintext encrypts to different ciphertext each
+time — so it cannot be applied here, because `GetApprovedKeysForMethodAsync` and the
+authentication path resolve a key by its value (`GetApiKeyByValueAsync` does
+`FirstOrDefaultAsync(k => k.KeyValue == keyValue)`). A non-deterministic converter would
+break that equality lookup. The result is that the one credential most exposed to
+external callers is the one credential left unprotected.
+
+**Recommendation**
+
+Store a salted cryptographic hash of the key value instead of the plaintext (or
+ciphertext): hash on create, and authenticate by hashing the presented key and
+comparing. This keeps the equality lookup working (the hash is deterministic) while
+ensuring the database never holds a usable credential. The plaintext key would then
+only ever be shown once, at creation time, to the Admin who created it. This requires
+a coordinated change with the Inbound API / Security components and the `ApiKey`
+entity in Commons; record the chosen scheme in
+`docs/requirements/Component-ConfigurationDatabase.md` and the Inbound API design doc.
+At minimum, ensure `ApiKey` entities are never passed to `IAuditService` without the
+key value redacted.
+
+**Resolution**
+
+_Unresolved._
+
+### ConfigurationDatabase-013 — Secret-column encryption silently falls back to an ephemeral (throwaway) key
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Error handling & resilience |
+| Status | Open |
+| Location | `src/ScadaLink.ConfigurationDatabase/ScadaLinkDbContext.cs:107-124` |
+
+**Description**
+
+`ApplySecretColumnEncryption` resolves the Data Protection provider as
+`_dataProtectionProvider ?? new EphemeralDataProtectionProvider()`. The `??` fallback
+is reached whenever the context is constructed via the single-argument
+`ScadaLinkDbContext(DbContextOptions)` constructor — i.e. whenever no provider was
+injected. An `EphemeralDataProtectionProvider` generates a key ring that lives only in
+process memory and is discarded at process exit.
+
+For design-time `dotnet ef` tooling this is harmless (the XML remark correctly notes
+it only emits schema). The risk is on a *runtime write path*. The runtime currently
+gets the provider-bearing context only because `AddConfigurationDatabase` adds an
+`AddScoped` factory registration that overrides EF's activator-based registration.
+That override is the single thing standing between correct behaviour and silent data
+corruption: any future change that resolves a `ScadaLinkDbContext` through a path the
+override does not cover — an `AddPooledDbContextFactory`/`IDbContextFactory<ScadaLinkDbContext>`
+registration, a second `AddDbContext` call, a hand-constructed context in server code —
+would construct the context with the single-arg constructor, encrypt secret columns
+with a throwaway key, and persist ciphertext that becomes **permanently undecryptable
+the moment the process restarts**. There is no exception, no warning; the failure only
+surfaces later as `CryptographicException` on read (mis-attributed by
+`EncryptedStringConverter` to "the stored value was not written by this system").
+
+**Recommendation**
+
+Do not silently substitute an ephemeral provider for write-capable contexts. Either:
+(a) require the provider unconditionally and have design-time tooling pass an explicit
+ephemeral provider so the intent is visible at the call site; or (b) keep the
+single-arg constructor but mark contexts built without a real provider as
+schema-only — e.g. record a flag and have the encrypting converter throw a clear
+`InvalidOperationException` ("secret columns cannot be written without a configured
+Data Protection key ring") on the first `Protect`, instead of producing throwaway
+ciphertext. Also harden the DI wiring so a `ScadaLinkDbContext` cannot be resolved
+through the EF-activator registration at all (e.g. register only the factory, or use
+`AddDbContextFactory` with the explicit constructor).
+
+**Resolution**
+
+_Unresolved._
+
+### ConfigurationDatabase-014 — Redundant, inconsistent cast on one `HasConversion` call
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Code organization & conventions |
+| Status | Open |
+| Location | `src/ScadaLink.ConfigurationDatabase/ScadaLinkDbContext.cs:121-123` |
+
+**Description**
+
+`ApplySecretColumnEncryption` calls `HasConversion(converter)` three times. The first
+two (`SmtpConfiguration.Credentials`, `ExternalSystemDefinition.AuthConfiguration`)
+pass `converter` directly; the third (`DatabaseConnectionDefinition.ConnectionString`)
+casts it to `(Microsoft.EntityFrameworkCore.Storage.ValueConversion.ValueConverter)`.
+`EncryptedStringConverter` already derives from `ValueConverter<string?, string?>`
+(itself a `ValueConverter`), and the first two call sites compile fine without the
+cast, so the cast is redundant. The inconsistency makes the code look as though the
+third call needs special handling when it does not, and the fully-qualified type name
+inline adds noise.
+
+**Recommendation**
+
+Remove the cast so all three calls read identically as `HasConversion(converter)`.
+If a `ValueConverter`-typed reference is genuinely wanted, give it a local variable of
+that type once and use it for all three.
+
+**Resolution**
+
+_Unresolved._
@@ -5,10 +5,10 @@
 | Module | `src/ScadaLink.DataConnectionLayer` |
 | Design doc | `docs/requirements/Component-DataConnectionLayer.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-16 |
+| Last reviewed | 2026-05-17 |
 | Reviewer | claude-agent |
-| Commit reviewed | `9c60592` |
-| Open findings | 0 |
+| Commit reviewed | `39d737e` |
+| Open findings | 4 |

 ## Summary

@@ -30,20 +30,40 @@ the design doc's failover state machine and the implemented unstable-disconnect
 heuristic. Test coverage is adequate for the happy paths and failover but absent for
 tag-resolution retry, disconnect/re-subscribe, and concurrency around `HandleSubscribe`.

+#### Re-review 2026-05-17 (commit `39d737e`)
+
+All 13 findings from the 2026-05-16 review remain `Resolved` and the fixes were
+verified in place against the current source (`PipeTo(Self)` subscribe pattern,
+`Resume` supervision, `ConcurrentDictionary` callback maps, atomic disconnect guards,
+bounded write timeout, etc.). The re-review walked all 10 checklist categories again
+and found **4 new findings**: one **High** — the DCL-012 security warning is never
+seen in production because `RealOpcUaClientFactory.Create()` constructs
+`RealOpcUaClient` with no logger, so the warning sinks into `NullLogger`; one
+**Medium** — initial-connect failures in the `Connecting` state never count toward
+failover, so a connection whose primary endpoint is unreachable at startup retries the
+primary forever and never tries the configured backup; one **Medium** —
+`HandleSubscribeCompleted` always replies `SubscribeTagsResponse(success: true)` even
+when a connection-level subscribe failure is driving the actor into `Reconnecting`,
+telling the Instance Actor the subscribe succeeded when it did not; and one **Low** —
+`WriteBatchAsync` does not catch the `InvalidOperationException` from `EnsureConnected`,
+so a mid-batch disconnect aborts the whole write batch (the same class of defect
+DCL-007 fixed for `ReadBatchAsync`). New findings are numbered from
+`DataConnectionLayer-014`.
+
 ## Checklist coverage

 | # | Category | Examined | Notes |
 |---|----------|----------|-------|
-| 1 | Correctness & logic bugs | x | `_resolvedTags` double-counting and stale counters after failover; `ReadBatchAsync` aborts mid-batch. |
-| 2 | Akka.NET conventions | x | `Task.Run` mutating actor state (critical); `Restart` supervision loses state; closures capturing `_subscriptionsByInstance`. |
-| 3 | Concurrency & thread safety | x | Actor state mutated off the actor thread; `RealOpcUaClient` callback dictionary unsynchronized. |
-| 4 | Error handling & resilience | x | Subscription failures not surfaced; unbounded write with no timeout; reconnect after subscribe-time failure not handled. |
-| 5 | Security | x | `AutoAcceptUntrustedCerts` defaults to `true`; OPC UA password handling acceptable. See finding 012. |
-| 6 | Performance & resource management | x | `HandleUnsubscribe` O(n^2) over instances; initial-read loop serial per tag. |
-| 7 | Design-document adherence | x | Failover heuristic (unstable-disconnect count) differs from documented state machine; `WriteTimeout` documented but unused. |
+| 1 | Correctness & logic bugs | x | 2026-05-16 findings resolved. Re-review: finding 016 — `SubscribeTagsResponse` reports success on a connection-level subscribe failure. |
+| 2 | Akka.NET conventions | x | 2026-05-16 findings resolved (`PipeTo(Self)` subscribe, `Resume` supervision). Re-review: no new issues. |
+| 3 | Concurrency & thread safety | x | 2026-05-16 findings resolved (`ConcurrentDictionary`, atomic disconnect guards). Re-review: no new issues. |
+| 4 | Error handling & resilience | x | Re-review: finding 015 — initial-connect failures never trigger failover; finding 017 — `WriteBatchAsync` aborts on mid-batch disconnect. |
+| 5 | Security | x | Re-review: finding 014 — the DCL-012 auto-accept-cert warning is never logged in production (`RealOpcUaClient` built without a logger). |
+| 6 | Performance & resource management | x | 2026-05-16 finding 008 resolved (reverse index). Re-review: no new issues. |
+| 7 | Design-document adherence | x | 2026-05-16 findings 005/009 resolved. Re-review: no new issues (finding 015 logged under resilience). |
 | 8 | Code organization & conventions | x | No issues found — POCOs in Commons, options class owned by component, factory pattern consistent. |
-| 9 | Testing coverage | x | No tests for tag-resolution retry, disconnect/re-subscribe, bad-quality push, or `HandleSubscribe` concurrency. |
-| 10 | Documentation & comments | x | XML comment on `RaiseDisconnected` claims thread safety it does not have; design doc round-robin description stale. |
+| 9 | Testing coverage | x | DCL001–013 regression tests present. Re-review: gaps remain for finding 014/015/016 scenarios (no test for production logger wiring, startup failover, or subscribe-response-on-failure). |
+| 10 | Documentation & comments | x | 2026-05-16 finding 013 resolved. Re-review: no new issues. |

 ## Findings

@@ -661,3 +681,179 @@ fanning 32 barrier-synchronised threads that raise the client's `ConnectionLost`
 simultaneously, and asserts `Disconnected` fires exactly once per round; against a
 non-atomic check-then-set it double-fires (verified by temporarily reverting the
 guard), and it passes against the atomic fix.
+
+### DataConnectionLayer-014 — DCL-012 security warning is never logged in production: `RealOpcUaClient` is created without a logger
+
+| | |
+|--|--|
+| Severity | High |
+| Category | Security |
+| Status | Open |
+| Location | `src/ScadaLink.DataConnectionLayer/Adapters/RealOpcUaClient.cs:325`, `src/ScadaLink.DataConnectionLayer/Adapters/RealOpcUaClient.cs:35-39,79-83` |
+
+**Description**
+
+Finding DataConnectionLayer-012 was resolved in part by adding a prominent
+`ILogger` warning in `RealOpcUaClient.ConnectAsync` whenever the auto-accept
+certificate validator is installed (`RealOpcUaClient.cs:79-83`). The
+`ILogger<RealOpcUaClient>` constructor parameter was made optional, defaulting to
+`NullLogger<RealOpcUaClient>.Instance` (`RealOpcUaClient.cs:35-39`).
+
+However, the only production code path that constructs a `RealOpcUaClient` is
+`RealOpcUaClientFactory.Create()` (`RealOpcUaClient.cs:325`), which calls
+`new RealOpcUaClient(_globalOptions)` and passes **no logger**. The factory itself
+holds only an `OpcUaGlobalOptions` and has no `ILoggerFactory`/`ILogger` available.
+As a result the `_logger` field is always `NullLogger` for every real OPC UA
+connection, and the man-in-the-middle warning the DCL-012 fix added is silently
+discarded. An operator who deploys a connection with `AutoAcceptUntrustedCerts`
+enabled — accepting any server certificate on an industrial control link — gets no
+visible signal anywhere in the logs. The in-scope half of DCL-012's resolution is
+therefore not actually effective in production; only the unit test
+(`DCL012_OpcUaConnectionOptions_AutoAcceptUntrustedCerts_DefaultsToFalse`, which only
+checks the default value) passes.
+
+**Recommendation**
+
+Thread a real logger through to `RealOpcUaClient`. `DataConnectionFactory` already
+holds an `ILoggerFactory` and constructs `RealOpcUaClientFactory(globalOptions)` —
+give `RealOpcUaClientFactory` an `ILoggerFactory` (or an `ILogger<RealOpcUaClient>`)
+constructor parameter and pass `_loggerFactory.CreateLogger<RealOpcUaClient>()` into
+each `new RealOpcUaClient(...)`. Add a test that asserts the warning is emitted on a
+real connect with auto-accept enabled (e.g. via a captured `ILogger`), not just that
+the default is `false`.
+
+**Resolution**
+
+_Unresolved._
+
+### DataConnectionLayer-015 — Initial-connect failures never trigger failover; an unreachable primary at startup never tries the backup
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Error handling & resilience |
+| Status | Open |
+| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:404-417`, `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:419-493` |
+
+**Description**
+
+Failover between the primary and backup endpoints is implemented in two places, both
+reachable only after a connection has already been `Connected` at least once:
+`HandleReconnectResult` (Reconnecting state) counts `_consecutiveFailures` and switches
+endpoint, and `BecomeReconnecting` counts `_consecutiveUnstableDisconnects`.
+
+`HandleConnectResult` — the handler for the *initial* connection attempt in the
+`Connecting` state (`DataConnectionActor.cs:404-417`) — does neither. On failure it
+only logs and re-arms the reconnect timer with `AttemptConnect`; it never increments
+`_consecutiveFailures`, never consults `_backupConfig`, and never switches endpoint.
+
+Consequence: if the primary endpoint is unreachable when the connection actor first
+starts — which is the common case after a fresh artifact deployment, a site restart,
+or a primary that is simply down at that moment — the actor retries the *primary*
+endpoint indefinitely at `ReconnectInterval` and **never** attempts the configured
+backup. The design doc's endpoint-redundancy promise ("automatic failover when the
+active endpoint becomes unreachable") is silently not honoured for the
+never-connected-yet case, and an operator sees a connection stuck `Connecting` forever
+despite a healthy backup being configured.
+
+**Recommendation**
+
+Make `HandleConnectResult` participate in the failover counter the same way
+`HandleReconnectResult` does: increment `_consecutiveFailures` on failure and, when
+`_backupConfig != null && _consecutiveFailures >= _failoverRetryCount`, perform the
+endpoint switch (dispose adapter, create the other adapter, bump `_adapterGeneration`,
+log the failover event) before re-arming the timer. Alternatively, fold the initial
+connect into the same reconnect path so there is a single failover decision point. Add
+a regression test for "primary down at startup, backup configured → fails over to
+backup".
+
+**Resolution**
+
+_Unresolved._
+
+### DataConnectionLayer-016 — `HandleSubscribeCompleted` reports `SubscribeTagsResponse` success even on a connection-level subscribe failure
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:606,666-672`, `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:232-240` |
+
+**Description**
+
+`HandleSubscribeCompleted` computes `connectionLevelFailure` (line 606) and returns it
+so the `Connected`-state handler can drive the actor into `Reconnecting`
+(`DataConnectionActor.cs:232-240`). But before returning, it unconditionally replies
+to the caller with `new SubscribeTagsResponse(..., true, null, ...)` (lines 666-667) —
+`Success: true`, `Error: null` — regardless of whether any tag failed at connection
+level.
+
+So when a subscribe arrives while the adapter is silently down, the Instance Actor is
+told the subscribe **succeeded**, while the connection actor simultaneously transitions
+to `Reconnecting`. The tags were never actually subscribed at the adapter (the catch
+block recorded `Success: false`); they are recovered later by `ReSubscribeAll` only if
+and when reconnection succeeds. The caller has no way to distinguish "subscribed and
+healthy" from "accepted, but the connection is currently down" — a misleading
+success signal on a request that did not do what the response claims.
+
+(Genuine tag-resolution failures are arguably also reported as overall `true`, but
+that is defensible: those tags are tracked in `_unresolvedTags` and the design models
+unresolved tags as a runtime quality concern, with a `Bad`-quality `TagValueUpdate`
+already pushed. The connection-level case is the clear defect because the actor itself
+treats it as a failure worth a state transition.)
+
+**Recommendation**
+
+When `connectionLevelFailure` is true, reply with
+`SubscribeTagsResponse(..., success: false, error: "connection unavailable — will
+re-subscribe on reconnect", ...)` (or an equivalent), so the caller's response matches
+the actor's own assessment. Optionally carry per-tag outcomes in the response so the
+Instance Actor can reflect partial success. Add a test asserting the response is not
+`Success: true` when a connection-level subscribe failure drives `Reconnecting`.
+
+**Resolution**
+
+_Unresolved._
+
+### DataConnectionLayer-017 — `WriteBatchAsync` aborts the whole batch on a mid-batch disconnect
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Error handling & resilience |
+| Status | Open |
+| Location | `src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:229-237`, `src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:218-227` |
+
+**Description**
+
+`WriteBatchAsync` loops calling `WriteAsync` per tag (`OpcUaDataConnection.cs:229-237`).
+`WriteAsync` returns a `WriteResult` for OPC-UA-level write rejections (good — a bad
+status does not abort the batch), but it first calls `EnsureConnected()`
+(`OpcUaDataConnection.cs:220`), which throws `InvalidOperationException` when the
+client is disconnected. `WriteBatchAsync` does not catch that exception, so if the
+connection drops partway through a batch the whole `WriteBatchAsync` throws and the
+caller gets no result map — losing the per-tag outcomes for the tags that already
+wrote. This is the same class of defect that DataConnectionLayer-007 fixed for
+`ReadBatchAsync` (which now records a failed `ReadResult` per failing tag and only
+propagates `OperationCanceledException`). `WriteBatchAsync` feeds
+`WriteBatchAndWaitAsync` (line 246), so a disconnect during a flag-and-wait write
+sequence surfaces as an unhandled exception rather than a clean `false`/per-tag result.
+
+Severity is Low because device writes are real-time control operations with no
+store-and-forward, the batch write paths are not on the primary `HandleWrite` hot path
+(`HandleWrite` calls single-tag `WriteAsync`), and a disconnect mid-batch is itself an
+error condition — but the inconsistent error shape (exception vs. per-tag result) is a
+maintainability and correctness wart.
+
+**Recommendation**
+
+Mirror the DCL-007 fix: wrap the per-tag `WriteAsync` call in `WriteBatchAsync` in a
+try/catch that records a failed `WriteResult(false, ex.Message)` for the failing tag
+and continues, while still propagating `OperationCanceledException` to abort a
+cancelled batch as a whole. This gives callers (including `WriteBatchAndWaitAsync`) a
+complete, consistent result map.
+
+**Resolution**
+
+_Unresolved._
@@ -5,10 +5,10 @@
 | Module | `src/ScadaLink.DeploymentManager` |
 | Design doc | `docs/requirements/Component-DeploymentManager.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-16 |
+| Last reviewed | 2026-05-17 |
 | Reviewer | claude-agent |
-| Commit reviewed | `9c60592` |
-| Open findings | 0 |
+| Commit reviewed | `39d737e` |
+| Open findings | 3 |

 ## Summary

@@ -30,20 +30,43 @@ detail. Configuration is not bound to `appsettings.json`, leaving one option
 entirely dead. Test coverage stops at the communication boundary and never
 exercises a successful deployment or the lifecycle success paths.

+#### Re-review 2026-05-17 (commit `39d737e`)
+
+Re-reviewed at commit `39d737e` after the batch of fixes for
+DeploymentManager-001..014. All fourteen prior findings remain `Resolved` and
+verified against source — the broadened catch, non-cancellable cleanup writes,
+ref-counted `OperationLockManager`, query-before-redeploy reconciliation,
+structured diff, options binding, and the expanded TestKit-actor test suite are
+all present and correct. The module is in markedly better shape than the
+first review: error paths are now defensively handled and test coverage is
+broad (successful deploy/lifecycle, lock serialization, reconciliation
+matrix, artifact per-site matrix).
+
+This re-review found **3 new findings**, all clustered on the
+DeploymentManager-006 reconciliation path added since the last review. The
+reconciliation shortcut (`TryReconcileWithSiteAsync`) marks a stale prior
+record `Success` when the site already has the target revision, but it does
+**not** perform the side effects the normal success path does — it never
+updates the instance `State`, never refreshes the `DeployedConfigSnapshot`,
+and never corrects the prior record's own `RevisionHash` (DeploymentManager-015,
+DeploymentManager-016). The `GetDeploymentStatusAsync` XML doc is now stale —
+it still describes the query-before-redeploy behaviour that actually moved into
+`TryReconcileWithSiteAsync` (DeploymentManager-017).
+
 ## Checklist coverage

 | # | Category | Examined | Notes |
 |---|----------|----------|-------|
-| 1 | Correctness & logic bugs | ✓ | Stuck `InProgress` record on unexpected exception; cancelled-token failure write. |
+| 1 | Correctness & logic bugs | ✓ | Re-review 2026-05-17: reconciliation skips instance-state/snapshot updates (DeploymentManager-015) and keeps a stale `RevisionHash` (DeploymentManager-016). Prior: stuck `InProgress` / cancelled-token write (resolved). |
 | 2 | Akka.NET conventions | ✓ | Module is a plain service layer; it calls `CommunicationService` which wraps Ask. No actors here. No issues. |
-| 3 | Concurrency & thread safety | ✓ | `OperationLockManager` is sound but leaks semaphores; `DeployToAllSitesAsync` correctly builds commands sequentially before parallel send. |
-| 4 | Error handling & resilience | ✓ | Several gaps — see DeploymentManager-001/002/003/004. |
-| 5 | Security | ✓ | SMTP credentials are serialized and broadcast to sites — see DeploymentManager-013. No injection vectors; no authz here (enforced upstream). |
-| 6 | Performance & resource management | ✓ | Semaphore leak (DeploymentManager-005); artifact rebuild does N+1 method queries per external system. |
-| 7 | Design-document adherence | ✓ | Missing query-before-redeploy (DeploymentManager-006); Diff View not implemented (DeploymentManager-007). |
-| 8 | Code organization & conventions | ✓ | Options class not bound to configuration — DeploymentManager-008. POCO/repo placement correct. |
-| 9 | Testing coverage | ✓ | No successful-deploy test, no lifecycle success test — DeploymentManager-011; dead `CreateCommand` helper — DeploymentManager-014. |
-| 10 | Documentation & comments | ✓ | Misleading timeout comment — DeploymentManager-009; stale option XML doc — DeploymentManager-012. |
+| 3 | Concurrency & thread safety | ✓ | `OperationLockManager` ref-counts and reclaims semaphores; `DeployToAllSitesAsync` correctly builds commands sequentially before parallel send. No issues at re-review. |
+| 4 | Error handling & resilience | ✓ | Prior gaps DeploymentManager-001/002/003/004 resolved and verified. No new issues. |
+| 5 | Security | ✓ | SMTP credential handling documented as an accepted design decision (DeploymentManager-013). No injection vectors; no authz here (enforced upstream). No new issues. |
+| 6 | Performance & resource management | ✓ | Semaphore leak resolved (DeploymentManager-005). No new issues. |
+| 7 | Design-document adherence | ✓ | Query-before-redeploy and Diff View implemented (DeploymentManager-006/007). Re-review: reconciliation path breaks the deployed-snapshot/instance-state invariants — see DeploymentManager-015. |
+| 8 | Code organization & conventions | ✓ | Options binding resolved (DeploymentManager-008). POCO/repo placement correct. No new issues. |
+| 9 | Testing coverage | ✓ | Broad coverage added (success, lifecycle, lock serialization, reconciliation, artifact matrix). Re-review: reconciled-success path's missing side effects (DeploymentManager-015) are untested. |
+| 10 | Documentation & comments | ✓ | Prior comment findings resolved. Re-review: `GetDeploymentStatusAsync` XML doc is now stale — DeploymentManager-017. |

 ## Findings

@@ -710,3 +733,126 @@ the communication boundary. New tests:
 `DeployToAllSitesAsync_AllPerSiteCommandsShareTheSummaryDeploymentId` (also
 covers DeploymentManager-010), `DeployToAllSitesAsync_PartialFailure_ReportsPerSiteMatrix`
 (per-site success/failure matrix), `RetryForSiteAsync_SiteSucceeds_ReturnsSuccessAndAudits`.
+
+### DeploymentManager-015 — Site-query reconciliation marks a deployment `Success` but skips instance-state and snapshot updates
+
+| | |
+|--|--|
+| Severity | High |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:631-655` |
+
+**Description**
+
+`TryReconcileWithSiteAsync` (the DeploymentManager-006 query-before-redeploy
+path) handles the case where a prior `InProgress`/timeout-`Failed` record exists
+and the site reports it already has the target revision hash. In that case it
+marks the prior `DeploymentRecord` `Success`, audit-logs `DeployReconciled`, and
+returns it — the caller then returns `Result.Success` and **never enters the
+normal deploy body**.
+
+The normal success path (`DeployInstanceAsync.cs:215-223`) does three things on
+a successful site response: writes the deployment record terminal status, sets
+`instance.State = InstanceState.Enabled` + `UpdateInstanceAsync`, and calls
+`StoreDeployedSnapshotAsync`. The reconciliation shortcut performs only the
+first. Consequently, after a reconciled deployment:
+
+- The instance `State` is left at whatever it was (e.g. `NotDeployed` for a
+  first-time deploy that timed out, or `Disabled`) even though the site is
+  actually running the configuration — the central state machine and the site
+  diverge, and a subsequent `DisableInstanceAsync`/`EnableInstanceAsync` will be
+  rejected or allowed incorrectly by `StateTransitionValidator`.
+- No `DeployedConfigSnapshot` is created or refreshed. A first-time deploy that
+  is resolved purely by reconciliation leaves `GetDeploymentComparisonAsync`
+  permanently returning `"No deployed snapshot found for this instance."`, and a
+  redeploy reconciliation leaves the stored snapshot showing the *old* config
+  even though the deployment record claims `Success` for the new revision.
+
+The design ("Deployed vs. Template-Derived State", WP-4/WP-8) requires the
+deployed snapshot and instance state to reflect the last successful deployment;
+the reconciliation path silently breaks both invariants.
+
+**Recommendation**
+
+In the reconciled-success branch of `TryReconcileWithSiteAsync`, perform the
+same post-success side effects as the normal path: set `instance.State =
+InstanceState.Enabled` (+ `UpdateInstanceAsync`) and call
+`StoreDeployedSnapshotAsync` with the target deployment ID / revision hash /
+config JSON. Factor the shared post-success logic into one helper so the normal
+and reconciliation paths cannot drift. Add a regression test asserting that a
+reconciled deployment leaves the instance `Enabled` and a snapshot stored.
+
+**Resolution**
+
+_Unresolved._
+
+### DeploymentManager-016 — Reconciled prior record keeps its stale `RevisionHash`
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:639-651` |
+
+**Description**
+
+When `TryReconcileWithSiteAsync` reconciles a prior record, it mutates
+`prior.Status`, `prior.ErrorMessage`, and `prior.CompletedAt`, but **not**
+`prior.RevisionHash`. The reconciliation condition only compares the *site's*
+`AppliedRevisionHash` against the *freshly-flattened* `targetRevisionHash` — it
+does not require `prior.RevisionHash` to equal either of them.
+
+The prior record can legitimately carry a different revision hash than the
+current target: e.g. a deploy timed out at revision `R1`, the template was then
+edited so the current flatten yields `R2`, and meanwhile the site actually
+applied `R2` through some other path (or `R1` and `R2` are equal-by-content but
+the prior record predates a hash recompute). After reconciliation the record's
+`Status` is `Success` but its `RevisionHash` still says `R1`, so staleness
+checks and any UI that reads `DeploymentRecord.RevisionHash` will report the
+instance as deployed at the wrong revision. The audit `DeployReconciled` entry
+records `RevisionHash = targetRevisionHash`, contradicting the persisted record.
+
+**Recommendation**
+
+In the reconciled-success branch, also set `prior.RevisionHash =
+targetRevisionHash` so the persisted record, the audit entry, and the site's
+actual applied revision all agree. Alternatively, only reconcile when
+`prior.RevisionHash == targetRevisionHash` and otherwise fall through to a
+normal deploy.
+
+**Resolution**
+
+_Unresolved._
+
+### DeploymentManager-017 — `GetDeploymentStatusAsync` XML doc describes behaviour it does not implement
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Documentation & comments |
+| Status | Open |
+| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:562-570` |
+
+**Description**
+
+The XML summary on `GetDeploymentStatusAsync` reads: *"WP-2: After
+failover/timeout, query site for current deployment state before
+re-deploying."* The method body does no such thing — it is a one-line
+pass-through to `_repository.GetDeploymentByDeploymentIdAsync`, a pure local DB
+read. The query-the-site-before-redeploy behaviour the comment describes was
+implemented separately in `TryReconcileWithSiteAsync` (DeploymentManager-006).
+The stale comment is a leftover of the original design intent and misleads a
+reader into thinking this method contacts the site.
+
+**Recommendation**
+
+Reword the summary to describe what the method actually does — "returns the
+current persisted `DeploymentRecord` for the given deployment ID from the
+configuration database" — and, if useful, cross-reference
+`TryReconcileWithSiteAsync` as the place the site-query reconciliation lives.
+
+**Resolution**
+
+_Unresolved._
@@ -5,10 +5,10 @@
 | Module | `src/ScadaLink.ExternalSystemGateway` |
 | Design doc | `docs/requirements/Component-ExternalSystemGateway.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-16 |
+| Last reviewed | 2026-05-17 |
 | Reviewer | claude-agent |
-| Commit reviewed | `9c60592` |
-| Open findings | 0 |
+| Commit reviewed | `39d737e` |
+| Open findings | 3 |

 ## Summary

@@ -30,20 +30,41 @@ Test coverage is thin — `CachedCall` transient/buffering paths and `DatabaseGa
 are entirely untested. Themes: incomplete wiring against the S&F engine, and design-doc
 requirements (timeout, retry settings) that are declared but not implemented.

+#### Re-review 2026-05-17 (commit `39d737e`)
+
+All fourteen prior findings remain `Resolved`; the resolutions for findings 001–014
+were spot-checked against the current source and hold. The re-review walked the full
+10-category checklist again and surfaced **three new findings**. The most serious
+(`ExternalSystemGateway-015`, High) is a regression *introduced by* the
+`ExternalSystemGateway-004` resolution: `CachedCall`/`CachedWrite` now pass a
+per-system/per-connection `MaxRetries` of `0` through verbatim, but
+`StoreAndForwardService.RetryMessageAsync` interprets a stored `MaxRetries` of `0` as
+**retry forever**, not "never retry" — so the very `0` the ESG-004 fix claims to
+"honour as never retry" actually produces an unbounded retry loop, and two ESG tests
+assert the broken behaviour. `ExternalSystemGateway-016` (Medium) is that the
+`ExternalSystemGateway-013` resolution used `ConfigureHttpClientDefaults`, which is a
+**process-global** registration — it forces a `SocketsHttpHandler` (capped at the ESG
+option) onto every `HttpClient` in the host, including the Notification Service's
+OAuth2 token client, not just the gateway's per-system clients. `ExternalSystemGateway-017`
+(Low) is a trailing-`?` URL nit when a GET method's parameters are all null. Theme:
+both substantive findings are second-order defects in earlier fixes — the earlier
+resolutions did not verify the downstream contract of the S&F engine they integrate
+with.
+
 ## Checklist coverage

 | # | Category | Examined | Notes |
 |---|----------|----------|-------|
-| 1 | Correctness & logic bugs | ☑ | URL building edge cases, dropped S&F result, classification gaps — findings 003, 006, 009. |
-| 2 | Akka.NET conventions | ☑ | No actors in this module; `AddExternalSystemGatewayActors` is a no-op. Blocking-I/O isolation is delegated to Site Runtime. No issues found in this module. |
-| 3 | Concurrency & thread safety | ☑ | Services are stateless and DI-scoped; `ExternalCallResult.Response` lazy-parse is not thread-safe but instances are single-use. No findings raised. |
-| 4 | Error handling & resilience | ☑ | S&F handler never registered, double-dispatch, timeout not applied, cancellation conflation — findings 001, 002, 003, 008. |
-| 5 | Security | ☑ | Auth secrets logged-safe, but error bodies echoed verbatim — finding 007. |
-| 6 | Performance & resource management | ☑ | `HttpRequestMessage`/`HttpResponseMessage` and failed `SqlConnection` not disposed; full repository scan per call — findings 005, 010, 011. |
-| 7 | Design-document adherence | ☑ | Timeout, retry settings, audit logging gaps — findings 002, 004, 012. |
-| 8 | Code organization & conventions | ☑ | Options class correctly owned by module; `MaxConcurrentConnectionsPerSystem` unused — finding 013. |
-| 9 | Testing coverage | ☑ | CachedCall buffering and DatabaseGateway untested — finding 014. |
-| 10 | Documentation & comments | ☑ | XML docs reference WP numbers; permanent-failure logging requirement unverified — folded into finding 012. |
+| 1 | Correctness & logic bugs | ☑ | Prior: URL edge cases, dropped S&F result, classification — 003, 006, 009. Re-review: `MaxRetries == 0` retry-forever vs never-retry contradiction (015); trailing-`?` URL nit (017). |
+| 2 | Akka.NET conventions | ☑ | No actors in this module; `AddExternalSystemGatewayActors` is a no-op. Blocking-I/O isolation is delegated to Site Runtime. No issues found. |
+| 3 | Concurrency & thread safety | ☑ | Services are stateless and DI-scoped; the S&F delivery handlers resolve in a fresh DI scope on the sweep thread. No findings raised. |
+| 4 | Error handling & resilience | ☑ | Prior: handler registration, double-dispatch, timeout, cancellation — 001, 002, 003, 008. Re-review: the unbounded-retry consequence of finding 015 is also a resilience defect (recorded under category 1). |
+| 5 | Security | ☑ | Error bodies now truncated (007). No new issues — auth secrets not logged, body capped. |
+| 6 | Performance & resource management | ☑ | Prior: disposal and repository-scan findings 005, 010, 011 — all resolved and verified. No new issues. |
+| 7 | Design-document adherence | ☑ | Prior: timeout, retry settings, logging — 002, 004, 012. Re-review: finding 015 is also a design-adherence gap (S&F retry contract); recorded under category 1. |
+| 8 | Code organization & conventions | ☑ | Prior: `MaxConcurrentConnectionsPerSystem` wiring — 013. Re-review: that wiring uses process-global `ConfigureHttpClientDefaults`, leaking the ESG cap onto every host `HttpClient` — finding 016. |
+| 9 | Testing coverage | ☑ | Coverage is broad after finding 014. Re-review note: the `ZeroMaxRetries...` tests assert the persisted column, not the sweep outcome, and so lock in the finding-015 defect. |
+| 10 | Documentation & comments | ☑ | Inline comments at `ExternalSystemClient.cs:118-119` / `DatabaseGateway.cs:99-101` assert a "never retry" semantic that the code does not deliver — see finding 015. |

 ## Findings

@@ -760,3 +781,144 @@ headers and body so URL/auth/body construction is now verified, not just status
 These are new-coverage tests against already-correct behaviour, so they pass on the
 current source; the `BuildUrl` and `ApplyAuth` paths they exercise are now protected
 against regression.
+
+### ExternalSystemGateway-015 — `MaxRetries == 0` is buffered as "retry forever", contradicting the ExternalSystemGateway-004 "never retry" claim
+
+| | |
+|--|--|
+| Severity | High |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:120-127`, `src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:102-108` |
+
+**Description**
+
+The `ExternalSystemGateway-004` fix removed the `MaxRetries > 0 ? ... : null` guard so
+that `CachedCallAsync` and `CachedWriteAsync` now pass the definition's `MaxRetries`
+to `StoreAndForwardService.EnqueueAsync` verbatim. The stated rationale (and the inline
+comments at `ExternalSystemClient.cs:118-119` / `DatabaseGateway.cs:99-101`, plus the
+tests `CachedCall_TransientFailure_ZeroMaxRetriesIsHonouredNotTreatedAsUnset` and
+`CachedWrite_ZeroMaxRetriesIsHonouredNotTreatedAsUnset`) is that a configured
+`MaxRetries` of `0` means **"never retry"**.
+
+That is the opposite of what the Store-and-Forward engine actually does with the
+value. `EnqueueAsync` stores the passed `maxRetries` directly into
+`StoreAndForwardMessage.MaxRetries` (`StoreAndForwardService.cs:139`), whose own XML
+doc states **"Maximum retry-sweep attempts before parking (0 = no limit)"**
+(`StoreAndForwardMessage.cs:30`). The retry sweep enforces it as
+`if (message.MaxRetries > 0 && message.RetryCount >= message.MaxRetries)`
+(`StoreAndForwardService.cs:285`) — when `MaxRetries == 0` that guard is false on
+every sweep, so the message is **never parked and is retried forever**.
+
+Consequences for a `CachedCall`/`CachedWrite` against a system or connection
+configured with `MaxRetries = 0`:
+
+1. A transiently-failing message that the operator intended never to retry is instead
+   retried on every sweep indefinitely, accumulating in the buffer and repeatedly
+   re-dispatching the request — the exact unbounded-retry / duplicate-delivery hazard
+   the idempotency note in the design doc warns about.
+2. The two ESG regression tests cited above assert `max_retries == 0` is *stored* and
+   describe that as "honoured" — they verify the persisted column, never the resulting
+   sweep behaviour, so they lock in the broken semantics.
+3. Because the SiteRuntime repository still always supplies `MaxRetries == 0` (the
+   open companion gap noted in `ExternalSystemGateway-004`), **every** cached call and
+   cached write at every site currently buffers as retry-forever. Before the ESG-004
+   fix the `> 0` guard sent `null`, so the S&F `DefaultMaxRetries` (a bounded value)
+   applied — i.e. the ESG-004 fix turned a bounded retry into an unbounded one for the
+   common path.
+
+**Recommendation**
+
+Reconcile the ESG and S&F interpretations of `MaxRetries == 0` — they must agree.
+Either: (a) treat the entity's `MaxRetries == 0` as "unset" and pass `null` so the
+bounded S&F default applies (reverting to the pre-ESG-004 behaviour, and accepting
+that "never retry" then needs a different representation such as a nullable field or a
+`MaxRetries == 1` convention); or (b) if `0` genuinely must mean "never retry", add an
+explicit no-retry path — e.g. do not enqueue at all on transient failure when
+`MaxRetries == 0`, or introduce a distinct sentinel — and fix the
+`StoreAndForwardMessage.MaxRetries` doc and `RetryMessageAsync` guard so `0` no longer
+means "no limit". Update the two `ZeroMaxRetries...` tests to assert the *sweep*
+outcome (parked / not retried), not just the stored column value.
+
+**Resolution**
+
+_Unresolved._
+
+### ExternalSystemGateway-016 — `ConfigureHttpClientDefaults` applies the ESG connection cap to every `HttpClient` in the host process
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Code organization & conventions |
+| Status | Open |
+| Location | `src/ScadaLink.ExternalSystemGateway/ServiceCollectionExtensions.cs:21-29` |
+
+**Description**
+
+The `ExternalSystemGateway-013` fix wires `MaxConcurrentConnectionsPerSystem` by
+calling `services.ConfigureHttpClientDefaults(builder => builder.ConfigurePrimaryHttpMessageHandler(...))`.
+The inline comment claims this "applies to the dynamically-named clients created by
+`ExternalSystemClient`" — but `ConfigureHttpClientDefaults` is **process-global**: it
+adds the configuration to *every* `HttpClient`/`IHttpClientFactory` client created
+anywhere in the host, regardless of name.
+
+The Host registers the External System Gateway alongside other components that also
+use `IHttpClientFactory` — notably `ScadaLink.NotificationService` (`OAuth2TokenService`
+and its `ServiceCollectionExtensions` call `AddHttpClient`). With the ESG registration
+present, the OAuth2 token client (and any future `HttpClient` consumer in the host)
+has its **primary handler replaced** by a `SocketsHttpHandler` whose
+`MaxConnectionsPerServer` is the ESG's `MaxConcurrentConnectionsPerSystem`. That:
+
+1. Silently caps unrelated clients at a value owned by, and named for, a different
+   component — an operator tuning the ESG option unknowingly throttles Microsoft 365
+   OAuth2 token acquisition.
+2. Overrides/discards any primary-handler configuration those other components add for
+   their own clients (e.g. a custom handler, proxy, or certificate settings).
+
+This is a leaky, surprising side effect for what the option claims to be a per-ESG
+setting; `ConfigureHttpClientDefaults` should not be used to express a single
+component's policy.
+
+**Recommendation**
+
+Scope the handler configuration to the gateway's own clients only. The ESG already
+creates per-system clients with a deterministic name pattern
+(`ExternalSystem_{system.Name}`); register a typed/named `HttpClient` (or a small
+factory abstraction) for that pattern and call `ConfigurePrimaryHttpMessageHandler`
+on that registration instead of on the global defaults. If a name-pattern registration
+is impractical, document the global effect explicitly and rename the option, but the
+preferred fix is to stop using `ConfigureHttpClientDefaults`.
+
+**Resolution**
+
+_Unresolved._
+
+### ExternalSystemGateway-017 — `BuildUrl` appends a bare trailing `?` when a GET method's parameters are all null
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:324-333` |
+
+**Description**
+
+In `BuildUrl`, the GET/DELETE query-string branch is entered when
+`parameters != null && parameters.Count > 0`, but the projection then filters out
+null-valued entries (`parameters.Where(p => p.Value != null)`). When a GET/DELETE
+method is invoked with a non-empty parameter dictionary whose values are *all* null,
+`queryString` is the empty string and the code still executes `url += "?" + queryString`,
+producing a URL ending in a bare `?` (e.g. `https://host/api/resource?`). Most servers
+tolerate a trailing `?`, but it is an unintended artifact, can defeat response caching
+on some endpoints, and makes captured request URLs harder to read in logs.
+
+**Recommendation**
+
+Only append `"?" + queryString` when `queryString` is non-empty (compute the joined
+string first and check it), so a method whose effective parameter set is empty
+produces a clean URL identical to the no-parameters case.
+
+**Resolution**
+
+_Unresolved._
@@ -5,10 +5,10 @@
 | Module | `src/ScadaLink.HealthMonitoring` |
 | Design doc | `docs/requirements/Component-HealthMonitoring.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-16 |
+| Last reviewed | 2026-05-17 |
 | Reviewer | claude-agent |
-| Commit reviewed | `9c60592` |
-| Open findings | 0 |
+| Commit reviewed | `39d737e` |
+| Open findings | 4 |

 ## Summary

@@ -32,20 +32,38 @@ heartbeat path, and most collector setters. None of the findings are crash-class
 but the concurrency issues are Medium/High and the missing S&F metric is a real
 design-adherence gap.

+#### Re-review 2026-05-17 (commit `39d737e`)
+
+All twelve prior findings (HealthMonitoring-001..012) are confirmed `Resolved` —
+`SiteHealthState` is now an immutable `sealed record` mutated only via atomic
+compare-and-swap, the store-and-forward buffer-depth metric is populated, the
+central-site offline grace and the unknown-site heartbeat registration are in
+place, and the test suite has grown to cover the report loop, heartbeat path, and
+collector setters. This re-review found **4 new findings, all Low/Medium, none
+crash-class**. They are residual polish items rather than behaviour regressions:
+an inaccurate offline-check-interval comment (HealthMonitoring-013), unvalidated
+`HealthMonitoringOptions` intervals that crash the hosted service on
+misconfiguration (HealthMonitoring-014), a heartbeat-only registered site left
+with a year-0001 `LastReportReceivedAt` that the UI's staleness display must
+special-case (HealthMonitoring-015), and `CollectReport` reading
+`DateTimeOffset.UtcNow` directly instead of the module's now-standard injected
+`TimeProvider` (HealthMonitoring-016). The module remains small, readable, and
+broadly faithful to the design intent.
+
 ## Checklist coverage

 | # | Category | Examined | Notes |
 |---|----------|----------|-------|
-| 1 | Correctness & logic bugs | x | `MarkHeartbeat` drops heartbeats for unregistered sites (HealthMonitoring-007); central self-report has no heartbeat grace (HealthMonitoring-005). |
+| 1 | Correctness & logic bugs | x | `MarkHeartbeat` drops heartbeats for unregistered sites (HealthMonitoring-007); central self-report has no heartbeat grace (HealthMonitoring-005). Re-review: heartbeat-registered site left with year-0001 `LastReportReceivedAt` (HealthMonitoring-015). |
 | 2 | Akka.NET conventions | x | Module itself contains no actors (transport abstracted via `IHealthReportTransport`); `AddHealthMonitoringActors` is a dead placeholder (HealthMonitoring-011). Actor-side wiring lives in Communication and is out of scope. |
 | 3 | Concurrency & thread safety | x | Unguarded mutable `SiteHealthState` (HealthMonitoring-002); mutation inside `AddOrUpdate` delegate (HealthMonitoring-003); `GetAllSiteStates` leaks live mutable references (HealthMonitoring-008). Collector counters correctly use `Interlocked`. |
-| 4 | Error handling & resilience | x | `HealthReportSender` silently swallows inner failures with bare `catch {}` (HealthMonitoring-010); top-level loop error handling is sound. |
+| 4 | Error handling & resilience | x | `HealthReportSender` silently swallows inner failures with bare `catch {}` (HealthMonitoring-010, resolved); top-level loop error handling is sound. Re-review: `HealthMonitoringOptions` intervals unvalidated — a zero/negative value crashes the hosted service at `PeriodicTimer` construction (HealthMonitoring-014). |
 | 5 | Security | x | No issues found. Module handles only numeric/string operational metrics, no secrets, no external input parsing, no auth surface. |
 | 6 | Performance & resource management | x | `PeriodicTimer` instances correctly disposed via `using`. Dictionary snapshots per report are acceptable at the documented scale. No issues found. |
 | 7 | Design-document adherence | x | Store-and-forward buffer depth metric unimplemented (HealthMonitoring-001); sequence seeding deviates from doc's "starting at 1" wording (HealthMonitoring-006). |
-| 8 | Code organization & conventions | x | Options class correctly owned by the component; POCO/messages in Commons. Dead placeholder method noted (HealthMonitoring-011). |
+| 8 | Code organization & conventions | x | Options class correctly owned by the component; POCO/messages in Commons. Dead placeholder method noted (HealthMonitoring-011, resolved). Re-review: `SiteHealthCollector.CollectReport` reads `DateTimeOffset.UtcNow` directly instead of the module's now-standard injected `TimeProvider` (HealthMonitoring-016). |
 | 9 | Testing coverage | x | No tests for `CentralHealthReportLoop`, `MarkHeartbeat`, offline-via-heartbeat, replica idempotency, or most collector setters (HealthMonitoring-009). |
-| 10 | Documentation & comments | x | Heartbeat interval is described inconsistently (~2s vs ~5s) across XML docs (HealthMonitoring-004); `LatestReport = null!` misrepresents the contract (HealthMonitoring-012). |
+| 10 | Documentation & comments | x | Heartbeat interval is described inconsistently (~2s vs ~5s) across XML docs (HealthMonitoring-004, resolved); `LatestReport = null!` misrepresents the contract (HealthMonitoring-012, resolved). Re-review: offline-check-interval comment claims "(shorter)" timeout but code only uses `OfflineTimeout` (HealthMonitoring-013). |

 ## Findings

@@ -560,3 +578,148 @@ has not yet sent a report"). A codebase-wide search confirms no `null!` suppress
 remains anywhere in `src/ScadaLink.HealthMonitoring`. This is exactly the change
 HealthMonitoring-002 made when converting `SiteHealthState` to an immutable record, so
 the contract is now honest and no further code change was required.
+
+### HealthMonitoring-013 — Offline-check interval comment claims "shorter timeout" but only ever uses `OfflineTimeout`
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Documentation & comments |
+| Status | Open |
+| Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:194-196` |
+
+**Description**
+
+`ExecuteAsync` derives the `PeriodicTimer` cadence with the comment "Check at half
+the (shorter) offline timeout interval for timely detection", but the code only
+reads `_options.OfflineTimeout`:
+
+```csharp
+var checkInterval = TimeSpan.FromMilliseconds(_options.OfflineTimeout.TotalMilliseconds / 2);
+```
+
+`CentralOfflineTimeout` (HealthMonitoring-005's fix) is never considered. With the
+default options (`OfflineTimeout` 60s, `CentralOfflineTimeout` 3m) `OfflineTimeout`
+genuinely is the shorter of the two, so the parenthetical happens to hold. But the
+comment states an invariant the code does not enforce: if an operator configures
+`CentralOfflineTimeout` *smaller* than `OfflineTimeout`, the check cadence stays
+tied to `OfflineTimeout`, and central offline detection is delayed by up to a full
+`OfflineTimeout / 2` beyond the intended `CentralOfflineTimeout` window. The comment
+misleads a reader into believing the cadence already adapts to whichever timeout is
+shorter.
+
+**Recommendation**
+
+Either compute `checkInterval` from `Math.Min(OfflineTimeout, CentralOfflineTimeout)`
+so the code matches the comment, or drop the "(shorter)" wording and state plainly
+that the cadence is derived from `OfflineTimeout` only (acceptable while the default
+`CentralOfflineTimeout` is the larger value).
+
+**Resolution**
+
+_Unresolved._
+
+### HealthMonitoring-014 — `HealthMonitoringOptions` intervals are unvalidated; a zero/negative value crashes the hosted service
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Error handling & resilience |
+| Status | Open |
+| Location | `src/ScadaLink.HealthMonitoring/HealthMonitoringOptions.cs:3-20`, `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:196`, `src/ScadaLink.HealthMonitoring/HealthReportSender.cs:67`, `src/ScadaLink.HealthMonitoring/CentralHealthReportLoop.cs:63` |
+
+**Description**
+
+`HealthMonitoringOptions` is bound from the `ScadaLink:HealthMonitoring` config
+section (`SiteServiceRegistration.BindSharedOptions`) with no validation —
+no `IValidateOptions<HealthMonitoringOptions>`, no `ValidateDataAnnotations`, no
+`ValidateOnStart`. `ReportInterval`, `OfflineTimeout`, and `CentralOfflineTimeout`
+are all fed straight into `new PeriodicTimer(...)` (and `OfflineTimeout` into a
+division for the check interval). `PeriodicTimer`'s constructor throws
+`ArgumentOutOfRangeException` for a zero or negative period. A misconfigured
+`appsettings.json` (e.g. `"ReportInterval": "00:00:00"`, an empty/garbled value
+that binds to `TimeSpan.Zero`, or a negative span) therefore crashes the
+`HealthReportSender` / `CentralHealthReportLoop` / `CentralHealthAggregator`
+hosted service at startup with an opaque exception that does not name the
+offending config key, rather than failing fast with a clear validation message.
+
+**Recommendation**
+
+Add an options validator (DataAnnotations `[Range]`-style on the spans, or an
+`IValidateOptions<HealthMonitoringOptions>`) that rejects non-positive
+`ReportInterval`/`OfflineTimeout`/`CentralOfflineTimeout` and ideally requires
+`CentralOfflineTimeout >= OfflineTimeout`, and call `.ValidateOnStart()` so a bad
+configuration fails fast with a message naming the section and key.
+
+**Resolution**
+
+_Unresolved._
+
+### HealthMonitoring-015 — Heartbeat-registered site is left with a year-0001 `LastReportReceivedAt`
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:122-130`, `src/ScadaLink.HealthMonitoring/SiteHealthState.cs:27` |
+
+**Description**
+
+When `MarkHeartbeat` registers a previously-unknown site (HealthMonitoring-007's
+fix), it sets `LastReportReceivedAt = default` — i.e. `DateTimeOffset.MinValue`
+(`0001-01-01`). The XML doc on `SiteHealthState.LastReportReceivedAt` states the
+field is "Used by the UI to surface report staleness during failover." A
+heartbeat-only site therefore has `LatestReport == null` **and**
+`LastReportReceivedAt == DateTimeOffset.MinValue`. Any UI code that computes
+"last report N ago" as `now - LastReportReceivedAt` without first checking
+`LatestReport != null` will render a nonsensical staleness of roughly two
+thousand years for a site that is, in fact, freshly reachable. The two
+"no report yet" signals (`LatestReport == null`, `LastReportReceivedAt == default`)
+are independent and both must be special-cased; the sentinel value is an easy trap.
+
+**Recommendation**
+
+Make `LastReportReceivedAt` nullable (`DateTimeOffset?`) so "no report received
+yet" is an explicit, unmissable state rather than a magic sentinel — consistent
+with how `LatestReport` was already made nullable for the same case — and have UI
+consumers render staleness only when it has a value. Alternatively, document the
+`default` sentinel prominently on the field and audit every UI reader, but the
+nullable option is safer and matches the existing `LatestReport` treatment.
+
+**Resolution**
+
+_Unresolved._
+
+### HealthMonitoring-016 — `SiteHealthCollector.CollectReport` reads `DateTimeOffset.UtcNow` directly instead of an injected `TimeProvider`
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Code organization & conventions |
+| Status | Open |
+| Location | `src/ScadaLink.HealthMonitoring/SiteHealthCollector.cs:151` |
+
+**Description**
+
+`CollectReport` stamps each report with `ReportTimestamp: DateTimeOffset.UtcNow`,
+read directly from the system clock. Every other time-dependent class in the
+module — `CentralHealthAggregator`, `HealthReportSender`, `CentralHealthReportLoop`
+— was deliberately refactored (HealthMonitoring-006) to take an injectable
+`TimeProvider` so the behaviour is deterministically testable and the clock
+dependency is explicit. `SiteHealthCollector` is the lone holdout: the report
+timestamp cannot be controlled in a unit test, which is why
+`SiteHealthCollectorTests.CollectReport_IncludesUtcTimestamp` can only assert the
+timestamp falls in a `before`/`after` wall-clock window rather than equalling a
+known instant. This is a minor consistency/testability gap, not a behaviour bug.
+
+**Recommendation**
+
+Add an optional `TimeProvider` constructor parameter to `SiteHealthCollector`
+(defaulting to `TimeProvider.System`, mirroring the other classes) and derive
+`ReportTimestamp` from `GetUtcNow()`, so the report timestamp is deterministically
+testable and the module is consistent.
+
+**Resolution**
+
+_Unresolved._
@@ -5,10 +5,10 @@
 | Module | `src/ScadaLink.Host` |
 | Design doc | `docs/requirements/Component-Host.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-16 |
+| Last reviewed | 2026-05-17 |
 | Reviewer | claude-agent |
-| Commit reviewed | `9c60592` |
-| Open findings | 0 |
+| Commit reviewed | `39d737e` |
+| Open findings | 4 |

 ## Summary

@@ -31,20 +31,37 @@ and unguarded string interpolation when building HOCON. None are crash/data-loss
 class, but the readiness bug is High because it breaks load-balancer behaviour with
 no safe workaround.

+#### Re-review 2026-05-17 (commit `39d737e`)
+
+All eleven findings from the first review (Host-001..011) are confirmed `Resolved`
+in the current tree — the `/health/ready` predicate, the externalised secrets, the
+seed-node/GrpcPort validation rules, the escaped HOCON builder, the bounded
+migration retry and the live `LoggingOptions.MinimumLevel` are all present as
+described in their Resolution notes. This re-review walked all ten checklist
+categories again over the full module and recorded four new findings, none of them
+crash/data-loss class. Following up on a batch-1 ClusterInfrastructure re-review
+note, Host-012 (Medium) confirms `AkkaHostedService.BuildHocon` hard-codes
+`down-if-alone = on` and never reads the `ClusterOptions.DownIfAlone` property, so
+that documented, bound option is dead. The remaining three are Low: `:F0` rounding
+of cluster timing values silently degrades any sub-second configuration (Host-013),
+Serilog sink setup is hard-coded in `Program.cs` rather than configuration-driven as
+REQ-HOST-8 requires (Host-014), and `StartupRetry` retries indiscriminately on every
+exception type including permanent schema-validation failures (Host-015).
+
 ## Checklist coverage

 | # | Category | Examined | Notes |
 |---|----------|----------|-------|
-| 1 | Correctness & logic bugs | ☑ | `/health/ready` includes the leader-only check (Host-001); site seed-node config points at the gRPC port (Host-004). |
-| 2 | Akka.NET conventions | ☑ | CoordinatedShutdown, receptionist registration, singleton scoping all correct. HOCON built by raw string interpolation (Host-006); `StartAsync` returns before actors are confirmed running (Host-009). |
-| 3 | Concurrency & thread safety | ☑ | Blocking `GetAwaiter().GetResult()` on a hosted-service startup thread (Host-005). `DeadLetterMonitorActor` state is actor-confined — no issues. |
-| 4 | Error handling & resilience | ☑ | Top-level try/catch logs fatal and rethrows. No retry around DB migration / readiness preconditions (Host-010). |
-| 5 | Security | ☑ | Plaintext DB password, LDAP service-account password and dev JWT key checked into `appsettings.Central.json` (Host-003). |
-| 6 | Performance & resource management | ☑ | No undisposed resources. Inbound API script compilation is a synchronous startup loop — acceptable. |
-| 7 | Design-document adherence | ☑ | REQ-HOST-6 mandates Akka.Persistence config but none exists and no persistent actors exist — doc is stale (Host-002). REQ-HOST-4 GrpcPort-≠-RemotingPort rule not enforced (Host-007). |
-| 8 | Code organization & conventions | ☑ | `MachineDataDb` validated/declared but never consumed (Host-008). `LoggingOptions.MinimumLevel` is dead (Host-011). |
-| 9 | Testing coverage | ☑ | Strong suite; no test asserts `/health/ready` excludes `active-node`, which is why Host-001 slipped through (noted in Host-001). |
-| 10 | Documentation & comments | ☑ | Comments are accurate. REQ-HOST-6 in the design doc is the main stale-doc item (Host-002). |
+| 1 | Correctness & logic bugs | ☑ | Host-001/004 resolved. Re-review: `:F0` rounding of cluster timing values silently distorts sub-second durations (Host-013). |
+| 2 | Akka.NET conventions | ☑ | CoordinatedShutdown, receptionist registration, singleton scoping all correct. Host-006/009 resolved; no new issues. |
+| 3 | Concurrency & thread safety | ☑ | Host-005 resolved (`StartAsync` now genuinely async). `DeadLetterMonitorActor` state is actor-confined — no issues. |
+| 4 | Error handling & resilience | ☑ | Host-010 resolved. Re-review: `StartupRetry` retries indiscriminately on permanent failures (e.g. schema-validation mismatch) (Host-015). |
+| 5 | Security | ☑ | Host-003 resolved — secrets externalised to env vars. No new issues. |
+| 6 | Performance & resource management | ☑ | No undisposed resources. Inbound API script compilation is a synchronous startup loop — acceptable. No new issues. |
+| 7 | Design-document adherence | ☑ | Host-002/007 resolved. Re-review: `ClusterOptions.DownIfAlone` bound/documented but never consumed — HOCON hard-codes `on` (Host-012); Serilog sinks hard-coded, not config-driven per REQ-HOST-8 (Host-014). |
+| 8 | Code organization & conventions | ☑ | Host-008/011 resolved. No new issues. |
+| 9 | Testing coverage | ☑ | Strong suite; regression tests added for Host-001/004/006/007/010/011. No coverage for the new `down-if-alone`, sub-second-duration, or non-transient-retry paths (Host-012/013/015). |
+| 10 | Documentation & comments | ☑ | REQ-HOST-6 stale-doc resolved. Re-review: REQ-HOST-8 says sinks are "configuration-driven" but they are code-defined (Host-014). |

 ## Findings

@@ -561,3 +578,153 @@ Regression tests in new `LoggerConfigurationTests`:
 — they assert the configured level actually filters log events; they pass against the
 new factory (and would fail against the old inline configuration, which ignored the
 key). Full Host suite green (175 passed).
+
+### Host-012 — `down-if-alone` hard-coded in HOCON; `ClusterOptions.DownIfAlone` is never read
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Design-document adherence |
+| Status | Open |
+| Location | `src/ScadaLink.Host/Actors/AkkaHostedService.cs:146-148` |
+
+**Description**
+
+`AkkaHostedService.BuildHocon` emits the split-brain-resolver block with
+`keep-oldest { down-if-alone = on }` as a literal constant. `ClusterOptions`
+(`src/ScadaLink.ClusterInfrastructure/ClusterOptions.cs:74`) exposes a
+`DownIfAlone` property — bound from `ScadaLink:Cluster` via the Options pattern,
+documented as "the design-doc requirement", default `true` — but a repo-wide search
+shows it is referenced **nowhere outside its own declaration**. The Host therefore
+ignores the bound value entirely: setting `ScadaLink:Cluster:DownIfAlone` to `false`
+in `appsettings.json` has no effect, the resolver still runs with `down-if-alone =
+on`. This is the same class of defect as the resolved Host-011
+(`LoggingOptions.MinimumLevel` was dead config) — a configuration option that is
+declared, bound, and documented but never consumed, which silently misleads any
+operator who edits it. A batch-1 re-review of ClusterInfrastructure flagged the same
+hard-coding; it is recorded here because `BuildHocon` is Host-owned code (the
+ClusterInfrastructure project owns only the configuration contract, per the
+`ClusterOptions` class comment).
+
+**Recommendation**
+
+Make `BuildHocon` consume `clusterOptions.DownIfAlone` — emit `down-if-alone =
+{(clusterOptions.DownIfAlone ? "on" : "off")}` (the value is a bool, so no escaping
+is needed). Add a `HoconBuilderTests` case asserting both `true` and `false` produce
+the corresponding `down-if-alone` token. If the flag is genuinely meant to be fixed
+at `on` for ScadaLink's two-node clusters, remove the `DownIfAlone` property and its
+doc comment instead so code and configuration contract agree — but do not leave it
+declared-but-dead.
+
+**Resolution**
+
+_Unresolved._
+
+### Host-013 — `:F0` rounding of cluster timing values silently degrades sub-second configuration
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.Host/Actors/AkkaHostedService.cs:135-136,145,151-152` |
+
+**Description**
+
+`BuildHocon` renders every duration into HOCON via `TotalSeconds:F0` —
+`transportHeartbeatSec:F0`, `transportFailureSec:F0`, `StableAfter.TotalSeconds:F0`,
+`HeartbeatInterval.TotalSeconds:F0`, `FailureDetectionThreshold.TotalSeconds:F0`. The
+`F0` format specifier rounds to whole seconds, so any sub-second configuration is
+silently distorted: a `HeartbeatInterval` of `00:00:00.5` renders as
+`heartbeat-interval = 0s` (a degenerate / invalid value that Akka will reject or
+treat as zero), and `00:00:02.6` becomes `3s`. The shipped defaults are all whole
+seconds so this is latent, but the configuration model accepts arbitrary `TimeSpan`
+values and an operator tuning a heartbeat to e.g. `750ms` would get a `1s` value with
+no warning — or, worse, `0s`. Rounding configured timing values without surfacing
+the change is a correctness hazard for exactly the kind of failure-detection tuning
+these options exist for.
+
+**Recommendation**
+
+Render durations without precision loss — emit milliseconds (e.g.
+`heartbeat-interval = {ts.TotalMilliseconds:F0}ms`) so sub-second values survive, or
+validate in `StartupValidator` that each cluster timing value is a positive whole
+number of seconds and fail fast otherwise. Either way, do not silently round.
+
+**Resolution**
+
+_Unresolved._
+
+### Host-014 — Serilog sinks are hard-coded in `Program.cs`, not configuration-driven (REQ-HOST-8)
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Design-document adherence |
+| Status | Open |
+| Location | `src/ScadaLink.Host/Program.cs:43-48`, `src/ScadaLink.Host/appsettings.json:1-7` |
+
+**Description**
+
+REQ-HOST-8 requires the Host to configure Serilog with "Configuration-driven sink
+setup (console and file sinks at minimum)". `LoggerConfigurationFactory.Build` calls
+`.ReadFrom.Configuration(configuration)`, which reads the standard `Serilog`
+configuration section — but neither `appsettings.json` nor either role-specific file
+contains a `Serilog` section (only a Microsoft `Logging` section with a `LogLevel`
+map, which Serilog's `ReadFrom.Configuration` does not consume). The two sinks that
+actually run are appended in `Program.cs` as hard-coded `.WriteTo.Console(...)` and
+`.WriteTo.File("logs/scadalink-.log", rollingInterval: Day)` calls. As a result the
+console output template, the file path, and the rolling interval cannot be changed
+without recompiling — an operator cannot redirect logs, change the file location, or
+add a sink via configuration, contrary to REQ-HOST-8. The `ReadFrom.Configuration`
+call is effectively a no-op because the section it reads does not exist.
+
+**Recommendation**
+
+Move the console and file sink definitions into a `Serilog` section in
+`appsettings.json` (with `WriteTo` entries and the output template / file path /
+rolling interval as arguments) so `ReadFrom.Configuration` drives them, and drop the
+hard-coded `.WriteTo` calls from `Program.cs`. Alternatively, update REQ-HOST-8 to
+state the sinks are intentionally code-defined — but the design doc currently says
+"configuration-driven", so code and doc must be reconciled.
+
+**Resolution**
+
+_Unresolved._
+
+### Host-015 — `StartupRetry` retries on every exception type, including permanent failures
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Error handling & resilience |
+| Status | Open |
+| Location | `src/ScadaLink.Host/StartupRetry.cs:36-45` |
+
+**Description**
+
+`StartupRetry.ExecuteWithRetryAsync` catches `Exception` with only the guard
+`when (attempt < maxAttempts)` — it retries on *any* exception type. Its sole caller
+wraps `MigrationHelper.ApplyOrValidateMigrationsAsync` (`Program.cs:124-134`), which
+on a Central node in production *validates* the schema version and throws if the
+deployed schema does not match the expected migration set. A schema-version mismatch
+is a permanent error — retrying it cannot succeed — yet `StartupRetry` will retry it
+8 times with capped exponential backoff (2s, 4s, 8s, 16s, 30s, 30s, 30s ≈ 2 minutes)
+before finally rethrowing, delaying the fatal-exit-and-alert by minutes for a fault
+that is fatal on the first attempt. The retry helper is meant for *transient*
+connection faults (the XML doc says exactly that: "the database may be briefly
+unreachable"), but it cannot distinguish transient from permanent failures.
+
+**Recommendation**
+
+Restrict retries to transient faults — e.g. accept an `Func<Exception, bool>
+isTransient` predicate and, for the migration call site, treat only
+connection-class exceptions (`SqlException` with a connection/transport error
+number, `TimeoutException`, socket errors) as retryable while letting
+schema-validation / `InvalidOperationException` failures propagate immediately. Add
+a `StartupRetryTests` case asserting a non-transient exception is rethrown after a
+single attempt.
+
+**Resolution**
+
+_Unresolved._
@@ -5,10 +5,10 @@
 | Module | `src/ScadaLink.InboundAPI` |
 | Design doc | `docs/requirements/Component-InboundAPI.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-16 |
+| Last reviewed | 2026-05-17 |
 | Reviewer | claude-agent |
-| Commit reviewed | `9c60592` |
-| Open findings | 0 |
+| Commit reviewed | `39d737e` |
+| Open findings | 4 |

 ## Summary

@@ -30,6 +30,25 @@ well but there is no coverage of the HTTP endpoint, concurrency, or recompilatio
 None of the findings are data-loss-class, but the concurrency and trust-model issues
 are High severity and should be addressed before production use.

+#### Re-review 2026-05-17 (commit `39d737e`)
+
+All 13 findings from the initial review remain `Resolved`; the module source under
+`src/ScadaLink.InboundAPI` is unchanged since the last InboundAPI fix commit
+(`8dd7412`), which precedes `39d737e`. This re-review re-walked all 10 checklist
+categories against the resolved code and surfaced **4 new findings** — none touching
+the previously-fixed concurrency/trust-model code, but all in areas the first pass
+did not probe deeply: (1) the `ReturnDefinition` column is loaded onto `ApiMethod`
+but is never consulted — script return values are serialized verbatim with no shaping
+or validation against the declared return structure (InboundAPI-014); (2) the new
+`ForbiddenApiChecker` is a purely textual syntax walker and can be bypassed by
+reaching forbidden functionality through member access that never spells a forbidden
+namespace, e.g. `typeof(x).Assembly.GetType("System.IO.File")` (InboundAPI-015);
+(3) routed `Route.To().Call()` invocations are not bound by the method timeout unless
+the script explicitly threads `Parameters`-side cancellation, contradicting the design
+statement that the timeout covers routed calls (InboundAPI-016); and (4) `RouteHelper`
+/ `RouteTarget` — the entire WP-4 cross-site routing surface — has no test coverage
+(InboundAPI-017). New findings are one Medium-trio plus one Low; no Critical or High.
+
 ## Checklist coverage

 | # | Category | Examined | Notes |
@@ -38,11 +57,11 @@ are High severity and should be addressed before production use.
 | 2 | Akka.NET conventions | ☑ | Module is ASP.NET-hosted, no actors of its own; routes to actors via `CommunicationService`. No correlation-ID issues — IDs are set in `RouteHelper`. |
 | 3 | Concurrency & thread safety | ☑ | Singleton `InboundScriptExecutor` mutates a non-thread-safe `Dictionary` from concurrent request threads — see InboundAPI-001/002. |
 | 4 | Error handling & resilience | ☑ | Catch-all conflates client cancellation with timeout (InboundAPI-004); compilation-failure path repeats work on every request (InboundAPI-009). |
-| 5 | Security | ☑ | Non-constant-time key comparison, no trust-model enforcement, no body-size limit, missing-method enumeration oracle — see InboundAPI-003/005/006/011. |
+| 5 | Security | ☑ | Prior items resolved. Re-review: `ForbiddenApiChecker` is a textual deny-list bypassable via reflection without a forbidden namespace token (InboundAPI-015). |
 | 6 | Performance & resource management | ☑ | Up to 3 separate DB round-trips per request in `ApiKeyValidator`; uncapped lazy recompilation. |
-| 7 | Design-document adherence | ☑ | `Database.Connection()` script API missing; central-only hosting not enforced; lazy-compile diverges from "compiled at startup". |
-| 8 | Code organization & conventions | ☑ | `ParameterDefinition` is an API-shaped POCO declared in the component project rather than Commons; otherwise conventions followed. |
-| 9 | Testing coverage | ☑ | Good unit coverage of the two validators; no endpoint, concurrency, recompilation, or timeout-vs-cancel tests. |
+| 7 | Design-document adherence | ☑ | Re-review: `ReturnDefinition` loaded but never used (InboundAPI-014); routed-call timeout not enforced (InboundAPI-016). Prior `Database.Connection()`/central-only items resolved. |
+| 8 | Code organization & conventions | ☑ | `ParameterDefinition` moved to Commons (InboundAPI-012 resolved); no new issues. |
+| 9 | Testing coverage | ☑ | Re-review: `RouteHelper`/`RouteTarget` (WP-4 routing) entirely untested (InboundAPI-017); validators/executor/filter well covered. |
 | 10 | Documentation & comments | ☑ | `ApiKeyValidationResult.NotFound` XML/name says "NotFound" but returns HTTP 400 — misleading (InboundAPI-013). |

 ## Findings
@@ -580,3 +599,181 @@ the new method-not-found status, and removing dead code cannot regress. Doc-owne
 follow-up: `Component-InboundAPI.md`'s Error Handling section still does not list a
 "method not found" status; it should note that it is reported as 403 (indistinguishable
 from "key not approved"), but that doc edit is outside this module's editable scope.
+
+### InboundAPI-014 — `ReturnDefinition` is loaded but never used; script return value is unshaped/unvalidated
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Design-document adherence |
+| Status | Open |
+| Location | `src/ScadaLink.InboundAPI/InboundScriptExecutor.cs:201-205`, `src/ScadaLink.Commons/Entities/InboundApi/ApiMethod.cs:10` |
+
+**Description**
+
+`Component-InboundAPI.md` ("API Method Definition → Return Value Definition" and the
+"Response Format" section) specifies that each method has a declared return structure
+— "Field names and data types … Supports returning lists of objects" — and that the
+success response body is "the method's return value as JSON, with fields matching the
+return value definition". The `ApiMethod` entity carries a `ReturnDefinition` column
+to hold exactly this. However, nothing in the module ever reads `ReturnDefinition`:
+`ExecuteAsync` takes whatever object the script happens to return and does a blind
+`JsonSerializer.Serialize(result)`. There is no validation that the script's return
+value matches the declared shape, no coercion to the declared field types, and no
+error when a method returns a structure inconsistent with its definition. A method
+whose script returns the wrong shape (or `null` where a structure is required) will
+silently emit a malformed 200 response, and the documented return-definition contract
+is effectively unenforced. This is the response-side mirror of the parameter
+validation that `ParameterValidator` does perform, leaving the two halves of the
+method contract asymmetric.
+
+**Recommendation**
+
+Either (a) implement return-value validation/shaping: parse `ReturnDefinition` with
+the same extended-type machinery used for parameters and validate/coerce the script
+result before serializing, returning a 500 (or logging) when the script result does
+not match; or (b) if return shaping is deliberately out of scope, remove the "Return
+Value Definition" / "fields matching the return value definition" language from
+`Component-InboundAPI.md` and document that the response is the script's raw return
+value serialized as-is. Code and design doc must be reconciled.
+
+**Resolution**
+
+_Unresolved._
+
+### InboundAPI-015 — `ForbiddenApiChecker` is purely textual and is bypassable via reflection reachable without a forbidden namespace token
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Security |
+| Status | Open |
+| Location | `src/ScadaLink.InboundAPI/ForbiddenApiChecker.cs:63-119`, `src/ScadaLink.InboundAPI/InboundScriptExecutor.cs:109-126` |
+
+**Description**
+
+`ForbiddenApiChecker` walks the script syntax tree and rejects any `using` directive,
+`QualifiedNameSyntax`, or `MemberAccessExpressionSyntax` whose textual dotted name
+starts with a forbidden namespace prefix (`System.IO`, `System.Diagnostics`,
+`System.Reflection`, `System.Net`, etc.). This is a textual match, not a semantic
+one, and the trust model it enforces (per InboundAPI-005) is explicitly meant to keep
+*untrusted* Design-role scripts away from host APIs. The check can be bypassed because
+forbidden functionality is reachable through member access that never spells a
+forbidden namespace:
+
+- `typeof(string).Assembly.GetType("System.IO.File")` — `typeof(string)` is permitted,
+  `.Assembly` is a `System.Type` property, `.GetType(string)` is a `System.Reflection.Assembly`
+  method. The string literal `"System.IO.File"` is a string, not a `QualifiedNameSyntax`
+  or `MemberAccessExpressionSyntax`, so `IsForbidden` never sees it. The script obtains
+  a `System.IO.File` `Type` and can `InvokeMember`/`GetMethod(...).Invoke(...)` on it —
+  all via members of permitted types — with no forbidden namespace ever appearing in
+  the source. `CompileAndRegister` references `typeof(object).Assembly`
+  (System.Private.CoreLib) in `ScriptOptions`, so every framework type is loadable at
+  runtime.
+- The executor also references the `Microsoft.CSharp.RuntimeBinder` assembly
+  (`InboundScriptExecutor.cs:116`), enabling the `dynamic` keyword, which further
+  widens late-bound member access that the static walker cannot see through.
+
+Because the inbound API script runs on the central node with the host process's
+privileges and is authored by the (less-trusted-than-Admin) Design role, a static
+textual deny-list gives a false sense of containment.
+
+**Recommendation**
+
+Treat the syntax walker as defence-in-depth, not the boundary. Strengthen it where
+cheap (flag `Assembly.GetType`, `Type.GetType`, `Activator.CreateInstance`,
+`InvokeMember`, and `dynamic` usage), but for real enforcement run compiled scripts
+under a genuine boundary — a restricted `AssemblyLoadContext`/AppDomain-equivalent, a
+curated reference set that does not expose reflection-to-arbitrary-type, or an
+out-of-process sandbox — consistent with however the Site Runtime ultimately enforces
+its instance-script trust model. At minimum, document in `Component-InboundAPI.md`
+that the current check is best-effort and does not stop a determined script.
+
+**Resolution**
+
+_Unresolved._
+
+### InboundAPI-016 — Routed `Route.To().Call()` invocations are not bound by the method timeout
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Design-document adherence |
+| Status | Open |
+| Location | `src/ScadaLink.InboundAPI/RouteHelper.cs:59-152`, `src/ScadaLink.InboundAPI/InboundScriptExecutor.cs:177`, `:199` |
+
+**Description**
+
+`Component-InboundAPI.md` states the per-method timeout "defines the maximum time the
+method is allowed to execute (**including any routed calls to sites**)", and the
+Routing Behavior section says a routed call "blocks until the site responds or the
+**method-level timeout** is reached". The executor builds a linked
+`CancellationTokenSource` (`cts`) combining the request-abort token and a dedicated
+timeout CTS, and exposes `cts.Token` to the script as `InboundScriptContext.CancellationToken`.
+However, every `RouteTarget` method (`Call`, `GetAttribute(s)`, `SetAttribute(s)`)
+takes `CancellationToken cancellationToken = default` and the script must *explicitly*
+pass the context token for the routed call to honour the timeout. A natural script —
+`Route.To("inst").Call("doWork", parameters)` — invokes the routed call with
+`CancellationToken.None`. That request flows into `CommunicationService.RouteToCallAsync`
+with no cancellation, so the routed call is not bounded by the method timeout at all.
+The only timeout guard left is `handler(context).WaitAsync(cts.Token)` in
+`ExecuteAsync`: when the method timeout fires, `WaitAsync` returns a cancellation to
+the caller, but the underlying script `Task` — and the in-flight `RouteToCallAsync`
+awaiting a remote site — keeps running orphaned with no cancellation, holding the
+correlation/communication resources until the site eventually responds or its own
+transport timeout (if any) fires. The design's guarantee that the method timeout
+covers routed calls is therefore not met, and a slow/hung site can leak background
+work past the timeout the caller was told bounds the request.
+
+**Recommendation**
+
+Make routed calls inherit the method deadline without relying on script discipline:
+have `RouteHelper`/`RouteTarget` carry the executing method's `CancellationToken`
+(injected by `InboundScriptExecutor` when it constructs the context, e.g. a
+`RouteHelper` bound to `cts.Token`) and pass it into every `CommunicationService`
+call by default, so `Route.To("x").Call("s", p)` is timeout-bounded with no token
+argument. Keep the explicit-token overload for callers that want a tighter bound.
+Verify `RouteToCallAsync` and the attribute-routing calls actually observe the token
+and abandon the in-flight request when it fires.
+
+**Resolution**
+
+_Unresolved._
+
+### InboundAPI-017 — `RouteHelper` / `RouteTarget` has no test coverage
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Testing coverage |
+| Status | Open |
+| Location | `src/ScadaLink.InboundAPI/RouteHelper.cs:1-165`, `tests/ScadaLink.InboundAPI.Tests/` |
+
+**Description**
+
+`RouteHelper`/`RouteTarget` is the entire WP-4 cross-site routing surface — the
+`Route.To().Call()/GetAttribute(s)/SetAttribute(s)` API that inbound API scripts use
+to reach instances at any site. It has zero tests: the `ScadaLink.InboundAPI.Tests`
+project covers `ApiKeyValidator`, `ParameterValidator`, `InboundScriptExecutor`, and
+`InboundApiEndpointFilter`, but no test file exercises `RouteHelper`. Untested
+behaviours include site resolution via `IInstanceLocator` (including the
+"instance not found / no assigned site" `InvalidOperationException` path at
+`RouteHelper.cs:154-164`), the `!response.Success` → `InvalidOperationException`
+translation in each routed method, `GetAttribute` delegating to the batch
+`GetAttributes` and returning `null` for an absent key, correlation-ID generation,
+and `SetAttribute` delegating to `SetAttributes`. These are non-trivial branches
+whose failure modes (a thrown exception inside a script) surface to the caller as a
+500, so regressions would be silent.
+
+**Recommendation**
+
+Add a `RouteHelperTests` suite using substituted `IInstanceLocator` and
+`CommunicationService` (the executor tests already substitute `CommunicationService`):
+cover the happy path of each routed method, the unresolved-instance throw, the
+`!Success` → `InvalidOperationException` mapping, and `GetAttribute` returning `null`
+for a missing key. This also gives InboundAPI-016 a regression home if the timeout
+wiring is added.
+
+**Resolution**
+
+_Unresolved._
@@ -5,10 +5,10 @@
 | Module | `src/ScadaLink.ManagementService` |
 | Design doc | `docs/requirements/Component-ManagementService.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-16 |
+| Last reviewed | 2026-05-17 |
 | Reviewer | claude-agent |
-| Commit reviewed | `9c60592` |
-| Open findings | 0 (1 Deferred — see ManagementService-012) |
+| Commit reviewed | `39d737e` |
+| Open findings | 4 (1 Deferred — see ManagementService-012) |

 ## Summary

@@ -28,6 +28,24 @@ instances never disposed) and dead/unused configuration. None of the findings ar
 crash-class, but the site-scope gaps are High severity because they are a real
 authorization bypass with no workaround.

+#### Re-review 2026-05-17 (commit `39d737e`)
+
+All thirteen prior findings remain correctly closed; the source under
+`src/ScadaLink.ManagementService` is byte-identical between the previously reviewed state
+and `39d737e` (the resolution commits of findings 001–013 are folded into the history at or
+before `39d737e`). ManagementService-012 was re-checked and its **Deferred** status still
+holds: `ManagementEnvelope.Command` is still typed `object`, and the marker-interface fix
+still belongs in the Commons module, outside this module's edit scope — nothing has changed
+to make it actionable here. This re-review re-ran the full 10-category checklist against the
+current sources and surfaced **four new findings**. The dominant theme is the same
+site-scope authorization gap that findings 001/002 closed: `HandleQueryDeployments`
+(`QueryDeploymentsCommand`) was overlooked by that sweep and still performs no site-scope
+enforcement, letting a site-scoped Deployment user read deployment history for any site
+(014, High). The remaining three are lower severity: a non-atomic multi-override mutation
+that can leave an instance partially modified after an error (015, Medium), raw exception
+messages from unexpected faults being returned verbatim to HTTP callers (016, Low), and
+`QueryDeploymentsCommand` having no test coverage at all (017, Low).
+
 ## Checklist coverage

 | # | Category | Examined | Notes |
@@ -551,3 +569,144 @@ infrastructure, so the testable command-parsing/dispatch logic was extracted int
 `ParseCommand` helper and covered instead. Tests: `ParseCommand_*` (5),
 `SerializeResult_*` (2), `UnknownCommandType_FaultMappedToManagementError`, plus the
 pre-existing site-scope and DebugStreamHub suites. `dotnet test` -> 48 passed.
+
+### ManagementService-014 — HandleQueryDeployments bypasses site-scope enforcement
+
+| | |
+|--|--|
+| Severity | High |
+| Category | Security |
+| Status | Open |
+| Location | `src/ScadaLink.ManagementService/ManagementActor.cs:306`, `:1174` |
+
+**Description**
+
+`QueryDeploymentsCommand` is gated to the `Deployment` role by `GetRequiredRole`
+(`ManagementActor.cs:170`–`:177`), and the design document's Authorization section states
+"Site scoping is enforced for site-scoped Deployment users" and explicitly lists deployments
+among the Deployment-role operations. `HandleQueryDeployments` makes no call to
+`EnforceSiteScope` / `EnforceSiteScopeForInstance` / `EnforceSiteScopeForIdentifier`: with no
+`InstanceId` it returns `IDeploymentManagerRepository.GetAllDeploymentRecordsAsync()` (every
+deployment record across all sites), and with an `InstanceId` it returns that instance's
+deployment history with no check that the instance's site is within the caller's permitted
+set. A site-scoped Deployment user scoped to site A can therefore enumerate deployment
+records for instances at site B — instance IDs, `DeployedBy` (operator usernames),
+timestamps, deployment status, and `ErrorMessage` content — by issuing `QueryDeployments`
+with or without an out-of-scope `InstanceId`. This is the same authorization-bypass class as
+the resolved findings 001/002, on a handler that sweep did not cover; it is `DispatchCommand`'s
+only `Deployment`-role handler with no scope enforcement.
+
+**Recommendation**
+
+Thread `AuthenticatedUser` into `HandleQueryDeployments` (the dispatch case at line 306
+already has `user` in scope). When `cmd.InstanceId` is supplied, call
+`EnforceSiteScopeForInstance` before querying. When it is not supplied, filter the returned
+`DeploymentRecord` list to the caller's permitted sites — resolve each record's instance to
+its `SiteId` (or join through a site-aware repository query) and drop records for sites
+outside `PermittedSiteIds`, mirroring the `HandleListInstances` / `HandleListSites` filter
+pattern. Add a regression test for a site-scoped user against in-scope and out-of-scope
+instances.
+
+**Resolution**
+
+_Unresolved._
+
+### ManagementService-015 — HandleSetInstanceOverrides applies overrides non-atomically
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Error handling & resilience |
+| Status | Open |
+| Location | `src/ScadaLink.ManagementService/ManagementActor.cs:647`–`:659` |
+
+**Description**
+
+`HandleSetInstanceOverrides` iterates `cmd.Overrides` and calls
+`InstanceService.SetAttributeOverrideAsync` once per attribute, throwing
+`InvalidOperationException` on the first `result.IsSuccess == false`. Each
+`SetAttributeOverrideAsync` call persists independently, so if the command supplies five
+overrides and the third fails (e.g. an unknown attribute name or a validation error), the
+first two overrides are already committed to the configuration database while the caller
+receives a `ManagementError`. The instance is left partially mutated in a state the operator
+neither sees nor requested, and the per-instance operation lock referenced in the design's
+deployment decisions does not protect against this because the partial writes are committed
+before the throw. A retry of the same command then re-applies the already-applied overrides.
+
+**Recommendation**
+
+Make the multi-override mutation all-or-nothing: either validate every requested override up
+front before applying any, or apply all overrides within a single transaction / unit-of-work
+so a mid-batch failure rolls back the earlier writes. If `InstanceService` cannot offer a
+batch method, at minimum document the partial-application behaviour on `SetInstanceOverridesCommand`
+and have the handler report which overrides were applied before the failure so the caller
+can reconcile.
+
+**Resolution**
+
+_Unresolved._
+
+### ManagementService-016 — Unexpected exception messages returned verbatim to HTTP callers
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Security |
+| Status | Open |
+| Location | `src/ScadaLink.ManagementService/ManagementActor.cs:121`–`:131` |
+
+**Description**
+
+`MapFault` maps any non-`SiteScopeViolationException` fault to
+`new ManagementError(correlationId, cause.Message, "COMMAND_FAILED")`, and
+`ManagementEndpoints.HandleRequest` returns that `Error` string directly in the HTTP 400
+body. For handler-thrown `InvalidOperationException`s carrying a curated `result.Error`
+message this is intended and safe. But the same path also surfaces the raw `.Message` of
+unanticipated exceptions — a `SqlException` (which can include server/database names and
+constraint details), a `DbUpdateException`, an `ArgumentException` from `Enum.Parse` on a
+malformed `DataType`/`TriggerType` value, or a `NullReferenceException` — straight to the
+external CLI/HTTP client. This is a minor internal-detail disclosure surface: the exception
+text is already logged server-side with full context, so the client does not need the raw
+message.
+
+**Recommendation**
+
+Distinguish handler-curated failures from unexpected faults. Have handlers throw a dedicated
+exception type (e.g. `ManagementCommandException`) for messages that are safe to surface, and
+in `MapFault` return that message for the known type while returning a generic
+"An internal error occurred (CorrelationId=...)" string for everything else — the operator
+can still correlate to the server log via the correlation ID.
+
+**Resolution**
+
+_Unresolved._
+
+### ManagementService-017 — QueryDeploymentsCommand has no test coverage
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Testing coverage |
+| Status | Open |
+| Location | `tests/ScadaLink.ManagementService.Tests/ManagementActorTests.cs:1` |
+
+**Description**
+
+`QueryDeploymentsCommand` / `HandleQueryDeployments` is exercised by no test in
+`ManagementActorTests`. There is no test that it requires the `Deployment` role, no test of
+the `InstanceId`-filtered versus unfiltered branches, and — because the handler performs no
+site-scope enforcement at all — no test that would have caught finding 014. The deployment
+query is one of the operations the design's Authorization section calls out for site
+scoping, yet it is the only `Deployment`-role command with neither an authorization test nor
+a site-scope test.
+
+**Recommendation**
+
+Add tests for `QueryDeploymentsCommand`: a role test (Design/no-role caller →
+`ManagementUnauthorized`), branch coverage for the `InstanceId`-filtered and unfiltered
+repository calls, and — once finding 014 is fixed — site-scope tests for a site-scoped
+Deployment user against in-scope and out-of-scope deployment records.
+
+**Resolution**
+
+_Unresolved._
@@ -5,10 +5,10 @@
 | Module | `src/ScadaLink.NotificationService` |
 | Design doc | `docs/requirements/Component-NotificationService.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-16 |
+| Last reviewed | 2026-05-17 |
 | Reviewer | claude-agent |
-| Commit reviewed | `9c60592` |
-| Open findings | 0 |
+| Commit reviewed | `39d737e` |
+| Open findings | 5 |

 ## Summary

@@ -30,6 +30,31 @@ plaintext `string` values. Test coverage exercises the happy path and the main
 error branches but misses the OAuth2 delivery path, the permanent-classification
 fallback in `DeliverAsync`, and concurrency on the token cache.

+#### Re-review 2026-05-17 (commit `39d737e`)
+
+Re-reviewed at commit `39d737e` after the NS-001..013 fixes. All twelve prior
+findings remain closed (eleven Resolved, NS-013 Deferred) and the fixes hold up
+against the current source — error classification is now typed, the SMTP client is
+created once and always disconnected/disposed, the S&F `Notification` delivery
+handler is registered in `AkkaHostedService`, and the OAuth2 token cache is keyed
+per credential. The re-review surfaced **five new findings**. The recurring theme is
+**unclassified-exception handling**: the typed `ClassifySmtpError` helper introduced
+by NS-002/003 returns `Unknown` for anything that is not a recognised SMTP/socket
+failure, but neither `SendAsync` nor `DeliverBufferedAsync` has a catch-all for the
+`Unknown` bucket. As a result an OAuth2 token-fetch failure (`HttpRequestException`
+from `EnsureSuccessStatusCode`, or `InvalidOperationException` from a malformed
+credential string) escapes `SendAsync` as a raw exception to the calling script
+(NS-015) and escapes `DeliverBufferedAsync` out of the S&F handler, where the engine
+treats any thrown exception as transient and retries a permanently-broken config
+forever (NS-014) — the same defect class NS-008 fixed for address parsing, but the
+OAuth2 path was never covered. Secondary findings: `AuthenticateAsync` silently
+proceeds unauthenticated for an unknown `authType` or empty credentials (NS-016);
+`NotificationOptions` is bound from configuration in two places but never read by
+any code (NS-017, dead config — NS-007 sourced the timeout/limit from
+`SmtpConfiguration` instead); and the lazily-created concurrency limiter is read
+outside its lock, is sized once and never resized on redeployment, and is never
+disposed (NS-018).
+
 ## Checklist coverage

 | # | Category | Examined | Notes |
@@ -465,3 +490,108 @@ filed) and the genuinely-missing tests were added:
   `AkkaHostedService` handler registration), so no further test is required here.

 Module test suite is green at 56 tests.
+
+### NotificationService-014 — OAuth2 token-fetch failure escapes `DeliverBufferedAsync`; a permanently-broken config is retried forever
+
+| | |
+|--|--|
+| Severity | High |
+| Category | Error handling & resilience |
+| Status | Open |
+| Location | `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:214-228`, `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:308-312`, `src/ScadaLink.NotificationService/OAuth2TokenService.cs:56-84` |
+
+**Description**
+
+`DeliverBufferedAsync` is the registered Store-and-Forward delivery handler for `StoreAndForwardCategory.Notification`. Per the S&F handler contract (`StoreAndForwardService.cs:40-53`), a handler must return `true` for success, `false` for permanent failure (park, no further retries), and **throw only for a transient failure** (retry). `DeliverBufferedAsync`'s `try` block (`:214-228`) catches only `SmtpPermanentException`; the comment at `:227` asserts "Transient SMTP errors propagate out of `DeliverAsync`". But `DeliverAsync` can also throw exceptions that `ClassifySmtpError` buckets as `Unknown` — most importantly the OAuth2 path at `:308-312`, where `OAuth2TokenService.GetTokenAsync` throws `HttpRequestException` (from `EnsureSuccessStatusCode` at `OAuth2TokenService.cs:78`) or `InvalidOperationException` (malformed credential triple, or missing `access_token`). None of these is an `SmtpCommandException`/`SmtpProtocolException`/`SocketException`/`TimeoutException`, so they are neither caught nor classified — they propagate straight out of `DeliverBufferedAsync`. The S&F engine treats *any* thrown exception as a transient failure, so a notification whose SMTP config has an invalid client secret (a 401 → `HttpRequestException`, a genuinely **permanent** condition) is retried on every sweep until `MaxRetries` is reached — burning OAuth2 token-endpoint calls and never parking promptly. A malformed credential string (`InvalidOperationException`, never fixable) is likewise retried instead of parked immediately.
+
+**Recommendation**
+
+Add a catch-all to `DeliverBufferedAsync` for exceptions that `ClassifySmtpError` classifies as `Unknown`: log and return `false` (park) for non-retryable causes such as `InvalidOperationException` from credential parsing, and decide deliberately whether an OAuth2 `HttpRequestException` should park or retry (a 5xx token endpoint is transient; a 401 is permanent — inspect `HttpRequestException.StatusCode`). Do not let an unclassified exception leave the handler and be silently reinterpreted as transient.
+
+**Resolution**
+
+_Unresolved._
+
+### NotificationService-015 — Unclassified exceptions (OAuth2 token fetch, non-cancellation OCE) escape `SendAsync` to the calling script
+
+| | |
+|--|--|
+| Severity | High |
+| Category | Error handling & resilience |
+| Status | Open |
+| Location | `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:96-148`, `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:308-312` |
+
+**Description**
+
+`SendAsync`'s `try`/`catch` around `DeliverAsync` has exactly three catch clauses: `SmtpPermanentException` (`:101`), `OperationCanceledException` when cancellation was requested (`:111`), and `Exception` filtered by `IsTransientSmtpError` (`:116`). There is no catch-all for an exception that is none of those. `ClassifySmtpError` returns `SmtpErrorClass.Unknown` for anything it does not recognise, and `IsTransientSmtpError` only matches `Transient`, so an `Unknown`-classified exception falls through all three clauses and escapes `SendAsync` unhandled. Concretely, the OAuth2 branch at `:308-312` calls `OAuth2TokenService.GetTokenAsync`, which throws `HttpRequestException` (token endpoint unreachable / 4xx / 5xx) or `InvalidOperationException` (malformed `tenant:client:secret` triple, or no `access_token` in the response). Both reach the calling script (an instance/alarm/shared script) as a raw exception of a type the `INotificationDeliveryService` contract never advertises — the design doc says delivery returns either success, a buffered transient result, or a permanent `NotificationResult` error. `Notify.To().Send()` against an OAuth2 config with a bad secret therefore crashes the script instead of returning a clean `NotificationResult(false, ...)`. This is the same defect class NS-008 resolved for malformed addresses, but NS-008's `ValidateAddresses` fix never covered the OAuth2 token path. (A non-cancellation `OperationCanceledException` — e.g. an internal MailKit timeout surfaced as OCE while the caller's token is *not* cancelled — would escape the same way.)
+
+**Recommendation**
+
+Add a final `catch (Exception ex)` to `SendAsync` that converts any otherwise-unhandled exception into a permanent `NotificationResult(false, ...)` (credential-scrubbed, consistent with NS-009), and log it. Treating an unclassified failure as permanent is the safe default — it returns control to the script rather than crashing it; an OAuth2 transient (token endpoint 5xx) being reported as permanent is an acceptable trade against an unhandled crash, and can be refined by inspecting `HttpRequestException.StatusCode` if transient buffering of token failures is wanted.
+
+**Resolution**
+
+_Unresolved._
+
+### NotificationService-016 — `AuthenticateAsync` silently sends unauthenticated for an unknown auth type or empty credentials
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Security |
+| Status | Open |
+| Location | `src/ScadaLink.NotificationService/MailKitSmtpClientWrapper.cs:46-67` |
+
+**Description**
+
+`MailKitSmtpClientWrapper.AuthenticateAsync` returns without authenticating in two silent paths. (1) `if (string.IsNullOrEmpty(credentials)) return;` (`:48-49`) — if an SMTP config has `AuthType` "basic" or "oauth2" but a null/empty `Credentials` value (a misconfigured or partially-deployed row), the method returns and `DeliverAsync` proceeds straight to `SendAsync` on an *unauthenticated* connection. (2) The `switch (authType.ToLowerInvariant())` (`:51-66`) has cases only for `"basic"` and `"oauth2"` and **no `default`** — any other `AuthType` value (a typo, a future "ntlm", an empty string) falls through with no authentication attempted. Additionally, in the `"basic"` case, `credentials.Split(':', 2)` that yields fewer than 2 parts (a credential string with no colon) is silently skipped (`:55-58`). In every case the connection then attempts to send mail unauthenticated: against a relay that does not require auth this masks a misconfiguration; against a server that does require auth it fails later with a less obvious error, and at worst an unauthenticated send succeeds where it should not. Authentication being skipped should never be silent.
+
+**Recommendation**
+
+Make missing/unparseable credentials and an unrecognised `AuthType` hard errors: throw a typed exception (e.g. a permanent configuration exception so `SendAsync`/`DeliverBufferedAsync` surface it as a clean failure / park) rather than returning. Add a `default:` arm to the `switch` that throws `ArgumentException` naming the unsupported auth type, and reject a "basic" credential string that does not split into exactly two parts.
+
+**Resolution**
+
+_Unresolved._
+
+### NotificationService-017 — `NotificationOptions` is bound from configuration but never read (dead config)
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Code organization & conventions |
+| Status | Open |
+| Location | `src/ScadaLink.NotificationService/NotificationOptions.cs:1-15`, `src/ScadaLink.NotificationService/ServiceCollectionExtensions.cs:10-11`, `src/ScadaLink.Host/SiteServiceRegistration.cs:70` |
+
+**Description**
+
+`NotificationOptions` (with `ConnectionTimeoutSeconds` and `MaxConcurrentConnections`) is bound from the `ScadaLink:Notification` configuration section in two places — `ServiceCollectionExtensions.AddNotificationService` (`AddOptions<NotificationOptions>().BindConfiguration(...)`) and again in `Host/SiteServiceRegistration.cs:70` (`services.Configure<NotificationOptions>(...)`). However, a repo-wide search shows no code ever injects `IOptions<NotificationOptions>` or otherwise reads either property. When NS-007 enforced the connection timeout and concurrency limit, it sourced both values from the per-site `SmtpConfiguration` entity, not from `NotificationOptions` — so this options class is now entirely dead configuration. Its XML doc still claims it "provides fallback defaults and operational limits", which is misleading: nothing falls back to it. The double binding is also redundant. Dead, falsely-documented configuration invites an operator to set `ScadaLink:Notification:ConnectionTimeoutSeconds` and expect it to take effect, when it has no effect at all.
+
+**Recommendation**
+
+Either delete `NotificationOptions` and both of its registrations, or genuinely wire it in as the documented fallback (e.g. `DeliverAsync` uses `NotificationOptions` values when the `SmtpConfiguration` field is non-positive). If kept, remove the duplicate `Configure` call and correct the XML doc. The NS-007 resolution note already states the `NotificationOptions` fields "remain as operational fallback defaults" — that intent should be implemented or the class removed.
+
+**Resolution**
+
+_Unresolved._
+
+### NotificationService-018 — Concurrency limiter: lock-free read of a non-volatile field, never resized on redeployment, never disposed
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Concurrency & thread safety |
+| Status | Open |
+| Location | `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:237-255` |
+
+**Description**
+
+Three minor issues with the NS-007 concurrency limiter. (1) `GetConcurrencyLimiter` reads the plain (non-`volatile`) field `_concurrencyLimiter` at `:242` *outside* `_limiterLock` as a fast path. The field is only written inside the lock (`:252`), so under the .NET memory model a concurrent thread is not guaranteed to observe the fully-constructed `SemaphoreSlim` via the lock-free read. In practice on x86/x64 this is benign and the double-checked write inside the lock is correct, but the unsynchronized read of a reference set under a lock is the classic lazy-init memory-model pitfall — use `Volatile.Read`/`LazyInitializer`/`Lazy<T>`. (2) The limiter is sized once from the *first* `SmtpConfiguration` ever seen and the `??=` at `:252` means it is never re-created. The design doc says one SMTP config is deployed per site, but that config can be **redeployed** with a different `MaxConcurrentConnections`; the new value is silently ignored for the lifetime of the (scoped, but potentially long-lived per request) service — and more importantly the limit is captured per `NotificationDeliveryService` instance, so behaviour depends on which scoped instance happens to create it first. (3) `SemaphoreSlim` implements `IDisposable`; `_concurrencyLimiter` is never disposed and `NotificationDeliveryService` does not implement `IDisposable`. A scoped service leaking one undisposed `SemaphoreSlim` per scope is a slow handle leak.
+
+**Recommendation**
+
+Replace the hand-rolled double-checked init with `Lazy<SemaphoreSlim>` or `LazyInitializer.EnsureInitialized`, or at minimum make the field access `Volatile.Read`/`Volatile.Write`. Consider hoisting the limiter to a singleton keyed by site (the limit is a per-site invariant, not a per-scope one) so redeployment with a changed limit and disposal are handled in one place; if it stays on the scoped service, implement `IDisposable` to dispose it.
+
+**Resolution**
+
+_Unresolved._
@@ -40,34 +40,34 @@ module file and counted in **Total**.
 | Severity | Open findings |
 |----------|---------------|
 | Critical | 0 |
-| High | 0 |
-| Medium | 0 |
-| Low | 0 |
-| **Total** | **0** |
+| High | 8 |
+| Medium | 26 |
+| Low | 32 |
+| **Total** | **66** |

 ## Module Status

 | Module | Last reviewed | Commit | Open (C/H/M/L) | Open | Total |
 |--------|---------------|--------|----------------|------|-------|
-| [CLI](CLI/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 13 |
-| [CentralUI](CentralUI/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 19 |
-| [ClusterInfrastructure](ClusterInfrastructure/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 8 |
-| [Commons](Commons/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 12 |
-| [Communication](Communication/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 11 |
-| [ConfigurationDatabase](ConfigurationDatabase/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 11 |
-| [DataConnectionLayer](DataConnectionLayer/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 13 |
-| [DeploymentManager](DeploymentManager/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 14 |
-| [ExternalSystemGateway](ExternalSystemGateway/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 14 |
-| [HealthMonitoring](HealthMonitoring/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 12 |
-| [Host](Host/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 11 |
-| [InboundAPI](InboundAPI/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 13 |
-| [ManagementService](ManagementService/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 13 |
-| [NotificationService](NotificationService/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 13 |
-| [Security](Security/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 11 |
-| [SiteEventLogging](SiteEventLogging/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 11 |
-| [SiteRuntime](SiteRuntime/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 16 |
-| [StoreAndForward](StoreAndForward/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 14 |
-| [TemplateEngine](TemplateEngine/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 14 |
+| [CLI](CLI/findings.md) | 2026-05-16 | `9c60592` | 0/0/1/2 | 3 | 16 |
+| [CentralUI](CentralUI/findings.md) | 2026-05-16 | `9c60592` | 0/1/2/3 | 6 | 25 |
+| [ClusterInfrastructure](ClusterInfrastructure/findings.md) | 2026-05-16 | `9c60592` | 0/0/1/1 | 2 | 10 |
+| [Commons](Commons/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/2 | 2 | 14 |
+| [Communication](Communication/findings.md) | 2026-05-16 | `9c60592` | 0/1/1/2 | 4 | 15 |
+| [ConfigurationDatabase](ConfigurationDatabase/findings.md) | 2026-05-16 | `9c60592` | 0/0/2/1 | 3 | 14 |
+| [DataConnectionLayer](DataConnectionLayer/findings.md) | 2026-05-16 | `9c60592` | 0/1/2/1 | 4 | 17 |
+| [DeploymentManager](DeploymentManager/findings.md) | 2026-05-16 | `9c60592` | 0/1/1/1 | 3 | 17 |
+| [ExternalSystemGateway](ExternalSystemGateway/findings.md) | 2026-05-16 | `9c60592` | 0/1/1/1 | 3 | 17 |
+| [HealthMonitoring](HealthMonitoring/findings.md) | 2026-05-16 | `9c60592` | 0/0/1/3 | 4 | 16 |
+| [Host](Host/findings.md) | 2026-05-16 | `9c60592` | 0/0/1/3 | 4 | 15 |
+| [InboundAPI](InboundAPI/findings.md) | 2026-05-16 | `9c60592` | 0/0/3/1 | 4 | 17 |
+| [ManagementService](ManagementService/findings.md) | 2026-05-16 | `9c60592` | 0/1/1/2 | 4 | 17 |
+| [NotificationService](NotificationService/findings.md) | 2026-05-16 | `9c60592` | 0/2/1/2 | 5 | 18 |
+| [Security](Security/findings.md) | 2026-05-16 | `9c60592` | 0/0/2/2 | 4 | 15 |
+| [SiteEventLogging](SiteEventLogging/findings.md) | 2026-05-16 | `9c60592` | 0/0/1/2 | 3 | 14 |
+| [SiteRuntime](SiteRuntime/findings.md) | 2026-05-16 | `9c60592` | 0/0/1/2 | 3 | 19 |
+| [StoreAndForward](StoreAndForward/findings.md) | 2026-05-16 | `9c60592` | 0/0/2/1 | 3 | 17 |
+| [TemplateEngine](TemplateEngine/findings.md) | 2026-05-16 | `9c60592` | 0/0/2/0 | 2 | 16 |

 ## Pending Findings

@@ -80,14 +80,83 @@ description, location, recommendation — lives in the module's `findings.md`.

 _None open._

-### High (0)
+### High (8)

-_None open._
+| ID | Module | Title |
+|----|--------|-------|
+| CentralUI-020 | [CentralUI](CentralUI/findings.md) | Idle-session redirect never fires: `SessionExpiry` polls a frozen auth-state snapshot |
+| Communication-012 | [Communication](Communication/findings.md) | gRPC client factory ignores the endpoint on a cache hit, breaking NodeA→NodeB stream failover |
+| DataConnectionLayer-014 | [DataConnectionLayer](DataConnectionLayer/findings.md) | DCL-012 security warning is never logged in production: `RealOpcUaClient` is created without a logger |
+| DeploymentManager-015 | [DeploymentManager](DeploymentManager/findings.md) | Site-query reconciliation marks a deployment `Success` but skips instance-state and snapshot updates |
+| ExternalSystemGateway-015 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | `MaxRetries == 0` is buffered as "retry forever", contradicting the ExternalSystemGateway-004 "never retry" claim |
+| ManagementService-014 | [ManagementService](ManagementService/findings.md) | HandleQueryDeployments bypasses site-scope enforcement |
+| NotificationService-014 | [NotificationService](NotificationService/findings.md) | OAuth2 token-fetch failure escapes `DeliverBufferedAsync`; a permanently-broken config is retried forever |
+| NotificationService-015 | [NotificationService](NotificationService/findings.md) | Unclassified exceptions (OAuth2 token fetch, non-cancellation OCE) escape `SendAsync` to the calling script |

-### Medium (0)
+### Medium (26)

-_None open._
+| ID | Module | Title |
+|----|--------|-------|
+| CLI-014 | [CLI](CLI/findings.md) | `update` commands require "core" fields, making partial updates impossible |
+| CentralUI-021 | [CentralUI](CentralUI/findings.md) | `DebugView` stream callback mutates `Dictionary` off the render thread |
+| CentralUI-022 | [CentralUI](CentralUI/findings.md) | `Deployments` push handler fires `InvokeAsync` with no disposal guard |
+| ClusterInfrastructure-009 | [ClusterInfrastructure](ClusterInfrastructure/findings.md) | `DownIfAlone` is an inert configuration knob — never consumed by the HOCON builder |
+| Communication-013 | [Communication](Communication/findings.md) | Site gRPC address changes are never applied; `RemoveSiteAsync` has no production caller |
+| ConfigurationDatabase-012 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | Inbound-API `ApiKey.KeyValue` bearer credential stored in plaintext |
+| ConfigurationDatabase-013 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | Secret-column encryption silently falls back to an ephemeral (throwaway) key |
+| DataConnectionLayer-015 | [DataConnectionLayer](DataConnectionLayer/findings.md) | Initial-connect failures never trigger failover; an unreachable primary at startup never tries the backup |
+| DataConnectionLayer-016 | [DataConnectionLayer](DataConnectionLayer/findings.md) | `HandleSubscribeCompleted` reports `SubscribeTagsResponse` success even on a connection-level subscribe failure |
+| DeploymentManager-016 | [DeploymentManager](DeploymentManager/findings.md) | Reconciled prior record keeps its stale `RevisionHash` |
+| ExternalSystemGateway-016 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | `ConfigureHttpClientDefaults` applies the ESG connection cap to every `HttpClient` in the host process |
+| HealthMonitoring-015 | [HealthMonitoring](HealthMonitoring/findings.md) | Heartbeat-registered site is left with a year-0001 `LastReportReceivedAt` |
+| Host-012 | [Host](Host/findings.md) | `down-if-alone` hard-coded in HOCON; `ClusterOptions.DownIfAlone` is never read |
+| InboundAPI-014 | [InboundAPI](InboundAPI/findings.md) | `ReturnDefinition` is loaded but never used; script return value is unshaped/unvalidated |
+| InboundAPI-015 | [InboundAPI](InboundAPI/findings.md) | `ForbiddenApiChecker` is purely textual and is bypassable via reflection reachable without a forbidden namespace token |
+| InboundAPI-016 | [InboundAPI](InboundAPI/findings.md) | Routed `Route.To().Call()` invocations are not bound by the method timeout |
+| ManagementService-015 | [ManagementService](ManagementService/findings.md) | HandleSetInstanceOverrides applies overrides non-atomically |
+| NotificationService-016 | [NotificationService](NotificationService/findings.md) | `AuthenticateAsync` silently sends unauthenticated for an unknown auth type or empty credentials |
+| Security-012 | [Security](Security/findings.md) | Partial LDAP failure during login yields a roleless authenticated session |
+| Security-014 | [Security](Security/findings.md) | `RefreshToken` re-issues a token without checking the idle timeout |
+| SiteEventLogging-012 | [SiteEventLogging](SiteEventLogging/findings.md) | Dropped events report success: `Task` is completed, not faulted, when the event cannot be persisted |
+| SiteRuntime-017 | [SiteRuntime](SiteRuntime/findings.md) | Instance Actor's live `_attributes` dictionary is shared by reference into child actor constructors |
+| StoreAndForward-015 | [StoreAndForward](StoreAndForward/findings.md) | `EnqueueAsync`'s public contract never documents that `maxRetries == 0` means "retry forever" |
+| StoreAndForward-016 | [StoreAndForward](StoreAndForward/findings.md) | Operator-initiated parked-message retry and discard are not replicated to the standby |
+| TemplateEngine-015 | [TemplateEngine](TemplateEngine/findings.md) | `RenameCompositionAsync` does not cascade-rename nested derived templates |
+| TemplateEngine-016 | [TemplateEngine](TemplateEngine/findings.md) | Composed-script `ScriptScope.ParentPath` is always empty, breaking `Parent.X` resolution for nested modules |

-### Low (0)
+### Low (32)

-_None open._
+| ID | Module | Title |
+|----|--------|-------|
+| CLI-015 | [CLI](CLI/findings.md) | `Component-CLI.md` command surface has drifted again in two places |
+| CLI-016 | [CLI](CLI/findings.md) | `WriteAsTable` derives columns from the first array element only |
+| CentralUI-023 | [CentralUI](CentralUI/findings.md) | Residual bare `catch {}` blocks swallow JS interop errors |
+| CentralUI-024 | [CentralUI](CentralUI/findings.md) | Claim lookups use magic strings instead of `JwtTokenService` constants |
+| CentralUI-025 | [CentralUI](CentralUI/findings.md) | `SessionExpiry` polling/redirect path has no test coverage |
+| ClusterInfrastructure-010 | [ClusterInfrastructure](ClusterInfrastructure/findings.md) | Validator does not enforce `DownIfAlone = true` despite the design doc requiring it |
+| Commons-013 | [Commons](Commons/findings.md) | `DynamicJsonElement.TryGetIndex` rejects non-`int` index values |
+| Commons-014 | [Commons](Commons/findings.md) | `OpcUaEndpointConfigSerializer.Deserialize` can mislabel a corrupt typed row as `Legacy` |
+| Communication-014 | [Communication](Communication/findings.md) | Untrusted gRPC `correlation_id` flows directly into an Akka actor name |
+| Communication-015 | [Communication](Communication/findings.md) | No test exercises the real gRPC client factory across a node flip |
+| ConfigurationDatabase-014 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | Redundant, inconsistent cast on one `HasConversion` call |
+| DataConnectionLayer-017 | [DataConnectionLayer](DataConnectionLayer/findings.md) | `WriteBatchAsync` aborts the whole batch on a mid-batch disconnect |
+| DeploymentManager-017 | [DeploymentManager](DeploymentManager/findings.md) | `GetDeploymentStatusAsync` XML doc describes behaviour it does not implement |
+| ExternalSystemGateway-017 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | `BuildUrl` appends a bare trailing `?` when a GET method's parameters are all null |
+| HealthMonitoring-013 | [HealthMonitoring](HealthMonitoring/findings.md) | Offline-check interval comment claims "shorter timeout" but only ever uses `OfflineTimeout` |
+| HealthMonitoring-014 | [HealthMonitoring](HealthMonitoring/findings.md) | `HealthMonitoringOptions` intervals are unvalidated; a zero/negative value crashes the hosted service |
+| HealthMonitoring-016 | [HealthMonitoring](HealthMonitoring/findings.md) | `SiteHealthCollector.CollectReport` reads `DateTimeOffset.UtcNow` directly instead of an injected `TimeProvider` |
+| Host-013 | [Host](Host/findings.md) | `:F0` rounding of cluster timing values silently degrades sub-second configuration |
+| Host-014 | [Host](Host/findings.md) | Serilog sinks are hard-coded in `Program.cs`, not configuration-driven (REQ-HOST-8) |
+| Host-015 | [Host](Host/findings.md) | `StartupRetry` retries on every exception type, including permanent failures |
+| InboundAPI-017 | [InboundAPI](InboundAPI/findings.md) | `RouteHelper` / `RouteTarget` has no test coverage |
+| ManagementService-016 | [ManagementService](ManagementService/findings.md) | Unexpected exception messages returned verbatim to HTTP callers |
+| ManagementService-017 | [ManagementService](ManagementService/findings.md) | QueryDeploymentsCommand has no test coverage |
+| NotificationService-017 | [NotificationService](NotificationService/findings.md) | `NotificationOptions` is bound from configuration but never read (dead config) |
+| NotificationService-018 | [NotificationService](NotificationService/findings.md) | Concurrency limiter: lock-free read of a non-volatile field, never resized on redeployment, never disposed |
+| Security-013 | [Security](Security/findings.md) | `ExtractFirstRdnValue` mis-parses group DNs containing escaped commas |
+| Security-015 | [Security](Security/findings.md) | Username is not trimmed before use in the LDAP filter, fallback DN, and JWT claims |
+| SiteEventLogging-013 | [SiteEventLogging](SiteEventLogging/findings.md) | Keyword search does not escape SQL `LIKE` wildcards in user input |
+| SiteEventLogging-014 | [SiteEventLogging](SiteEventLogging/findings.md) | Initial purge runs synchronously on the host startup thread |
+| SiteRuntime-018 | [SiteRuntime](SiteRuntime/findings.md) | `ScriptExecutionActor` XML doc still claims a "dedicated blocking I/O dispatcher" |
+| SiteRuntime-019 | [SiteRuntime](SiteRuntime/findings.md) | Dead `DisableInstanceCommand` / `EnableInstanceCommand` handlers in `InstanceActor` |
+| StoreAndForward-017 | [StoreAndForward](StoreAndForward/findings.md) | Retry/Discard activity-log entries hard-code the `ExternalSystem` category |
@@ -5,10 +5,10 @@
 | Module | `src/ScadaLink.Security` |
 | Design doc | `docs/requirements/Component-Security.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-16 |
+| Last reviewed | 2026-05-17 |
 | Reviewer | claude-agent |
-| Commit reviewed | `9c60592` |
-| Open findings | 0 (1 deferred — Security-008) |
+| Commit reviewed | `39d737e` |
+| Open findings | 4 (1 deferred — Security-008) |

 ## Summary

@@ -28,6 +28,26 @@ on every refresh, weakening the documented 30-minute idle policy. None of these
 crash/data-loss bugs, but the TLS, cookie, and key-validation items are security
 defects that should be fixed before any production deployment.

+#### Re-review 2026-05-17 (commit `39d737e`)
+
+Re-reviewed the module after the batch-resolution of Security-001..007, 009, 010, 011.
+Those fixes were verified in place and remain Resolved: the `LdapTransport` enum makes
+StartTLS reachable, the auth cookie is `SecurePolicy.Always` with a sliding window, the
+JWT signing key is length-validated at construction, issuer/audience are bound and
+checked, the DN-injection fallback is RFC 4514-escaped, and the idle-timeout anchor is
+carried across refreshes. Security-008 (N+1 scope-rule query) remains correctly
+**Deferred** — `ISecurityRepository` still exposes no per-set scope-rule query, so the
+N+1 cannot be removed from within `RoleMapper.cs` alone; the deferral condition is
+unchanged. This pass surfaced **4 new findings** (Security-012..015): two Medium and
+two Low. The most notable is that a *partial* LDAP outage during login (bind succeeds,
+group search fails) silently returns an authenticated session with **zero roles**
+instead of failing the login as the design's LDAP-failure rule requires
+(Security-012); and that `RefreshToken` re-issues a token without ever checking
+`IsIdleTimedOut`, so an idle-expired session can be renewed indefinitely if the caller
+omits the separate idle check (Security-014). The two Low findings concern fragile
+DN parsing of group names containing escaped commas and an un-trimmed username flowing
+into the LDAP filter, fallback DN, and JWT claims.
+
 ## Checklist coverage

 | # | Category | Examined | Notes |
@@ -456,3 +476,141 @@ suite. Remaining gaps closed here: renamed the test file `UnitTest1.cs` →
 LDAP-timeout coverage (Security-009) plus extra no-service-account / DN-path edge cases
 (`BuildFallbackUserDn_NoSearchBase_ReturnsBareRdn`, `EscapeLdapDn_LeadingHash_IsEscaped`,
 `EscapeLdapDn_NullOrEmpty_ReturnedUnchanged`). Full module suite: 54 tests, all green.
+
+### Security-012 — Partial LDAP failure during login yields a roleless authenticated session
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Error handling & resilience |
+| Status | Open |
+| Location | `src/ScadaLink.Security/LdapAuthService.cs:78-118` |
+
+**Description**
+
+In `AuthenticateAsync`, after the user bind succeeds, the group/attribute `Search` is
+wrapped in a `try`/`catch (LdapException)` that logs a warning and continues — the
+comment states "Auth succeeded even if attribute lookup failed". The method then returns
+`new LdapAuthResult(true, displayName, username, groups, null)` with an **empty
+`groups`** list. Because `RoleMapper.MapGroupsToRolesAsync` derives all roles from that
+group list, a login during a *partial* LDAP outage (bind reachable, search/group server
+unavailable, or a transient `LdapException` mid-search) produces a fully authenticated
+session with **zero roles** — the user is logged in but silently stripped of all
+permissions. The design's LDAP Connection Failure rule states new logins must **fail**
+when LDAP is unavailable; this path instead admits the user. The caller cannot
+distinguish "user genuinely belongs to no mapped groups" from "the group query failed",
+so it cannot make the documented fail-the-login decision. The `while` loop's inner
+`catch (LdapException) { break; }` (line 107-111) compounds this: a real error mid-enumeration
+is indistinguishable from end-of-results and silently truncates the group list.
+
+**Recommendation**
+
+Treat a failed group/attribute lookup on the *initial login* as an authentication
+failure: return `Success = false` with a "directory unavailable, try again" message, or
+add an explicit flag on `LdapAuthResult` (e.g. `GroupLookupSucceeded`) so the caller can
+apply the design's fail-the-login rule. The inner `while`-loop `catch` should not treat
+an arbitrary `LdapException` as a benign end-of-results sentinel — only the specific
+"no more results" condition should break the loop; any other exception should propagate
+or be surfaced.
+
+**Resolution**
+
+_Unresolved._
+
+### Security-013 — `ExtractFirstRdnValue` mis-parses group DNs containing escaped commas
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.Security/LdapAuthService.cs:258-269` |
+
+**Description**
+
+`ExtractFirstRdnValue` extracts a group name from a `memberOf` DN by taking the substring
+between the first `=` and the first `,`. RFC 4514 permits a `,` inside an RDN value when
+it is backslash-escaped — e.g. `cn=Acme\, Inc Operators,ou=groups,dc=example,dc=com`. The
+method splits on the *escaped* comma and returns `Acme\` as the group name. `RoleMapper`
+then matches that mangled name verbatim against configured `LdapGroupMapping.LdapGroupName`
+values, so a group whose CN legitimately contains a comma silently fails to map to its
+role — the user loses that role with no error. The same naive parsing also does not strip
+the trailing backslash escape.
+
+**Recommendation**
+
+Parse the first RDN with an escape-aware scan: ignore a `,` preceded by an unescaped
+`\`, and unescape RFC 4514 sequences in the extracted value. Alternatively use the LDAP
+library's DN-parsing API (`LdapDN` / RDN accessors) rather than hand-rolled `IndexOf`.
+
+**Resolution**
+
+_Unresolved._
+
+### Security-014 — `RefreshToken` re-issues a token without checking the idle timeout
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Security |
+| Status | Open |
+| Location | `src/ScadaLink.Security/JwtTokenService.cs:156-169` |
+
+**Description**
+
+`RefreshToken` carries the existing `LastActivity` claim forward (correct, per the
+Security-007 fix) and unconditionally issues a fresh token with a new 15-minute expiry.
+It never calls `IsIdleTimedOut`. The documented 30-minute idle policy therefore depends
+entirely on the CentralUI request pipeline calling `IsIdleTimedOut` *before* calling
+`RefreshToken` — two independent, order-dependent checks with no enforcement that they
+are used together. If a caller refreshes without first checking idle state (an easy
+mistake, and there is no compiler or API signal preventing it), an idle-expired session
+is silently renewed: each refresh resets the JWT expiry while `LastActivity` keeps
+ageing, so the session can be kept alive indefinitely by background refreshes even
+though the user has been idle for hours. `RefreshToken` is the natural place to enforce
+the invariant but provides no safety net.
+
+**Recommendation**
+
+Have `RefreshToken` itself reject a principal that is already past the idle timeout —
+return `null` (the same "cannot refresh" signal it already uses for missing claims) when
+`IsIdleTimedOut(currentPrincipal)` is true — so the idle policy holds regardless of
+caller discipline. Document that an idle-expired principal cannot be refreshed and add a
+regression test covering refresh of a 31-minute-idle token.
+
+**Resolution**
+
+_Unresolved._
+
+### Security-015 — Username is not trimmed before use in the LDAP filter, fallback DN, and JWT claims
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.Security/LdapAuthService.cs:20-21`, `:80`, `:122`, `:169`, `:193` |
+
+**Description**
+
+`AuthenticateAsync` rejects null/whitespace-only usernames via `IsNullOrWhiteSpace`, but
+an otherwise-valid username with leading or trailing whitespace (`" alice"`, copy-paste
+artefacts, mobile keyboards) is passed through verbatim. It flows into the LDAP search
+filter `({attr}={EscapeLdapFilter(username)})`, into the fallback bind DN via
+`BuildFallbackUserDn`, and — most consequentially — into the returned `LdapAuthResult`
+and from there into the JWT `Username` claim. Most LDAP directories ignore surrounding
+whitespace on a search term, so `" alice"` and `"alice"` may both authenticate but yield
+JWT `Username` claims that differ by whitespace. Any downstream identity comparison
+(audit-log `who`, site-scope lookups keyed by username, session de-duplication) then
+sees two distinct identities for the same person.
+
+**Recommendation**
+
+Trim the username once at the top of `AuthenticateAsync` (after the `IsNullOrWhiteSpace`
+guard) and use the trimmed value consistently for the filter, the DN, and the
+`LdapAuthResult`. This normalises the identity that ends up in the JWT claim and audit
+trail.
+
+**Resolution**
+
+_Unresolved._
@@ -5,10 +5,10 @@
 | Module | `src/ScadaLink.SiteEventLogging` |
 | Design doc | `docs/requirements/Component-SiteEventLogging.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-16 |
+| Last reviewed | 2026-05-17 |
 | Reviewer | claude-agent |
-| Commit reviewed | `9c60592` |
-| Open findings | 0 |
+| Commit reviewed | `39d737e` |
+| Open findings | 3 |

 ## Summary

@@ -28,16 +28,33 @@ cluster-singleton placement of the handler actor (which can pin to the standby
 node), missing indexes for common query filters, retention/cap purge not enforcing
 the requirement strictly, and several documentation/maintainability issues.

+#### Re-review 2026-05-17 (commit `39d737e`)
+
+Re-reviewed the module at commit `39d737e`. All eleven prior findings remain closed
+(SiteEventLogging-001..003, 005..011 Resolved; 004 Won't Fix) and the resolutions
+hold up under inspection — the background writer, lock-guarded `WithConnection`,
+`auto_vacuum = INCREMENTAL` plus logical-size measurement, the severity index, and
+the concrete-recorder DI wiring are all present and correct at this commit. The
+module source is byte-identical between `39d737e` and current `HEAD`, so this review
+reflects the live code. Three new findings were recorded, all low-to-medium and none
+regressions of prior fixes. The most notable (SiteEventLogging-012) is a correctness
+gap left by the SiteEventLogging-005 background-writer rework: when an event cannot
+be persisted because the logger has been disposed, the returned `Task` is completed
+*successfully* rather than faulted, so an `await`-ing caller is told a dropped audit
+event was written. The other two are minor: unescaped SQL `LIKE` wildcards in the
+keyword-search filter (SiteEventLogging-013) and the initial purge running
+synchronously on the host startup thread (SiteEventLogging-014).
+
 ## Checklist coverage

 | # | Category | Examined | Notes |
 |---|----------|----------|-------|
-| 1 | Correctness & logic bugs | ☑ | `incremental_vacuum` no-op breaks cap purge (-001); over-delete on cap (-002). |
+| 1 | Correctness & logic bugs | ☑ | `incremental_vacuum` no-op breaks cap purge (-001); over-delete on cap (-002). Re-review: dropped events report success (-012); `LIKE` wildcards unescaped in keyword search (-013). |
 | 2 | Akka.NET conventions | ☑ | Handler actor has no supervision/correlation concerns of its own; singleton placement issue (-004). `Ask` boundary is appropriate. |
 | 3 | Concurrency & thread safety | ☑ | Shared `SqliteConnection` used by purge/query without the write lock (-003). |
 | 4 | Error handling & resilience | ☑ | `LogEventAsync` swallows write failures silently into a log line only (-008); purge catches broadly. |
 | 5 | Security | ☑ | Queries fully parameterised. No authz in module (delegated to caller) — noted, not a finding. |
-| 6 | Performance & resource management | ☑ | Synchronous I/O on actor threads (-005); missing indexes for severity/source/message (-006). |
+| 6 | Performance & resource management | ☑ | Synchronous I/O on actor threads (-005); missing indexes for severity/source/message (-006). Re-review: initial purge blocks host startup thread (-014). |
 | 7 | Design-document adherence | ☑ | Singleton placement contradicts "active node" model (-004); cap purge does not honour "oldest first within budget" cleanly (-002). |
 | 8 | Code organization & conventions | ☑ | Concrete-type downcast of `ISiteEventLogger` (-007); `internal Connection` leaks DB handle (-007). |
 | 9 | Testing coverage | ☑ | No tests for purge interaction with live writes, vacuum effectiveness, the actor bridge, or query error path (-010). |
@@ -529,3 +546,122 @@ explanatory note added to `AddSiteEventLogging` pointing readers to where the ac
 is actually registered. Documentation/dead-code change only; no regression test was
 added — the change is a method removal verified by the compiler (no callers) and the
 full module suite still passing.
+
+### SiteEventLogging-012 — Dropped events report success: `Task` is completed, not faulted, when the event cannot be persisted
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:160-166,193-197` |
+
+**Description**
+
+`LogEventAsync` returns a `Task` that, per the interface XML doc (corrected under
+SiteEventLogging-009), "completes once the event is durably persisted and faults if
+the write fails, so callers that `await` it observe success or failure." Two paths
+break that contract by signalling **success** for an event that was never written:
+
+1. In `LogEventAsync`, if `_writeQueue.Writer.TryWrite(pending)` fails (the channel
+   has been completed because the logger was disposed), the code calls
+   `pending.Completion.TrySetResult()` — completing the `Task` successfully — even
+   though the comment immediately above acknowledges "there is nowhere to persist the
+   event."
+2. In `ProcessWriteQueueAsync`, `WithConnection` returns `false` when the logger has
+   been disposed mid-drain. The code does not inspect the returned `written` flag and
+   unconditionally calls `pending.Completion.TrySetResult()`, again reporting success
+   for an event the comment admits "simply cannot be persisted."
+
+The event log is the site's diagnostic audit trail. A caller that `await`s
+`LogEventAsync` to confirm a critical event (deployment applied, alarm activated) was
+recorded will observe a *successful* completion for an event that was silently
+dropped. This is the same class of defect SiteEventLogging-008 fixed for write
+*errors* — but the disposed-drop path was left reporting false success. The window
+is the disposal/shutdown interval, during which shutdown-related events (graceful
+singleton handover, instance disable) are exactly the events most likely to be
+enqueued and lost.
+
+**Recommendation**
+
+For both paths, fault the `Task` (or complete it with a sentinel failure) instead of
+`TrySetResult()` — e.g. `pending.Completion.TrySetException(new ObjectDisposedException(...))`
+— so an `await`-ing caller can distinguish a dropped event from a persisted one.
+Inspect the `written` flag returned by `WithConnection` in `ProcessWriteQueueAsync`
+and only call `TrySetResult()` when `written` is `true`. Update the XML doc if a
+deliberate "drop silently on shutdown" semantics is chosen instead.
+
+**Resolution**
+
+_Unresolved._
+
+### SiteEventLogging-013 — Keyword search does not escape SQL `LIKE` wildcards in user input
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:79-83` |
+
+**Description**
+
+The keyword-search filter builds the `LIKE` pattern as `$"%{request.KeywordFilter}%"`
+and binds it as a parameter. Parameterisation correctly prevents SQL injection, but
+it does **not** neutralise the `LIKE` metacharacters `%` and `_` inside the
+user-supplied keyword. A search for a literal `_` (common in event sources and
+identifiers such as `store_and_forward`, `PLC_1`, or instance IDs) is interpreted as
+"match any single character", and a `%` matches any run of characters. The design
+calls keyword search "free-text search on message and source fields ... Useful for
+finding events by script name, alarm name, or error message" — users will reasonably
+expect a literal substring match, so a query for `store_and_forward` silently returns
+events containing `storeXandYforward` and similar false positives. There is no way
+for the caller to search for a literal underscore or percent.
+
+**Recommendation**
+
+Escape `%`, `_`, and the escape character itself in `request.KeywordFilter` before
+wrapping it in `%...%`, and append an `ESCAPE` clause to the `LIKE` expression
+(e.g. `... LIKE $keyword ESCAPE '\'`). Alternatively document that the keyword field
+accepts `LIKE` wildcard syntax, but a literal-substring match is the behaviour the
+design implies.
+
+**Resolution**
+
+_Unresolved._
+
+### SiteEventLogging-014 — Initial purge runs synchronously on the host startup thread
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Performance & resource management |
+| Status | Open |
+| Location | `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:34-48` |
+
+**Description**
+
+`EventLogPurgeService.ExecuteAsync` calls `RunPurge()` (a fully synchronous method
+that runs `PurgeByRetention` and `PurgeByStorageCap`) *before* the first `await`
+(`await timer.WaitForNextTickAsync(...)`). A `BackgroundService`'s `ExecuteAsync` is
+invoked from `StartAsync`, and the host's startup pipeline does not proceed past a
+`BackgroundService` until its `ExecuteAsync` yields at the first real `await`. Because
+`RunPurge()` precedes any `await`, the entire initial purge — including a cap-purge
+that deletes rows in 1000-row batches and runs `PRAGMA incremental_vacuum` until a
+near-1 GB database is back under the cap — executes inline on the startup thread,
+blocking host startup (and therefore the `/health/ready` gate) for as long as the
+purge takes. On a site that has accumulated a large log this can be a multi-second
+stall during every node start/failover. The class doc states the service "runs on a
+background thread and does not block event recording" — the startup-thread block is
+inconsistent with that intent.
+
+**Recommendation**
+
+Yield before the initial purge so it runs on the background scheduler rather than the
+startup thread — e.g. `await Task.Yield();` as the first statement of `ExecuteAsync`,
+or move the initial `RunPurge()` to after the first `await timer.WaitForNextTickAsync`
+(accepting a one-interval delay), or offload it with `await Task.Run(RunPurge, stoppingToken)`.
+
+**Resolution**
+
+_Unresolved._
@@ -5,10 +5,10 @@
 | Module | `src/ScadaLink.SiteRuntime` |
 | Design doc | `docs/requirements/Component-SiteRuntime.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-16 |
+| Last reviewed | 2026-05-17 |
 | Reviewer | claude-agent |
-| Commit reviewed | `9c60592` |
-| Open findings | 0 |
+| Commit reviewed | `39d737e` |
+| Open findings | 3 |

 ## Summary

@@ -28,6 +28,24 @@ in a comment but ships it anyway). Test coverage exists for the coordinator acto
 persistence and scripting, but the short-lived execution actors, the replication
 actor, and the repositories are untested.

+#### Re-review 2026-05-17 (commit `39d737e`)
+
+The module was re-reviewed at commit `39d737e`. No source under
+`src/ScadaLink.SiteRuntime` has changed since the previous review at `9c60592`
+(the only intervening commits are code-review documentation updates), so all of
+SiteRuntime-001..013, 015, 016 remain Resolved and SiteRuntime-014 remains
+Deferred — its Deferred justification (a trigger-evaluation concurrency design
+decision is required before either recommended fix can land in-module) still
+holds verbatim against the unchanged `ScriptActor`/`AlarmActor` source. The
+re-review nonetheless worked through all 10 checklist categories afresh and
+surfaced three new findings that the prior pass did not record: a cross-thread
+`Dictionary` enumeration race when the Instance Actor's live `_attributes`
+dictionary is handed by reference into child `ScriptActor`/`AlarmActor`
+constructors (SiteRuntime-017, Medium); a stale `ScriptExecutionActor` XML doc
+that still claims a "dedicated blocking I/O dispatcher" (SiteRuntime-018, Low);
+and two dead lifecycle handlers in `InstanceActor` that the Deployment Manager
+never routes to (SiteRuntime-019, Low). Open findings: 3.
+
 ## Checklist coverage

 | # | Category | Examined | Notes |
@@ -733,3 +751,126 @@ harness is a larger test-infrastructure task tracked separately and out of scope
 Low-severity coverage finding; the highest-value untested paths the finding called out
 (script timeout/failure/reply/self-stop) are now covered. Full module suite: 192 tests
 green.
+
+### SiteRuntime-017 — Instance Actor's live `_attributes` dictionary is shared by reference into child actor constructors
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Concurrency & thread safety |
+| Status | Open |
+| Location | `src/ScadaLink.SiteRuntime/Actors/InstanceActor.cs:625`, `src/ScadaLink.SiteRuntime/Actors/InstanceActor.cs:675`, `src/ScadaLink.SiteRuntime/Actors/ScriptActor.cs:83`, `src/ScadaLink.SiteRuntime/Actors/AlarmActor.cs:93` |
+
+**Description**
+
+`InstanceActor.CreateChildActors` passes the Instance Actor's own mutable
+`_attributes` field (a plain `Dictionary<string, object?>`) by reference into the
+`Props.Create(...)` factory for every `ScriptActor` and `AlarmActor` (as the
+`initialAttributes` constructor argument). Each child constructor then iterates
+that dictionary to seed its `_attributeSnapshot`:
+
+```csharp
+if (initialAttributes != null)
+    foreach (var kvp in initialAttributes)
+        _attributeSnapshot[kvp.Key] = kvp.Value;
+```
+
+`Context.ActorOf` returns immediately; the child actor's constructor runs later on
+the *child's* mailbox thread. Meanwhile the Instance Actor's `PreStart` returns and
+the Instance Actor begins processing its mailbox — `HandleTagValueUpdate` and
+`HandleAttributeValueChanged` both mutate `_attributes` (`_attributes[...] = ...`).
+A DCL tag update that arrives before a child has finished its constructor copy
+therefore mutates the dictionary on the Instance Actor thread while the child
+thread is enumerating it. `Dictionary<,>` is explicitly not safe for concurrent
+read/write: the enumeration can throw `InvalidOperationException` ("collection was
+modified") — which surfaces as an `ActorInitializationException` and, under the
+Instance Actor's `SupervisorStrategy`, **stops** the child (the strategy returns
+`Stop` for `ActorInitializationException`). The script or alarm is then silently
+absent for the life of the instance. A torn read of an entry is also possible. The
+window is small but deterministically reachable on a busy site at startup/failover
+— exactly the staggered-startup scenario the design is most concerned about.
+
+**Recommendation**
+
+Do not share the live dictionary. Snapshot it on the Instance Actor thread before
+constructing the child — e.g. pass `new Dictionary<string, object?>(_attributes)`
+(or an immutable copy) into each `Props.Create`. The copy is made on the Instance
+Actor thread inside `CreateChildActors`, so it is race-free, and each child gets a
+private dictionary to seed from.
+
+**Resolution**
+
+_Unresolved._
+
+### SiteRuntime-018 — `ScriptExecutionActor` XML doc still claims a "dedicated blocking I/O dispatcher"
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Documentation & comments |
+| Status | Open |
+| Location | `src/ScadaLink.SiteRuntime/Actors/ScriptExecutionActor.cs:17` |
+
+**Description**
+
+The class-level XML summary on `ScriptExecutionActor` states "Runs on a dedicated
+blocking I/O dispatcher." That is not what the code does. SiteRuntime-009 was
+resolved by introducing `ScriptExecutionScheduler` (a bounded dedicated
+`TaskScheduler`); the *actor itself and its mailbox* run on the **default** Akka
+dispatcher, and only the script body runs on the scheduler's threads via
+`Task.Factory.StartNew(..., scheduler)`. The resolution of SiteRuntime-009
+explicitly chose the `TaskScheduler` route *instead of* a HOCON dispatcher and
+even removed the "in production, configure a dedicated dispatcher" comments
+elsewhere — but this stale summary line was missed. A reader is told the actor is
+on a dedicated dispatcher when it is not, which is misleading when reasoning about
+mailbox throughput and thread-pool pressure. (`AlarmExecutionActor` does not carry
+the equivalent claim — its summary only says "Same pattern as ScriptExecutionActor.")
+
+**Recommendation**
+
+Correct the summary to describe the actual model: the actor runs on the default
+dispatcher and the script body is dispatched onto the dedicated
+`ScriptExecutionScheduler` (SiteRuntime-009). Align the wording with the accurate
+comment already present at `ScriptExecutionActor.cs:71-73`.
+
+**Resolution**
+
+_Unresolved._
+
+### SiteRuntime-019 — Dead `DisableInstanceCommand` / `EnableInstanceCommand` handlers in `InstanceActor`
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.SiteRuntime/Actors/InstanceActor.cs:106`, `src/ScadaLink.SiteRuntime/Actors/InstanceActor.cs:113` |
+
+**Description**
+
+`InstanceActor`'s constructor registers `Receive<DisableInstanceCommand>` and
+`Receive<EnableInstanceCommand>` handlers that log and reply with a successful
+`InstanceLifecycleResponse`. These handlers are unreachable. The Deployment Manager
+is the only sender of those commands, and `DeploymentManagerActor.HandleDisable` /
+`HandleEnable` handle the lifecycle entirely themselves — they call
+`Context.Stop(actor)` (disable) or `CreateInstanceActor(...)` (enable) directly and
+reply to the original sender from the Deployment Manager. Neither command is ever
+`Forward`-ed or `Tell`-ed to the Instance Actor. The handlers are dead code, and
+they are actively misleading: a maintainer reading `InstanceActor` would reasonably
+believe disable/enable is partly an Instance-Actor responsibility, and the no-op
+"true" reply implies an instance-side acknowledgement contract that does not exist.
+If a future change *did* route these commands here, the disable handler would do
+nothing useful (it does not stop children or tear down state — Akka does that when
+the parent stops the actor).
+
+**Recommendation**
+
+Remove the two `Receive<...>` registrations and their handler bodies from
+`InstanceActor`, since the Deployment Manager owns the disable/enable lifecycle.
+If the intent is to keep them for a future instance-side hook, add an XML comment
+stating that the Deployment Manager currently handles these and the handlers are a
+reserved placeholder — but removal is preferred.
+
+**Resolution**
+
+_Unresolved._
@@ -5,10 +5,10 @@
 | Module | `src/ScadaLink.StoreAndForward` |
 | Design doc | `docs/requirements/Component-StoreAndForward.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-16 |
+| Last reviewed | 2026-05-17 |
 | Reviewer | claude-agent |
-| Commit reviewed | `9c60592` |
-| Open findings | 0 (3 Deferred: 002, 011, 012 — see notes) |
+| Commit reviewed | `39d737e` |
+| Open findings | 3 (3 Deferred: 002, 011, 012 — see notes) |

 ## Summary

@@ -30,20 +30,45 @@ status set, and untested critical paths (retry-due timing, replication-from-acti
 the actor bridge). None of the findings are blockers for compilation, but the
 replication and retry-count issues are functional defects against the design.

+#### Re-review 2026-05-17 (commit `39d737e`)
+
+Re-reviewed at commit `39d737e` after the batch-3 fixes. All of findings 001 and
+003–010, plus 013–014, are confirmed `Resolved` against the current source: the
+replication wiring (`BufferAsync`/`ReplicateRemove`/`ReplicatePark`), the corrected
+retry-count semantics, the conditional `UpdateMessageIfStatusAsync` writes, the
+transactioned parked-message reads, the `PipeTo` refactor, the `RaiseActivity`
+hardening, the `RetryParkedMessageAsync` `last_attempt_at` reset and the database
+directory creation are all present as described. Findings 002, 011 and 012 were
+re-verified and remain validly `Deferred` — their preconditions are unchanged (002's
+residual no-handler gap, 011's Commons-owned enum, 012's Commons-owned entity placement).
+
+This pass surfaced **three new findings**. StoreAndForward-015 records the
+StoreAndForward side of the cross-module `MaxRetries == 0` ambiguity flagged by
+ExternalSystemGateway-015: `EnqueueAsync`'s public contract documents `maxRetries` only
+as "parked once `MaxRetries` is reached" and never states the `0 = no limit / retry
+forever` special case that `RetryMessageAsync` actually enforces, so an ESG caller
+passing `0` to mean "never retry" gets the opposite behaviour with no warning from the
+S&F API surface. StoreAndForward-016 records that operator-initiated parked-message
+retry and discard are not replicated to the standby — only the add/remove/park sweep
+paths are — so a failover diverges the standby buffer from the active one.
+StoreAndForward-017 records that the Retry/Discard activity-log entries hard-code the
+`ExternalSystem` category, mislabelling notification and cached-DB-write messages in
+the site event log.
+
 ## Checklist coverage

 | # | Category | Examined | Notes |
 |---|----------|----------|-------|
-| 1 | Correctness & logic bugs | ☑ | Off-by-one in retry counting (003); parked-message retry timing (010). |
+| 1 | Correctness & logic bugs | ☑ | Off-by-one in retry counting (003); parked-message retry timing (010); Retry/Discard activity log hard-codes the category (017). |
 | 2 | Akka.NET conventions | ☑ | `ContinueWith` used instead of `PipeTo`-friendly continuations; default supervision; see 007. |
 | 3 | Concurrency & thread safety | ☑ | Sweep guarded by `Interlocked`, but no guard against retry-vs-manage races (005); `OnActivity` event not thread-safe (009). |
-| 4 | Error handling & resilience | ☑ | Replication never invoked from active path (001); no-handler messages buffered then stuck (002). |
+| 4 | Error handling & resilience | ☑ | Replication never invoked from active path (001); no-handler messages buffered then stuck (002); operator retry/discard not replicated to standby (016). |
 | 5 | Security | ☑ | No issues found — parameterised SQL throughout; no secrets handled directly; payload JSON treated opaquely. |
 | 6 | Performance & resource management | ☑ | New SQLite connection per call; multi-statement operations not wrapped in a transaction (006, 008). |
 | 7 | Design-document adherence | ☑ | Replication gap (001); `InFlight` status undocumented/unused (011); "retrying" status from design doc not modelled. |
 | 8 | Code organization & conventions | ☑ | `StoreAndForwardMessage` is an entity-like POCO living in the component, not Commons (012). |
 | 9 | Testing coverage | ☑ | Retry-due timing, replication-from-active, and `ParkedMessageHandlerActor` are untested (013). |
-| 10 | Documentation & comments | ☑ | XML doc on `RegisterDeliveryHandler` contract is inconsistent with code (004). |
+| 10 | Documentation & comments | ☑ | XML doc on `RegisterDeliveryHandler` contract is inconsistent with code (004); `EnqueueAsync` never documents the `maxRetries == 0` = "retry forever" special case (015). |

 ## Findings

@@ -703,3 +728,158 @@ and bare filenames are skipped). Regression test
 `InitializeAsync_FileInMissingDirectory_CreatesDirectory` fails against the pre-fix code;
 all six `SiteActorPathTests` now pass. Fixed by the commit whose message references
 `StoreAndForward-014`.
+
+### StoreAndForward-015 — `EnqueueAsync`'s public contract never documents that `maxRetries == 0` means "retry forever"
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Documentation & comments |
+| Status | Open |
+| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:114`–`:130`, `:285` |
+
+**Description**
+
+The re-review brief asks for the StoreAndForward side of the cross-module ambiguity
+recorded as `ExternalSystemGateway-015`. The semantics are split across this module and
+its callers, and the StoreAndForward side carries a genuine documentation/API-contract
+fault:
+
+- `RetryMessageAsync` parks a message only when `message.MaxRetries > 0 && message.RetryCount >= message.MaxRetries`
+  (`StoreAndForwardService.cs:285`). When `MaxRetries == 0` the guard is false on every
+  sweep, so a `0` value means **"no limit — retry forever"**. The
+  `StoreAndForwardMessage.MaxRetries` XML doc (`StoreAndForwardMessage.cs:31`) does state
+  `"0 = no limit"`, so the persistence model is internally consistent.
+- But `EnqueueAsync` — the *only* public entry point into the engine — exposes a
+  `maxRetries` parameter (`StoreAndForwardService.cs:128`) with **no parameter
+  documentation at all**, and its method summary (lines 114–122) describes the lifecycle
+  only as "On max retries → park" / "parked once `MaxRetries` is reached" (see also the
+  `_deliveryHandlers` field doc, line 50–51). Nothing on the public surface tells a
+  caller that passing `0` flips the meaning from "park immediately / never retry" to
+  "retry forever". A caller reading only `EnqueueAsync` would reasonably assume `0`
+  retries means zero retries.
+- This is exactly the trap ESG fell into: `ExternalSystemClient.CachedCallAsync` /
+  `DatabaseGateway.CachedWriteAsync` pass the source entity's `MaxRetries` verbatim,
+  intending `0` to mean "never retry", and instead get unbounded retry — the
+  duplicate-delivery / unbounded-buffer-growth hazard the design doc's idempotency note
+  warns against. The fault is not solely ESG's: the S&F public API silently overloads
+  `0` with the opposite of its natural reading and does not document it.
+
+The defect is in this module's API contract and documentation, so it is recorded here
+in addition to `ExternalSystemGateway-015`. (Whether `0` *should* mean "no limit" or
+"no retry" is the cross-module design decision tracked by ESG-015; this finding is
+specifically that the StoreAndForward public surface fails to document whichever
+meaning is chosen.)
+
+**Recommendation**
+
+Document the `maxRetries` parameter on `EnqueueAsync` explicitly with a `<param>` tag
+that states the `0` special case in the same words as `StoreAndForwardMessage.MaxRetries`
+(`"0 = no limit — retried on every sweep until delivered, never parked"`), and add the
+`0` case to the method summary's lifecycle description. Better still — and consistent
+with the resolution of ESG-015 — make the engine reject the ambiguity at the API: accept
+a nullable/`enum` retry policy, or treat `0` as an explicit "no retry" (do not buffer, or
+park on the first sweep) so the natural reading and the behaviour agree. Either way the
+public `EnqueueAsync` contract must state the chosen meaning; today it states nothing.
+
+**Resolution**
+
+_Unresolved._
+
+### StoreAndForward-016 — Operator-initiated parked-message retry and discard are not replicated to the standby
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Error handling & resilience |
+| Status | Open |
+| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:339`–`:362`; `src/ScadaLink.StoreAndForward/ReplicationService.cs:131`–`:136` |
+
+**Description**
+
+`StoreAndForward-001`'s fix wired replication into the active *delivery* paths:
+`BufferAsync` replicates an `Add`, a successful retry replicates a `Remove`, and a park
+replicates a `Park`. But the two *operator* paths — `RetryParkedMessageAsync` (line 339)
+and `DiscardParkedMessageAsync` (line 353) — change buffer state and never touch
+`_replication`:
+
+- `RetryParkedMessageAsync` flips a row from `Parked` back to `Pending` (and clears
+  `retry_count` / `last_attempt_at`) in the local SQLite only. The standby's copy stays
+  `Parked`.
+- `DiscardParkedMessageAsync` `DELETE`s the row from the local SQLite only. The standby's
+  copy is left in place, still `Parked`.
+
+The Component design doc ("Persistence") requires the active node to forward "each
+buffer operation (add, remove, park)" so that on failover "the new active node has a
+near-complete copy of the buffer." An operator retrying a parked message is a buffer
+state change; an operator discarding one is a removal. After a failover that follows an
+operator action:
+
+1. A **discarded** message reappears on the new active node — it is still `Parked`
+   there, so it resurfaces in the central UI's parked-message list and an operator must
+   discard it a second time. For a message deliberately removed (e.g. a known-bad
+   payload) this is a correctness regression of the operator's intent.
+2. A **retried** message is still `Parked` on the new active node, so the operator's
+   "move it back to the queue" action is silently lost across the failover and the
+   message is not re-attempted.
+
+`ReplicationOperationType` only models `Add`/`Remove`/`Park` — there is no operation for
+"un-park / move back to pending", so even a minimal fix needs either a new operation
+type or a re-use of `Add` to overwrite the standby row. This is the same class of defect
+as the now-resolved `StoreAndForward-001`, for the operator paths rather than the sweep
+paths.
+
+**Recommendation**
+
+Replicate both operator actions. `DiscardParkedMessageAsync` should call
+`_replication?.ReplicateRemove(messageId)` after a successful local delete (the existing
+`Remove` op already deletes on the standby). For `RetryParkedMessageAsync`, add a
+`Requeue`/`Unpark` `ReplicationOperationType` whose `ApplyReplicatedOperationAsync` case
+resets the standby row to `Pending` with `retry_count = 0`, or have the method re-load
+the updated message and replicate it as an `Add`-style upsert. Add replication tests for
+both operator paths (the existing `StoreAndForwardReplicationTests` only cover the sweep
+paths).
+
+**Resolution**
+
+_Unresolved._
+
+### StoreAndForward-017 — Retry/Discard activity-log entries hard-code the `ExternalSystem` category
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:344`, `:358` |
+
+**Description**
+
+`RetryParkedMessageAsync` and `DiscardParkedMessageAsync` raise an S&F activity
+notification (consumed by Site Event Logging — WP-14) but pass a hard-coded
+`StoreAndForwardCategory.ExternalSystem` as the category argument:
+
+```csharp
+RaiseActivity("Retry",   StoreAndForwardCategory.ExternalSystem, $"Parked message {messageId} moved back to queue");
+RaiseActivity("Discard", StoreAndForwardCategory.ExternalSystem, $"Parked message {messageId} discarded");
+```
+
+Both methods take only a `messageId` and never load the message, so they have no access
+to its real category. When an operator retries or discards a parked **Notification** or
+**CachedDbWrite** message, the site event log records the activity under
+`ExternalSystem`. Every other `RaiseActivity` call in the service passes the message's
+true `Category` (`EnqueueAsync`, `RetryMessageAsync`), so the operator paths are
+inconsistent and produce mislabelled audit entries — misleading when an operator later
+filters or reviews S&F activity by category.
+
+**Recommendation**
+
+Load the message (or have `StoreAndForwardStorage.RetryParkedMessageAsync` /
+`DiscardParkedMessageAsync` return the affected row's category) and pass the real
+`Category` to `RaiseActivity`. If loading the row is considered too costly on these
+infrequent operator paths, change the `OnActivity` event / `RaiseActivity` signature to
+allow a nullable category for management actions rather than asserting a false one.
+
+**Resolution**
+
+_Unresolved._
@@ -5,10 +5,10 @@
 | Module | `src/ScadaLink.TemplateEngine` |
 | Design doc | `docs/requirements/Component-TemplateEngine.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-16 |
+| Last reviewed | 2026-05-17 |
 | Reviewer | claude-agent |
-| Commit reviewed | `9c60592` |
-| Open findings | 0 |
+| Commit reviewed | `39d737e` |
+| Open findings | 2 |

 ## Summary

@@ -29,11 +29,30 @@ create, optimistic concurrency on instance state) are claimed but not implemente
 Themes: validation that is weaker than the design promises, and asymmetric handling
 of attributes vs. alarms vs. scripts throughout the resolve/flatten/derive paths.

+#### Re-review 2026-05-17 (commit `39d737e`)
+
+Re-reviewed the whole module against all ten checklist categories at commit
+`39d737e`. All fourteen prior findings remain closed — the batch-4 fixes
+(`bc88a36`/`804697f` and predecessors) hold up: the recursive composition walk,
+the per-slot alarm override mechanism, the code-region-aware delimiter scanner,
+and the single-source deletion-constraint logic are all correctly in place. Two
+new Medium findings surfaced, both in the composition-cascade path and both
+affecting **nested** (depth ≥ 2) compositions specifically — the same blind spot
+that produced TemplateEngine-001. **TemplateEngine-015**: `RenameCompositionAsync`
+renames only the directly slot-owned derived template, leaving cascaded inner
+derived templates with a stale dotted-path name. **TemplateEngine-016**:
+`FlatteningService` hard-codes `ScriptScope.ParentPath` to the empty string for
+every composed script regardless of nesting depth, so a script two or more
+levels deep cannot resolve `Parent.X` references to its real parent module.
+Both are limited-impact (nested compositions are the less common case and there
+is design-time visibility) but represent genuine drift from the recursive-nesting
+design promise.
+
 ## Checklist coverage

 | # | Category | Examined | Notes |
 |---|----------|----------|-------|
-| 1 | Correctness & logic bugs | ✓ | Multiple real bugs: deep composed-member loss, derived alarms omitted, granularity bypass, no-op create-time collision block. |
+| 1 | Correctness & logic bugs | ✓ | Prior bugs (001–005, 013) all resolved and verified. Re-review 2026-05-17 found two new nested-composition defects: rename does not cascade (TemplateEngine-015), composed-script `ParentPath` always empty (TemplateEngine-016). |
 | 2 | Akka.NET conventions | ✓ | No actors in this module (`AddTemplateEngineActors` is an empty placeholder). Nothing to assess. |
 | 3 | Concurrency & thread safety | ✓ | Services are stateless, scoped per request; static helpers hold no mutable state. Design says template editing is last-write-wins; that is honoured. See TemplateEngine-010 re: a doc claim of optimistic concurrency that is not implemented. |
 | 4 | Error handling & resilience | ✓ | `Result<T>` used consistently; repository nulls guarded. `FlatteningService` wraps in try/catch. No store-and-forward or failover surface in this module. |
@@ -648,3 +667,102 @@ reports all blocking reasons and uses `TemplateDeletionService`'s phrasing — t
 affected `TemplateServiceTests` delete tests were updated to the unified messages,
 and a regression test `DeleteTemplate_MultipleConstraints_ReportsAllNotJustFirst`
 verifies all three constraint categories are surfaced together.
+
+### TemplateEngine-015 — `RenameCompositionAsync` does not cascade-rename nested derived templates
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.TemplateEngine/TemplateService.cs:680` |
+
+**Description**
+
+`AddCompositionAsync` builds a cascade of derived templates whose names follow a
+dotted path: composing `$Sensor` (which itself composes `$Probe` as `Probe1`)
+into `$Pump` as `TempSensor` produces `$Pump.TempSensor` **and** the nested
+`$Pump.TempSensor.Probe1` (see `CreateCascadedCompositionAsync` and the
+`AddComposition_CascadesChildCompositions` test). `RenameCompositionAsync`,
+however, renames only the **directly** slot-owned derived template:
+
+```csharp
+var derived = await _repository.GetTemplateByIdAsync(composition.ComposedTemplateId, ...);
+if (derived != null && derived.IsDerived && derived.OwnerCompositionId == compositionId)
+{
+    var newDerivedName = $"{owner.Name}.{newInstanceName}";
+    ...
+    derived.Name = newDerivedName;
+    await _repository.UpdateTemplateAsync(derived, ...);
+}
+```
+
+There is no recursion into `derived.Compositions`. After renaming the `TempSensor`
+slot to `MainSensor`, the parent derived becomes `$Pump.MainSensor` but the
+cascaded child stays `$Pump.TempSensor.Probe1` — its name no longer reflects the
+slot path it lives under, breaking the dotted-path naming invariant the cascade
+otherwise maintains. `DeleteCompositionAsync` correctly recurses
+(`CascadeDeleteDerivedAsync`), so rename is the asymmetric outlier. The
+`RenameComposition_RenamesSlotAndDerivedTemplate` test only exercises a
+single-level derived, so the gap is untested. The stale name also breaks the
+`AddComposition_DerivedNameCollision_Fails` / cascade-name pre-check on any
+subsequent compose that walks the now-inconsistent name tree.
+
+**Recommendation**
+
+Recurse over `derived.Compositions` (mirroring `CascadeDeleteDerivedAsync`),
+re-deriving each cascaded child's name from the renamed parent
+(`$"{parentDerivedName}.{childComposition.InstanceName}"`), and run the
+existing same-name collision pre-check across every name the cascade will
+produce — not just the top-level one. Add a regression test covering a
+two-level cascade rename.
+
+**Resolution**
+
+_Unresolved._
+
+### TemplateEngine-016 — Composed-script `ScriptScope.ParentPath` is always empty, breaking `Parent.X` resolution for nested modules
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.TemplateEngine/Flattening/FlatteningService.cs:750` |
+
+**Description**
+
+`ResolveComposedScriptsRecursive` assigns each composed script a `ScriptScope`:
+
+```csharp
+Scope = new Commons.Types.Scripts.ScriptScope(SelfPath: prefix, ParentPath: "")
+```
+
+`prefix` is the accumulated path-qualified module path (`Outer` at depth 1,
+`Outer.Inner` at depth 2, etc.), so `SelfPath` is correct. `ParentPath`, however,
+is hard-coded to the empty string at every depth. Per `ScriptScope`'s own XML
+doc, `ParentPath` is "computed at flattening time and seeded into the script's
+globals … so `Attributes["X"]` / `Parent.X` can prepend the right path-prefix."
+For a script directly composed at depth 1 the parent is the root and `""` is
+correct, but for a script in a nested module (`Outer.Inner.Foo`) the parent
+module is `Outer` — yet `ParentPath` is still `""`. A nested composed script
+that references `Parent.X` will therefore resolve the reference against the root
+flat namespace instead of its actual parent module, reading the wrong attribute
+(or failing to find one). This is the same depth-≥2 nesting blind spot as
+TemplateEngine-001; the recursive walk was added there but the `Scope`
+construction was not updated to carry the parent path. `ResolveComposedScripts`
+for direct (root-template) scripts leaves `Scope` at the default `ScriptScope.Root`,
+which is correct.
+
+**Recommendation**
+
+Thread the parent module path through `ResolveComposedScriptsRecursive` (the
+caller already knows it — it is the `prefix` of the enclosing recursion frame,
+or `""` for a depth-1 composition) and set
+`ParentPath` to that value, so `SelfPath = "Outer.Inner"` pairs with
+`ParentPath = "Outer"`. Add a flattening test asserting the `Scope` of a
+two-level composed script.
+
+**Resolution**
+
+_Unresolved._
Author	SHA1	Message	Date
Joseph Doherty	0ba4e49e11	docs(code-reviews): re-review batch 4 at `39d737e` — SiteEventLogging, SiteRuntime, StoreAndForward, TemplateEngine 11 new findings: SiteEventLogging-012..014, SiteRuntime-017..019, StoreAndForward-015..017, TemplateEngine-015..016.	2026-05-17 00:51:58 -04:00
Joseph Doherty	3b3760f026	docs(code-reviews): re-review batch 3 at `39d737e` — Host, InboundAPI, ManagementService, NotificationService, Security 21 new findings: Host-012..015, InboundAPI-014..017, ManagementService-014..017, NotificationService-014..018, Security-012..015.	2026-05-17 00:48:25 -04:00
Joseph Doherty	89636e2bbf	docs(code-reviews): re-review batch 2 at `39d737e` — ConfigurationDatabase, DataConnectionLayer, DeploymentManager, ExternalSystemGateway, HealthMonitoring 17 new findings: ConfigurationDatabase-012..014, DataConnectionLayer-014..017, DeploymentManager-015..017, ExternalSystemGateway-015..017, HealthMonitoring-013..016.	2026-05-17 00:45:10 -04:00
Joseph Doherty	e49846603e	docs(code-reviews): re-review batch 1 at `39d737e` — CentralUI, CLI, ClusterInfrastructure, Commons, Communication 17 new findings: CentralUI-020..025, CLI-014..016, ClusterInfrastructure-009..010, Commons-013..014, Communication-012..015.	2026-05-17 00:41:21 -04:00