From d39089f4eda6a6c1d499355f3d78e64baadac377 Mon Sep 17 00:00:00 2001
From: Joseph Doherty <dohejw01@gmail.com>
Date: Sat, 20 Jun 2026 18:02:32 -0400
Subject: [PATCH] =?UTF-8?q?docs(code-review):=20full=20review=20at=204307c?=
 =?UTF-8?q?381=20=E2=80=94=2018=20modules,=2067=20findings=20recorded=20+?=
 =?UTF-8?q?=20remediation=20tracked?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Full per-module re-review of the 16 stale modules (last seen 1eb6e97 / 2026-05-28)
plus first-ever reviews of KpiHistory (#26) and ScriptAnalysis (#25), at HEAD 4307c381.

67 new findings (0 Critical, 6 High, 27 Medium, 34 Low). Remediation in commit
fd618cf1 closed 5 of the 6 Highs and ~33 Medium/Low; the rest are Deferred/Won't Fix
with rationale. Remaining pending (4) are all InboundAPI's Database-helper findings
(IA-026 High .. IA-029), left to the active feat/ipsen-movein effort per owner decision.

Highlights: caught a central-only-delivery security drift (SMTP creds broadcast to
sites — DM-025/SR-031), a never-committed 'Resolved' fix (SiteEventLogging-016 → -024),
an unguarded KPI recorder tick (KH-001), a trust-analyzer fallback weakening (SA-001),
and a native-alarm subscribe-path leak (DCL-023). ScriptAnalysis verdict: trust boundary
is semantically sound (symbol-based) in the production cluster config.

README regenerated; regen-readme.py --check passes (4 pending / 567 total).
---
 code-reviews/AuditLog/findings.md             | 229 +++++++++-
 .../ClusterInfrastructure/findings.md         | 101 ++++-
 code-reviews/Communication/findings.md        | 140 +++++-
 code-reviews/DataConnectionLayer/findings.md  | 243 ++++++++++-
 code-reviews/DeploymentManager/findings.md    | 208 ++++++++-
 .../ExternalSystemGateway/findings.md         | 199 ++++++++-
 code-reviews/HealthMonitoring/findings.md     | 152 ++++++-
 code-reviews/Host/findings.md                 | 206 ++++++++-
 code-reviews/InboundAPI/findings.md           | 232 +++++++++-
 code-reviews/KpiHistory/findings.md           | 307 +++++++++++++
 code-reviews/NotificationService/findings.md  | 152 ++++++-
 code-reviews/README.md                        |  61 +--
 code-reviews/ScriptAnalysis/findings.md       | 407 ++++++++++++++++++
 code-reviews/Security/findings.md             | 225 +++++++++-
 code-reviews/SiteCallAudit/findings.md        | 241 ++++++++++-
 code-reviews/SiteEventLogging/findings.md     | 298 ++++++++++++-
 code-reviews/SiteRuntime/findings.md          | 274 +++++++++++-
 code-reviews/StoreAndForward/findings.md      | 222 +++++++++-
 code-reviews/TemplateEngine/findings.md       | 203 ++++++++-
 19 files changed, 4031 insertions(+), 69 deletions(-)
 create mode 100644 code-reviews/KpiHistory/findings.md
 create mode 100644 code-reviews/ScriptAnalysis/findings.md

diff --git a/code-reviews/AuditLog/findings.md b/code-reviews/AuditLog/findings.md
index a871b435..8a86b5f0 100644
--- a/code-reviews/AuditLog/findings.md
+++ b/code-reviews/AuditLog/findings.md
@@ -5,9 +5,9 @@
 | Module | `src/ZB.MOM.WW.ScadaBridge.AuditLog` |
 | Design doc | `docs/requirements/Component-AuditLog.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-28 |
+| Last reviewed | 2026-06-20 |
 | Reviewer | claude-agent |
-| Commit reviewed | `1eb6e97` |
+| Commit reviewed | `4307c381` |
 | Open findings | 0 |
 
 ## Summary
@@ -59,6 +59,231 @@ chain doesn't reject a central composition root that mistakenly calls the site b
 
 ## Findings
 
+#### Re-review 2026-06-20 (commit `4307c381`) — full review
+
+Since `1eb6e97` the module was renamed (`ScadaLink → ZB.MOM.WW.ScadaBridge`) and substantially
+re-architected: the site SQLite store is now a two-table canonical design (`audit_event`
+append-only + `audit_forward_state` sidecar with a precomputed `IsCachedKind`); the payload
+filter became the `IAuditRedactor` seam (`ScadaBridgeAuditRedactor` + `SafeDefaultAuditRedactor`
++ stateless `AuditRedactionPrimitives`); and milestone work landed (M5.3 T7 inbound response
+capture, M5.5 T3 per-channel retention, K8 KPI source, central health snapshot). The module
+remains one of the best-engineered in the tree: the AuditLog-001..011 fixes from the last pass
+are all present and intact (dual read connection, async scopes, per-EventId reconciliation
+retry escape valve, thread-pool-hop `Dispose`, lifecycle CTS), the best-effort "never abort the
+action" contract is honoured throughout, and redaction over-redacts on fault. This pass found
+**no Critical/High** issues; two design-adherence drifts where documented per-target/purge
+config knobs are silently inert, one error-handling asymmetry that lets a transient DI fault
+restart the ingest singleton, and two Low items. Five new Open findings (AuditLog-012..016).
+
+| # | Category | Examined | Notes |
+|---|----------|----------|-------|
+| 1 | Correctness & logic bugs | Yes | `IsErrorRow` case-sensitive parse is correct (Status is `enum.ToString()` PascalCase). `IsCachedKind` parse falls back to `InboundRequest` (non-cached) on bad JSON — safe. `OnIngestAsync` scope-resolution throw is uncaught (AuditLog-014). |
+| 2 | Akka.NET conventions | Yes | Sender/EventStream captured before first await across all three central actors; Tell for hot path, Ask only at the gRPC/ClusterClient boundary; supervisor-strategy comments now correctly credit per-row catch (AuditLog-002 fix intact). No findings. |
+| 3 | Concurrency & thread safety | Yes | WAL dual-connection + `_readLock`/`_writeLock` split intact; `Interlocked` counters; `_drainGate` serialises ring drain. `WriteAsync` fast path ignores `ct` when `TryWrite` succeeds (AuditLog-015). |
+| 4 | Error handling & resilience | Yes | Best-effort contract honoured everywhere; gRPC pull collapses tolerable faults to empty; reconciliation retry escape valve (AuditLog-004) intact. `OnIngestAsync` lacks the scope-creation try/catch its sibling `OnCachedTelemetryAsync` has (AuditLog-014). |
+| 5 | Security | Yes | Header/body/SQL-param redaction + over-redact-on-fault safety net solid; `SafeDefaultAuditRedactor` fallback wired at all three writer sites (AuditLog-008 fix intact); SQL built with bound parameters only. No findings. |
+| 6 | Performance & resource management | Yes | Hot path batched + back-pressured; backlog scan off the write lock; partition switch metadata-only; per-channel DELETE batched + clamped; gRPC channel cache race-safe. No findings. |
+| 7 | Design-document adherence | Yes | Per-target `CapBytes` override is documented + has a config property but is read NOWHERE (AuditLog-012); purge config section/key drift from the doc's `AuditLogPurge`/`ChannelPurgeBatchSize` (AuditLog-013). Combined-telemetry transport (AuditLog-001) now wired. `SkipBodyCapture` honoured in InboundAPI middleware (out of module). |
+| 8 | Code organization & conventions | Yes | Composition root well-segmented with "safe from any root" invariant; `INodeIdentityProvider` standardised on `GetRequiredService` (AuditLog-007 fix intact); idempotency sentinel on the health bridge (AuditLog-011 fix intact). No new findings. |
+| 9 | Testing coverage | Yes | Broad (~13k lines). Gaps: no test asserts per-target `CapBytes` is applied (because it isn't — AuditLog-012), and no test binds `AuditLogPurgeOptions` from an `IConfiguration` shaped like the doc (would catch AuditLog-013). |
+| 10 | Documentation & comments | Yes | Mostly excellent and accurate. `AuditLogIngestActor` class XML has a duplicated "the central-side" phrase and several stale "Bundle X" milestone tags that mislead (AuditLog-016). |
+
+### AuditLog-012 — Per-target `CapBytes` override is documented and bindable but never read by the redactor
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Design-document adherence |
+| Status | Deferred |
+| Location | `src/ZB.MOM.WW.ScadaBridge.AuditLog/Redaction/ScadaBridgeAuditRedactor.cs:185-196`, `src/ZB.MOM.WW.ScadaBridge.AuditLog/Configuration/PerTargetRedactionOverride.cs:13` |
+
+**Description**
+
+The design (Component-AuditLog.md §"Payload Capture Policy" / §"Configuration") documents a
+per-target payload-cap override and gives it as a worked example:
+`"PerTargetOverrides": { "Weather/GetForecast": { "CapBytes": 4096 } }`. The
+`PerTargetRedactionOverride.CapBytes` property exists and its XML doc promises "Optional payload
+cap override (bytes); null inherits the global cap." But `ScadaBridgeAuditRedactor.SelectCap`
+chooses among `InboundMaxBytes` / `ErrorCapBytes` / `DefaultCapBytes` only — it never consults
+`opts.PerTargetOverrides[target].CapBytes`. A repo-wide grep confirms `CapBytes` (the per-target
+property) is read in zero production code paths. An operator who sets a per-target cap to bound a
+chatty target's payloads gets the global `DefaultCapBytes`/`ErrorCapBytes` instead, with no error
+and no log — the knob is silently inert. This is a code-vs-spec drift: either the feature was
+dropped in the redactor re-architecture (C2/C3) without updating the doc + override XML, or it
+was never implemented. The `AdditionalBodyRedactors` and `RedactSqlParamsMatching` per-target
+fields ARE honoured, which makes the silent `CapBytes` omission especially easy to miss.
+
+**Recommendation**
+
+Either (a) implement it: in `SelectCap`, when `category != ApiInbound` and the resolved
+`PerTargetOverrides` entry for `rawEvent.Target` has a non-null `CapBytes`, use
+`Math.Min`/`Max(over.CapBytes.Value, …)` per the intended semantics (the doc says it *overrides*
+the global cap — clarify whether error rows still get the larger of the two), and add a
+`ScadaBridgeAuditRedactorTests` case asserting a per-target cap shortens a payload the global cap
+would not; or (b) if the feature is intentionally dropped, delete `PerTargetRedactionOverride.CapBytes`,
+remove the `CapBytes` example from the Component-AuditLog.md Configuration block, and note the
+removal in the design doc so the override surface stops advertising a no-op.
+
+**Resolution**
+
+Deferred 2026-06-20: whether to implement per-target `CapBytes` in `SelectCap` or delete the unused property + doc example is a design-owner decision (was it intentionally dropped in the redactor re-architecture, or lost?). Recorded; no change this pass.
+
+### AuditLog-013 — Purge config section + key drift: documented `AuditLogPurge`/`ChannelPurgeBatchSize` does not bind
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Design-document adherence |
+| Status | Resolved |
+| Location | `src/ZB.MOM.WW.ScadaBridge.AuditLog/ServiceCollectionExtensions.cs:56,370-371`, `src/ZB.MOM.WW.ScadaBridge.AuditLog/Central/AuditLogPurgeOptions.cs:40,47` |
+
+**Description**
+
+Component-AuditLog.md §"Retention & Purge" / §"Configuration" documents the purge tuning as a
+top-level section named `AuditLogPurge` with keys `IntervalHours` and `ChannelPurgeBatchSize`:
+
+```jsonc
+"AuditLogPurge": { "IntervalHours": 24, "ChannelPurgeBatchSize": 5000 }
+```
+
+The code does not match on either axis. (1) `ServiceCollectionExtensions.PurgeSectionName` is
+`"AuditLog:Purge"` — a nested subsection, not the documented top-level `AuditLogPurge`. (2) The
+batch-size property is `AuditLogPurgeOptions.ChannelPurgeBatchSizeConfigured`, so its bind key is
+`ChannelPurgeBatchSizeConfigured`, not the documented `ChannelPurgeBatchSize` (which is a
+computed read-only property the binder ignores). There is no `[ConfigurationKeyName]` attribute
+to reconcile the names. The net effect: an operator who follows the design doc and writes
+`"AuditLogPurge": { "ChannelPurgeBatchSize": 1000 }` has BOTH values silently ignored — the
+wrong section path means `IntervalHours` falls back to the 24 h default, and the property-name
+mismatch means the channel batch size falls back to 5000. The actor still functions on defaults,
+so nothing crashes, but the documented operator-facing tuning surface is inert. No test binds
+`AuditLogPurgeOptions` from an `IConfiguration` shaped like the doc, so the drift is unguarded.
+(The §407 doc text says `AuditLogPurge:ChannelPurgeBatchSize`, agreeing with the JSON block and
+disagreeing with the code on both section and key.)
+
+**Recommendation**
+
+Pick one canonical shape and make code, doc, and a binding test agree. Recommended: keep the
+nested `AuditLog:Purge` section (consistent with `AuditLog:SiteWriter` / `AuditLog:SiteTelemetry`
+/ `AuditLog:Reconciliation`), but then fix the doc's JSON block + §407 to read
+`"AuditLog": { "Purge": { "IntervalHours": …, "ChannelPurgeBatchSize": … } }`, AND add
+`[ConfigurationKeyName("ChannelPurgeBatchSize")]` to `ChannelPurgeBatchSizeConfigured` (or rename
+it and move the clamp into the `Interval`-style computed getter pattern) so the documented key
+binds. Add an `AuditLogOptionsBindingTests`-style test that binds the purge section from an
+in-memory config and asserts both `IntervalHours` and the channel batch size take effect.
+
+**Resolution**
+
+Resolved 2026-06-20 (commit `fd618cf1`): added `[ConfigurationKeyName("ChannelPurgeBatchSize")]` so the documented key binds, reconciled the design doc to the real nested `AuditLog:Purge` section, and added a binding test proving the purge options bind from the documented keys.
+
+### AuditLog-014 — `AuditLogIngestActor.OnIngestAsync` does not guard scope/repository resolution, so a transient DI fault restarts the singleton and drops the reply
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Error handling & resilience |
+| Status | Resolved |
+| Location | `src/ZB.MOM.WW.ScadaBridge.AuditLog/Central/AuditLogIngestActor.cs:135-148` |
+
+**Description**
+
+`OnIngestAsync` opens `await using var scope = _serviceProvider!.CreateAsyncScope();` and
+`scope.ServiceProvider.GetRequiredService<IAuditLogRepository>()` with NO surrounding try/catch.
+The per-row try/catch lives inside `IngestWithRepositoryAsync`, so it only protects the insert
+loop — a throw from scope creation or from `GetRequiredService` (a `DbContext` factory throw on
+transient SQL-connection exhaustion, a pooled-context init fault, a DI resolution race during
+host churn) propagates straight out of the message handler. When a `ReceiveActor`'s async handler
+faults, Akka applies the parent's supervision and restarts the singleton; the captured
+`replyTo.Tell(...)` at line 147 never runs, so the site's `Ask<IngestAuditEventsReply>` times out
+and (correctly) retries — but the singleton has been bounced over a transient fault. The class
+XML even claims "that per-row catch is what keeps this actor alive across handler throws" — which
+is true for insert throws but false for the scope-resolution throw this handler leaves uncaught.
+The sibling `OnCachedTelemetryAsync` (line 216) wraps its entire scope-creation-plus-loop body in
+a try/catch and logs + replies-with-partial precisely to avoid this; `OnIngestAsync` is the
+asymmetric one. `AuditLogPurgeActor.OnTickAsync` and `SiteAuditReconciliationActor.OnTickAsync`
+both guard their `GetRequiredService` with try/catch too — `OnIngestAsync` is the outlier.
+
+**Recommendation**
+
+Wrap the scope-creation + repository-resolution in `OnIngestAsync` in a try/catch that mirrors
+`OnCachedTelemetryAsync`: on failure, log the resolution fault and `replyTo.Tell(new
+IngestAuditEventsReply(accepted))` (accepted will be whatever was processed before the throw,
+typically empty) so the site keeps its rows `Pending` and retries on the next drain without
+bouncing the singleton. Optionally bump the `ICentralAuditWriteFailureCounter` on that path too,
+so a sustained DI/connection fault is visible on the dashboard.
+
+**Resolution**
+
+Resolved 2026-06-20 (commit `fd618cf1`): `OnIngestAsync` now wraps the scope-creation + repo resolution + ingest in a try/catch that logs, best-effort bumps the write-failure counter, and still `replyTo.Tell(IngestAuditEventsReply(accepted))` — a transient DI/DbContext fault no longer bounces the central ingest singleton or drops the reply (mirrors `OnCachedTelemetryAsync`).
+
+### AuditLog-015 — `SqliteAuditWriter.WriteAsync` ignores the cancellation token on the fast (`TryWrite`) path
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Concurrency & thread safety |
+| Location | `src/ZB.MOM.WW.ScadaBridge.AuditLog/Site/SqliteAuditWriter.cs:254-275` |
+| Status | Deferred |
+
+**Description**
+
+`WriteAsync(AuditEvent, CancellationToken ct)` accepts a `ct` but, on the common fast path where
+`_writeQueue.Writer.TryWrite(pending)` succeeds, it returns `pending.Completion.Task` without ever
+observing `ct`. The token is only honoured on the slow path (`WriteSlowPathAsync`, which passes it
+to `WriteAsync`). So a caller that passes an already-cancelled (or soon-cancelled) token still
+enqueues the row and then awaits an uncancellable completion. In practice the hot-path callers
+(`FallbackAuditWriter`, `CachedCallTelemetryForwarder`) thread a real `ct` through, and the audit
+contract is best-effort durable-in-microseconds, so the impact is minor — the row simply gets
+written instead of cancelled. But the signature advertises cancellation it does not deliver on the
+dominant path, which is a latent surprise for any future caller that relies on it (e.g. to bound a
+shutdown drain). The completion `Task` returned is not linked to `ct`, so an awaiter cannot abandon
+the wait either.
+
+**Recommendation**
+
+Either (a) honour the token on the fast path — `ct.ThrowIfCancellationRequested()` before
+`TryWrite`, and return a `ct`-linked wait (e.g. `pending.Completion.Task.WaitAsync(ct)`) so an
+awaiter can abandon a stuck completion; or (b) if cancellation is intentionally a no-op for this
+best-effort hot path, drop the `ct` parameter (or document on the method that it is observed only
+on the back-pressured slow path) so the surface stops advertising behaviour it does not provide.
+
+**Resolution**
+
+Deferred 2026-06-20: whether cancellation should be honoured on the best-effort `TryWrite` hot path (vs. intentionally a no-op) is a design decision. Recorded; no change this pass.
+
+### AuditLog-016 — `AuditLogIngestActor` class XML has a duplicated phrase and stale "Bundle"/`IngestedAtUtc`-column wording
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Documentation & comments |
+| Status | Resolved |
+| Location | `src/ZB.MOM.WW.ScadaBridge.AuditLog/Central/AuditLogIngestActor.cs:13-46` |
+
+**Description**
+
+The `AuditLogIngestActor` class summary reads "Each row is stamped with the central-side **the
+central-side IngestedAtUtc**" — the phrase "the central-side" is duplicated, a copy/paste slip
+that survived the rename. The same summary describes `IngestedAtUtc` as if it were a promoted
+column, but post-C3 it is a `DetailsJson` field stamped via `AuditRowProjection.WithIngestedAtUtc`
+(as the handler bodies and their own inline comments correctly note) — the class-level prose is
+now stale relative to the canonical-record shim. The remarks also still narrate the design in
+"Bundle A / Bundle D / Bundle E" milestone terms (and the two-constructor rationale references
+"Bundle D's tests" / "Bundle E's host wiring"), which no longer map to any current artifact and
+read as archaeology to a new maintainer. None of this is wrong behaviourally, but the duplicated
+phrase plus the column-vs-DetailsJson mismatch is exactly the kind of doc rot that misleads.
+
+**Recommendation**
+
+Fix the duplicated "the central-side the central-side" to a single phrase; reword the
+`IngestedAtUtc` sentence to say it is stamped into `DetailsJson` (matching the handler comments);
+and either drop the "Bundle X" tags in favour of the feature/milestone names used elsewhere in
+the module, or add a one-line legend. A quick sweep of the other central actors' class XML for
+the same "Bundle" archaeology would be worthwhile while in there.
+
+**Resolution**
+
+Resolved 2026-06-20 (commit `fd618cf1`): class XML doc deduped ('the central-side' duplicate removed), `IngestedAtUtc` reworded to 'stamped into DetailsJson (no promoted column)', and the stale Bundle milestone tags replaced with role-neutral wording.
+
 ### AuditLog-001 — Combined-telemetry transport is plumbed end-to-end but never invoked in production
 
 | | |
diff --git a/code-reviews/ClusterInfrastructure/findings.md b/code-reviews/ClusterInfrastructure/findings.md
index ee22058a..e1f343d0 100644
--- a/code-reviews/ClusterInfrastructure/findings.md
+++ b/code-reviews/ClusterInfrastructure/findings.md
@@ -5,10 +5,10 @@
 | Module | `src/ZB.MOM.WW.ScadaBridge.ClusterInfrastructure` |
 | Design doc | `docs/requirements/Component-ClusterInfrastructure.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-28 |
+| Last reviewed | 2026-06-20 |
 | Reviewer | claude-agent |
-| Commit reviewed | `1eb6e97` |
-| Open findings | 1 |
+| Commit reviewed | `4307c381` |
+| Open findings | 0 |
 
 ## Summary
 
@@ -82,6 +82,32 @@ either did not surface or that have aged into the file:
   but the method now offers nothing — keeping it is API-surface noise that an
   IDE will still suggest via auto-complete.
 
+#### Re-review 2026-06-20 (commit `4307c381`) — full review
+
+The module remains small and clean — `ClusterOptions`, `ClusterOptionsValidator`,
+and `ServiceCollectionExtensions`, all well-documented and well-tested. The
+split-brain-resolver, seed-node, and failure-detector configuration contract
+matches the design doc, and all fourteen prior findings (CI-001..014) still hold
+as resolved: `DownIfAlone` is now consumed by the Host's HOCON builder, the dead
+`AddClusterInfrastructureActors` surface is gone, and the validator enforces every
+catastrophic value. The only new observation is one Low code-organization item — the
+validator hand-rolls every rule via raw `RequireThat` rather than the base's
+`MinCount`/`OneOf`/`PositiveTimeSpan` primitives (CI-015), a deliberate trade-off
+flagged as a judgment call rather than a clear defect.
+
+| # | Category | Examined | Notes |
+|---|----------|----------|-------|
+| 1 | Correctness & logic bugs | ✓ | Validator rules and DI registration are correct. No new defects. |
+| 2 | Akka.NET conventions | ✓ | No actors in this module (legitimate, per CI-001 resolution). Nothing actor-shaped to evaluate. |
+| 3 | Concurrency & thread safety | ✓ | Validator and DI extension remain stateless. No issues. |
+| 4 | Error handling & resilience | ✓ | Validator rejects every catastrophic value the design doc enumerates, now including `DownIfAlone = false` (CI-010) and `SeedNodes.Count < 2` (CI-012). No issues. |
+| 5 | Security | ✓ | No authn/authz surface, no secret handling, no remoting transport configured here. No issues. |
+| 6 | Performance & resource management | ✓ | No resources held; validator allocates a small failure list per call only. No issues. |
+| 7 | Design-document adherence | ✓ | `ClusterOptions` contract complete; SBR/seed/failure-detector config matches the design doc. No new drift. |
+| 8 | Code organization & conventions | ✓ | Options/validator placement and Options pattern correct; CI-014 dead surface removed. New — validator hand-rolls every rule via raw `RequireThat` instead of the `OptionsValidatorBase` primitives `MinCount`/`OneOf`/`PositiveTimeSpan` (CI-015, a judgment call). |
+| 9 | Testing coverage | ✓ | 19 tests across three classes covering options, validator (incl. single-seed and `DownIfAlone=false`), and DI registration. No gaps. |
+| 10 | Documentation & comments | ✓ | XML docs accurate across all source files; CI-013 intent comment present. No issues. |
+
 ## Checklist coverage
 
 Original review (2026-05-16, `9c60592`) below; the re-review notes (2026-05-17,
@@ -857,3 +883,72 @@ explicitly stating that this project exposes no actor-registration extension
 (actor wiring lives in `ZB.MOM.WW.ScadaBridge.Host`). If the user prefers to keep the
 "fail-fast" trap, mark the method `[Obsolete(true, error: true)]` so the compiler —
 not the runtime — rejects the call.
+
+### ClusterInfrastructure-015 — Validator hand-rolls every rule via raw `RequireThat` instead of the base primitives
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Code organization & conventions |
+| Status | Deferred |
+| Location | `src/ZB.MOM.WW.ScadaBridge.ClusterInfrastructure/ClusterOptionsValidator.cs:29-57` |
+
+**Description**
+
+`ClusterOptionsValidator` derives from `OptionsValidatorBase<ClusterOptions>`
+(`ZB.MOM.WW.Configuration`), whose `ValidationBuilder` exposes a set of typed
+validation primitives — `MinCount<T>` ("requires a collection with at least N
+items"), `OneOf` ("requires the value to be one of `allowed`, case-insensitive"),
+`PositiveTimeSpan` ("requires a strictly positive duration"), plus `Required`,
+`Port`, and `HostPort` — alongside the lower-level `RequireThat(bool, message)`
+escape hatch for "custom or cross-field rules". `ClusterOptionsValidator.Validate`
+expresses *every* one of its rules through raw `RequireThat`, even where a
+first-class primitive exists for the exact check:
+
+- the `SeedNodes.Count >= 2` rule (29-33) is what `MinCount<string>(options.SeedNodes, 2, …)` does;
+- the `keep-oldest`-only strategy rule (35-39) is what `OneOf(options.SplitBrainResolverStrategy, AllowedStrategies, …)` does;
+- the three `> TimeSpan.Zero` timing rules (45-52) are each what `PositiveTimeSpan(…)` does.
+
+Only the `MinNrOfMembers == 1` equality check, the cross-field
+`HeartbeatInterval < FailureDetectionThreshold` comparison, and the `DownIfAlone`
+boolean (59-63) genuinely require `RequireThat` (no primitive covers them). Routing
+the three primitive-eligible rules through `RequireThat` makes this validator
+inconsistent with the *intent* of the shared base (the primitives exist precisely so
+common rules read uniformly across the codebase's validators) and risks subtle
+wording drift between modules for identical checks.
+
+This is **explicitly a deliberate trade-off, not a clear win**, and is filed as a
+judgment call rather than a defect. Adopting the primitives would **lose** what makes
+the current messages valuable: each hand-rolled message cites the governing design-doc
+section (`Component-ClusterInfrastructure.md → Node Configuration` / `→ Split-Brain
+Resolution`) and spells out the catastrophic consequence of the misconfiguration
+("would risk a total cluster shutdown on a partition", "blocks the cluster singleton
+after failover and halts all data collection", "the oldest node can run as an isolated
+single-node cluster"). The shared primitives emit terse, generic messages
+("requires a collection with at least 2 items") that cannot carry that
+safety-critical context. The peer `AuditLogOptionsValidator` likewise prefers raw
+`RequireThat` for the same reason, so the current style is at least internally
+consistent.
+
+**Recommendation**
+
+Treat this as a conscious choice between two competing goods rather than assuming a
+fix is warranted:
+
+- **Adopt the primitives** (`MinCount`/`OneOf`/`PositiveTimeSpan` for the three
+  eligible rules; keep `RequireThat` for the equality, cross-field, and boolean
+  checks) if codebase-wide *wording consistency* across validators is the priority —
+  accepting that the design-doc citations and consequence explanations move out of the
+  failure message (e.g. into XML docs or an appended suffix).
+- **Keep as-is** if the *richer, safety-critical failure messages* are the priority
+  for a module whose misconfiguration is cluster-fatal — accepting the minor
+  inconsistency with the base's primitive surface.
+
+Do not assume either direction; decide deliberately. If kept as-is, a one-line code
+comment recording the choice (richer messages over primitive uniformity) would stop a
+future reader from "tidying" the validator into the primitives and silently dropping
+the consequence text.
+
+**Resolution**
+
+Deferred 2026-06-20: this is a deliberate trade-off, not a clear win — the hand-rolled `RequireThat` messages carry design-doc citations and cluster-fatal consequence text that the shared base primitives (`MinCount`/`OneOf`/`PositiveTimeSpan`) cannot, and the peer `AuditLogOptionsValidator` follows the same style. Adopting the primitives would prioritize cross-validator wording consistency over those richer safety-critical messages — a deliberate decision left to the owner. No change this pass.
diff --git a/code-reviews/Communication/findings.md b/code-reviews/Communication/findings.md
index 361cbd1b..3b866bc3 100644
--- a/code-reviews/Communication/findings.md
+++ b/code-reviews/Communication/findings.md
@@ -5,10 +5,10 @@
 | Module | `src/ZB.MOM.WW.ScadaBridge.Communication` |
 | Design doc | `docs/requirements/Component-Communication.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-28 |
+| Last reviewed | 2026-06-20 |
 | Reviewer | claude-agent |
-| Commit reviewed | `1eb6e97` |
-| Open findings | 2 |
+| Commit reviewed | `4307c381` |
+| Open findings | 0 |
 
 ## Summary
 
@@ -68,6 +68,41 @@ keyed by caller-supplied `CorrelationId` could orphan a subscriber on ID reuse
 (Communication-022). Seven new findings, all Open: one High, one Medium, five
 Low.
 
+#### Re-review 2026-06-20 (commit `4307c381`) — full review
+
+All prior findings (Communication-001..022) remain closed (Resolved, or Resolved-by-Comm-016).
+This module was renamed `ScadaLink → ZB.MOM.WW.ScadaBridge` since `1eb6e97`, so the
+`git diff 1eb6e97 HEAD` shows the whole module as added; the review re-examined the
+full current state across all 10 categories, with focus on the surface added since the
+last review — the four new gRPC RPCs on `SiteStreamService` (`IngestAuditEvents`,
+`IngestCachedTelemetry`, `PullAuditEvents`, `PullSiteCalls`) and the M7 command/control
+forwards (`SearchAddressSpaceCommand`, `WriteTagRequest`, `VerifyEndpointCommand`,
+cert-trust commands). The module is healthy: actor state stays on the actor thread,
+`PipeTo`/`Forward`/Sender-capture are clean, gRPC channel/stream/CTS lifecycles are
+disciplined, the duplicate-stream and Subscribe-throws cleanup paths are correct, and
+the gRPC ingest/pull handlers degrade best-effort (empty ack / empty response, rows stay
+Pending) on fault. The only new findings are **design-doc drift**: the doc still claims
+`SiteStreamService` has "a single RPC `SubscribeInstance`" and that audit/telemetry
+reconciliation rides ClusterClient, when in fact four RPCs now exist and the
+reconciliation *pull* runs over gRPC (Communication-023); and the AlarmStateUpdate field
+table stops at field 21 while the proto/code carry fields 22–23 (Communication-024). Two
+new findings, both Open: zero Critical/High, zero Medium-code, one Medium-doc, one Low.
+
+## Checklist coverage 2026-06-20
+
+| # | Category | Examined | Notes |
+|---|----------|----------|-------|
+| 1 | Correctness & logic bugs | ✓ | `SubscribeInstance` duplicate-removal + Subscribe-throws cleanup (Comm-021) correct; telemetry open/close gauge is balanced (counted only after Subscribe succeeds, decremented in finally). `DebugStreamBridgeActor` stream-first buffer/dedup sound. No new issues. |
+| 2 | Akka.NET conventions | ✓ | Coordinator `Resume` strategies present on both actors; `Forward` preserves Sender on every site relay; `SiteAddressCacheLoaded` now read-only-typed (Comm-020). gRPC ingest handlers capture Sender before `PipeTo`. No new issues. |
+| 3 | Concurrency & thread safety | ✓ | `_activeStreams`/`_subscriptions` use ownership-preserving `TryRemove(KeyValuePair)`; `GetOrCreate` uses `AddOrUpdate` atomically; bridge state mutated only on the actor thread via `Self.Tell`. No new issues. |
+| 4 | Error handling & resilience | ✓ | gRPC `Ingest*`/`Pull*` degrade to empty ack/response on fault (rows stay Pending) — divergent from the ClusterClient path's `Status.Failure` pipe but behaviourally equivalent (both → Pending+retry), and explicitly documented. `LoadSiteAddressesFromDb` lifecycle-CT bounded (Comm-019). No new issues. |
+| 5 | Security | ✓ | `correlation_id` validated against `ActorPath.IsValidPathElement` before actor naming (Comm-014). `Pull*` reject `batch_size <= 0` with `InvalidArgument`. Secrets ride encrypted blocks, not this layer. No new issues. |
+| 6 | Performance & resource management | ✓ | gRPC channel disposed synchronously (Comm-007); per-stream CTS `CancelAfter(GrpcMaxStreamLifetime)`; bounded `Channel` (1000, DropOldest). `RemoveSiteAsync` still test-only but Comm-013 documents that endpoint change is handled by `GetOrCreate`. No new issues. |
+| 7 | Design-document adherence | ✓ | Doc claims a *single* RPC and ClusterClient-only reconciliation; reality is 4 RPCs + gRPC reconciliation pulls (Communication-023). AlarmStateUpdate field table stops at 21; proto/code carry 22–23 (Communication-024). |
+| 8 | Code organization & conventions | ✓ | DTO mappers correctly hosted in Communication (owns generated DTOs); proto evolution is additive (field numbers 22–23 appended, never reused); Options pattern intact. No new issues. |
+| 9 | Testing coverage | ✓ | `ProtoContractTests` guards oneof variants; `StreamRelayActorTests` + `ConvertToDomainEvent` cover enriched fields 8–23 end-to-end; dedicated proto tests cover the audit/telemetry/pull DTOs. `ProtoRoundtripTests.AlarmStateUpdate_RoundTrip` only asserts fields 1–5, but the gap is covered elsewhere — no separate finding. |
+| 10 | Documentation & comments | ✓ | In-code XML docs are thorough and accurate (the prior-finding comments are precise). The drift is in the design doc only (Communication-023/024), not the code comments. |
+
 ## Checklist coverage 2026-05-28
 
 | # | Category | Examined | Notes |
@@ -1120,3 +1155,102 @@ The `_debugSubscriptions` dictionary, `TrackMessageForCleanup` helper, and the
 `HandleConnectionStateChanged` handler that consumed them were all deleted as
 part of Comm-016's dead-code purge. There is no longer any caller-supplied
 correlation-id keyed map to overwrite — the orphan-on-reuse hazard is gone.
+
+---
+
+### Communication-023 — Design doc describes `SiteStreamService` as a single RPC and reconciliation as ClusterClient-only; both are now stale
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Design-document adherence |
+| Status | Resolved |
+| Location | `docs/requirements/Component-Communication.md:85`, `docs/requirements/Component-Communication.md:160-166`, `src/ZB.MOM.WW.ScadaBridge.Communication/Protos/sitestream.proto:8-14` |
+
+**Description**
+
+The design doc's "gRPC Proto Definition" section states: *"**Service**: `SiteStreamService`
+with **a single RPC** `SubscribeInstance(InstanceStreamRequest) returns (stream
+SiteStreamEvent)`."* (line 85). The proto and the generated service now expose **five**
+RPCs — `SubscribeInstance`, `IngestAuditEvents`, `IngestCachedTelemetry`,
+`PullAuditEvents`, and `PullSiteCalls` (`sitestream.proto:8-14`,
+`SiteStreamGrpcServer.cs:211/330/386/436/519`). The four new RPCs are not mentioned
+anywhere in the component doc — neither the proto section, the responsibilities, the
+topology diagram, nor the Dependencies (the server now also depends on `ISiteAuditQueue`
+and `IOperationTrackingStore`).
+
+This drift also contradicts two transport statements elsewhere in the same doc. The
+"Cached Call Telemetry" section (line 166) and the "Notification Submission" section
+(line 158) both assert **"Transport: ClusterClient"**, and the reconciliation-pull
+paragraph (line 164) describes `CachedCallReconcileRequest`/`CachedCallReconcileResponse`.
+In the shipped code the reconciliation **pull** runs over **gRPC**, not ClusterClient:
+`GrpcPullAuditEventsClient` / `GrpcPullSiteCallsClient`
+(`src/ZB.MOM.WW.ScadaBridge.AuditLog/Central/`) call the `PullAuditEvents` / `PullSiteCalls`
+RPCs against `SiteStreamGrpcServer`. So the gRPC channel — documented as carrying
+**real-time data only** ("`SiteCommunicationActor` and `CentralCommunicationActor` have no
+role in streaming event delivery", "gRPC server-streaming for real-time data") — now also
+carries site→central audit/telemetry **ingest** and central→site **reconciliation pulls**.
+A reader relying on the doc to understand the two-transport split (the central CLAUDE.md
+"Data & Communication" invariant) would be materially misled about what flows over which
+transport.
+
+**Recommendation**
+
+Update `Component-Communication.md`: (1) replace "a single RPC" with the full RPC list
+and document each new RPC's direction, payload, and idempotency; (2) reconcile the
+"Transport: ClusterClient" / `CachedCallReconcile*` wording in §10 (and the Site Call
+Audit / Audit Log interaction notes) with the gRPC `PullAuditEvents` / `PullSiteCalls`
+reality; (3) add `ISiteAuditQueue` / `IOperationTrackingStore` to the SiteStreamGrpcServer
+description and Dependencies. If the doc's "gRPC = real-time data only" framing is meant
+to be a hard invariant, state explicitly that audit/telemetry ingest and reconciliation
+are the documented exceptions.
+
+**Resolution**
+
+Resolved 2026-06-20 (commit `fd618cf1`): the design doc (Component-Communication.md) now
+enumerates all five `SiteStreamService` RPCs (the one server-streaming `SubscribeInstance`
+plus the unary `IngestAuditEvents`/`IngestCachedTelemetry`/`PullAuditEvents`/`PullSiteCalls`),
+reconciles the §10 reconciliation wording with the gRPC-pull reality (push telemetry stays
+ClusterClient; reconciliation pulls are gRPC request/response), and adds `ISiteAuditQueue`/`IOperationTrackingStore`
+to Dependencies. Doc-only.
+
+---
+
+### Communication-024 — AlarmStateUpdate proto carries fields 22–23 but the design-doc field table stops at 21
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Design-document adherence |
+| Status | Resolved |
+| Location | `docs/requirements/Component-Communication.md:92-114`, `src/ZB.MOM.WW.ScadaBridge.Communication/Protos/sitestream.proto:87-88` |
+
+**Description**
+
+The "Enriched AlarmStateUpdate" section states the message "was extended **additively**:
+existing fields 1–7 are unchanged, and **fields 8–21** carry the enriched native-alarm
+state" (line 92), and the field table enumerates fields 8–21 (lines 94-109). The proto
+now also defines **field 22 `native_source_canonical_name`** and **field 23
+`is_configured_placeholder`** (`sitestream.proto:87-88`), both populated on the server
+(`StreamRelayActor.HandleAlarmStateChanged`, lines 97-98) and read back on the client
+(`SiteStreamGrpcClient.ConvertToDomainEvent`, lines 257-258). The doc's field table and
+its "fields 8–21" / "repopulated from fields 8–21" prose (lines 92, 112) are stale — they
+omit fields 22–23. `is_configured_placeholder` in particular is load-bearing: it is the
+discriminator `StreamRelayActor` uses to drop snapshot-only placeholder rows before
+packing the live stream (lines 63-66), a behaviour the doc does not capture.
+
+The proto evolution itself is correct (additive, field numbers appended and never reused),
+so this is purely doc staleness, not a contract regression.
+
+**Recommendation**
+
+Extend the AlarmStateUpdate field table to fields 22–23, update the "fields 8–21" prose to
+"8–23" in both the intro and the client-mapping bullet, and add a one-line note that
+`is_configured_placeholder` rows are dropped at the relay (never relayed to the live gRPC
+stream).
+
+**Resolution**
+
+Resolved 2026-06-20 (commit `fd618cf1`): the `AlarmStateUpdate` field table was extended to
+fields 22–23 (`native_source_canonical_name`, `is_configured_placeholder`) with a note that
+placeholder rows are dropped at the relay. Doc-only.
diff --git a/code-reviews/DataConnectionLayer/findings.md b/code-reviews/DataConnectionLayer/findings.md
index 60785634..9eddae7f 100644
--- a/code-reviews/DataConnectionLayer/findings.md
+++ b/code-reviews/DataConnectionLayer/findings.md
@@ -5,10 +5,10 @@
 | Module | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer` |
 | Design doc | `docs/requirements/Component-DataConnectionLayer.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-28 |
+| Last reviewed | 2026-06-20 |
 | Reviewer | claude-agent |
-| Commit reviewed | `1eb6e97` |
-| Open findings | 5 |
+| Commit reviewed | `4307c381` |
+| Open findings | 0 |
 
 ## Summary
 
@@ -116,6 +116,37 @@ DCL-007 fixed for `ReadBatchAsync`). New findings are numbered from
 
 ## Findings
 
+#### Re-review 2026-06-20 (commit `4307c381`) — full review
+
+The 2026-06-20 full re-review walked all 10 checklist categories against the current
+source, focused on the M7 native-alarm subscribe path, the OPC UA node-browser
+(BrowseNext/search/type-info), and the verify-endpoint + site-local cert-trust surface.
+The M7 surface is well-built overall — the verify-endpoint probe correctly *captures
+but never trusts* an untrusted server certificate, the per-node `CertStoreActor`
+broadcast keeps both site nodes' PKI stores consistent across failover, and the
+node-browser paging/search/type-info additions are clean. All 22 prior findings remain
+`Resolved` and their fixes were verified in place. The review found **4 new findings**,
+all clustering on the native-alarm subscribe path, which did **not** inherit the
+guards the tag-subscribe path earned across DCL-018 / DCL-021 / DCL-022: the alarm
+path leaks its adapter feed on a mid-flight unsubscribe (no DCL-021-style obsolete-
+completion guard), has no alarm-resolution retry to match the tag path's
+`tag-resolution-retry` timer, leaks `_alarmCts` on `DisposeAsync` (only
+`DisconnectAsync` tears it down), and carries a stale "first subscriber wins" comment
+on a last-wins filter assignment.
+
+| # | Category | Examined | Notes |
+|---|----------|----------|-------|
+| 1 | Correctness & logic bugs | x | No new issues; DCL-016/020/021 counter/response fixes verified. Alarm-path snapshot atomic-swap + per-source prefix routing are correct. |
+| 2 | Akka.NET conventions | x | No new issues; DCL-022 `IsTimerActive` gate verified. Alarm subscribe uses `ContinueWith(...).PipeTo(Self)` with `Self`/generation captured (no `Sender`/`this` in closures). |
+| 3 | Concurrency & thread safety | x | Finding 023 — native-alarm mid-flight unsubscribe leaks the adapter alarm feed (the tag path's DCL-021 obsolete-completion guard was not mirrored on the alarm path). |
+| 4 | Error handling & resilience | x | Finding 024 — no alarm-resolution retry: a failed initial alarm subscribe strands the subscriber with no feed until the next full reconnect (the tag path retries on `tag-resolution-retry`). |
+| 5 | Security | x | No new issues. The M7 verify-endpoint probe builds a temporary `RealOpcUaClient`, captures the untrusted server cert, and NEVER trusts it; cert trust is gated through `CertStoreActor` broadcast to both nodes. DCL-012/014 auto-accept warning + Commons default remain out-of-scope follow-ups. |
+| 6 | Performance & resource management | x | Finding 025 — `MxGatewayDataConnection.DisposeAsync` abandons `_alarmCts` (alarm task + CTS leak on failover/stop); `DisconnectAsync` already cancels+disposes it under `_alarmLock`. |
+| 7 | Design-document adherence | x | No new issues; M7 native-alarm read-only mirror, `AlarmKind` discriminator, and per-source `IAlarmSubscribableConnection` feed match the design. DCL-009 doc action still open at doc level (out of scope). |
+| 8 | Code organization & conventions | x | No issues — alarm messages/POCOs in Commons, `IAlarmSubscribableConnection` capability seam in the module, options class owned by component. |
+| 9 | Testing coverage | x | DCL001–022 regression tests present. Gaps for findings 023 (alarm unsubscribe mid-flight), 024 (failed alarm subscribe → no retry), 025 (`DisposeAsync` alarm-CTS leak). |
+| 10 | Documentation & comments | x | Finding 028 — the `_alarmSourceFilter` "first subscriber wins" XML comment contradicts the last-wins assignment in `HandleSubscribeAlarms`. |
+
 ### DataConnectionLayer-001 — `Task.Run` in `HandleSubscribe` mutates actor state off the actor thread
 
 | | |
@@ -1267,3 +1298,209 @@ calling `StartPeriodicTimer` — `IsTimerActive` is on `ITimerScheduler`. Apply
 same gate at both call sites. Add a regression test that fires 5 subscribes with
 unresolved tags within one retry interval and asserts the retry fires at most one
 interval after the first failure (not after the fifth subscribe).
+
+### DataConnectionLayer-023 — Native-alarm mid-flight unsubscribe leaks the adapter alarm feed
+
+| | |
+|--|--|
+| Severity | High |
+| Category | Concurrency & thread safety |
+| Status | Resolved |
+| Location | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Actors/DataConnectionActor.cs:1773,1790-1809,1864-1882` |
+
+**Description**
+
+The native-alarm subscribe path mirrors the tag-subscribe path's
+dispatch-I/O-then-`PipeTo(Self)` shape but never inherited the obsolete-completion
+guard the tag path earned in DCL-021. `HandleSubscribeAlarms` adds the source to
+`_alarmSubscribesInFlight` (line 1773) and dispatches
+`alarmable.SubscribeAlarmsAsync(...)` whose `ContinueWith` pipes an
+`AlarmSubscribeCompleted` back to `Self`. If an `UnsubscribeAlarmsRequest` for the
+last (or only) subscriber is processed on the actor thread between that dispatch and
+the completion, `HandleUnsubscribeAlarms` (lines 1864-1882) removes the subscriber,
+empties `_alarmSourceSubscribers`, and removes the filter entries — but it tries to
+tear down the adapter feed via `_alarmSubscriptionIds.Remove(...)`, which **fails**
+because the subscription id has not been stored yet (it is still in flight). It also
+leaves the stale `_alarmSubscribesInFlight` entry in place. The late
+`AlarmSubscribeCompleted` then reaches `HandleAlarmSubscribeCompleted` (lines
+1790-1809), which **unconditionally** stores `_alarmSubscriptionIds[msg.SourceReference]
+= msg.SubscriptionId` (line 1796) without re-checking whether any subscriber still
+exists in `_alarmSourceSubscribers`.
+
+The net result is an orphaned adapter alarm feed: the OPC UA monitored-item set / the
+MxGateway gateway-wide alarm stream stays alive for a source with **zero** subscribers,
+streaming transitions into `HandleAlarmTransitionReceived` that match no subscriber set
+and are silently dropped — the feed is never torn down for the lifetime of the adapter
+(only a full `ReSubscribeAllAlarms` on reconnect clears `_alarmSubscriptionIds`, and
+even that re-subscribes from `_alarmSourceSubscribers`, which no longer contains the
+orphaned source, so the stale device-side feed is simply abandoned, not closed). The
+tag path closes exactly this race in `HandleSubscribeCompleted` (lines 834-867) by
+releasing the just-created adapter handle and clearing `_subscribesInFlight` when the
+instance entry has gone; the alarm path does neither. This matters because native-alarm
+sources are created/destroyed on every deploy/undeploy and instance stop, so each
+mid-flight unsubscribe permanently leaks one device-side alarm subscription and its
+publish traffic.
+
+**Recommendation**
+
+Mirror the closed DCL-021 fix on the alarm path. In `HandleAlarmSubscribeCompleted`,
+before storing the id, guard on the live subscriber set: `if
+(!_alarmSourceSubscribers.ContainsKey(msg.SourceReference)) {
+if (msg.Success && msg.SubscriptionId != null && _adapter is
+IAlarmSubscribableConnection alarmable) _ =
+alarmable.UnsubscribeAlarmsAsync(msg.SubscriptionId); return; }` so a feed that
+completed after its last subscriber left is released at the adapter rather than stored.
+Additionally clear the `_alarmSubscribesInFlight` entry in `HandleUnsubscribeAlarms`
+when the last subscriber leaves, so the in-flight marker does not linger. Add a
+regression test that subscribes a source, sends `UnsubscribeAlarmsRequest` while the
+alarm subscribe I/O is in flight, completes the subscribe, and asserts
+`UnsubscribeAlarmsAsync` is called and `_alarmSubscriptionIds` /
+`_alarmSubscribesInFlight` are clean.
+
+**Resolution**
+
+Resolved 2026-06-20 (commit `fd618cf1`): `HandleAlarmSubscribeCompleted` now guards against an orphaned completion (if no live subscriber remains for the source, it tears down the just-created adapter subscription and returns) and `HandleUnsubscribeAlarms` clears the in-flight marker when the last subscriber leaves — mirroring the DCL-021 tag-path fix. Regression test added (verified failing pre-fix).
+
+### DataConnectionLayer-024 — No alarm-resolution retry: a failed initial alarm subscribe strands the subscriber until the next full reconnect
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Error handling & resilience |
+| Status | Deferred |
+| Location | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Actors/DataConnectionActor.cs:1752-1757,1790-1809,1885-1908` |
+
+**Description**
+
+When an initial native-alarm subscribe fails, the subscriber is left registered but
+with no feed and no path to recovery short of a full connection reconnect.
+`HandleSubscribeAlarms` registers the subscriber in `_alarmSourceSubscribers` (lines
+1752-1757) **before** issuing the adapter subscribe, so a transition arriving
+mid-subscribe is still routed. If `alarmable.SubscribeAlarmsAsync` fails, the piped
+`AlarmSubscribeCompleted` reaches `HandleAlarmSubscribeCompleted` (lines 1790-1809),
+which on the failure branch only logs a warning (line 1801) and replies a
+`SubscribeAlarmsResponse(success: false, ...)`. It does **not** re-arm any retry, does
+**not** push `NativeAlarmSourceUnavailable` to the subscriber, and does **not** drop
+the source from `_alarmSourceSubscribers`. The subscriber therefore stays in the
+routing map with the adapter feed never established.
+
+Recovery only happens via `ReSubscribeAllAlarms` (lines 1885-1908), which is invoked
+solely from `BecomeConnected` after a full connection reconnect (line 583). So a
+transient device-side failure on the *initial* alarm subscribe — the alarm server
+still booting, a momentary fault — leaves the source dark until the entire connection
+happens to cycle through `Reconnecting`. This diverges from the tag path, which arms a
+periodic `tag-resolution-retry` timer (lines 989-993, 1678-1682; fired by
+`HandleRetryTagResolution` at line 1491) so a tag that fails to resolve at subscribe
+time is retried every `TagResolutionRetryInterval` (10 s default) without needing a
+reconnect. The native-alarm source has no analogous self-healing.
+
+This is partly a **design judgment call** rather than a clear-cut defect: the M7 design
+models native alarms as a read-only mirror, and "drop the source and signal the
+subscriber" is an equally valid contract to "retry the alarm subscribe periodically."
+The defect is that the code currently does **neither** — it silently leaves a
+registered subscriber with a permanently-dark feed and no signal — which is the worst
+of both.
+
+**Recommendation**
+
+Pick one of the two valid contracts and implement it (this is a judgment call — do not
+assume which is wanted):
+
+- **Retry:** add an `alarm-resolution-retry` periodic timer (mirroring
+  `tag-resolution-retry`): track sources whose subscribe failed, re-issue
+  `SubscribeAlarmsAsync` on the timer, and gate the timer start with
+  `IsTimerActive` (per the DCL-022 pattern) so a burst of failed subscribes does not
+  reset it.
+- **Fail fast:** on the failure branch of `HandleAlarmSubscribeCompleted`, push
+  `NativeAlarmSourceUnavailable` to the subscriber(s) for that source and remove the
+  source from `_alarmSourceSubscribers` / `_alarmSubscribesInFlight`, so the
+  `NativeAlarmActor` learns the feed is unavailable rather than waiting forever.
+
+Either way, add a regression test for "initial alarm subscribe fails → subscriber is
+not silently left dark."
+
+**Resolution**
+
+Deferred 2026-06-20: the fix (add an alarm-resolution retry timer vs. push `NativeAlarmSourceUnavailable` and drop the source on initial-subscribe failure) is a design-owner decision; the current code does neither. Awaiting that decision before implementing.
+
+### DataConnectionLayer-025 — `MxGatewayDataConnection.DisposeAsync` abandons `_alarmCts`, leaking the alarm task on failover/stop
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Performance & resource management |
+| Status | Resolved |
+| Location | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Adapters/MxGatewayDataConnection.cs:277-283,121-134` |
+
+**Description**
+
+`MxGatewayDataConnection.DisposeAsync` (lines 277-283) cancels `_eventLoopCts` and
+disposes the underlying `_client`, but never touches `_alarmCts`. The alarm feed is
+established in `SubscribeAlarmsAsync` (lines 153-175): under `_alarmLock` it lazily
+creates `_alarmCts = new CancellationTokenSource()` and launches a long-running
+`Task.Run(() => client.RunAlarmStreamAsync(null, ..., token))`. That background alarm
+stream is bound only to `_alarmCts.Token`. `DisconnectAsync` (lines 121-134) already
+tears it down correctly — under `_alarmLock` it cancels and disposes `_alarmCts`,
+nulls it, and resets `_alarmSubCount` — but `DisposeAsync` does not, so the
+`CancellationTokenSource` and the alarm-stream `Task` are both leaked whenever the
+adapter is disposed without a prior `DisconnectAsync`.
+
+The `DataConnectionActor` disposes adapters fire-and-forget on failover (and on
+actor/connection stop) via `_adapter.DisposeAsync()` without necessarily calling
+`DisconnectAsync` first, so every MxGateway failover or connection teardown that goes
+through `DisposeAsync` leaks one running alarm-stream task plus its CTS. Severity is
+Low because the leaked task is cancellation-bound to a CTS that will be GC-reclaimed
+eventually and the gRPC stream it holds will fault when `_client` is disposed, but a
+still-running alarm-stream loop racing a disposed client is precisely the class of
+dangling background work the lock-guarded `DisconnectAsync` block was written to avoid.
+
+**Recommendation**
+
+In `DisposeAsync`, cancel and dispose `_alarmCts` under `_alarmLock` before disposing
+the client — copy the block already present in `DisconnectAsync` (lines 124-130):
+`lock (_alarmLock) { _alarmCts?.Cancel(); _alarmCts?.Dispose(); _alarmCts = null;
+_alarmSubCount = 0; }`. This guarantees the alarm-stream task is cancelled
+deterministically on every teardown path, not just the `DisconnectAsync` one.
+
+**Resolution**
+
+Resolved 2026-06-20 (commit `fd618cf1`): `MxGatewayDataConnection.DisposeAsync` now cancels and disposes `_alarmCts` under `_alarmLock`, matching `DisconnectAsync` — no more alarm task/CTS leak on failover/stop.
+
+### DataConnectionLayer-028 — `_alarmSourceFilter` "first subscriber wins" comment contradicts the last-wins code
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Documentation & comments |
+| Status | Resolved |
+| Location | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Actors/DataConnectionActor.cs:102-103,1758-1761` |
+
+**Description**
+
+The XML doc on the `_alarmSourceFilter` field (lines 102-103) reads "sourceReference →
+raw condition filter string passed to the adapter (**first subscriber wins**)." The
+code does the opposite: `HandleSubscribeAlarms` **unconditionally** overwrites the
+filter on every subscribe — `_alarmSourceFilter[request.SourceReference] =
+request.ConditionFilter;` (line 1758) and likewise re-parses
+`_alarmSourceFilterPredicate[request.SourceReference] =
+AlarmConditionFilter.Parse(request.ConditionFilter)` (line 1761) — with no
+"already present" guard. So a **second** subscriber to the same source reference
+re-filters the shared feed with its own condition filter, i.e. **last subscriber
+wins**, not first. The comment will mislead a maintainer reasoning about which filter
+governs a shared alarm feed when two instances subscribe to the same source with
+different condition filters, and it understates a real behavioural subtlety (the second
+subscriber silently changes the gate applied to the first subscriber's transitions in
+`HandleAlarmTransitionReceived`).
+
+**Recommendation**
+
+Correct the comment to "last subscriber wins" (and note that the shared feed carries a
+single filter, so co-subscribers to one source reference must agree on the condition
+filter). If per-subscriber filtering is actually intended, make the predicate
+per-subscriber rather than per-source so each subscriber's own filter governs only its
+own deliveries — but that is a behaviour change, not a comment fix, and should be
+decided explicitly.
+
+**Resolution**
+
+Resolved 2026-06-20 (commit `fd618cf1`): the `_alarmSourceFilter` comment corrected from 'first subscriber wins' to 'last subscriber wins' (and notes co-subscribers share the single filter).
diff --git a/code-reviews/DeploymentManager/findings.md b/code-reviews/DeploymentManager/findings.md
index 8b4e5c96..f2ce0738 100644
--- a/code-reviews/DeploymentManager/findings.md
+++ b/code-reviews/DeploymentManager/findings.md
@@ -5,9 +5,9 @@
 | Module | `src/ZB.MOM.WW.ScadaBridge.DeploymentManager` |
 | Design doc | `docs/requirements/Component-DeploymentManager.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-28 |
+| Last reviewed | 2026-06-20 |
 | Reviewer | claude-agent |
-| Commit reviewed | `1eb6e97` |
+| Commit reviewed | `4307c381` |
 | Open findings | 0 |
 
 ## Summary
@@ -83,8 +83,51 @@ The remaining six findings are medium/low: lifecycle-timeout audit gap
 (DeploymentManager-023), and shared static state across `*ProbeActor` tests
 (DeploymentManager-024).
 
+#### Re-review 2026-06-20 (commit `4307c381`) — full review
+
+Re-reviewed the whole current module at HEAD after the rename, the
+cert-broadcast / Transport `IStaleInstanceProbe` work, and milestone changes.
+DeploymentManager-001..024 all remain `Resolved` and verified against source —
+the ref-counted `OperationLockManager`, the broadened/non-cancellable failure
+writes, the `ApplyPostSuccessSideEffectsAsync` shared helper with
+`forceEnabledState` (Disabled-preservation), the lifecycle-timeout audit helper,
+the structured diff with List-value normalization, the hoisted global artifact
+fetch, and the instance-state-aware reconciliation are all present and correct.
+The two flagged cross-module/architectural seams the prompt called out — the
+`TrustServerCert`/`RemoveServerCert` broadcast-to-both-nodes and the
+`DeploymentManagerActor` deploy-state query handler — live in **SiteRuntime /
+Communication / CentralUI**, not this module, so they are out of scope here.
+This review found **3 new findings**. The material one is DeploymentManager-025:
+the system-wide artifact path still fetches and broadcasts **notification lists
+and SMTP configurations (including SMTP credentials)** to every site, in direct
+contradiction of the now-explicit design decision that these are central-only
+and "no SMTP credential is ever distributed to sites" (Component-DeploymentManager.md
+lines 142-146; CLAUDE.md notification-central-only decision). This supersedes
+the earlier accepted-deployable-artifact framing of the closed
+DeploymentManager-013. DeploymentManager-026 (deployment records are insert-only
+— a new row per deploy accumulates per instance, contradicting "only current
+status stored, no history table", and the same-tick `OrderByDescending(DeployedAt)`
+read has no tiebreaker) and DeploymentManager-027 (artifact tests *assert* the
+forbidden notif/SMTP shipping, cementing the DeploymentManager-025 violation)
+are the remaining two.
+
 ## Checklist coverage
 
+#### Re-review 2026-06-20 (commit `4307c381`)
+
+| # | Category | Examined | Notes |
+|---|----------|----------|-------|
+| 1 | Correctness & logic bugs | ✓ | New: deployment records are insert-only — `DeployInstanceAsync` Adds a new row per deploy; reconciliation's `GetCurrentDeploymentStatusAsync` orders by `DeployedAt` with no tiebreaker (DeploymentManager-026). |
+| 2 | Akka.NET conventions | ✓ | Module remains a plain service layer; no actors. The deploy-state-query/cert-broadcast actors live in SiteRuntime, out of scope. No issues. |
+| 3 | Concurrency & thread safety | ✓ | `OperationLockManager` ref-counting + gate re-verified; `DeployToAllSitesAsync` prebuilds per-site commands before the parallel phase (no shared DbContext under `Task.WhenAll`). No issues. |
+| 4 | Error handling & resilience | ✓ | Failure-status writes use `CancellationToken.None`; lifecycle timeouts now audit; delete-orphan path surfaced. No new issues. |
+| 5 | Security | ✓ | New: SMTP credentials are still serialized into the per-site artifact command and broadcast to every site, which the current design forbids outright (DeploymentManager-025). |
+| 6 | Performance & resource management | ✓ | Global artifact queries hoisted (DM-023 resolved). Deployment-record row growth is unbounded per instance (part of DeploymentManager-026). |
+| 7 | Design-document adherence | ✓ | New: notification lists + SMTP configs are still treated as deployable artifacts, contradicting the "central-only, never distributed to sites" design (DeploymentManager-025). |
+| 8 | Code organization & conventions | ✓ | Options bound via Host; `OptionsSection` constant correct. No new issues. |
+| 9 | Testing coverage | ✓ | Broad and current. New: artifact tests assert the forbidden notif/SMTP shipping (DeploymentManager-027). |
+| 10 | Documentation & comments | ✓ | `ArtifactDeploymentService` class XML doc still lists notification lists + SMTP as broadcast artifacts (stale vs design — folded into DeploymentManager-025). |
+
 #### Re-review 2026-05-28 (commit `1eb6e97`)
 
 | # | Category | Examined | Notes |
@@ -1234,3 +1277,164 @@ recipient (an `IActorRef` to a TestKit probe), and assert via `ExpectMsg`
 in each test. Where the simpler counter shape is preferred, pass a
 shared-state object into the actor's constructor so each test owns its own
 instance — never reach for `static` mutable test state.
+
+### DeploymentManager-025 — Notification lists and SMTP configurations (with credentials) are still broadcast to every site, contradicting the central-only design
+
+| | |
+|--|--|
+| Severity | High |
+| Category | Design-document adherence |
+| Status | Resolved |
+| Location | `src/ZB.MOM.WW.ScadaBridge.DeploymentManager/ArtifactDeploymentService.cs:13-22,128,130,150-151,181-188` |
+
+**Description**
+
+The design now states explicitly, in two authoritative places, that notification
+lists and SMTP configuration are **not** deployable artifacts:
+
+- `docs/requirements/Component-DeploymentManager.md:142-146`: *"Notification lists
+  and SMTP configuration are **not** deployable artifacts — they are central-only
+  definitions managed by the Notification Service ... Notification delivery happens
+  on the central cluster, so **no notification artifact or SMTP credential is ever
+  distributed to sites**."*
+- CLAUDE.md "External Integrations" decision: *"Notification delivery is
+  central-only ... Notification lists and SMTP config are no longer deployed to
+  sites; recipient resolution happens at central, at delivery time."*
+
+`ArtifactDeploymentService` still does the opposite. `FetchGlobalArtifactsAsync`
+queries `_notificationRepo.GetAllNotificationListsAsync` and
+`GetAllSmtpConfigurationsAsync` (lines 150-151), maps them into
+`NotificationListArtifact`s and `SmtpConfigurationArtifact`s — the SMTP artifact
+carrying `smtp.Credentials` verbatim (line 188) — and `BuildDeployArtifactsCommandAsync`
+places both into the `DeployArtifactsCommand` sent to **every site** (lines 128,
+130). The site side persists them: `SiteReplicationActor` (lines 192-201) and
+`DeploymentManagerActor` (lines 1383-1419) loop over `command.NotificationLists`
+and `command.SmtpConfigurations` and write them to site SQLite via
+`SiteNotificationRepository`.
+
+This is the precise scenario the design says must never happen: SMTP credentials
+travel across the inter-cluster transport and land on every site's SQLite. It
+supersedes the framing of the now-closed DeploymentManager-013, which accepted
+SMTP-as-deployable-artifact as a documented design decision — the design has since
+flipped to forbid distribution entirely, so this is a fresh divergence, not the
+same finding. The class-level XML doc (lines 13-22) is correspondingly stale: it
+still advertises "notification lists ... and SMTP configurations" as artifacts the
+service "broadcasts ... to all sites."
+
+Secondary defect in the same mapping: `NotificationListArtifact` is built from
+`nl.Recipients.Where(r => r.EmailAddress is not null)` (line 182), which silently
+drops every SMS-only recipient (`PhoneNumber` set, `EmailAddress` null) of a
+`NotificationType.Sms` list. Even if list distribution were intended, the SMS
+recipient set would be lost — but since lists must not be distributed at all, this
+is subsumed by the primary fix.
+
+**Recommendation**
+
+Stop fetching, mapping, and shipping notification lists and SMTP configurations
+from the artifact deployment path. Drop the `_notificationRepo` queries from
+`FetchGlobalArtifactsAsync`, pass `null` (or empty) for the `NotificationLists`
+and `SmtpConfigurations` fields of `DeployArtifactsCommand`, and update the class
+XML doc to remove both from the artifact list. The message fields can remain on
+`DeployArtifactsCommand` for additive compatibility but must never be populated
+from central. Coordinate removal of the consuming code in SiteRuntime
+(`SiteReplicationActor`, `DeploymentManagerActor`, `SiteNotificationRepository`)
+in the same session per the project's "design + code + tests travel together"
+rule. Update the contradicting tests (see DeploymentManager-027).
+
+**Resolution**
+
+Resolved 2026-06-20 (commit `fd618cf1`): central `FetchGlobalArtifactsAsync` no longer queries or ships notification lists / SMTP configs (passes null; `DeployArtifactsCommand` fields kept for contract compatibility), and the site purges any already-persisted notification_lists / smtp_configurations rows (clearing the plaintext SMTP password) on both apply paths — enforcing the central-only delivery design. Verified no site runtime/delivery path reads this config.
+
+### DeploymentManager-026 — `DeployInstanceAsync` inserts a new deployment record every deploy; per-instance rows accumulate and the "current status" read has no tiebreaker
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Design-document adherence |
+| Status | Resolved |
+| Location | `src/ZB.MOM.WW.ScadaBridge.DeploymentManager/DeploymentService.cs:215-225`, `src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase/Repositories/DeploymentManagerRepository.cs:55-61` |
+
+**Description**
+
+`DeployInstanceAsync` always creates a brand-new `DeploymentRecord` and calls
+`_repository.AddDeploymentRecordAsync(record, …)` (line 223) — it never reuses or
+updates the instance's existing record. There is no update-in-place or
+delete-prior path on the deploy flow, so every successful or failed deployment of
+an instance leaves its predecessor row behind. Over the life of a central process
+— amplified by the bulk "deploy all out-of-date instances" workflow and by repeated
+redeploys after timeouts — the `DeploymentRecords` table grows without bound, one
+row per deploy attempt per instance.
+
+This contradicts the design's "Deployment Status Persistence" section
+(`Component-DeploymentManager.md:106-109`): *"Only the **current deployment
+status** per instance is stored in the configuration database ... No deployment
+history table — the audit log (via IAuditService) already captures every
+deployment action."* The audit log is the history; the deployment-record table is
+supposed to hold only the current status. The implementation instead keeps an
+ad-hoc, unindexed history there.
+
+The accumulation also makes the reconciliation read order-sensitive.
+`TryReconcileWithSiteAsync` reads the "prior" record via
+`GetCurrentDeploymentStatusAsync`, which is
+`OrderByDescending(d => d.DeployedAt).FirstOrDefault()` with **no secondary sort
+key** (e.g. `ThenByDescending(d => d.Id)`). `DeployedAt` is a `DateTimeOffset`
+stamped with `DateTimeOffset.UtcNow` at record creation; two records inserted
+within the same clock tick (rapid redeploy, or a redeploy immediately after a
+timed-out attempt) tie on `DeployedAt`, and SQL Server's choice between equal
+sort keys is undefined. Reconciliation could then read the *wrong* prior record
+(e.g. an older Success instead of the latest stuck InProgress), skipping the
+intended site query, or vice-versa.
+
+**Recommendation**
+
+Either (a) make the deploy path upsert the instance's single current record
+(update-in-place when one exists, insert only on first deploy) so the table holds
+exactly one row per instance per the design, or (b) if multiple rows are
+deliberately retained, add a deterministic tiebreaker to
+`GetCurrentDeploymentStatusAsync` (`OrderByDescending(d => d.DeployedAt)
+.ThenByDescending(d => d.Id)`) and document the retention/cleanup story so the
+table does not grow unbounded. Option (a) aligns with the design and is preferred.
+
+**Resolution**
+
+Resolved 2026-06-20 (commit `fd618cf1`): added a deterministic `.ThenByDescending(d => d.Id)` tiebreaker to `GetCurrentDeploymentStatusAsync` so same-tick deployment records resolve to the newest row. Insert-per-deploy behaviour unchanged (history-vs-upsert remains a separate design question).
+
+### DeploymentManager-027 — Artifact tests assert that notification lists and SMTP configs are shipped, cementing the DeploymentManager-025 design violation
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Testing coverage |
+| Status | Resolved |
+| Location | `tests/ZB.MOM.WW.ScadaBridge.DeploymentManager.Tests/ArtifactDeploymentServiceTests.cs:173-174,200-201` |
+
+**Description**
+
+`ArtifactDeploymentServiceTests` asserts, via NSubstitute, that the artifact
+deployment path queries the forbidden artifact sets exactly once per deployment:
+
+```
+await _notificationRepo.Received(1).GetAllNotificationListsAsync(Arg.Any<CancellationToken>());
+await _notificationRepo.Received(1).GetAllSmtpConfigurationsAsync(Arg.Any<CancellationToken>());
+```
+
+(both in the `DeployToAllSitesAsync` global-query-hoisting test at 173-174 and the
+`RetryForSiteAsync` test at 200-201). These assertions pin the exact behaviour the
+current design forbids (DeploymentManager-025): they will keep the service shipping
+notification lists and SMTP configs to sites and will actively block the fix —
+removing the queries makes these `Received(1)` assertions fail. Tests that lock in
+a design violation are worse than no test, because they make the correct change
+look like a regression.
+
+**Recommendation**
+
+When DeploymentManager-025 is fixed, change these to
+`DidNotReceive().GetAllNotificationListsAsync(...)` /
+`DidNotReceive().GetAllSmtpConfigurationsAsync(...)` (and assert the
+`DeployArtifactsCommand`'s `NotificationLists` / `SmtpConfigurations` fields are
+null/empty) so the tests enforce the central-only design instead of contradicting
+it.
+
+**Resolution**
+
+Resolved 2026-06-20 (commit `fd618cf1`): the artifact tests no longer assert the forbidden notification/SMTP shipping — flipped `Received(1)` → `DidNotReceive()` and added assertions that the shipped command's NotificationLists/SmtpConfigurations are null.
diff --git a/code-reviews/ExternalSystemGateway/findings.md b/code-reviews/ExternalSystemGateway/findings.md
index 5d2a80af..6bcf263f 100644
--- a/code-reviews/ExternalSystemGateway/findings.md
+++ b/code-reviews/ExternalSystemGateway/findings.md
@@ -5,9 +5,9 @@
 | Module | `src/ZB.MOM.WW.ScadaBridge.ExternalSystemGateway` |
 | Design doc | `docs/requirements/Component-ExternalSystemGateway.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-28 |
+| Last reviewed | 2026-06-20 |
 | Reviewer | claude-agent |
-| Commit reviewed | `1eb6e97` |
+| Commit reviewed | `4307c381` |
 | Open findings | 0 |
 
 ## Summary
@@ -81,6 +81,42 @@ design-doc drift). Theme: every new finding is in a code path that was added or
 touched by the earlier fix bundle but whose error-propagation contract was not
 verified end-to-end against the S&F engine or the design doc.
 
+#### Re-review 2026-06-20 (commit `4307c381`) — full review
+
+All twenty-three prior findings (001–023) remain `Resolved`; spot-checks against the
+current source confirm the fixes still hold. Since `1eb6e97` the module gained a new
+SQL error-classification layer (`SqlErrorClassifier` + `TransientDatabaseException` /
+`PermanentDatabaseException`, commits `d0527064` / `de375ff7`) and `DatabaseGateway.CachedWriteAsync`
+was reshaped to attempt the write immediately and classify the outcome (transient →
+buffer, permanent → synchronous `Failed`, mirroring `CachedCallAsync`); the `ScadaLink → ZB.MOM.WW.ScadaBridge`
+rename also landed. The full 10-category checklist was re-walked and surfaced **three new
+findings**, none Critical/High. The most serious (`ExternalSystemGateway-024`, Medium) is
+that `SqlErrorClassifier.IsTransient(Exception)` classifies *every* `InvalidOperationException`
+(and `TimeoutException`) as transient, so a DB-layer authoring/driver-misuse bug that surfaces
+as `InvalidOperationException` is silently buffered and retried instead of propagating —
+contradicting the classifier's own stated "authoring bugs must propagate" contract and the
+HTTP-path symmetry it claims. `-025` (Low) is a caller-cancellation that the SQL driver
+raises as a `SqlException` (mid-flight cancel) being reclassified as a *permanent* DB error
+rather than propagating the cancellation (a narrow asymmetry with the `-008` contract; untested).
+`-026` (Low) is a mis-ordered numbered-label comment block in `ExecuteWriteAsync`. Theme:
+the new SQL-classification seam faithfully mirrors the HTTP path's *shape* but is more
+permissive on the non-typed-exception branch than the HTTP path it cites as its model.
+
+_Re-review (2026-06-20, `4307c381`):_
+
+| # | Category | Examined | Notes |
+|---|----------|----------|-------|
+| 1 | Correctness & logic bugs | ☑ | `BuildUrl` / `JsonElementToParameterValue` precision cascade / verb validation all still correct (006/017/020/022). New: caller-cancel surfacing as `SqlException` is misclassified permanent — finding 025. |
+| 2 | Akka.NET conventions | ☑ | Still no actors; `AddExternalSystemGatewayActors` remains a no-op. Cached-call/write lifecycle + audit emission live in SiteRuntime/AuditLog, correct boundary. No issues. |
+| 3 | Concurrency & thread safety | ☑ | Services stateless and DI-scoped; new `ExecuteWriteAsync`/`RunSqlAsync`/`SqlErrorClassifier` seams introduce no shared mutable state (static `HashSet` is read-only). No findings. |
+| 4 | Error handling & resilience | ☑ | New SQL transient/permanent split is the headline area. `SqlErrorClassifier.IsTransient(Exception)` is over-broad — all `InvalidOperationException`/`TimeoutException` are transient, sweeping authoring bugs into retry — finding 024. Number-based set (024 aside) is sound; unknown→permanent fail-fast is correct. |
+| 5 | Security | ☑ | `ExecuteWriteAsync`/`SqlErrorClassifier` use the connection NAME, never the connection string, in error text (verified — credentials not leaked). `ApplyAuth` secrets still never logged. SQL `CommandText` is script-authored by design (parameterised values bound separately). No new findings. |
+| 6 | Performance & resource management | ☑ | `RunSqlAsync` uses `await using`/`using` for connection + command. `EmptyParameters` reused. `client.Timeout` set per rented client (correct — `IHttpClientFactory` clients are per-call). No new findings. |
+| 7 | Design-document adherence | ☑ | Updated doc (lines 65–66) now describes the immediate-attempt-then-buffer cached-write model the code implements — in sync. PATCH reconciled (023). No new drift. |
+| 8 | Code organization & conventions | ☑ | `Transient`/`PermanentDatabaseException` are component-local exception types (parallel to the HTTP ones); acceptable. `SqlErrorClassifier` is a static policy class in the component. No new findings. |
+| 9 | Testing coverage | ☑ | New SQL-classification paths well covered (immediate + retry, transient/permanent/non-Sql/cancel/unexpected). Gaps: caller-cancel-as-`SqlException` (025) and the `InvalidOperationException`-authoring-bug case (024) are untested — recorded under those findings. |
+| 10 | Documentation & comments | ☑ | XML docs accurate and thorough. The numbered case labels in `ExecuteWriteAsync`'s comment block are mis-ordered vs the catch sequence — finding 026. |
+
 ## Checklist coverage
 
 | # | Category | Examined | Notes |
@@ -1312,3 +1348,162 @@ code travel together" rule:
 - If PATCH is not in scope, remove `method.HttpMethod.Equals("PATCH", ...)` from the
   body branch in `InvokeHttpAsync` and let finding-022's verb validation reject it.
   The design-doc list then remains the single source of truth.
+
+### ExternalSystemGateway-024 — `SqlErrorClassifier.IsTransient(Exception)` classifies all `InvalidOperationException`/`TimeoutException` as transient, so DB authoring/driver-misuse bugs are buffered and retried instead of propagating
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Error handling & resilience |
+| Status | Deferred |
+| Location | `src/ZB.MOM.WW.ScadaBridge.ExternalSystemGateway/SqlErrorClassifier.cs:123-142`, `src/ZB.MOM.WW.ScadaBridge.ExternalSystemGateway/DatabaseGateway.cs:346-355` |
+
+**Description**
+
+`SqlErrorClassifier.IsTransient(Exception)` is a pure exception-type test:
+
+```csharp
+return exception is InvalidOperationException
+    or IOException
+    or SocketException
+    or TimeoutException
+    or TaskCanceledException
+    or DbException;
+```
+
+`DatabaseGateway.ExecuteWriteAsync` uses it as the third ordered catch
+(`catch (Exception ex) when (SqlErrorClassifier.IsTransient(ex))`) to decide whether a
+write that did *not* surface as a `SqlException` should be buffered and retried. The
+type's own XML doc justifies the set as "a live DB outage does not always surface as a
+`SqlException`" and explicitly contrasts it with authoring bugs: *"Authoring bugs
+(`ArgumentException`, `NullReferenceException`, etc.) are loud, fixable failures —
+silently buffering and retrying them forever would hide the bug."*
+
+The problem is that `InvalidOperationException` and `TimeoutException` are exactly the
+two most common shapes of an authoring/driver-misuse bug on the ADO.NET write path, and
+both are now swept into the transient set:
+
+1. `Microsoft.Data.SqlClient` throws `InvalidOperationException` for a long list of
+   *programming* errors that are NOT outages and will fail identically on every retry —
+   e.g. a `SqlParameter` already belongs to another `SqlParameterCollection`, the
+   command's `Connection` property is not set, an open `SqlDataReader` already exists on
+   the connection, or `CommandText` is invalid for the call. The doc treats
+   `InvalidOperationException` as "the connection is not open / pooled connection
+   broken", but the type carries no such discrimination — any of those authoring
+   conditions is classified transient and buffered.
+2. A bare `TimeoutException` (or any of these) thrown by a custom/wrapped provider for a
+   non-outage reason is likewise classified transient.
+
+Consequence: a `CachedWrite` whose payload triggers one of these deterministic
+programming errors is returned to the script as `WasBuffered: true` (a lie — it will
+never succeed), then re-attempted on every S&F sweep, failing identically each time,
+until `RetryCount >= DefaultMaxRetries` and it is finally parked — a noisy retry loop
+hiding a fixable bug, which is the exact outcome the doc says the design avoids. It is
+also asymmetric with the HTTP path it cites as its model: `ErrorClassifier.IsTransient(Exception)`
+(ExternalSystemClient) deliberately does NOT catch `InvalidOperationException`, so the
+same class of bug propagates on the API side.
+
+**Recommendation**
+
+Narrow the non-`SqlException` transient set to genuine transport/outage shapes. Drop
+the broad `InvalidOperationException` arm — or, if connection-state errors must stay
+transient, narrow it (e.g. only when the message indicates a broken/closed connection,
+or only the `DbException`/`SocketException`/`IOException`/timeout shapes) so that
+ADO.NET programming `InvalidOperationException`s propagate as the design intends.
+Mirror the HTTP path's narrower `ErrorClassifier.IsTransient(Exception)` set unless
+there is a documented reason the SQL path must be broader. Add a regression test that
+throws a programming-shaped `InvalidOperationException` (e.g. "The SqlParameter is
+already contained by another SqlParameterCollection.") from `RunSqlAsync` and asserts
+it propagates (not buffered), companion to the existing
+`CachedWrite_UnexpectedException_Propagates_NotClassifiedTransient`.
+
+**Resolution**
+
+Deferred 2026-06-20: narrowing the over-broad SQL transient set (currently all `InvalidOperationException`/`TimeoutException` are treated transient) to match the HTTP path is a classification-policy decision for the owner; recommended narrowing recorded. No data loss today (bounded noisy retry returning a false WasBuffered), so deferred rather than forced.
+
+### ExternalSystemGateway-025 — Caller-token cancellation surfacing from the SQL driver as a `SqlException` is reclassified as a permanent DB error instead of propagating
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Error handling & resilience |
+| Status | Resolved |
+| Location | `src/ZB.MOM.WW.ScadaBridge.ExternalSystemGateway/DatabaseGateway.cs:329-345`, `src/ZB.MOM.WW.ScadaBridge.ExternalSystemGateway/SqlErrorClassifier.cs:158-174` |
+
+**Description**
+
+`ExecuteWriteAsync`'s ordered catches handle caller cancellation only via
+`catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested)`.
+That covers the common case where the driver observes the token before/around the
+network call and raises `OperationCanceledException`/`TaskCanceledException`. However,
+when `Microsoft.Data.SqlClient` is cancelled *mid-statement* by the caller's token it
+can instead raise a **`SqlException`** (historically error number `0` / message
+"Operation cancelled by user", and on some paths `3980` "the request failed to run
+because the batch is aborted"). Such a `SqlException` does not derive from
+`OperationCanceledException`, so the first filter is skipped; it falls into
+`catch (SqlException ex) { throw SqlErrorClassifier.Throw(connectionName, ex); }`, and
+because `0`/`3980` are not in the transient set, `SqlErrorClassifier.Throw` raises a
+`PermanentDatabaseException`.
+
+The caller-requested cancellation is therefore reported to the script as a *permanent
+database failure* (`ExternalCallResult.Success == false`, "Permanent database error: …")
+rather than propagating as an `OperationCanceledException`. This contradicts the
+`ExternalSystemGateway-008` contract ("the caller asked to abandon the work — do not
+reclassify") that the comment block at `DatabaseGateway.cs:333-337` explicitly cites,
+and it is the SQL-path analogue of the cancellation/timeout conflation that `-008`
+fixed for HTTP. The existing `CachedWrite_CancellationRequested_PropagatesOperationCanceled_NotReclassified`
+test only exercises the `OperationCanceledException` shape, so the `SqlException`
+cancellation path is uncovered.
+
+**Recommendation**
+
+Before classifying a `SqlException`, re-check `cancellationToken.IsCancellationRequested`
+and, if set, rethrow as `OperationCanceledException(cancellationToken)` (or call
+`cancellationToken.ThrowIfCancellationRequested()`), so a caller-initiated cancel
+propagates regardless of whether the driver surfaced it as `OperationCanceledException`
+or as a cancellation-shaped `SqlException`. Add a regression test that cancels the
+caller token and throws a cancellation-shaped `SqlException` (or, since `SqlException`
+has no public constructor, drive it through `RunSqlAsync` with a token-cancelled
+`OperationCanceledException` wrapped to assert the guard) and asserts an
+`OperationCanceledException` propagates and nothing is buffered.
+
+**Resolution**
+
+Resolved 2026-06-20 (commit `fd618cf1`): `ExecuteWriteAsync` now calls `cancellationToken.ThrowIfCancellationRequested()` at the top of the `catch (SqlException)` block (before `SqlErrorClassifier.Throw`), so a caller-token cancel that surfaces as a SqlException propagates as `OperationCanceledException` regardless of the driver's exception shape (version-independent). Test added (verified failing pre-fix).
+
+### ExternalSystemGateway-026 — `ExecuteWriteAsync` comment-block case labels are mis-ordered relative to the catch sequence
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Documentation & comments |
+| Status | Resolved |
+| Location | `src/ZB.MOM.WW.ScadaBridge.ExternalSystemGateway/DatabaseGateway.cs:316-355` |
+
+**Description**
+
+The header comment for `ExecuteWriteAsync` (lines 316-328) numbers the classification
+cases 1–4 in catch order: `1.` caller-requested cancellation propagates unchanged; `2.`
+a `SqlException` is classified by number; `3.` a non-`SqlException` transport/connection
+failure is transient; `4.` unexpected exceptions propagate. The inline labels on the
+catch bodies, however, are inverted: the cancellation catch (line 333) is annotated
+`// [2]` and the non-`SqlException` transient catch (line 348) is annotated `// [1]`.
+The labels also do not match the parallel labelling in the test file, where `[1]` is
+used for the non-Sql-outage case and `[2]` for cancellation — so the source comment and
+the tests disagree with the source's own header list.
+
+This is cosmetic but actively misleading: a reader cross-referencing the numbered
+rationale in the header against the `[n]`-tagged catch bodies will map each case to the
+wrong rationale, and the discrepancy undermines confidence in an otherwise carefully
+documented classification seam.
+
+**Recommendation**
+
+Renumber the inline `// [n]` labels to match the header list (cancellation = `[1]`,
+non-`SqlException` transient = `[3]`), or drop the bracketed numbers entirely and keep
+the prose, so the comment is self-consistent. Align the test-file annotations to
+whatever scheme is chosen.
+
+**Resolution**
+
+Resolved 2026-06-20 (commit `fd618cf1`): renumbered the inline `// [n]` case labels in `ExecuteWriteAsync` to match the method header's rationale list (cancellation=[1], SqlException=[2], non-Sql=[3]), and realigned the three corresponding labels in the test file. Comment-only.
diff --git a/code-reviews/HealthMonitoring/findings.md b/code-reviews/HealthMonitoring/findings.md
index 7a492167..076f1173 100644
--- a/code-reviews/HealthMonitoring/findings.md
+++ b/code-reviews/HealthMonitoring/findings.md
@@ -5,10 +5,10 @@
 | Module | `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring` |
 | Design doc | `docs/requirements/Component-HealthMonitoring.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-28 |
+| Last reviewed | 2026-06-20 |
 | Reviewer | claude-agent |
-| Commit reviewed | `1eb6e97` |
-| Open findings | 2 |
+| Commit reviewed | `4307c381` |
+| Open findings | 0 |
 
 ## Summary
 
@@ -80,6 +80,45 @@ covers (HealthMonitoring-023). All sequence-number and offline-detection
 arithmetic uses `_timeProvider.GetUtcNow()` consistently — no wall-clock vs
 monotonic mismatch was observed.
 
+#### Re-review 2026-06-20 (commit `4307c381`) — full review
+
+All twenty-three prior findings (HealthMonitoring-001..023) remain `Resolved`
+and were spot-verified against the current source: the HealthMonitoring-017/018
+"snapshot-then-restore-on-failure" counter logic is in place
+(`HealthReportSender.cs:158-176`, `CentralHealthReportLoop.cs:118-134`,
+`SiteHealthCollector.AddIntervalCounters`), the HealthMonitoring-020 offline→online
+`max(receivedAt, now)` anchor is correct (`CentralHealthAggregator.cs:138-149`),
+and the HealthMonitoring-021 `$central` sentinel is honoured by the aggregator and
+the Central UI. The full 10-category checklist produced **2 new findings, both Low,
+none crash-class** — both confined to the new M6 `SiteHealthKpiSampleSource`
+(KPI History sampling) added since the last baseline: the synthetic `$central`
+self-report is sampled as a real `KpiScopes.Site` series with meaningless
+zero connection/instance/S&F values (HealthMonitoring-024), and 8 of the 12
+emitted Site Health metrics are persisted every minute per site but never read by
+any chart while the design doc claims all 12 are "rendered in the dashboard's
+per-site KpiTrendChart panel" (HealthMonitoring-025). The KPI-recorder
+scope/captive-dependency reasoning is sound (recorder opens a per-pass scope and
+the scoped source over a singleton aggregator is the normal lifetime relationship),
+the per-source fault isolation is correct, and the `AddIntervalCounters` /
+`SetSiteEventLogWriteFailures` default-interface no-ops are an appropriate
+test-fake-compatibility seam. No concurrency, security, or sequence-arithmetic
+regressions were found.
+
+_Re-review (2026-06-20, `4307c381`):_
+
+| # | Category | Examined | Notes |
+|---|----------|----------|-------|
+| 1 | Correctness & logic bugs | x | Re-verified HealthMonitoring-017/018/020 fixes in place and correct. `SiteHealthKpiSampleSource.CollectAsync` arithmetic (connectionsUp/Down split, summed `sfBufferDepth`, null-`SiteAuditBacklog`→0) is correct. No new logic bug. |
+| 2 | Akka.NET conventions | x | Module still contains no actors. `IHealthReportTransport.Send` is fire-and-forget (Tell semantics). KPI source is a plain `IKpiSampleSource` consumed by the KpiHistory recorder singleton (out of scope). No issues found. |
+| 3 | Concurrency & thread safety | x | `SiteHealthCollector` counters use `Interlocked`; `AddIntervalCounters` restore is per-field `Interlocked.Add` and correctly sums with concurrent increments against the zero left by `CollectReport`'s `Exchange`. Aggregator CAS pattern unchanged. `SiteEventLogFailureCountReporter` uses a linked CTS + `Task.Run` loop with catch-all probe isolation — sound. No issues found. |
+| 4 | Error handling & resilience | x | Counter-restore-on-failure (HM-017/018) confirmed. KPI source is exception-isolated by the recorder per-source. `SiteEventLogFailureCountReporter.SafeProbe` logs and continues. No issues found. |
+| 5 | Security | x | No issues found. Numeric/string operational metrics only; no secrets, auth surface, or untrusted-input parsing. |
+| 6 | Performance & resource management | x | `PeriodicTimer`/CTS disposed correctly. New: `SiteHealthKpiSampleSource` writes 12 Site-scoped rows per site per minute, of which 8 are never charted (HealthMonitoring-025) — bounded but wasteful; also emits a full row-set for the synthetic `$central` (HealthMonitoring-024). |
+| 7 | Design-document adherence | x | Design doc line 119 claims all 12 emitted Site Health metrics are "rendered in the dashboard's per-site KpiTrendChart panel" — only 4 (`connectionsDown`/`scriptErrors`/`sfBufferDepth`/`deadLetters`) are actually charted (HealthMonitoring-025). Synthetic `$central` is sampled as a real Site scope (HealthMonitoring-024). |
+| 8 | Code organization & conventions | x | KPI source's metric-name catalog is split: 4 charted metrics share `KpiMetrics.SiteHealth` (Commons), 8 are private literals — noted under HealthMonitoring-025. Options/validator ownership and idempotent registration correct. `AddCentralHealthAggregation` registers the scoped KPI source via `TryAddEnumerable` — correct. |
+| 9 | Testing coverage | x | `SiteHealthKpiSampleSourceTests` covers the populated/heartbeat-only/null-backlog/empty paths and exact (metric,value) tuples well. No test asserts `$central` is excluded (it currently is not — HealthMonitoring-024). Otherwise coverage is strong (73 tests). |
+| 10 | Documentation & comments | x | XML docs accurate. The only doc drift is design-doc line 119's "all 12 rendered" claim (HealthMonitoring-025). |
+
 ## Checklist coverage
 
 | # | Category | Examined | Notes |
@@ -1153,3 +1192,110 @@ a non-bug.
 Rename to `StoreAndForwardBufferDepths_DefaultsToEmpty_WhenSetterNotCalled`
 (or similar) and update the test body's intent — purely a documentation /
 maintainability fix; no behaviour change.
+
+### HealthMonitoring-024 — `SiteHealthKpiSampleSource` samples the synthetic `$central` self-report as a real `KpiScopes.Site` series
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Design-document adherence |
+| Status | Resolved |
+| Location | `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/Kpi/SiteHealthKpiSampleSource.cs:69-86`, `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/CentralHealthReportLoop.cs:115-120` |
+
+**Description**
+
+`SiteHealthKpiSampleSource.CollectAsync` iterates
+`ICentralHealthAggregator.GetAllSiteStates()` and emits the 12-metric catalog
+for **every** entry whose `LatestReport` is non-null. The aggregator keyspace
+includes the synthetic central self-report under `CentralHealthReportLoop.CentralSiteId`
+(`"$central"`), which `CentralHealthReportLoop` feeds via `ProcessReport` every
+30s — so `$central` always has a non-null `LatestReport` once the loop has run.
+The source therefore writes 12 `KpiSample` rows per minute with
+`Scope = KpiScopes.Site`, `ScopeKey = "$central"` for the central cluster.
+
+The class XML doc, the design doc (`Component-HealthMonitoring.md:119`,
+"emits `IKpiSampleSource` (`SiteHealthKpiSampleSource`, per-Site)"), and the
+`KpiScopes.Site` scope all describe this as a **per-real-site** series.
+`$central` is not a site — its `DataConnectionStatuses`, `StoreAndForwardBufferDepths`,
+`ParkedMessageCount`, and instance counts are structurally empty/zero
+(the central node runs no DCL, no S&F engine, no instances), so the persisted
+`connectionsUp`/`connectionsDown`/`sfBufferDepth`/`parkedMessages`/`deployedInstances`
+trend rows for `$central` are meaningless constant-zero noise. The Central UI
+trend selector (`Health.razor:570-572`) deliberately pins `$central` first and
+queries `KpiScopes.Site`/`scopeKey="$central"`, so the dashboard will plot these
+flat-zero "Central Cluster" trends.
+
+This is consistent (the UI renders central as a card by design) so it is not a
+behaviour bug — but it conflates a synthetic non-site with real-site KPI history,
+permanently stores meaningless zero series, and contradicts the "per-Site"
+contract. None of the other M6 sample sources sample a synthetic scope key.
+
+**Recommendation**
+
+Decide whether central-cluster trends belong in the per-Site KPI store. If not,
+skip the `CentralHealthReportLoop.CentralSiteId` entry in `CollectAsync` (a single
+`if (siteId == CentralHealthReportLoop.CentralSiteId) continue;` guard) and remove
+the `$central` option from the trend selector. If central trends are intended,
+give them a distinct scope (e.g. a `KpiScopes.Central` scope or a central-specific
+metric subset) so the data is not labelled as a real site, and update the design
+doc + class XML doc to say so. Either way add a test asserting the chosen
+behaviour for `$central`.
+
+**Resolution**
+
+Resolved 2026-06-20 (commit `fd618cf1`): the design doc now documents that the synthetic `$central` self-report is intentionally sampled into the per-site KPI store (the Central UI deliberately pins `$central` first in the trend selector as 'Central Cluster'), and that its zero connection/instance/S&F values reflect the self-report carrying no site-runtime data — not a defect. Doc-only.
+
+### HealthMonitoring-025 — 8 of the 12 emitted Site Health KPI metrics are never charted; design doc claims all 12 are rendered
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Design-document adherence |
+| Status | Resolved |
+| Location | `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/Kpi/SiteHealthKpiSampleSource.cs:40-51`, `docs/requirements/Component-HealthMonitoring.md:119` |
+
+**Description**
+
+`SiteHealthKpiSampleSource` emits 12 metrics per site per sampling pass:
+`connectionsUp`, `connectionsDown`, `scriptErrors`, `alarmEvalErrors`,
+`sfBufferDepth`, `deadLetters`, `parkedMessages`, `deployedInstances`,
+`enabledInstances`, `disabledInstances`, `auditBacklogPending`,
+`eventLogWriteFailures`. Only **four** are ever read back: the Central UI
+per-site trend panel calls `LoadTrendSeriesAsync` for exactly
+`KpiMetrics.SiteHealth.ConnectionsDown`, `.ScriptErrors`, `.SfBufferDepth`, and
+`.DeadLetters` (`Health.razor:596-603`) — and only those four are in the public
+`KpiMetrics.SiteHealth` catalog (`Commons/Types/Kpi/KpiMetrics.cs:84-97`). The
+other 8 metric names are private string literals in the source and have no
+reader anywhere in the codebase.
+
+Two consequences:
+
+1. **Stale design doc.** `Component-HealthMonitoring.md:119` lists all 12 metric
+   names and states they are "rendered in the dashboard's per-site
+   `KpiTrendChart` panel." Eight of them are never rendered. A reader / future
+   maintainer is misled about what the trend panel shows.
+2. **Persisted-but-dead samples.** The `KpiHistoryRecorderActor` writes all 12
+   rows to the central `KpiSample` table every minute for every reporting site;
+   8 are pure write-amplification that nothing queries and that the 90-day
+   purge eventually drops. Bounded, but wasteful — roughly two-thirds of the
+   Site Health KPI rows are never read.
+
+A related maintainability smell: the split catalog (4 metrics shared via
+`KpiMetrics.SiteHealth`, 8 private literals) means a future chart added for, say,
+`alarmEvalErrors` would have to re-type the literal and risk a silent typo
+mismatch against the source.
+
+**Recommendation**
+
+Reconcile intent in one of two directions: (a) if only the four charted metrics
+are wanted, stop emitting the other eight (and correct the design doc's list to
+the four actually rendered); or (b) if the full 12-metric history is intended for
+future charts / ad-hoc query, promote all 12 names into the public
+`KpiMetrics.SiteHealth` catalog so source and any future UI binding key off one
+symbol, and either add the missing charts or soften the doc's "rendered … panel"
+wording to "recorded for trend history." Add the design-doc fix in the same
+session per the repo's doc-and-code-travel-together rule.
+
+**Resolution**
+
+Resolved 2026-06-20 (commit `fd618cf1`): corrected the design doc — of the 12 sampled Site Health KPI metrics only four are charted (connectionsDown, scriptErrors, sfBufferDepth, deadLetters); the other eight are now documented as sampled-but-not-charted (retained for future surfaces / ad-hoc query). Doc-only.
diff --git a/code-reviews/Host/findings.md b/code-reviews/Host/findings.md
index d0ed36ac..fccbb7f1 100644
--- a/code-reviews/Host/findings.md
+++ b/code-reviews/Host/findings.md
@@ -5,9 +5,9 @@
 | Module | `src/ZB.MOM.WW.ScadaBridge.Host` |
 | Design doc | `docs/requirements/Component-Host.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-28 |
+| Last reviewed | 2026-06-20 |
 | Reviewer | claude-agent |
-| Commit reviewed | `1eb6e97` |
+| Commit reviewed | `4307c381` |
 | Open findings | 0 |
 
 ## Summary
@@ -80,6 +80,33 @@ Serilog is the only logger provider and the section is dead config (Host-021);
 and `ParseLevel` silently swallows an unrecognised `MinimumLevel` value (e.g.
 a typo) and falls back to `Information` with no warning (Host-022).
 
+#### Re-review 2026-06-20 (commit `4307c381`) — full review
+
+All twenty-two prior findings (Host-001..022) remain `Resolved` in the current tree
+(Host-008 was deliberately *reverted* in M2.9 — see Host-024). This is the first
+review since the `ScadaBridge` rename, so a `git diff 1eb6e97 HEAD` shows the whole
+directory as additions and is not a useful delta; the review walked the full current
+state plus the per-file commit history. Major changes since `1eb6e97`: REQ-HOST-4a was
+re-architected (the leader-only `active-node` check left `/health/ready`; a new
+leadership-agnostic `RequiredSingletonsHealthCheck` probes the five required central
+singletons via `Identify` — clean), shared `ZB.MOM.WW.Health`/`Telemetry` adoption
+(`/healthz`, `/metrics`, OTel), a site HTTP/1.1 `MetricsPort` (8084) listener, the
+KpiHistory + Transport central registrations, the disable-login dev flag, and the
+library-backed inbound API-key auth. Four new findings, none crash/data-loss class.
+Host-023 (Medium) is a regression-of-resolution: the shipped `appsettings.Site.json`
+second seed-node now targets `localhost:8084` — the new HTTP/1.1 MetricsPort, not an
+Akka remoting endpoint — the same defect class as the resolved Host-004 (which chose
+8084 as a "remoting port" before MetricsPort claimed it); `StartupValidator` rejects a
+seed on the GrpcPort but not on the MetricsPort. Host-024 (Medium) flags the M2.9
+reversal of Host-008: `MachineDataDb` is again *required* for Central yet is consumed
+by no DbContext/repository AND is absent from the shipped `appsettings.Central.json`,
+so a developer running the shipped Central binary fails startup for a value nothing
+uses. Host-025 (Low): the Host directly consumes KpiHistory (`AddKpiHistory`,
+`KpiHistoryRecorderActor`, `KpiHistoryOptions`) but has no direct `ProjectReference` —
+it relies on a transitive pull through CentralUI. Host-026 (Low): the Component
+Registration Matrix in `Component-Host.md` is stale — KpiHistory and Transport are
+registered central-only in `Program.cs` but absent from the matrix.
+
 ## Checklist coverage
 
 | # | Category | Examined | Notes |
@@ -110,6 +137,21 @@ _Re-review (2026-05-28, `1eb6e97`):_
 | 9 | Testing coverage | ☑ | Strong existing suite. No coverage for the Site `CentralContactPoints` second-entry rule (Host-016), the site-shutdown ordering (Host-017), the `NodeName`-absent shipped config (Host-018), the unused `CancellationToken` parameter (Host-019), the `MinimumLevel.Is` override semantics (Host-020) or the `ParseLevel` silent fallback (Host-022). |
 | 10 | Documentation & comments | ☑ | Re-review: layered `MinimumLevel.Is` / `ReadFrom.Configuration` semantics are not surfaced — an operator-set `Serilog:MinimumLevel` is silently overridden by `ScadaBridge:Logging:MinimumLevel` (Host-020); `ParseLevel` silently coerces a misspelled level to `Information` with no warning (Host-022). |
 
+_Re-review (2026-06-20, `4307c381`):_
+
+| # | Category | Examined | Notes |
+|---|----------|----------|-------|
+| 1 | Correctness & logic bugs | ☑ | Re-review: shipped `appsettings.Site.json` second seed-node targets MetricsPort 8084 (HTTP/1.1 Kestrel listener), not an Akka remoting endpoint — regression of the resolved Host-004 fix (Host-023). |
+| 2 | Akka.NET conventions | ☑ | All five central singletons + the four site singletons reviewed: ClusterSingletonManager/Proxy pairing, role scoping (site singletons `WithRole(site-*)`, central on base "Central"), `PhaseClusterLeave` graceful-stop drain tasks, receptionist registrations, `RequiredSingletonsHealthCheck` `Identify` probe — all sound. No new issues. |
+| 3 | Concurrency & thread safety | ☑ | `GetOrCreateActorSystem` double-checked lock (HOST-021) is correct; `_trackedDisposables` snapshot-under-lock; `RequiredSingletonsHealthCheck` probes concurrently and never throws. No new issues. |
+| 4 | Error handling & resilience | ☑ | `StartupRetry` now threads `ApplicationStopping` + transient-only classifier (Host-019/015 resolved); migration retry sound. No new issues. |
+| 5 | Security | ☑ | Shipped configs keep `${...}` secret placeholders; pepper fail-fast validated (≥16 chars). Note: docker per-node `appsettings.Central.json` still carry plaintext SQL passwords / JWT key — out of this module's edit scope (Host-003 follow-up), not re-filed. |
+| 6 | Performance & resource management | ☑ | ActorSystem singleton-not-transient avoids per-probe disposal; gRPC stream cancel-on-shutdown wired. No new issues. |
+| 7 | Design-document adherence | ☑ | Re-review: Component Registration Matrix omits KpiHistory + Transport (both central-only in Program.cs) (Host-026); `MachineDataDb` required-but-unconsumed and absent from shipped Central config (Host-024). |
+| 8 | Code organization & conventions | ☑ | Re-review: Host directly consumes KpiHistory types but has no direct `ProjectReference` (transitive via CentralUI) (Host-025). |
+| 9 | Testing coverage | ☑ | Strong suite (RequiredSingletons, Hocon, Serilog, StartupRetry covered). Gaps: no test asserts a seed on the MetricsPort is rejected (Host-023 — the existing 8084-seed test only avoids the grpc rule), and no shipped-config test catches the missing `MachineDataDb` key (Host-024). |
+| 10 | Documentation & comments | ☑ | Re-review: `Component-Host.md` matrix stale vs `Program.cs` (Host-026). Inline XML/comments otherwise accurate and unusually thorough. |
+
 ## Findings
 
 ### Host-001 — `/health/ready` includes the leader-only `active-node` check
@@ -1118,3 +1160,163 @@ behaviour you choose.
 **Resolution**
 
 _Open._
+
+### Host-023 — Shipped Site `SeedNodes` second entry targets the MetricsPort, not a remoting port
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Correctness & logic bugs |
+| Status | Deferred |
+| Location | `src/ZB.MOM.WW.ScadaBridge.Host/appsettings.Site.json:13-17`, `src/ZB.MOM.WW.ScadaBridge.Host/StartupValidator.cs:117-127` |
+
+**Description**
+
+The shipped site config sets `Node:RemotingPort = 8082`, `Node:GrpcPort = 8083`,
+`Node:MetricsPort = 8084`, but `Cluster:SeedNodes` is
+`["akka.tcp://scadabridge@localhost:8082", "akka.tcp://scadabridge@localhost:8084"]`.
+The second seed targets `8084` — which `Program.cs:419-422` binds as the **HTTP/1.1
+Kestrel `/metrics` listener** (`HttpProtocols.Http1AndHttp2`), not an Akka.Remote
+endpoint. A node joining via that seed attempts an Akka.Remote TCP association against
+the Prometheus scrape listener and fails. This is the exact defect class as the
+**resolved Host-004** (where the second seed targeted the gRPC port 8083): Host-004's
+fix corrected the seed to `8084` and called it a "remoting port", but the later
+introduction of `MetricsPort` (commits `bbc9f092`/`c41cb41c`) claimed `8084`, silently
+re-breaking the example. `StartupValidator` (`StartupValidator.cs:121-127`) was
+extended by Host-004 to reject a seed whose port equals `GrpcPort`, but it does **not**
+reject a seed whose port equals `MetricsPort`, so this misconfiguration passes
+validation silently. For the single-node dev loopback the first seed (`8082`) succeeds
+and the bug is masked, but it is an incorrect example that copies into multi-node
+configs (the docker site configs correctly use distinct hostnames both on `8082`).
+
+**Recommendation**
+
+The loopback dev site cannot host two distinct remoting endpoints, so drop the
+second seed (mirror Host-016's single-entry `CentralContactPoints` template) or use a
+distinct hostname placeholder both on `8082`, with a comment that multi-node sites
+list each node's *remoting* port. Extend the Site block of `StartupValidator` to also
+reject any seed whose port equals the resolved `MetricsPort` (and ideally any port
+the node binds for non-remoting use). Add a `StartupValidatorTests` case
+`Site_SeedNodeOnMetricsPort_FailsValidation` mirroring
+`Site_SeedNodeOnGrpcPort_FailsValidation`.
+
+**Resolution**
+
+Deferred 2026-06-20: the shipped `appsettings.Site.json` second seed-node points at the MetricsPort and the validator lacks a `seedPort != metricsPort` rule; fixing touches shipped dev config and a single-node-loopback assumption. Recorded for a follow-up (no production node ships this dev config).
+
+### Host-024 — `MachineDataDb` re-required for Central but consumed by nothing and absent from the shipped config
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Design-document adherence |
+| Status | Deferred |
+| Location | `src/ZB.MOM.WW.ScadaBridge.Host/StartupValidator.cs:63-65`, `src/ZB.MOM.WW.ScadaBridge.Host/appsettings.Central.json:22-24` |
+
+**Description**
+
+Commit `76198b36` ("reverts Host-008") re-added a fail-fast `.Require` for
+`ScadaBridge:Database:MachineDataDb` to the Central `.When` block, citing
+REQ-HOST-3/REQ-HOST-4 and that the docker per-node configs carry the key. Two problems
+remain after the reversal:
+
+1. **Still consumed by nothing.** A repo-wide search shows `MachineDataDb` is read only
+   in `StartupValidator.cs` and declared on `DatabaseOptions.cs:11` — no DbContext, no
+   repository, no component reads it. Only `ConfigurationDb` is wired into
+   `AddConfigurationDatabase` (`Program.cs:200-202`). The original Host-008 rationale
+   ("dead configuration that fails startup for a value nothing uses") is therefore
+   still entirely true; the reversal re-instated the fail-fast without adding a
+   consumer.
+2. **Absent from the shipped Central config.** `appsettings.Central.json` (the template
+   a developer running the binary directly uses) contains `Database:ConfigurationDb`
+   (a `${...}` placeholder) but **no `MachineDataDb` key at all**. So a Central node
+   started from the shipped config fails `StartupValidator` immediately with
+   "ScadaBridge:Database:MachineDataDb connection string required for Central" — for a
+   value nothing consumes. The docker configs happen to set it; the shipped default
+   does not. (This is *not* a re-file of Host-008, which was Resolved-then-reverted;
+   the new, distinct defect is the validation breaking the shipped dev config.)
+
+This is also a code-vs-doc inconsistency: the design intent (validate a value the
+system needs) is unmet because the value is needed by nothing.
+
+**Recommendation**
+
+Pick one and make code, shipped config, and doc agree: either (a) wire a machine-data
+store so the value is actually consumed and add a `MachineDataDb` placeholder to the
+shipped `appsettings.Central.json` (matching the `_secrets` env-var pattern), or (b)
+remove the `.Require`, the `DatabaseOptions.MachineDataDb` property, and the docker
+keys again. Add a config-shape regression test asserting the shipped
+`appsettings.Central.json` satisfies `StartupValidator` for the Central role (the
+absence of such a test is why this slipped through).
+
+**Resolution**
+
+Deferred 2026-06-20: `MachineDataDb` is re-required for Central but consumed by nothing and absent from shipped Central config (the shipped Central binary would fail StartupValidator). This is a reversal of the reverted Host-008 and needs a product decision — whether a machine-data store is actually coming (add a config placeholder) or the requirement should be removed again. Recorded for that decision.
+
+### Host-025 — Host directly consumes KpiHistory but has no direct project reference
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Code organization & conventions |
+| Status | Resolved |
+| Location | `src/ZB.MOM.WW.ScadaBridge.Host/ZB.MOM.WW.ScadaBridge.Host.csproj:42-63`, `src/ZB.MOM.WW.ScadaBridge.Host/Program.cs:118`, `src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs:717-757` |
+
+**Description**
+
+`Program.cs` calls `builder.Services.AddKpiHistory(builder.Configuration)` and
+`AkkaHostedService.RegisterCentralActors` constructs the `KpiHistoryRecorderActor`
+cluster singleton from `KpiHistoryOptions` / `KpiHistoryRecorderActor`
+(`ZB.MOM.WW.ScadaBridge.KpiHistory`), yet the Host `.csproj` carries **no
+`ProjectReference` to `ZB.MOM.WW.ScadaBridge.KpiHistory`**. It compiles only because
+CentralUI (`ZB.MOM.WW.ScadaBridge.CentralUI.csproj`) transitively re-exports it. The
+Host is the composition root for every component and the project deliberately declares
+direct dependencies elsewhere — the csproj even comments that `ZB.MOM.WW.Theme` is
+declared directly "rather than leaning on CentralUI re-exporting it transitively"
+(lines 37-39). KpiHistory is the same situation but was not given the same treatment.
+If CentralUI ever drops or restructures its KpiHistory reference, the Host fails to
+compile for a dependency it directly uses — a latent build fragility.
+
+**Recommendation**
+
+Add `<ProjectReference Include="../ZB.MOM.WW.ScadaBridge.KpiHistory/ZB.MOM.WW.ScadaBridge.KpiHistory.csproj" />`
+to the Host csproj's `ProjectReference` group, mirroring the explicit Theme reference
+rationale already documented there.
+
+**Resolution**
+
+Resolved 2026-06-20 (commit `fd618cf1`): added an explicit `<ProjectReference>` to the KpiHistory project in the Host csproj, so Host's direct use of KpiHistory types no longer relies on the transitive pull through CentralUI.
+
+### Host-026 — Component Registration Matrix omits KpiHistory and Transport
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Documentation & comments |
+| Status | Resolved |
+| Location | `docs/requirements/Component-Host.md` (Component Registration Matrix, lines 177-198), `src/ZB.MOM.WW.ScadaBridge.Host/Program.cs:92,118` |
+
+**Description**
+
+`Program.cs` registers two central-only components that have shipped since the matrix
+was last updated: **Transport** (`builder.Services.AddTransport()`, line 92, #24) and
+**KpiHistory** (`builder.Services.AddKpiHistory(...)`, line 118, plus the
+`kpi-history-recorder` cluster singleton in `AkkaHostedService`, #26). Neither appears
+in the Component Registration Matrix in `Component-Host.md` (lines 177-198), which is
+the documented source of which components register on which role. The matrix is
+described in CLAUDE.md as something that "must stay in sync with actual component
+documents"; it is now stale against the actual composition root. A maintainer auditing
+role-based registration against the doc would miss both components (and miss that
+KpiHistory's recorder is deliberately *not* readiness-gated — worth a matrix note).
+
+**Recommendation**
+
+Add `Transport` (Central: Yes / Site: No / DI: Yes / Actors: No / Endpoints: No) and
+`KpiHistory` (Central: Yes / Site: No / DI: Yes / Actors: Yes / Endpoints: No) rows to
+the matrix, and note KpiHistory's recorder singleton is intentionally absent from
+`RequiredSingletonsHealthCheck` (not readiness-gated). Re-check the Dependencies list
+("All 19 component libraries") against the actual csproj reference count while editing.
+
+**Resolution**
+
+Resolved 2026-06-20 (commit `fd618cf1`): added `KpiHistory` and `Transport` rows to the Component Registration Matrix in Component-Host.md (both Central=Yes/Site=No/DI=Yes), reflecting their actual central-only registration.
diff --git a/code-reviews/InboundAPI/findings.md b/code-reviews/InboundAPI/findings.md
index 291b60a5..d7109c81 100644
--- a/code-reviews/InboundAPI/findings.md
+++ b/code-reviews/InboundAPI/findings.md
@@ -5,10 +5,10 @@
 | Module | `src/ZB.MOM.WW.ScadaBridge.InboundAPI` |
 | Design doc | `docs/requirements/Component-InboundAPI.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-28 |
+| Last reviewed | 2026-06-20 |
 | Reviewer | claude-agent |
-| Commit reviewed | `1eb6e97` |
-| Open findings | 0 |
+| Commit reviewed | `4307c381` |
+| Open findings | 4 |
 
 ## Summary
 
@@ -124,6 +124,36 @@ configuration database, but the invariant is undocumented.)
 | 9 | Testing coverage | ☑ | `EndpointExtensions.HandleInboundApiRequest` composition wiring has no test (InboundAPI-023); middleware/filter/validator/executor/route are individually covered. |
 | 10 | Documentation & comments | ☑ | No new issues. |
 
+#### Re-review 2026-06-20 (commit `4307c381`) — full review
+
+All 25 prior findings remain `Resolved` (InboundAPI-022's `IActiveNodeGate` Host
+registration is now present at `Program.cs:260`; InboundAPI-025's POST-only audit
+`UseWhen` predicate is live at `Program.cs:337-344`). The module changed materially:
+the whole project was renamed (ScadaLink → ZB.MOM.WW.ScadaBridge), inbound auth was
+re-architected to the shared `ZB.MOM.WW.Auth.ApiKeys` Bearer/`sbk_` verifier (the SQL
+`ApiKey` store, `ApiKeyValidator`, and peppered `IApiKeyHasher` were retired),
+`ForbiddenApiChecker` became a thin shim over Script Analysis #25's `ScriptTrustValidator`,
+and JSON-Schema `$ref` runtime resolution (`SchemaRefResolver`) plus recursive
+Object/List validation+materialization landed. The dominant new finding is that a
+**read-only `InboundDatabaseHelper`** was added to `InboundScriptContext.Database` —
+reintroducing the very raw-DB-access capability the design doc still explicitly
+forbids (regressing the InboundAPI-007 closure), with no read-only enforcement and a
+sync-over-async, token-ignoring data path. **4 new findings**: 1 High, 1 Medium,
+2 Low — no Critical.
+
+| # | Category | Examined | Notes |
+|---|----------|----------|-------|
+| 1 | Correctness & logic bugs | ☑ | Recursive Materialize/Validate look sound; `WaitForAttribute` deadline-vs-wait-timeout tension (InboundAPI-029). |
+| 2 | Akka.NET conventions | ☑ | ASP.NET-hosted; routes via `IInstanceRouter` → `CommunicationService`. Correlation/parent ids stamped on all four `Route.To()` verbs. No issues. |
+| 3 | Concurrency & thread safety | ☑ | `ConcurrentDictionary` caches; per-request DI scope created from root in singleton executor (valid). New DB helper blocks a pool thread (InboundAPI-027). |
+| 4 | Error handling & resilience | ☑ | DB helper's `ExecuteScalar`/`ExecuteReader` are synchronous and ignore `_ct` — a slow query is unbounded by the method timeout (InboundAPI-027). |
+| 5 | Security | ☑ | `InboundDatabaseHelper` runs arbitrary script-supplied SQL (not read-only despite the name) on central with host privileges (InboundAPI-026). Bearer verifier delegates constant-time compare to shared lib. Forbidden-API check delegates to #25. |
+| 6 | Performance & resource management | ☑ | Capped capture/known-bad caches; `ArrayPool` rent. Sync-over-async DB calls (InboundAPI-027). |
+| 7 | Design-document adherence | ☑ | `Database` helper contradicts the doc's "No direct database access" decision (InboundAPI-026, regresses InboundAPI-007); `WaitForAttribute` timeout binding diverges from spec §6 line 192 (InboundAPI-029). |
+| 8 | Code organization & conventions | ☑ | `InboundDatabaseHelper` is a component-local helper (fine); schema types in Commons. No issues. |
+| 9 | Testing coverage | ☑ | `InboundDatabaseHelper` tests cover SELECT happy-path only — no write/DDL-rejection test (because none is enforced); `WaitForAttribute`/`Database` not end-to-end via the endpoint (InboundAPI-028). |
+| 10 | Documentation & comments | ☑ | Helper XML doc claims "read-only" which is unenforced (folded into InboundAPI-026). |
+
 ## Findings
 
 ### InboundAPI-001 — Singleton script handler cache mutated without synchronization
@@ -1299,3 +1329,199 @@ route, not on the `/api/` prefix. Options:
 The endpoint-filter form is the recommended fix — it co-locates the audit-emission
 scope with the route definition and matches how InboundAPI-006/008 gating is
 already wired.
+
+### InboundAPI-026 — `InboundDatabaseHelper` gives inbound scripts arbitrary raw SQL (not read-only); contradicts the design doc's "No direct database access" decision and regresses InboundAPI-007
+
+| | |
+|--|--|
+| Severity | High |
+| Category | Security |
+| Status | Open |
+| Location | `src/ZB.MOM.WW.ScadaBridge.InboundAPI/InboundDatabaseHelper.cs:13-70`, `src/ZB.MOM.WW.ScadaBridge.InboundAPI/InboundScriptExecutor.cs:267-281`, `:387`; design doc `docs/requirements/Component-InboundAPI.md:202-215` |
+
+**Description**
+
+A new `InboundDatabaseHelper` is exposed to every inbound API script as
+`InboundScriptContext.Database` (`InboundScriptExecutor.cs:387`), backed by the
+production `IDatabaseGateway` (`AddExternalSystemGateway()` in the central-role branch
+of `Program.cs:81`). Its `QuerySingle<T>`/`Query` methods take a script-supplied `sql`
+string and run it verbatim — `cmd.CommandText = sql; cmd.ExecuteScalar()/ExecuteReader()`
+(`InboundDatabaseHelper.cs:27-29`, `:41-44`) — against a raw ADO.NET `DbConnection`
+obtained from `IDatabaseGateway.GetConnectionAsync` (`IDatabaseGateway.cs:20`, the same
+gateway the External System Gateway uses for SCADA machine-data DBs). Two problems:
+
+1. **It directly contradicts the design doc.** `Component-InboundAPI.md:202-215` — which
+   is the *resolution* of finding InboundAPI-007 — states verbatim: *"**No direct
+   database access.** Inbound API scripts are not given a raw database client. Handing a
+   script a raw `SqlConnection` is in direct tension with the ScadaBridge script trust
+   model … Scripts interact with the system only through the curated `Route` and
+   `Parameters` surfaces … If a method needs data from the configuration or machine-data
+   databases, that access belongs behind a dedicated, scoped helper — not a
+   general-purpose connection — and would be added here as an explicit design change."*
+   The code now ships exactly the capability the doc forbids, and the doc was **not**
+   updated. This is a regression of the closed InboundAPI-007 (whose chosen resolution
+   was to remove the DB API and document its absence). Inbound API scripts are authored
+   by the less-trusted-than-Admin Design role and execute on the central node with the
+   host process's privileges, so this materially widens the trust boundary the design
+   deliberately drew.
+
+2. **"Read-only" is unenforced — it is a comment, not a constraint.** The class summary
+   says *"Read-only database access"* but nothing restricts the SQL: a script may run
+   `Database.Query("conn", "DELETE FROM Machine")`, `UPDATE`, `INSERT`, `DROP`, or any
+   DDL/DML, and `Query`/`QuerySingle` will execute it (`ExecuteReader`/`ExecuteScalar`
+   happily run non-SELECT statements). The gateway opens a normal read-write connection
+   (no `ApplicationIntent=ReadOnly`, no statement allow-listing). The "read-only" claim
+   in the XML doc and the call-site comment (`InboundScriptExecutor.cs:240`,
+   "read-only Database helper") is therefore false and misleading.
+
+Parameter *values* are correctly bound (`AddParameters`, `:59-68`) so value-injection is
+not the issue — the issue is that arbitrary statement text is script-controlled and the
+"read-only" containment does not exist.
+
+**Recommendation**
+
+Reconcile code and design, and make the safety claim real. Either (a) if scoped DB
+access is now intended, update `Component-InboundAPI.md:202-215` to authorize it as the
+explicit design change the prior text demanded, AND enforce read-only — restrict to
+`SELECT` (reject any statement whose first significant token is not `SELECT`/`WITH`, or
+better, route through a read-intent connection / a curated query API that cannot mutate),
+fix the XML summary to describe what is actually enforced, and gate which connection
+names a script may reach; or (b) if direct DB access is still out of scope, remove
+`InboundDatabaseHelper` and the `Database` member (reverting to the InboundAPI-007
+posture). Add a regression test asserting a non-SELECT statement is rejected once
+read-only is enforced (see InboundAPI-028).
+
+**Resolution**
+
+_Unresolved._
+
+### InboundAPI-027 — `InboundDatabaseHelper` is sync-over-async and ignores the method-deadline token — thread-pool starvation and an unbounded slow query
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Performance & resource management |
+| Status | Open |
+| Location | `src/ZB.MOM.WW.ScadaBridge.InboundAPI/InboundDatabaseHelper.cs:25-33`, `:40-44` |
+
+**Description**
+
+`QuerySingle`/`Query` execute on the ASP.NET request thread (the script runs inside
+`CSharpScript.RunAsync` on a thread-pool thread). Both methods do
+`_gateway.GetConnectionAsync(connectionName, _ct).GetAwaiter().GetResult()`
+(`:25`, `:40`) — a synchronous block on an async DB-open — and then call the **synchronous**
+`cmd.ExecuteScalar()` (`:29`) / `cmd.ExecuteReader()` (`:44`). Two consequences:
+
+1. **Thread-pool starvation.** The inbound API has no rate limiting (a deliberate design
+   choice). Each DB-backed method invocation blocks a pool thread for the full duration
+   of the DB round-trip via `.GetAwaiter().GetResult()`. Under concurrent load this
+   consumes pool threads that cannot be released until the DB responds — exactly the
+   pattern ASP.NET guidance warns against (it won't deadlock without a sync context, but
+   it does starve the pool and degrade the whole central node).
+
+2. **The slow query is not bounded by the method timeout.** The cancellation token
+   `_ct` (the method-deadline `cts.Token` passed at `InboundScriptExecutor.cs:268`) is
+   forwarded to `GetConnectionAsync` but **not** to the synchronous `ExecuteScalar()` /
+   `ExecuteReader()` — those overloads ignore cancellation entirely. So once a connection
+   is open, a hung or slow `SELECT` runs to completion (or the provider's own command
+   timeout) regardless of the method timeout, holding the blocked thread the whole time.
+   This contradicts the InboundAPI-016 guarantee that the method timeout bounds the
+   method's work.
+
+**Recommendation**
+
+Use the async ADO.NET path end-to-end and honour the token: `await
+_gateway.GetConnectionAsync(...)`, then `await cmd.ExecuteReaderAsync(_ct)` /
+`await cmd.ExecuteScalarAsync(_ct)`, and make `QuerySingle`/`Query` async (`Task<T?>` /
+`Task<IReadOnlyList<...>>`) so the script `await`s them rather than blocking a pool
+thread. Set a `CommandTimeout` derived from the remaining method deadline as a backstop.
+If a synchronous script-facing signature must be kept, at minimum register the command
+with the token and a command timeout so a slow query cannot run unbounded.
+
+**Resolution**
+
+_Unresolved._
+
+### InboundAPI-028 — `InboundDatabaseHelper` has no negative-path tests; `Database`/`WaitForAttribute` are not exercised end-to-end through the endpoint
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Testing coverage |
+| Status | Open |
+| Location | `tests/ZB.MOM.WW.ScadaBridge.InboundAPI.Tests/InboundDatabaseHelperTests.cs:34-67`, `tests/ZB.MOM.WW.ScadaBridge.InboundAPI.Tests/EndpointExtensionsTests.cs` |
+
+**Description**
+
+`InboundDatabaseHelperTests` covers only the SELECT happy path (single value, no-rows
+default, multi-column row, empty result). There is no test that:
+
+- a non-SELECT statement (`DELETE`/`UPDATE`/`DROP`) is rejected — because nothing rejects
+  it today (InboundAPI-026); a test would currently *prove* the helper is not read-only;
+- the `InvalidOperationException("Database is not available …")` path fires when the
+  gateway is null (`InboundDatabaseHelper.cs:24`, `:39`);
+- the method-deadline token cancels a slow query (InboundAPI-027).
+
+Separately, `InboundScriptContext.Database` and `RouteTarget.WaitForAttribute` — both
+new script-facing capabilities — are unit-tested at the helper/route level but never
+driven end-to-end through `HandleInboundApiRequest` (the `EndpointExtensionsTests`
+TestServer suite does not invoke a script that touches `Database` or `WaitForAttribute`),
+so a wiring regression that fails to construct the `Database` helper or thread the
+deadline into a wait would be silent.
+
+**Recommendation**
+
+Add negative-path tests to `InboundDatabaseHelperTests` (null-gateway throw; once
+InboundAPI-026 enforces read-only, a write/DDL-rejection test; an InboundAPI-027
+cancellation test). Add an `EndpointExtensionsTests` case driving a method script that
+calls `Database.Query(...)` and one that calls `Route.To(...).WaitForAttribute(...)` so
+the executor→context→helper wiring is covered through the real endpoint.
+
+**Resolution**
+
+_Unresolved._
+
+### InboundAPI-029 — Routed `WaitForAttribute` is cancelled by the method-level deadline, contradicting spec §6 (wait bounded by the wait timeout, not the method timeout)
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Design-document adherence |
+| Status | Open |
+| Location | `src/ZB.MOM.WW.ScadaBridge.InboundAPI/RouteHelper.cs:220-247`, `:302-303`; design doc `docs/requirements/Component-InboundAPI.md:192` |
+
+**Description**
+
+`Component-InboundAPI.md:192` says of `WaitForAttribute`: *"The cluster call is bounded
+by the **wait timeout** rather than the generic integration timeout."* But
+`RouteTarget.WaitForAttribute` routes with `Effective(cancellationToken)` (`:226`),
+which — when the script passes no explicit token, the normal case — resolves to
+`_deadlineToken` (the method-level timeout CTS bound by `WithDeadline` at
+`InboundScriptExecutor.cs:279`). So a method whose timeout (default 30 s) is shorter than
+the requested wait timeout (e.g. `WaitForAttribute("Flag", true, TimeSpan.FromMinutes(5))`)
+has its wait cancelled at 30 s by the method deadline, never reaching the 5-minute wait
+timeout the spec says governs it. The routed wait is thus bounded by the *generic method
+timeout* — exactly what spec line 192 says it should not be.
+
+This is in genuine tension with the InboundAPI-016 decision ("the method timeout covers
+routed calls, including waits"). The two design statements conflict for the
+`WaitForAttribute` case: §6 line 192 wants the wait timeout to win; InboundAPI-016 wants
+the method deadline to be the hard ceiling. The current code silently resolves it in
+favour of the method deadline, which makes the documented `WaitForAttribute` timeout
+semantics (and its `false`-on-timeout return contract) unreliable whenever the method
+timeout is the smaller of the two.
+
+**Recommendation**
+
+Reconcile the two design statements explicitly. Options: (a) treat the method deadline as
+the hard ceiling and update `Component-InboundAPI.md:192` to say the wait is bounded by
+`min(waitTimeout, remaining method deadline)` and that an undersized method timeout will
+cut a wait short; or (b) honour spec §6 by deriving the wait's effective token from the
+wait timeout (a per-wait CTS) rather than the method deadline, accepting that a long wait
+can outlive the generic method timeout. Either way, add a `RouteHelperTests` case pinning
+the chosen semantics when method-timeout < wait-timeout (the current suite only tests
+deadline-token *inheritance*, not the timeout *ordering* conflict).
+
+**Resolution**
+
+_Unresolved._
diff --git a/code-reviews/KpiHistory/findings.md b/code-reviews/KpiHistory/findings.md
new file mode 100644
index 00000000..8582a58c
--- /dev/null
+++ b/code-reviews/KpiHistory/findings.md
@@ -0,0 +1,307 @@
+# Code Review — KpiHistory
+
+| Field | Value |
+|-------|-------|
+| Module | `src/ZB.MOM.WW.ScadaBridge.KpiHistory` |
+| Design doc | `docs/requirements/Component-KpiHistory.md` |
+| Status | Reviewed |
+| Last reviewed | 2026-06-20 |
+| Reviewer | claude-agent |
+| Commit reviewed | `4307c381` |
+| Open findings | 0 |
+
+## Summary
+
+KpiHistory is a small, well-built observability module: a single ~305-line recorder
+singleton (`KpiHistoryRecorderActor`), a strongly-typed options class with a fail-fast
+validator, and a thin DI composition root. The owned code is clean — the actor is
+textbook best-effort: every sample pass and purge sweep runs off the actor thread via
+`PipeTo`, per-source faults are isolated so one throwing `IKpiSampleSource` never aborts
+a tick or the other sources, the repository write/purge is guarded, no exception escapes
+either tick handler, and a lifecycle `CancellationTokenSource` is cancelled in `PostStop`
+so an in-flight pass observes shutdown promptly. The singleton is wired correctly in the
+Host (ClusterSingletonManager + proxy + `PhaseClusterLeave` drain) and is deliberately
+absent from the readiness barrier, exactly as the design requires. The options validator,
+the EAV table mapping, both named indexes, and all four `IKpiSampleSource` implementations
+exist and are registered on the central host as designed.
+
+The dominant theme is **unbounded work under load**, in two places. First, the recorder
+has **no in-flight guard** on its sample timer — directly contradicting its own XML-doc
+claim to "mirror the NotificationOutboxActor timer + scope-per-tick + PipeTo pattern,"
+because the NotificationOutbox dispatcher *does* hold an in-flight guard and the recorder
+does not. When a sample pass runs longer than `SampleInterval` (slow/recovering DB), Akka
+periodic timers enqueue, not coalesce, so overlapping `RunSamplePass` tasks pile up,
+multiplying DB load at exactly the moment the store is struggling and double-writing
+samples for overlapping windows. Second, `GetRawSeriesAsync` has **no server-side row
+cap**: the design's `DefaultMaxSeriesPoints` ceiling is applied by `KpiSeriesBucketer`
+only *after* the full raw window is materialised into memory — a 7-day window at the
+default 60 s cadence is ~10 080 rows per series pulled to the Central UI before
+downsampling, defeating the stated intent of the cap. A secondary theme is **bucketer
+contract drift** (the "largest-timestamp-wins for unsorted input" doc claim is not what
+the code does, and the short-series early-return emits raw capture timestamps where the
+downsample path emits bucket-boundary timestamps) — both live in Commons but are core to
+this module's query reducer. No Critical findings; one High, three Medium, two Low.
+
+## Checklist coverage
+
+| # | Category | Examined | Notes |
+|---|----------|----------|-------|
+| 1 | Correctness & logic bugs | Yes | `capturedAt`/cut-off captured on the actor thread (correct). `KpiSeriesBucketer` short-series early-return emits raw capture timestamps vs bucket-boundary timestamps on the downsample path (KpiHistory-005). Bucketer "largest-timestamp-wins for unsorted input" doc claim is false — it is last-in-iteration (KpiHistory-006). |
+| 2 | Akka.NET conventions | Yes | `PipeTo` + scope-per-tick + off-thread I/O + `IWithTimers` all correct; sender not captured across awaits; messages immutable singletons. But no in-flight guard despite the XML claiming to mirror NotificationOutbox (KpiHistory-001). |
+| 3 | Concurrency & thread safety | Yes | Actor state is effectively stateless; `_shutdownCts` lifecycle is correct. The missing in-flight guard allows overlapping sample passes under DB latency (KpiHistory-001). |
+| 4 | Error handling & resilience | Yes | Per-source isolation, write/purge guards, and `OperationCanceledException` shutdown handling are all correct and tested. No issues beyond KpiHistory-001's load amplification. |
+| 5 | Security | Yes | No injection surface — all queries are parameterised LINQ; `Source`/`Metric`/`Scope`/`ScopeKey` are equality predicates, never interpolated SQL. Scope isolation via `ScopeKey == scopeKey` (incl. `IS NULL` for Global) is correct. No issues found. |
+| 6 | Performance & resource management | Yes | `GetRawSeriesAsync` returns the entire window with no server-side cap before bucketing (KpiHistory-002). `RecordSamplesAsync` short-circuits empty batches (good). Purge is set-based `ExecuteDeleteAsync` but is a single unbatched statement (KpiHistory-004). |
+| 7 | Design-document adherence | Yes | The recorder XML claims to mirror NotificationOutbox's pattern but omits its in-flight guard (KpiHistory-001). `DefaultMaxSeriesPoints`-as-a-true-cap intent is undermined by KpiHistory-002. Otherwise faithful: singleton, not-readiness-gated, daily purge, EAV schema, indexes. |
+| 8 | Code organization & conventions | Yes | Options class owned by the component (correct); validator co-located; `IKpiHistoryRepository`/`IKpiSampleSource`/`KpiSample`/bucketer in Commons, impl in ConfigurationDatabase (correct); singleton Props built in Host. No issues found. |
+| 9 | Testing coverage | Yes | Recorder isolation, faulted-tick recovery, and purge cut-off are tested; validator bounds fully covered; bucketer has strong sorted-input coverage. Gaps: no overlapping-tick test, no unsorted-input bucketer test, no short-series timestamp-semantics assertion (KpiHistory-003). |
+| 10 | Documentation & comments | Yes | XML is generally excellent. Two drift points: the "mirror NotificationOutbox" claim (KpiHistory-001) and the bucketer unsorted-input claim (KpiHistory-006). `SampleComplete`/`PurgeComplete` no-op handlers are documented. |
+
+## Findings
+
+### KpiHistory-001 — Recorder has no in-flight guard; overlapping sample passes pile up under DB latency
+
+| | |
+|--|--|
+| Severity | High |
+| Category | Akka.NET conventions |
+| Status | Resolved |
+| Location | `src/ZB.MOM.WW.ScadaBridge.KpiHistory/KpiHistoryRecorderActor.cs:89-92`, `:103-107`, `:143-159` |
+
+**Description**
+
+The sample timer is a plain Akka periodic timer
+(`Timers.StartPeriodicTimer(SampleTimerKey, SampleTick.Instance, …, interval: _options.SampleInterval)`).
+`HandleSampleTick` launches `RunSamplePass(...)` off-thread via `PipeTo` and returns
+immediately; the piped-back `SampleComplete` is a deliberate no-op
+(`Receive<SampleComplete>(_ => { })`). There is **no in-flight guard** — nothing prevents
+a second `SampleTick` from starting a second `RunSamplePass` while the first is still
+awaiting its DB round-trip.
+
+Akka periodic timers do not coalesce missed ticks — they enqueue. So when a sample pass
+takes longer than `SampleInterval` (a slow, contended, or recovering central MS SQL —
+exactly the regime where observability matters most), each subsequent tick spawns *another*
+concurrent pass. Each pass opens its own DI scope + `DbContext` and issues its own
+`AddRange` + `SaveChangesAsync`. The result is a self-amplifying load spiral against the
+struggling store, plus duplicate sample rows whose `CapturedAtUtc` values straddle the same
+real-time window (the design's "one shared tick timestamp" invariant assumes one pass per
+tick).
+
+The actor's own XML-doc states it "mirrors the `NotificationOutboxActor` timer +
+scope-per-tick + PipeTo pattern" (lines 37-39) — but `NotificationOutboxActor` holds an
+explicit in-flight boolean cleared on `DispatchComplete` precisely to serialise sweeps.
+This recorder dropped that half of the pattern. The piped-back `SampleComplete` message is
+the natural place the guard would be lowered, and its current empty body is the tell that
+the guard was intended but omitted.
+
+**Recommendation**
+
+Add an in-flight guard: set a `_sampleInFlight` flag at the top of `HandleSampleTick`, skip
+(and log at debug) the tick if already set, and clear it in the `SampleComplete` handler
+(both success and failure projections already route there). This matches the
+NotificationOutbox pattern the doc claims to follow and bounds the recorder to one pass per
+tick. Add a regression test that drives two `SampleTick`s before the first pass completes
+(e.g. a repository whose `RecordSamplesAsync` blocks on a gate) and asserts only one pass
+ran. The purge tick is daily + idempotent so a guard there is optional, but consider the
+same treatment for symmetry.
+
+**Resolution**
+
+Resolved 2026-06-20 (commit `fd618cf1`): added a `_sampleInFlight` guard to the recorder — `HandleSampleTick` skips (debug-logs) if a pass is in flight; the flag is cleared via a `SampleComplete` message piped on BOTH success and failure, so a faulted source can't wedge the guard. Overlapping-tick regression test added.
+
+### KpiHistory-002 — `GetRawSeriesAsync` materialises the entire window with no server-side cap; `DefaultMaxSeriesPoints` is applied only after the fact
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Performance & resource management |
+| Status | Deferred |
+| Location | `src/ZB.MOM.WW.ScadaBridge.Commons/Interfaces/Repositories/IKpiHistoryRepository.cs:39-41` (contract); `src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase/Repositories/KpiHistoryRepository.cs:45-62` (impl); consumed by `src/ZB.MOM.WW.ScadaBridge.CentralUI/Services/KpiHistoryQueryService.cs:77-93` |
+
+**Description**
+
+`GetRawSeriesAsync` has no `limit`/`maxPoints`/`Take` parameter. The implementation issues a
+`Where(...).OrderBy(CapturedAtUtc).Select(...).ToListAsync()` that pulls **every** sample in
+`[fromUtc, toUtc]` for the series into memory. The `DefaultMaxSeriesPoints` ceiling
+(default 200, design-described as the cap that prevents "a single trend query [from
+streaming] an arbitrarily large series to the Central UI" — `KpiHistoryOptionsValidator`
+XML) is enforced by `KpiSeriesBucketer.Bucket` **after** the full raw set has already been
+fetched across the wire and allocated as a `List<KpiSeriesPoint>` in the query service.
+
+At the default 60 s sample cadence, a 24 h window is ~1 440 rows/series, a 7 d window is
+~10 080 rows/series, and a 30 d window ~43 200 rows/series — per chart, with up to four
+trend panels on a page each issuing its own query. The cap that the design relies on for
+back-pressure provides none at the data tier; it only trims the in-memory result the UI
+binds to. The `IX_KpiSample_Series` index makes the *range scan* efficient, but the row
+count returned is still unbounded by anything except the window the parent page chooses.
+
+**Recommendation**
+
+Push the downsampling toward the store, or at least bound the fetch. Cheapest correct fix:
+add an optional `int? maxRows` to `GetRawSeriesAsync` and have the query service pass a
+generous multiple of `effectiveMax` (e.g. `effectiveMax * k`) so the bucketer still has
+enough density to pick representative last-values while the DB-side `Take` caps the
+transfer. A more thorough fix is a server-side bucketed aggregation (GROUP BY a computed
+bucket index, MAX(CapturedAtUtc) per bucket), but that is a larger change the design
+explicitly deferred ("v1 ships exactly one aggregation"). At minimum, document the
+unbounded-fetch behaviour and the practical window ceiling so an operator does not point a
+30 d chart at a busy multi-site deployment.
+
+**Resolution**
+
+Deferred 2026-06-20: bounding the raw window fetch (cheap `Take`-cap vs. server-side bucketed aggregation) is a design decision and the code lives in ConfigurationDatabase/Commons, not this project. Recorded as a query-path enhancement; the practical window ceiling should be documented when addressed.
+
+### KpiHistory-003 — Missing tests: overlapping ticks, unsorted-input bucketing, short-series timestamp semantics
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Testing coverage |
+| Status | Resolved |
+| Location | `tests/ZB.MOM.WW.ScadaBridge.KpiHistory.Tests/KpiHistoryRecorderActorTests.cs`; `tests/ZB.MOM.WW.ScadaBridge.Commons.Tests/Kpi/KpiSeriesBucketerTests.cs` |
+
+**Description**
+
+The recorder and bucketer tests are otherwise strong (per-source isolation, faulted-tick
+recovery, purge cut-off, validator bounds, sorted-input bucketing, right-edge handling,
+empty-bucket omission, out-of-window filtering). Three behaviour gaps remain, each tied to
+a finding here:
+
+1. **No overlapping-tick test** (KpiHistory-001). Every recorder test sends a single tick
+   and awaits its effect; nothing exercises a second `SampleTick` arriving while the first
+   pass is still in flight, so the missing in-flight guard is invisible to the suite. A
+   gated-repository test (block `RecordSamplesAsync`, send two ticks, count passes) would
+   pin the intended one-pass-per-tick behaviour.
+2. **No unsorted-input bucketer test** (KpiHistory-006). All bucketer tests pass ascending
+   input, so the doc's "largest-timestamp-wins for unsorted input" claim is never checked —
+   and it is in fact wrong.
+3. **No short-series timestamp-semantics assertion** (KpiHistory-005).
+   `Bucket_BucketStartUtc_IsSetToBucketStart…` covers only the downsample path; no test
+   asserts what `BucketStartUtc` the early-return (`raw.Count <= maxPoints`) path emits, so
+   the inconsistency between the two paths is untested.
+
+**Recommendation**
+
+Add the three tests above. The overlapping-tick test belongs in
+`KpiHistoryRecorderActorTests` (it is a recorder behaviour); the two bucketer tests belong
+in `KpiSeriesBucketerTests`.
+
+**Resolution**
+
+Resolved 2026-06-20 (commit `fd618cf1`): added the overlapping-tick gated-repository test plus unsorted-input and short-series bucketer tests (the latter pin the documented short-series behaviour).
+
+### KpiHistory-004 — Retention purge is a single unbatched `ExecuteDeleteAsync`; a large backlog deletes in one transaction
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Performance & resource management |
+| Status | Deferred |
+| Location | `src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase/Repositories/KpiHistoryRepository.cs:65-71` |
+
+**Description**
+
+`PurgeOlderThanAsync` runs one set-based `Where(s => s.CapturedAtUtc < before).ExecuteDeleteAsync(...)`.
+For the steady-state daily cadence this deletes one day of expired rows and is fine. But if
+the purge has not run for an extended period (singleton down across a long failover window,
+`PurgeInterval` mis-set and later corrected, or `RetentionDays` shortened), a single
+unbounded `DELETE` can touch a very large row count in one transaction — lock escalation on
+`KpiSample`, a long-running transaction, and transaction-log growth, which on the shared
+central MS SQL can affect operational tables. The Audit Log purge path, by contrast, uses a
+bounded batched delete for exactly this reason.
+
+This is observability data on a non-partitioned `[PRIMARY]` table, so the blast radius is
+bounded and the severity is Low — but the unbatched delete is a latent operational hazard
+on the shared store.
+
+**Recommendation**
+
+Loop a bounded delete (`DELETE TOP (N) … WHERE CapturedAtUtc < @before`, or EF Core's
+batching) until zero rows are affected, mirroring the Audit Log purge shape. Keep the
+returned total for the existing log line.
+
+**Resolution**
+
+Deferred 2026-06-20: batching the retention purge is a shared-MS-SQL-store tradeoff (vs. the Audit Log's batched-delete precedent); low severity for non-partitioned observability data. Recorded as a future enhancement.
+
+### KpiHistory-005 — `KpiSeriesBucketer` short-series early-return emits raw capture timestamps where the downsample path emits bucket-boundary timestamps
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Correctness & logic bugs |
+| Status | Deferred |
+| Location | `src/ZB.MOM.WW.ScadaBridge.Commons/Types/Kpi/KpiSeriesBucketer.cs:57-58` (early return) vs `:93-94` (downsample) |
+
+**Description**
+
+`KpiSeriesBucketer.Bucket` has two return paths that disagree on what `BucketStartUtc`
+means. When `raw.Count <= maxPoints` it returns `raw` unchanged — those points carry the
+raw `CapturedAtUtc` as their `BucketStartUtc` (the repository builds
+`new KpiSeriesPoint(s.CapturedAtUtc, s.Value)`). When `raw.Count > maxPoints` it returns
+points whose `BucketStartUtc` is the **bucket boundary**
+(`fromUtc + bucketIndex * bucketWidthTicks`), as the dedicated test
+`Bucket_BucketStartUtc_IsSetToBucketStartNotRawPointTimestamp` asserts.
+
+So the same series, charted at a density below vs above `maxPoints`, plots its points at
+different x-positions: actual capture instants in the sparse case, evenly-spaced bucket
+starts in the dense case. The downstream `KpiTrendChart` normalises X across
+`[min(BucketStartUtc), max(BucketStartUtc)]`, so the visual impact is minor (the time range
+is essentially the same), but the contract is inconsistent and the x-axis "tick spacing"
+subtly changes as a window crosses the cap. This is the bucketer that the KpiHistory design
+defines as the module's query reducer, so the inconsistency is in-scope even though the file
+lives in Commons.
+
+**Recommendation**
+
+Make the two paths agree. Either document the difference explicitly on `Bucket` (the
+short-series path returns raw capture instants; the downsample path returns bucket starts),
+or — cleaner — have the short-series path also project onto a consistent timestamp basis if
+exact bucket-start semantics are part of the contract. Add the short-series timestamp test
+from KpiHistory-003.
+
+**Resolution**
+
+Deferred 2026-06-20: whether the bucketer's short-series early-return and the downsample path must agree on `BucketStartUtc` semantics (vs. documenting the difference) is a contract decision for the design owner. The doc was corrected (KpiHistory-006) to describe current behaviour; the contract change is deferred.
+
+### KpiHistory-006 — `KpiSeriesBucketer` doc claims "largest timestamp within each bucket is selected" for unsorted input; the code selects last-in-iteration
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Documentation & comments |
+| Status | Resolved |
+| Location | `src/ZB.MOM.WW.ScadaBridge.Commons/Types/Kpi/KpiSeriesBucketer.cs:20-21` (doc), `:88-97` (code) |
+
+**Description**
+
+The `raw` param doc states: *"If not sorted, the point with the largest timestamp within
+each bucket is selected."* The code does not do this. When a point is stored into a bucket,
+`best[bucketIndex].BucketStartUtc` is set to the **bucket-start** timestamp
+(`fromUtc + bucketIndex * bucketWidthTicks`), not the raw point's timestamp. The
+last-value comparison for a subsequent point in the same bucket is then
+`point.BucketStartUtc > best[bucketIndex].BucketStartUtc` — i.e. it compares the new raw
+point's capture time against the *bucket start*, which (for any in-bucket point) is almost
+always true. The effect is that each later-in-iteration point overwrites the previous one
+regardless of their relative timestamps, so the **last point in iteration order** wins, not
+the point with the largest timestamp.
+
+For the production caller this is harmless: `KpiHistoryRepository.GetRawSeriesAsync` always
+`OrderBy(CapturedAtUtc)`, so iteration order equals time order and last-in-iteration is the
+largest timestamp. But the documented contract for unsorted input is simply false, and the
+"if ties, keep first encountered — stable" comment (line 89) is also inaccurate — the
+overwrite triggers on equal-as-well-as-greater for any in-bucket point. A future caller that
+trusts the unsorted-input guarantee will get wrong results silently.
+
+**Recommendation**
+
+Either (a) fix the comparison to track the selected raw point's actual timestamp (store the
+raw `point.BucketStartUtc` alongside the emitted value and compare against that), making the
+"largest timestamp wins" claim true for unsorted input; or (b) tighten the doc to state the
+method requires ascending-sorted input and selects last-in-iteration otherwise, and drop the
+inaccurate "largest timestamp" / "stable ties" language. Pair with the unsorted-input test
+from KpiHistory-003.
+
+**Resolution**
+
+Resolved 2026-06-20 (commit `fd618cf1`): corrected the `KpiSeriesBucketer` param XML doc — dropped the false 'largest-timestamp-wins / stable ties' claim; now states it requires ascending-sorted input and selects last-in-iteration otherwise. Doc-only.
diff --git a/code-reviews/NotificationService/findings.md b/code-reviews/NotificationService/findings.md
index 401b94a8..fddec49e 100644
--- a/code-reviews/NotificationService/findings.md
+++ b/code-reviews/NotificationService/findings.md
@@ -5,9 +5,9 @@
 | Module | `src/ZB.MOM.WW.ScadaBridge.NotificationService` |
 | Design doc | `docs/requirements/Component-NotificationService.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-28 |
+| Last reviewed | 2026-06-20 |
 | Reviewer | claude-agent |
-| Commit reviewed | `1eb6e97` |
+| Commit reviewed | `4307c381` |
 | Open findings | 0 |
 
 ## Summary
@@ -100,6 +100,39 @@ and `CredentialRedactor` masks any component of the credential string that is
 ≥ 4 characters long — a 4-character user name like `root` or a 4-char tenant
 prefix could be aggressively scrubbed out of unrelated log text (NS-025).
 
+#### Re-review 2026-06-20 (commit `4307c381`) — full review
+
+Re-reviewed the whole current state. Since `1eb6e97` the module was renamed
+(`ScadaLink → ZB.MOM.WW.ScadaBridge`, so the diff shows every file as "added" at
+the new path), gained complete XML-doc coverage, and — most importantly — **NS-021
+is now RESOLVED**: `ISmtpClientWrapper.AuthenticateAsync` threads an `oauth2UserName`
+and the OAuth2 branch builds `new SaslMechanismOAuth2(oauth2UserName, token)`,
+throwing `SmtpPermanentException` if the identity is empty; the real production
+caller `EmailNotificationDeliveryAdapter.SendAsync` passes `config.FromAddress`
+(`src/ZB.MOM.WW.ScadaBridge.NotificationOutbox/Delivery/EmailNotificationDeliveryAdapter.cs:196`),
+and `MailKitSmtpClientWrapperTests` pins both the throw and the XOAUTH2 framing.
+All ten prior NS-019..NS-025 findings hold; the module is now cleanly scoped to the
+shared central SMTP primitives (no orphaned site-delivery code). The re-review
+surfaced **three new findings**, none above Medium: the OAuth2 token endpoint and
+scope are hardcoded to Microsoft 365 so no other OAuth2 provider can authenticate
+despite the design doc promising "other modern SMTP providers" (NS-026), one test
+file is misplaced in this test project (its system-under-test lives in
+NotificationOutbox — NS-027), and one test's XML doc still names the
+NS-019-deleted `NotificationDeliveryService` (NS-028).
+
+| # | Category | Examined | Notes |
+|---|----------|----------|-------|
+| 1 | Correctness & logic bugs | ☑ | NS-021 now resolved (oauth2 user identity threaded + tested). Classification/redaction correct. No new correctness defect. |
+| 2 | Akka.NET conventions | ☑ | No actors. `AddNotificationServiceActors` is a documented no-op. |
+| 3 | Concurrency & thread safety | ☑ | `OAuth2TokenService` per-credential `ConcurrentDictionary` + per-entry `SemaphoreSlim` double-check is correct; concurrent-fetch test passes. No issues. |
+| 4 | Error handling & resilience | ☑ | Delivery error path now lives entirely in NotificationOutbox; the shared `SmtpErrorClassifier` is sound (typed exceptions + numeric SMTP code, cancellation excluded). No issues here. |
+| 5 | Security | ☑ | OAuth2 secret/token never logged (tenant only); `CredentialRedactor` tightened (NS-025). At-rest encryption of `SmtpConfiguration.Credentials` still NS-013 Deferred (Commons-scoped). OAuth2 endpoint hardcoded to M365 (NS-026). |
+| 6 | Performance & resource management | ☑ | Single `SmtpClient` per wrapper, disposed by caller's `using`; per-delivery factory documented (NS-022). No leak. |
+| 7 | Design-document adherence | ☑ | Central-only redesign clean. New gap: OAuth2 hardcoded to M365 token URL + outlook.office365.com scope contradicts "other modern SMTP providers" in the doc (NS-026). |
+| 8 | Code organization & conventions | ☑ | One-type-per-file holds. `NotificationOutboxKpiSampleSourceTests.cs` is in this test project but its SUT is in NotificationOutbox (NS-027). |
+| 9 | Testing coverage | ☑ | Primitives well covered (classifier, TLS parser, redactor, OAuth2 token cache/refresh/concurrency, auth-wrapper negative paths incl. NS-021). No false-coverage; orphaned tests gone (NS-024). |
+| 10 | Documentation & comments | ☑ | XML docs accurate after NS-019/023, EXCEPT `SmtpErrorClassifierTests.cs:11` still references the deleted `NotificationDeliveryService` (NS-028). |
+
 ## Checklist coverage
 
 | # | Category | Examined | Notes |
@@ -860,3 +893,118 @@ The threshold is a defence-in-depth choice; the existing tests assert that `Hunt
 **Recommendation**
 
 Tighten the redaction policy: mask only the obviously-secret components — the password (Basic), the client secret (OAuth2), and the whole packed string — not the user name / tenant / client id. The simplest implementation is to redact only the **last** colon-separated component (the secret) plus the full packed string. Bump the per-component minimum length to something high enough that a typical short user name does not match (≥ 12 chars is the usual heuristic for a password). Add a test asserting `Scrub("/root/.config", "root:hunter2")` does not mask `/root/.config`'s `root`.
+
+### NotificationService-026 — OAuth2 token endpoint and scope are hardcoded to Microsoft 365; no other OAuth2 SMTP provider can authenticate
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Design-document adherence |
+| Status | Deferred |
+| Location | `src/ZB.MOM.WW.ScadaBridge.NotificationService/OAuth2TokenService.cs:73`, `src/ZB.MOM.WW.ScadaBridge.NotificationService/OAuth2TokenService.cs:80` |
+
+**Description**
+
+`OAuth2TokenService.GetTokenAsync` constructs the token URL as
+`$"https://login.microsoftonline.com/{tenantId}/oauth2/v2.0/token"` (`:73`) and
+requests the fixed scope `"https://outlook.office365.com/.default"` (`:80`). Both
+are Microsoft-365-specific: the authority host is Azure AD's, and the scope is the
+Outlook/Office-365 SMTP-send resource. The design doc (`Component-NotificationService.md`,
+"Email Server Configuration") states the OAuth2 Client Credentials mode is
+"For Microsoft 365 **and other modern SMTP providers** that require OAuth2", and
+CLAUDE.md's External Integrations decision likewise says "OAuth2 Client Credentials
+(Microsoft 365) **or** Basic Auth". As implemented, only Microsoft 365 can ever
+work: a Google Workspace, on-prem Keycloak/ADFS, or any non-Azure OAuth2 SMTP relay
+would need a different authority host and a different scope, neither of which is
+configurable. The credential triple carries only `tenantId:clientId:clientSecret`
+— there is no authority-host or scope field to override either constant. An operator
+who configures an `oauth2` SMTP config pointing at a non-M365 provider gets a token
+request sent to `login.microsoftonline.com` with `{tenantId}` interpolated as a path
+segment, which fails (404/400) and surfaces as a permanent delivery failure with no
+hint that the provider is simply unsupported.
+
+This is latent rather than actively broken — the only documented/tested target is
+M365 — but it contradicts the design doc's stated breadth and is the kind of gap that
+silently blocks a future deployment against a non-Microsoft tenant.
+
+**Recommendation**
+
+Either (a) make the authority host and scope configurable — add optional
+`OAuth2Authority`/`OAuth2Scope` fields to `SmtpConfiguration` (defaulting to the M365
+values) and thread them into `GetTokenAsync` — and update the cache key to include
+them; or (b) narrow the design doc and CLAUDE.md to say OAuth2 Client Credentials is
+**Microsoft 365 only** and document the M365-only constraint on `OAuth2TokenService`,
+so an operator is not misled into configuring an unsupported provider. Option (b) is
+the smaller change if M365-only is acceptable for v1.
+
+**Resolution**
+
+Deferred 2026-06-20: the OAuth2 token endpoint + scope are hardcoded to Microsoft 365 while the doc implies multi-provider support. Whether to make authority/scope configurable or narrow the doc to 'Microsoft 365 only' for v1 is a product/scope decision. Latent today (only M365 is deployed/tested). Recorded for that decision.
+
+### NotificationService-027 — `NotificationOutboxKpiSampleSourceTests` is in the wrong test project (its system-under-test lives in NotificationOutbox)
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Code organization & conventions |
+| Status | Resolved |
+| Location | `tests/ZB.MOM.WW.ScadaBridge.NotificationService.Tests/Kpi/NotificationOutboxKpiSampleSourceTests.cs:1`, `src/ZB.MOM.WW.ScadaBridge.NotificationOutbox/Kpi/NotificationOutboxKpiSampleSource.cs` |
+
+**Description**
+
+`tests/ZB.MOM.WW.ScadaBridge.NotificationService.Tests/Kpi/NotificationOutboxKpiSampleSourceTests.cs`
+was added (commit `0d6c026d`, "K6 — NotificationOutbox sample source") inside the
+**NotificationService** test project, but the class it exercises —
+`NotificationOutboxKpiSampleSource` — lives in the **NotificationOutbox** project
+(`src/ZB.MOM.WW.ScadaBridge.NotificationOutbox/Kpi/NotificationOutboxKpiSampleSource.cs`),
+confirmed by a repo-wide grep showing the type defined only there. A module's test
+project should test that module: this test belongs in
+`tests/ZB.MOM.WW.ScadaBridge.NotificationOutbox.Tests`. As placed, it forces the
+NotificationService test project to reference the NotificationOutbox project, blurs
+the module boundary the review process treats as the unit of review, and means a
+reviewer scoping "the NotificationService tests" picks up coverage for a different
+component. It also inflates this module's apparent test count with tests that assert
+nothing about NotificationService.
+
+**Recommendation**
+
+Move `Kpi/NotificationOutboxKpiSampleSourceTests.cs` to
+`tests/ZB.MOM.WW.ScadaBridge.NotificationOutbox.Tests/Kpi/`, and drop the
+NotificationOutbox project reference from `ZB.MOM.WW.ScadaBridge.NotificationService.Tests.csproj`
+if nothing else in that test project needs it.
+
+**Resolution**
+
+Resolved 2026-06-20 (commit `fd618cf1`): moved `NotificationOutboxKpiSampleSourceTests` from NotificationService.Tests into NotificationOutbox.Tests (where its SUT lives), with the namespace adjusted to that project's convention. Passes in its new home.
+
+### NotificationService-028 — Stale XML cross-reference: `SmtpErrorClassifierTests` still names the NS-019-deleted `NotificationDeliveryService`
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Documentation & comments |
+| Status | Resolved |
+| Location | `tests/ZB.MOM.WW.ScadaBridge.NotificationService.Tests/SmtpErrorClassifierTests.cs:11` |
+
+**Description**
+
+The class-summary XML doc on `SmtpErrorClassifierTests` says the classification
+policy "is shared between `<see cref="NotificationDeliveryService"/>` and the central
+outbox's `EmailNotificationDeliveryAdapter`". `NotificationDeliveryService` was
+deleted as part of the NS-019 resolution (the orphaned site-side sender), so this
+cross-reference points at a type that no longer exists. The `<see cref=...>` will not
+resolve and the sentence describes a sharing relationship that ended with the
+central-only redesign — the classifier is now consumed only by
+`EmailNotificationDeliveryAdapter`. This is the same stale-doc class NS-023 cleaned up
+in the production source, missed in this one test file. Minor, but misleading to a
+maintainer and a dangling `cref`.
+
+**Recommendation**
+
+Drop the `NotificationDeliveryService` reference from the summary; state that the
+classifier is shared by the central `EmailNotificationDeliveryAdapter` (and is a
+reusable primitive of this module), consistent with the NS-019/NS-023 cleanup.
+
+**Resolution**
+
+Resolved 2026-06-20 (commit `fd618cf1`): removed the dangling `<see cref="NotificationDeliveryService"/>` (type deleted by NS-019) from the `SmtpErrorClassifierTests` XML doc, repointing the reference to the actual production caller. Doc-only.
diff --git a/code-reviews/README.md b/code-reviews/README.md
index fb22104b..7869ab63 100644
--- a/code-reviews/README.md
+++ b/code-reviews/README.md
@@ -40,37 +40,39 @@ module file and counted in **Total**.
 | Severity | Open findings |
 |----------|---------------|
 | Critical | 0 |
-| High | 0 |
-| Medium | 0 |
-| Low | 0 |
-| **Total** | **0** |
+| High | 1 |
+| Medium | 1 |
+| Low | 2 |
+| **Total** | **4** |
 
 ## Module Status
 
 | Module | Last reviewed | Commit | Open (C/H/M/L) | Open | Total |
 |--------|---------------|--------|----------------|------|-------|
-| [AuditLog](AuditLog/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 11 |
+| [AuditLog](AuditLog/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 16 |
 | [CLI](CLI/findings.md) | 2026-06-19 | `d6ead8ae` | 0/0/0/0 | 0 | 24 |
 | [CentralUI](CentralUI/findings.md) | 2026-06-19 | `d6ead8ae` | 0/0/0/0 | 0 | 36 |
-| [ClusterInfrastructure](ClusterInfrastructure/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 14 |
+| [ClusterInfrastructure](ClusterInfrastructure/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 15 |
 | [Commons](Commons/findings.md) | 2026-06-19 | `d6ead8ae` | 0/0/0/0 | 0 | 26 |
-| [Communication](Communication/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 22 |
+| [Communication](Communication/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 24 |
 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | 2026-06-19 | `d6ead8ae` | 0/0/0/0 | 0 | 27 |
-| [DataConnectionLayer](DataConnectionLayer/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 22 |
-| [DeploymentManager](DeploymentManager/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 24 |
-| [ExternalSystemGateway](ExternalSystemGateway/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 23 |
-| [HealthMonitoring](HealthMonitoring/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 23 |
-| [Host](Host/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 22 |
-| [InboundAPI](InboundAPI/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 25 |
+| [DataConnectionLayer](DataConnectionLayer/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 26 |
+| [DeploymentManager](DeploymentManager/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 27 |
+| [ExternalSystemGateway](ExternalSystemGateway/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 26 |
+| [HealthMonitoring](HealthMonitoring/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 25 |
+| [Host](Host/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 26 |
+| [InboundAPI](InboundAPI/findings.md) | 2026-06-20 | `4307c381` | 0/1/1/2 | 4 | 29 |
+| [KpiHistory](KpiHistory/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 6 |
 | [ManagementService](ManagementService/findings.md) | 2026-06-19 | `d6ead8ae` | 0/0/0/0 | 0 | 26 |
 | [NotificationOutbox](NotificationOutbox/findings.md) | 2026-06-19 | `d6ead8ae` | 0/0/0/0 | 0 | 13 |
-| [NotificationService](NotificationService/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 25 |
-| [Security](Security/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 21 |
-| [SiteCallAudit](SiteCallAudit/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 6 |
-| [SiteEventLogging](SiteEventLogging/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 23 |
-| [SiteRuntime](SiteRuntime/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 26 |
-| [StoreAndForward](StoreAndForward/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 24 |
-| [TemplateEngine](TemplateEngine/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 22 |
+| [NotificationService](NotificationService/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 28 |
+| [ScriptAnalysis](ScriptAnalysis/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 8 |
+| [Security](Security/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 25 |
+| [SiteCallAudit](SiteCallAudit/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 9 |
+| [SiteEventLogging](SiteEventLogging/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 27 |
+| [SiteRuntime](SiteRuntime/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 31 |
+| [StoreAndForward](StoreAndForward/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 27 |
+| [TemplateEngine](TemplateEngine/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 25 |
 | [Transport](Transport/findings.md) | 2026-06-19 | `d6ead8ae` | 0/0/0/0 | 0 | 15 |
 
 ## Pending Findings
@@ -84,14 +86,21 @@ description, location, recommendation — lives in the module's `findings.md`.
 
 _None open._
 
-### High (0)
+### High (1)
 
-_None open._
+| ID | Module | Title |
+|----|--------|-------|
+| InboundAPI-026 | [InboundAPI](InboundAPI/findings.md) | `InboundDatabaseHelper` gives inbound scripts arbitrary raw SQL (not read-only); contradicts the design doc's "No direct database access" decision and regresses InboundAPI-007 |
 
-### Medium (0)
+### Medium (1)
 
-_None open._
+| ID | Module | Title |
+|----|--------|-------|
+| InboundAPI-027 | [InboundAPI](InboundAPI/findings.md) | `InboundDatabaseHelper` is sync-over-async and ignores the method-deadline token — thread-pool starvation and an unbounded slow query |
 
-### Low (0)
+### Low (2)
 
-_None open._
+| ID | Module | Title |
+|----|--------|-------|
+| InboundAPI-028 | [InboundAPI](InboundAPI/findings.md) | `InboundDatabaseHelper` has no negative-path tests; `Database`/`WaitForAttribute` are not exercised end-to-end through the endpoint |
+| InboundAPI-029 | [InboundAPI](InboundAPI/findings.md) | Routed `WaitForAttribute` is cancelled by the method-level deadline, contradicting spec §6 (wait bounded by the wait timeout, not the method timeout) |
diff --git a/code-reviews/ScriptAnalysis/findings.md b/code-reviews/ScriptAnalysis/findings.md
new file mode 100644
index 00000000..8a6a94ea
--- /dev/null
+++ b/code-reviews/ScriptAnalysis/findings.md
@@ -0,0 +1,407 @@
+# Code Review — ScriptAnalysis
+
+| Field | Value |
+|-------|-------|
+| Module | `src/ZB.MOM.WW.ScadaBridge.ScriptAnalysis` |
+| Design doc | `docs/requirements/Component-ScriptAnalysis.md` |
+| Status | Reviewed |
+| Last reviewed | 2026-06-20 |
+| Reviewer | claude-agent |
+| Commit reviewed | `4307c381` |
+| Open findings | 0 |
+
+## Summary
+
+This is the first formal review of the ScriptAnalysis component (#25), the single
+authoritative source of truth for the ScadaBridge script trust boundary. The module
+is small (5 source files, ~700 lines) and well-documented, and its core design is
+**sound**: the trust verdict is genuinely **semantic** (Roslyn symbol resolution in
+Pass 1), not a string/regex deny-list, so the classic spelling-based evasions —
+aliases, `using static`, `global::`, partial-namespace qualification — are correctly
+defeated by resolving the symbol rather than the text. The reflection-gateway
+hardening (Pass 2) closes the `typeof(x).Assembly.GetType("System.IO.File")` class of
+bypass that a namespace-only deny-list cannot see. All four documented consumers
+(Template Engine, Site Runtime, Inbound API, Central UI) genuinely delegate to
+`ScriptTrustValidator.FindViolations` and **hard-reject** on a non-empty result; none
+re-implements or relaxes the policy. A reflection-based parity guard for the
+compile-only surfaces exists in `SiteRuntime.Tests`. **No working trust-boundary
+bypass was found at HEAD against the production (cluster-hosted) configuration.**
+
+That said, the review surfaced several real issues. The highest-impact one is a
+**silent degradation of the semantic pass when `TRUSTED_PLATFORM_ASSEMBLIES` is
+unavailable** (single-file/AOT/trimmed host): `AnalysisReferences` falls back to the
+5-assembly minimal set, after which a *bare* forbidden type inside an *allowed*
+namespace (the documented `Process` via `using System.Diagnostics;` case) no longer
+resolves and slips Pass 1 entirely — and Pass 2 never flags bare identifiers, only
+dotted spellings (ScriptAnalysis-001). This is the only finding that can weaken the
+verdict, and it is narrow (the production hosts carry a TPA list, and the downstream
+real `Compile` gate catches it for two of the four consumers), so it is rated High
+rather than Critical. The remaining findings are robustness/correctness/doc issues:
+the semantic pass swallows nothing but the unresolved-symbol fallback is namespace-
+spelling-dependent (Medium); `nameof(...)` over-flags legitimate non-forbidden uses
+inconsistently (Low, documented-intentional but worth pinning); `ParseDiagnostics`/
+`Compile` drop warnings despite the doc promising "errors and warnings"; there is no
+caching/throttling of the Roslyn compilations (performance); the `extraReferences`
+XML-doc on `FindViolations` is contradicted by the Central UI consumer; and the
+adversarial test corpus, though good, has gaps (no extension-method, no
+TPA-fallback, no `file`-scoped/`unsafe`, no verbatim/Unicode-escape identifier
+cases). The trust boundary is **semantically sound in production**; the open items
+harden the edges and align the spec.
+
+## Checklist coverage
+
+| # | Category | Examined | Notes |
+|---|----------|----------|-------|
+| 1 | Correctness & logic bugs | ☑ | `nameof` over-flags inconsistently (ScriptAnalysis-005); `GetSyntacticScopes` only returns text for dotted names so a bare unresolved forbidden identifier yields no scope (root of ScriptAnalysis-001/002). |
+| 2 | Akka.NET conventions | ☑ | No issues found — module has no actors, no Akka dependency (by design; csproj references only Roslyn + Commons). Stateless static APIs, thread-safe. |
+| 3 | Concurrency & thread safety | ☑ | No issues found — all public surfaces are static/stateless; the shared `static readonly` policy collections are read-only after init; `HardeningWalker` is per-call. `AnalysisReferences` eager-built once at type init. |
+| 4 | Error handling & resilience | ☑ | `RoslynScriptCompiler.Compile` converts exceptions into a fail-the-gate error entry (fail-safe). `BuildAnalysisReferences` swallows per-assembly read errors and silently falls back to the minimal set — the silent fallback is the resilience gap (ScriptAnalysis-001). |
+| 5 | Security | ☑ | Semantic (symbol-based) verdict — no spelling bypass found in production. One conditional degradation (TPA-absent host) weakens Pass 1 to spelling-dependent (ScriptAnalysis-001); `extraReferences` can never cause a false allow (verified). Defence-in-depth only, as documented — not a runtime sandbox (ScriptAnalysis-008). |
+| 6 | Performance & resource management | ☑ | No compilation caching/throttling; every gate call re-parses, rebuilds a compilation, and re-runs Roslyn — `FindViolations` builds a fresh `CSharpCompilation` against the *full framework* reference set on every call (ScriptAnalysis-006). |
+| 7 | Design-document adherence | ☑ | `ParseDiagnostics`/`Compile` drop warnings though doc says "errors and warnings" (ScriptAnalysis-004); `AnalysisReferences` (full-framework set) is not described in REQ-SA-2, which says DefaultReferences (ScriptAnalysis-007). |
+| 8 | Code organization & conventions | ☑ | No issues found — POCO/static layout is clean, namespace correct, depends only on Commons + Roslyn. `ScriptCompileSurface` lives in ScriptAnalysis (not Commons) — acceptable as it mirrors a runtime surface and is consumed here. |
+| 9 | Testing coverage | ☑ | Good adversarial reject/allow corpus, but missing: TPA-fallback degradation, extension methods, verbatim/Unicode-escape identifiers, `unsafe`/pointer, comment/trivia, and `AnalysisReferences`/exception paths (ScriptAnalysis-003). |
+| 10 | Documentation & comments | ☑ | `FindViolations` XML-doc claim "Forbidden references are NOT added here" is contradicted by the Central UI consumer forwarding `compilation.References` (ScriptAnalysis-007, doc half). Otherwise thorough and accurate. |
+
+## Findings
+
+### ScriptAnalysis-001 — Semantic trust pass silently degrades to spelling-dependent when `TRUSTED_PLATFORM_ASSEMBLIES` is unavailable
+
+| | |
+|--|--|
+| Severity | High |
+| Category | Security |
+| Status | Resolved |
+| Location | `src/ZB.MOM.WW.ScadaBridge.ScriptAnalysis/ScriptTrustPolicy.cs:146-184`; interacts with `ScriptTrustValidator.cs:111-136`, `:195-213` |
+
+**Description**
+
+`AnalysisReferences` is the reference set that lets Pass 1 (the semantic symbol pass —
+the *authoritative* guard) resolve every type a script names to its true namespace.
+It is built from `AppContext.GetData("TRUSTED_PLATFORM_ASSEMBLIES")`. When that value
+is absent — a single-file deployment, an AOT/trimmed host, or any runtime that does
+not publish the TPA list — `BuildAnalysisReferences` falls back to the 5-assembly
+`DefaultReferences` set (`ScriptTrustPolicy.cs:183`) **silently** (the per-assembly
+`catch { }` at `:165`/`:178` and the `Count > 0` fallback at `:183` emit nothing).
+
+After that fallback, a *bare* forbidden type that lives inside an *allowed* namespace
+no longer resolves. The policy's own documented example is exactly this case:
+`using System.Diagnostics; var p = Process.Start("x");`. `System.Diagnostics` is an
+allowed namespace (so `VisitUsingDirective` does not flag the import), and bare
+`Process` is an `IdentifierName`. With `Process.dll` absent from the minimal
+reference set, `model.GetSymbolInfo` returns no symbol, so Pass 1 falls to
+`GetSyntacticScopes`, which returns an empty scope list for a bare (dotless)
+identifier (`ScriptTrustValidator.cs:210-212`). Pass 2's `HardeningWalker` only flags
+forbidden *dotted* names and forbidden `using` imports — never a bare identifier
+token (`VisitIdentifierName`, `:373-395`, only checks `dynamic`/`Activator`). The
+forbidden `Process.Start` therefore slips the trust validator entirely. The same
+degradation reopens the broader "forbidden type in an allowed namespace reached as a
+bare identifier" class that the full-framework reference set was specifically added
+(commit `06975720`, M3.6) to close.
+
+This is rated High not Critical because: (a) the production hosts (Akka.NET cluster,
+docker topology) run on a normal framework-dependent runtime that publishes the TPA
+list, so the full set is loaded; and (b) for two of the four consumers (Template
+Engine, Site Runtime) the downstream real `Compile` gate against a curated reference
+set independently rejects an undefined `Process` symbol. But Inbound API and the
+Central UI run gate rely on `FindViolations` for the reflection/namespace verdict,
+and a deployment-mode change that drops the TPA list would silently weaken the
+boundary with no diagnostic.
+
+**Recommendation**
+
+Make the degradation loud and fail-safe: when the TPA list is unavailable, log a
+warning (or expose a diagnostic flag the consumers can surface), and — more
+importantly — add the BCL assemblies that host the forbidden-type-in-allowed-namespace
+cases (at minimum `System.Diagnostics.Process`) to `DefaultAssemblies` so they resolve
+even in the minimal fallback. Alternatively, extend Pass 2 to flag bare identifiers
+whose name matches the simple name of a known forbidden type (e.g. `Process`,
+`WebClient`), independent of symbol resolution, so the verdict never depends solely on
+the reference set being complete.
+
+**Resolution**
+
+Resolved 2026-06-20 (commit `fd618cf1`): the TPA-absent fallback now folds the
+assemblies hosting the forbidden anchor types (Process, Socket, File, Thread, Assembly,
+Marshal) into the minimal validation reference set so a bare forbidden type still
+resolves and is flagged; an `AnalysisReferencesDegraded` flag + a `Trace.TraceWarning`
+make the degraded mode observable. CRITICAL: anchors are added ONLY to the validation
+`AnalysisReferences`, never to `DefaultReferences` — keeping the compile gate from being
+able to bind `Process`. Verified by a TPA-fallback test that a bare `Process.Start` is
+still rejected.
+
+### ScriptAnalysis-002 — Pass 2 (syntactic hardening) is spelling-dependent and cannot catch a bare forbidden identifier on its own
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Security |
+| Status | Deferred |
+| Location | `src/ZB.MOM.WW.ScadaBridge.ScriptAnalysis/ScriptTrustValidator.cs:289-395`; `:195-213` |
+
+**Description**
+
+Pass 2 (`HardeningWalker`) is purely textual: `VisitUsingDirective` flags forbidden
+`using` imports (`:289-295`), `VisitQualifiedName`/`VisitMemberAccessExpression` flag
+forbidden *dotted* names (`:298-369`), and `VisitIdentifierName` flags only the bare
+identifiers `dynamic`/`Activator` (`:373-395`). It cannot independently flag any other
+bare forbidden identifier — it relies entirely on Pass 1's semantic resolution for
+that. The two passes are therefore not the "two independent guards" the design implies
+for the bare-identifier case: for an unresolved bare identifier (see
+ScriptAnalysis-001) **neither** pass fires. `GetSyntacticScopes` deliberately returns
+nothing for dotless names (`:210-212`, "bare local identifiers cannot reach a forbidden
+namespace") — which is true only while Pass 1 resolves them; once resolution fails the
+assumption breaks.
+
+This is the structural root of ScriptAnalysis-001 and is recorded separately because
+it is a design property (the syntactic pass is not self-sufficient against bare names)
+that should be acknowledged even after the TPA-fallback hole is closed: the defence is
+single-layered, not double-layered, for the bare-identifier vector.
+
+**Recommendation**
+
+Document explicitly that Pass 1 is the sole guard for bare (single-identifier)
+forbidden references and that Pass 2 only backstops *spelled* forbidden namespaces. If
+genuine defence-in-depth is wanted for bare names, add a simple-name deny-set to Pass 2
+(the simple names of the forbidden root types — `Process`, `WebClient`, `Marshal`,
+`Registry`, `Thread`, `File`, `Directory`, etc.), accepting the small false-positive
+risk on a script's own identically-named local, which a security gate should tolerate.
+
+**Resolution**
+
+Deferred 2026-06-20: strengthening Pass-2 (syntactic) so a bare forbidden identifier is
+caught without Pass-1 symbol resolution is a defence-in-depth enhancement; the
+production path (TPA present, two-layer compile gate on most call sites) is sound.
+Recorded for a follow-up.
+
+### ScriptAnalysis-003 — Adversarial test corpus has bypass-relevant gaps
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Testing coverage |
+| Status | Resolved |
+| Location | `tests/ZB.MOM.WW.ScadaBridge.ScriptAnalysis.Tests/ScriptTrustValidatorTests.cs`; `RoslynScriptCompilerTests.cs` |
+
+**Description**
+
+The reject/allow corpus is solid for the spelled-namespace, alias, `using static`,
+`global::`, reflection-gateway, generic-argument, and nested-lambda vectors. For a
+security-critical trust boundary, however, several bypass-relevant cases are untested:
+
+- **TPA-fallback degradation** (ScriptAnalysis-001) — no test forces
+  `AnalysisReferences` onto the minimal set and asserts that bare `Process`/`WebClient`
+  is still rejected. This is the highest-value missing test.
+- **Extension methods** — no test that a forbidden-namespace extension method invoked
+  in receiver-position (`x.SomeForbiddenExt()`) is resolved and flagged.
+- **Verbatim / Unicode-escaped identifiers** — `@dynamic`, `dynamic`,
+  `Activator`, or `GetType` are not exercised; `VisitIdentifierName` compares
+  `Identifier.ValueText`, which decodes Unicode escapes (so it likely holds), but the
+  case is not pinned, and `dynamic` is matched via `text == "dynamic"` (`:380`) which is
+  a separate path from the `ForbiddenIdentifiers` set.
+- **`unsafe`/pointer/`stackalloc`** — relies implicitly on `ScriptOptions` defaulting
+  `AllowUnsafe=false`; not asserted.
+- **Comment/trivia and string-literal payloads** — no test that a forbidden namespace
+  inside a comment or string is *not* flagged (false-positive guard) and that a
+  forbidden token reconstructed only at runtime from strings is correctly *not* the
+  validator's concern (documents the sandbox caveat).
+
+**Recommendation**
+
+Add the cases above, prioritising the TPA-fallback test (inject a minimal-reference
+build of `AnalysisReferences` or factor the reference set so a test can force the
+fallback) and the extension-method test. Add a verbatim/Unicode-escape identifier
+reject test for `dynamic`/`Activator`/`GetType`.
+
+**Resolution**
+
+Resolved 2026-06-20 (commit `fd618cf1`): added 18 adversarial tests — TPA-fallback
+bare-Process/Socket, extension-method invocation, verbatim/Unicode-escaped identifiers,
+`unsafe` blocks (benign vs forbidden-API-inside), and comment/string-literal
+false-positive guards.
+
+### ScriptAnalysis-004 — `ParseDiagnostics`/`Compile` drop warnings though the design doc promises "errors and warnings"
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Design-document adherence |
+| Status | Resolved |
+| Location | `src/ZB.MOM.WW.ScadaBridge.ScriptAnalysis/RoslynScriptCompiler.cs:30-33`, `:73-76` |
+
+**Description**
+
+REQ-SA-3 states `Compile` returns the diagnostics "filtered to errors and warnings"
+and `ParseDiagnostics` returns "syntax-level diagnostics (errors and warnings)". The
+code filters `d.Severity == DiagnosticSeverity.Error` only in both methods
+(`RoslynScriptCompiler.cs:31` and `:74`), so warnings are silently dropped. Filtering
+to errors-only is arguably the safer choice for a compile *gate* (a warning should not
+block a deploy), but the code and the spec disagree, and a future maintainer reading
+the doc will expect warnings in the returned list.
+
+**Recommendation**
+
+Reconcile: either change the doc (REQ-SA-3) to say "errors only", or include warnings
+and let callers decide. Given the gate semantics, updating the doc to "error-severity
+diagnostics" is the cleaner fix.
+
+**Resolution**
+
+Resolved 2026-06-20 (commit `fd618cf1`): corrected the design doc (REQ-SA-3) to
+'error-severity only', matching the compile wrapper's existing behaviour (a warning must
+not block a deploy gate). Code unchanged.
+
+### ScriptAnalysis-005 — `nameof(forbiddenType)` flagging is inconsistent between passes and across resolution states
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Correctness & logic bugs |
+| Status | Deferred |
+| Location | `src/ZB.MOM.WW.ScadaBridge.ScriptAnalysis/ScriptTrustValidator.cs:42-48`, `:99-136`, `:298-325` |
+
+**Description**
+
+The class comment (`:42-48`) states that a forbidden type inside `nameof(...)` is
+"deliberately flagged" as a conservative fail-safe, and `Rejects_NameOf_ForbiddenType`
+pins `nameof(System.IO.File)`. However the flagging is incidental, not deliberate, and
+inconsistent: `nameof(System.IO.File)` is caught because the inner `System.IO.File` is
+a `QualifiedNameSyntax`/member access that both passes inspect directly — the
+`nameof` wrapper is irrelevant. A bare `nameof(Process)` (after `using
+System.Diagnostics;`, with full TPA loaded) resolves `Process` to a real symbol and
+*is* flagged by Pass 1; but the same `nameof(Process)` under the minimal-reference
+fallback resolves to nothing and is *not* flagged. So the "deliberate fail-safe"
+behaviour actually varies with the reference set, and `nameof` — which produces only a
+compile-time string and reaches no API — is a genuine false positive that will reject a
+legitimate script doing `nameof(System.Net.IPAddress)` for a log message.
+
+**Recommendation**
+
+Decide the intended semantics explicitly. If `nameof` should be allowed (it reaches no
+runtime API), special-case `NameOfExpression`/`InvocationExpression` named `nameof` to
+suppress the inner-reference report. If it should be rejected, make that uniform
+regardless of resolution state (tie to a simple-name check), and keep the test. Either
+way, align the comment with the actual, deterministic behaviour.
+
+**Resolution**
+
+Deferred 2026-06-20: the `nameof(forbiddenType)` over-flagging is a documented
+deliberate fail-safe; whether `nameof` (which reaches no API) should be allowed vs
+uniformly rejected is a design-owner decision. Recorded; no behaviour change this pass.
+
+### ScriptAnalysis-006 — No caching or throttling of Roslyn compilations; full-framework compilation rebuilt per call
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Performance & resource management |
+| Status | Deferred |
+| Location | `src/ZB.MOM.WW.ScadaBridge.ScriptAnalysis/ScriptTrustValidator.cs:73-97`; `RoslynScriptCompiler.cs:50-71` |
+
+**Description**
+
+`FindViolations` parses the script and builds a fresh `CSharpCompilation` against the
+**entire framework reference set** (`AnalysisReferences`, potentially hundreds of
+`MetadataReference`s) on every call (`:87-95`). `RoslynScriptCompiler.Compile`
+similarly creates a new `CSharpScript` and calls `.Compile()` per call. Roslyn
+compilation is the expensive part of these calls. The metadata references are cached
+(built once in `AnalysisReferences`), which is the main cost, so this is Low rather
+than Medium — but for the Central UI editor path (which can call the validator on every
+keystroke-debounced analysis) and for batch template validation (many scripts per
+deploy), the absence of any per-source memoisation or a bounded compilation cache means
+repeated work. There is no back-pressure or concurrency cap either; the Inbound API
+review (InboundAPI-009) already noted uncapped recompilation downstream.
+
+**Recommendation**
+
+Consider a small bounded cache keyed on a hash of `(code, globalsType)` for `Compile`
+results and on `code` for `FindViolations` verdicts (the verdict is a pure function of
+the source and the static policy). At minimum, document that callers on hot paths
+(editor live-analysis) should debounce and cache, and confirm the Central UI debounces
+before calling.
+
+**Resolution**
+
+Deferred 2026-06-20: compilation caching/throttling (the full-framework
+`CSharpCompilation` is rebuilt per `FindViolations` call) is a performance enhancement,
+not a correctness defect. Recorded for a follow-up.
+
+### ScriptAnalysis-007 — `FindViolations` `extraReferences` XML-doc contradicts the Central UI consumer; `AnalysisReferences` undocumented in REQ-SA-2
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Documentation & comments |
+| Status | Resolved |
+| Location | `src/ZB.MOM.WW.ScadaBridge.ScriptAnalysis/ScriptTrustValidator.cs:57-64`; `ScriptTrustPolicy.cs:125-144`; `docs/requirements/Component-ScriptAnalysis.md:74-80` |
+
+**Description**
+
+Two doc/code mismatches, neither a vulnerability:
+
+1. The `extraReferences` XML-doc on `FindViolations` (`:57-64`) says "Forbidden
+   references are NOT added here". But the Central UI run gate calls
+   `FindViolations(tree.ToString(), compilation.References)`
+   (`CentralUI/ScriptAnalysis/ScriptAnalysisService.cs:1319-1320`), forwarding the full
+   BCL surface as `extraReferences`. This is in fact *safe* (the deny-list is by
+   namespace/type, so adding references can only improve resolution, never produce a
+   false allow — and the policy comment at `ScriptTrustPolicy.cs:138-142` says exactly
+   that), but the `FindViolations` doc as written implies the caller must not pass
+   forbidden assemblies, which is both unnecessary and contradicted in practice.
+
+2. REQ-SA-2 in the design doc describes Pass 1 as using "the full trusted-platform
+   reference set from `ScriptTrustPolicy.DefaultReferences`". The code uses a separate
+   `ScriptTrustPolicy.AnalysisReferences` (the full framework), and `DefaultReferences`
+   is explicitly the *minimal* set used by `RoslynScriptCompiler`. The doc conflates the
+   two and never mentions `AnalysisReferences` or the TPA mechanism — material because
+   that mechanism is the heart of the security verdict (and of ScriptAnalysis-001).
+
+**Recommendation**
+
+Rewrite the `extraReferences` doc to state that extra references only widen symbol
+resolution and never whitelist a forbidden API. Update REQ-SA-2 to describe
+`AnalysisReferences` (full-framework, TPA-sourced) versus `DefaultReferences` (minimal,
+runtime-fidelity), and note the TPA-fallback behaviour.
+
+**Resolution**
+
+Resolved 2026-06-20 (commit `fd618cf1`): corrected the `FindViolations`
+`extraReferences` XML doc (extra references only widen resolution, never whitelist) and
+documented `AnalysisReferences` vs `DefaultReferences` in REQ-SA-2.
+
+### ScriptAnalysis-008 — Static trust gate is defence-in-depth, not a sandbox; correctly documented but worth recording as a standing risk
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Security |
+| Status | Won't Fix |
+| Location | `src/ZB.MOM.WW.ScadaBridge.ScriptAnalysis/ScriptTrustValidator.cs:34-40`; `docs/requirements/Component-ScriptAnalysis.md:151` |
+
+**Description**
+
+The component is explicit (both in code comments and REQ-SA-5) that it is a
+compile-time deny-list, not a runtime sandbox: scripts execute in-process, and a
+sufficiently determined script could reconstruct a forbidden API at runtime via paths
+the static walker cannot see (e.g. building type names from runtime strings, or any
+gap in the deny-list). The reflection-gateway hardening narrows this, and the
+`dynamic`/`Activator` bans close the obvious late-binding routes, but the residual risk
+is real and inherent to in-process execution of partially-trusted C#. This is recorded
+(not as a defect of this module, which behaves as designed) so the standing risk is
+visible in the audit trail and is revisited if/when the deferred runtime boundary
+(restricted `AssemblyLoadContext` / curated reference set / out-of-process sandbox) is
+scoped.
+
+**Recommendation**
+
+Keep the caveat prominent. Track the runtime-boundary work as a roadmap item; in the
+meantime ensure the curated *execution* reference sets (SiteRuntime `ScriptAssemblies`,
+`RoslynScriptCompiler` `DefaultReferences`) stay minimal so that even a deny-list miss
+cannot bind a forbidden type at execution time (this is the real second layer of
+defence for the compile-running consumers).
+
+**Resolution**
+
+Won't Fix 2026-06-20: the static gate is intentionally defence-in-depth, not a runtime
+sandbox — this is the explicitly-documented design posture, not a defect. Recorded as a
+standing risk for visibility.
diff --git a/code-reviews/Security/findings.md b/code-reviews/Security/findings.md
index 4e84ef39..65998a0a 100644
--- a/code-reviews/Security/findings.md
+++ b/code-reviews/Security/findings.md
@@ -5,10 +5,10 @@
 | Module | `src/ZB.MOM.WW.ScadaBridge.Security` |
 | Design doc | `docs/requirements/Component-Security.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-28 |
+| Last reviewed | 2026-06-20 |
 | Reviewer | claude-agent |
-| Commit reviewed | `1eb6e97` |
-| Open findings | 0 (1 deferred — Security-008) |
+| Commit reviewed | `4307c381` |
+| Open findings | 0 (2 deferred — Security-008, Security-022) |
 
 ## Summary
 
@@ -78,6 +78,50 @@ values silently surface only on first login (Security-020); and the
 `RequireHttpsCookie=false` dev opt-out emits no warning, so an HTTP production
 deployment silently transmits the JWT bearer credential in cleartext (Security-021).
 
+#### Re-review 2026-06-20 (commit `4307c381`) — full review
+
+Major architectural change since `1eb6e97`: the project was renamed (ScadaLink →
+ZB.MOM.WW.ScadaBridge) and the bespoke `LdapAuthService`/`LdapTransport`/`LdapAuthResult`
+were **cut over to the external `ZB.MOM.WW.Auth.Ldap` library** (commit `ac34dac4`); the
+prior `SiteScopeAuthorizationHandler` was deleted (Security-017 resolution), and the role
+model was canonicalised + collapsed (`Admin→Administrator`, `Design→Designer`,
+`Deployment→Deployer`, `Audit→Administrator`, `AuditReadOnly→Viewer`; commit `b104760b`)
+with M7 `Operator`/`Verifier` added (`a0ce8b6c`). The interactive cookie session now signs
+in with **bare claims** via the shared `SessionClaimBuilder` and is policed by
+`CookieSessionValidator` (idle-timeout + LDAP-free DB role-refresh, fail-closed, well
+tested). All prior fixes that survived the cutover remain in place (key-length guard,
+issuer/audience binding, `RoleMapper` union semantics, `Roles` constants, cookie hardening
+via `ZbCookieDefaults`, the `RequireHttpsCookie` warning). Security-008 (N+1 scope-rule
+query) is still correctly **Deferred**. This pass surfaced **4 new findings**
+(Security-022..025): three Medium and one Low. The most consequential is **Security-022** —
+the documented "cookie + JWT hybrid" no longer holds at HEAD: `/auth/token` still *mints* a
+bearer JWT but **no code anywhere validates it** (CLI uses HTTP Basic Auth, Inbound API uses
+the `sbk_` ApiKeys verifier, the cookie path uses bare claims), so `JwtTokenService`'s entire
+`ValidateToken`/`RefreshToken`/`RecordActivity`/`ShouldRefresh`/`IsIdleTimedOut` surface is
+dead code minting an orphaned credential. **Security-023** — `LdapGroupMapping.Role` accepts
+arbitrary free strings via CLI/API with no canonical-set validation, and the Central UI
+`RequireClaim` policies are case-sensitive while the ManagementActor authz check is
+case-insensitive, so a mis-cased/typo'd mapping silently grants inconsistent access across
+the two surfaces with no operator feedback. **Security-024** — the M7 `RequireOperator`/
+`RequireVerifier` SoD policies have no functional `AuthorizeAsync` grant/deny test (only
+constant-name assertions), unlike every other policy. **Security-025** (Low) — a unit test
+makes a live network connect to `nonexistent.invalid:9999`.
+
+_Re-review (2026-06-20, `4307c381`):_
+
+| # | Category | Examined | Notes |
+|---|----------|----------|-------|
+| 1 | Correctness & logic bugs | ☑ | `RoleMapper` union semantics (Security-016 fix) hold and read correctly. `LdapGroupMapping.Role` is an unvalidated free string + case-sensitive UI policy vs case-insensitive ManagementActor authz (Security-023). |
+| 2 | Akka.NET conventions | ☑ | No actors. `AddSecurityActors` is still a placeholder. No issues. |
+| 3 | Concurrency & thread safety | ☑ | Services stateless/DI-scoped; `CookieSessionValidator` reads injected `TimeProvider`. LDAP now in the external library. No shared mutable state. No issues. |
+| 4 | Error handling & resilience | ☑ | `CookieSessionValidator` fails closed on idle (missing anchor = timed out) and swallows refresh faults keeping current roles (mirrors "LDAP failure: active sessions continue") — correct and well tested. `LdapAuthFailureMessages` maps the library's structured failure enum to stable user text. No issues. |
+| 5 | Security | ☑ | Cookie hardening (HttpOnly/SameSite=Strict/Secure-when-required/sliding) comes from `ZbCookieDefaults` — verified correct. JWT mint binds issuer/audience, pins `MapInbound/OutboundClaims=false`, fails fast on short key. BUT the minted bearer JWT is validated by no consumer (Security-022). DisableLogin auto-login is gated at registration + loud warning + doc'd (acceptable). |
+| 6 | Performance & resource management | ☑ | Security-008 N+1 remains correctly Deferred (gated on `ISecurityRepository`). `CookieSessionValidator` deliberately avoids re-minting on the no-refresh path. No new perf issues. |
+| 7 | Design-document adherence | ☑ | The cookie+JWT-hybrid design is now cookie-with-bare-claims; the JWT half is orphaned/unvalidated (Security-022) — the M2.19 doc note acknowledges bare-claim cookies but the doc still presents JWT-in-cookie as the model and does not flag that `/auth/token`'s JWT is unconsumed. Operator/Verifier policies match the design. |
+| 8 | Code organization & conventions | ☑ | `Roles`/`SecurityOptions` owned by the component; repository interface in Commons; additive `IGroupRoleMapper<string>` adapter. Dead JWT methods (Security-022) are the main org smell. |
+| 9 | Testing coverage | ☑ | `CookieSessionValidator`, `SessionClaimBuilder`, `RoleMapper` union, cookie wiring, `SecurityOptionsValidator`, DisableLogin are well covered. Gaps: no functional authz test for `RequireOperator`/`RequireVerifier` (Security-024); a unit test does live network I/O (Security-025); no test for a non-canonical/mis-cased mapping role (Security-023). |
+| 10 | Documentation & comments | ☑ | XML docs are thorough and current. `SecurityOptions` and `JwtTokenService` docs still describe an active JWT-refresh lifecycle that no longer has a caller (Security-022). |
+
 ## Checklist coverage
 
 | # | Category | Examined | Notes |
@@ -1008,3 +1052,178 @@ SecurityOptions:RequireHttpsCookie=true in production.")`. Optionally, also fail
 startup when `RequireHttpsCookie=false` AND `ASPNETCORE_ENVIRONMENT=Production`. Add a
 regression test that asserts the warning is emitted when the flag is disabled and not
 when it is enabled.
+
+### Security-022 — `/auth/token` mints a bearer JWT that no consumer validates; the entire `JwtTokenService` validate/refresh surface is dead code
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Design-document adherence |
+| Status | Deferred |
+| Location | `src/ZB.MOM.WW.ScadaBridge.Security/JwtTokenService.cs:149`, `:194`, `:209`, `:239`, `:272`; `src/ZB.MOM.WW.ScadaBridge.CentralUI/Auth/AuthEndpoints.cs:99-148` |
+
+**Description**
+
+The design (Component-Security.md "Cookie + JWT Hybrid", CLAUDE.md "cookie+JWT hybrid …
+HMAC-SHA256, 15-minute expiry w/ sliding refresh") presents a cookie-embedded JWT as the
+session model. At HEAD this is no longer how the interactive session works (the M2.19 note
+acknowledges the cookie now carries **bare claims** policed by `CookieSessionValidator`),
+and — more importantly — the JWT half is now an **orphaned, unconsumed credential**. A grep
+across `src/` shows `JwtTokenService.ValidateToken`, `RefreshToken`, `RecordActivity`, and
+`ShouldRefresh` have **zero production callers**; only `GenerateToken` is called, solely by
+`POST /auth/token` (`AuthEndpoints.cs:134`). That endpoint returns an `access_token` to the
+caller, but **no endpoint in the codebase accepts or validates that JWT** as a credential:
+the CLI authenticates to the Management API with HTTP Basic Auth
+(`CLI/ManagementHttpClient.cs:36`), the Inbound API authenticates with the external
+`ZB.MOM.WW.Auth.ApiKeys` `sbk_<keyId>_<secret>` verifier, and the cookie pipeline uses
+`SessionClaimBuilder` bare claims (never `ValidateToken`). The result: `/auth/token` issues
+HMAC-signed bearer tokens that grant access to nothing, and ~120 lines of carefully-hardened
+issuer/audience/clock-skew/idle/refresh logic (the subject of resolved findings
+Security-006/007/014) are dead code. The hazard is twofold: (a) the design doc materially
+overstates the live session mechanism, misleading any future reader/auditor; and (b) an
+orphaned token mint is a latent footgun — a future endpoint added under the assumption that
+`/auth/token` produces a *usable* credential would wire a second, parallel auth path with
+no test coverage proving it actually validates.
+
+**Recommendation**
+
+Decide the intent explicitly. Either (a) **wire the bearer path**: add
+`AddAuthentication().AddJwtBearer(...)` (or a custom handler delegating to
+`JwtTokenService.ValidateToken`) on whatever surface is meant to accept the token, and add
+an end-to-end test that a `/auth/token`-issued JWT authorizes a protected request — then the
+hybrid model the doc describes is real; or (b) **remove the dead code**: delete
+`/auth/token` and `JwtTokenService.ValidateToken`/`RefreshToken`/`RecordActivity`/
+`ShouldRefresh`/`IsIdleTimedOut` (keep only what the bare-claim cookie path needs, which is
+none of `JwtTokenService`), and rewrite Component-Security.md's "Cookie + JWT Hybrid" /
+"Token Lifecycle" sections to describe the actual bare-claim cookie + `CookieSessionValidator`
+mechanism as the primary model rather than a footnote. Whichever is chosen, the design doc
+and the code must agree on whether a usable JWT exists.
+
+**Resolution**
+
+Deferred 2026-06-20: this needs a product decision — either wire a JWT-bearer auth handler so the documented cookie+JWT hybrid is real, or delete `/auth/token` + the orphaned `JwtTokenService` validate/refresh surface and rewrite the design doc to describe the actual bare-claim cookie model. Recorded for that decision.
+
+### Security-023 — `LdapGroupMapping.Role` is an unvalidated free string; case-sensitive UI policy vs case-insensitive server authz grants inconsistent access
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Correctness & logic bugs |
+| Status | Resolved |
+| Location | `src/ZB.MOM.WW.ScadaBridge.Security/AuthorizationPolicies.cs:144-157`; `src/ZB.MOM.WW.ScadaBridge.Security/RoleMapper.cs:40-42`; `src/ZB.MOM.WW.ScadaBridge.Commons/Entities/Security/LdapGroupMapping.cs:10`; `src/ZB.MOM.WW.ScadaBridge.ManagementService/ManagementActor.cs:106`, `:1893` |
+
+**Description**
+
+The role string an operator maps an LDAP group to (`LdapGroupMapping.Role`) is a free
+`string` with no validation against the canonical `Roles.*` set anywhere on the write path:
+`ManagementActor` (`:1893`) does `new LdapGroupMapping(cmd.LdapGroupName, cmd.Role)` verbatim,
+and the CLI `security role-mapping create --role <value>`
+(`CLI/Commands/SecurityCommands.cs:157`) accepts any string. The Central UI dropdown
+(`LdapMappingForm.razor`) constrains the *UI* path to the canonical constants, but the CLI/API
+path does not. Two enforcement surfaces then disagree on casing:
+
+- **Central UI authorization** uses `policy.RequireClaim(JwtTokenService.RoleClaimType, Roles.Deployer)`
+  (`AuthorizationPolicies.cs:150`). ASP.NET Core's `RequireClaim` compares claim *values*
+  with ordinal (case-**sensitive**) equality.
+- **ManagementActor authorization** uses `user.Roles.Contains(requiredRole, StringComparer.OrdinalIgnoreCase)`
+  (`ManagementActor.cs:106`) — case-**insensitive**.
+- **`RoleMapper`** matches the Deployer scope branch with `StringComparison.OrdinalIgnoreCase`
+  (`RoleMapper.cs:42`) but stamps the **DB row's verbatim casing** into the role claim.
+
+So a mapping created via CLI/API as `--role deployer` (lowercase) or a typo like `Deploy`
+produces a principal whose role claim is `"deployer"`: the **ManagementActor/CLI surface
+authorizes it** (case-insensitive), the **Central UI `RequireDeployment` policy denies it**
+(case-sensitive), and for the Deployer case `RoleMapper` still resolves site scope (matching
+case-insensitively) — a user who can deploy via CLI but is locked out of the equivalent UI
+pages, or vice versa, with no error explaining why. A pure typo (`"Deploer"`) silently grants
+a role claim that *no* policy and no ManagementActor check matches — the user appears to hold
+a role yet is denied everywhere, with no validation feedback at mapping-creation time.
+
+**Recommendation**
+
+Validate `cmd.Role` against the canonical `Roles.All` set (case-insensitively, normalising to
+the canonical casing) when creating/updating a mapping — in `ManagementActor` (the single
+server-side write path) and reject non-canonical values with a clear error; optionally also
+guard in the CLI for a fast local failure. Separately, make the two authorization surfaces
+agree on casing: either normalise the role claim to canonical casing in `SessionClaimBuilder`/
+`RoleMapper` before it is stamped, or make the ManagementActor check case-sensitive to match
+`RequireClaim` (the safer direction once values are validated). Add a regression test for a
+mis-cased mapping role.
+
+**Resolution**
+
+Resolved 2026-06-20 (commit `fd618cf1`): added membership validation rejecting any `cmd.Role` not in `Roles.All` (canonical set, case-insensitive) at LDAP-group-mapping create AND update, before any DB write — closing the unvalidated-free-string gap. The casing/comparison-asymmetry half is a larger change and remains a recorded follow-up (membership check is safe: a non-canonical role never worked). No existing test regressed.
+
+### Security-024 — `RequireOperator` / `RequireVerifier` policies have no functional authorization test
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Testing coverage |
+| Status | Resolved |
+| Location | `src/ZB.MOM.WW.ScadaBridge.Security/AuthorizationPolicies.cs:153-157`; `tests/ZB.MOM.WW.ScadaBridge.Security.Tests/SecurityTests.cs:1045-1186` (AuthorizationPolicyTests); `tests/ZB.MOM.WW.ScadaBridge.Security.Tests/RolesTests.cs:44-45` |
+
+**Description**
+
+The M7 two-person Secured Write feature added the `RequireOperator` and `RequireVerifier`
+policies (`AuthorizationPolicies.cs:153-157`), each a single-role `RequireClaim`. These gate a
+SCADA control-surface write workflow whose entire safety argument is separation of duties, so
+correct grant/deny behaviour is security-critical. But `AuthorizationPolicyTests` — which
+exercises every *other* policy with real `IAuthorizationService.AuthorizeAsync` evaluations
+(Admin/Design/Deployment/OperationalAudit/AuditExport, including the load-bearing
+"Viewer reads but cannot export" SoD case) — has **no** Operator/Verifier test. The only
+coverage is `RolesTests.cs:44-45`, which asserts the *constant string values*
+(`"RequireOperator"`, `"RequireVerifier"`) — not that an `Operator` principal satisfies
+`RequireOperator`, that a `Verifier` does not, or that the two policies are mutually distinct.
+A regression that, say, mapped `RequireOperator` to `Roles.Verifier` (or to
+`OperationalAuditRoles`) would compile and pass the existing tests while silently collapsing
+the separation of duties.
+
+**Recommendation**
+
+Add `AuthorizeAsync`-based `[Theory]` cases to `AuthorizationPolicyTests` mirroring the
+existing policy tests: `RequireOperator` succeeds for `Roles.Operator` and fails for
+`Roles.Verifier`/`Administrator`/empty; `RequireVerifier` succeeds for `Roles.Verifier` and
+fails for `Roles.Operator`; and a combined case asserting an Operator-only principal cannot
+satisfy `RequireVerifier` (the SoD invariant at the policy layer).
+
+**Resolution**
+
+Resolved 2026-06-20 (commit `fd618cf1`): added functional `AuthorizeAsync` grant/deny tests for `RequireOperator`/`RequireVerifier` (and a combined SoD-distinctness test), guarding the separation-of-duties policy wiring against regression.
+
+### Security-025 — Unit test performs a live network connection to `nonexistent.invalid:9999`
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Testing coverage |
+| Status | Resolved |
+| Location | `tests/ZB.MOM.WW.ScadaBridge.Security.Tests/SecurityTests.cs:71-89` (`AuthenticateAsync_ConnectionFailure_FailsClosed_NeverThrows`) |
+
+**Description**
+
+`LdapAuthServiceTests.AuthenticateAsync_ConnectionFailure_FailsClosed_NeverThrows` constructs
+the external library's `LdapAuthService` pointed at `Server = "nonexistent.invalid"`,
+`Port = 9999` with a 2-second `ConnectionTimeoutMs`, then asserts the result fails closed.
+Although the test relies on the connection *failing*, it still performs a real DNS resolution
+and TCP connect attempt from the unit-test process. In an offline or network-sandboxed CI
+environment the resolution/connect can behave differently (immediate failure vs. the full
+2-second timeout vs. a captive-portal redirect), making the test both **slow** (up to ~2s of
+wall-clock dead time) and **environment-dependent**. Unit tests in this suite are otherwise
+network-free (the sibling cases are explicitly named `_NoNetwork`). The behaviour under test
+(fail-closed mapping of an unreachable directory) is really the external library's
+responsibility and is, per the file's own header comment, already covered by the library's own
+test suite through its internal connection seam.
+
+**Recommendation**
+
+Drop this case from the ScadaBridge unit suite (the library owns the fail-closed contract via
+its internal `ILdapConnection` seam, as the region comment notes), or move it to an explicitly
+opt-in integration-test category so it does not run in the default network-free unit pass. If
+kept as a smoke test, point it at a guaranteed-unroutable loopback address (e.g.
+`127.0.0.1` on a closed port) to bound the failure deterministically rather than relying on
+DNS resolution of `.invalid`.
+
+**Resolution**
+
+Resolved 2026-06-20 (commit `fd618cf1`): the fail-closed connection test now targets `127.0.0.1` on a closed port instead of `nonexistent.invalid:9999` — deterministic loopback connection-refused, no external DNS/timeout dependency. Same assertion preserved.
diff --git a/code-reviews/SiteCallAudit/findings.md b/code-reviews/SiteCallAudit/findings.md
index f656afb3..83059753 100644
--- a/code-reviews/SiteCallAudit/findings.md
+++ b/code-reviews/SiteCallAudit/findings.md
@@ -5,10 +5,10 @@
 | Module | `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit` |
 | Design doc | `docs/requirements/Component-SiteCallAudit.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-28 |
+| Last reviewed | 2026-06-20 |
 | Reviewer | claude-agent |
-| Commit reviewed | `1eb6e97` |
-| Open findings | 2 |
+| Commit reviewed | `4307c381` |
+| Open findings | 0 |
 
 ## Summary
 
@@ -42,6 +42,23 @@ tests using a shared `MsSqlMigrationFixture`.
 | 9 | Testing coverage | Yes | Relay path well covered (6 unit tests). Ingest/KPI well covered by MSSQL fixture. Stuck-only paging boundary edge not directly exercised (Finding 006). |
 | 10 | Documentation & comments | Yes | XML docstring claims `SupervisorStrategy` uses Resume — incorrect (Finding 001). `AckErrorMessage` switch arm for `SiteUnreachable` falls through instead of throwing (Finding 005). |
 
+#### Re-review 2026-06-20 (commit `4307c381`) — full review
+
+Since the `1eb6e97` baseline the module grew the two pieces that were "still deferred" at the prior pass: the periodic per-site reconciliation puller (Piece A — `OnReconciliationTickAsync` → `ReconcileSiteAsync`, in-memory per-site cursor + idempotent monotonic upsert + per-site failure isolation) and the daily terminal-row purge scheduler (Piece B — `OnPurgeTickAsync` → `PurgeTerminalAsync`, continue-on-error). All of the new code is clean, idiomatic, and well-tested (dedicated `SiteCallAuditReconciliationTests`, `SiteCallAuditPurgeTests`, `SiteCallRelayTests`, KPI-sample and options suites); the M6 `SiteCallAuditKpiSampleSource` correctly anchors both cutoffs on a single capture instant and emits Global + per-Site + per-Node snapshots, and the per-node KPI handler is a clean additive peer of the per-site one. The three new findings are all forward-compat / cosmetic: a latent purge-gating gap (the purge timer is armed only when the reconciliation collaborators are present — currently masked because `Program.cs` co-registers them) plus two doc/edge items (the design doc over-claims six charted trend series when only three are charted, and the reconciliation cursor ignores `MoreAvailable` so a single-timestamp batch saturation re-pulls without progress). No correctness, concurrency, security, or resilience defects in the new code; the inclusive-cursor + idempotent-upsert pairing keeps every edge case safe-by-construction (wasted work, never corruption).
+
+| # | Category | Examined | Notes |
+|---|----------|----------|-------|
+| 1 | Correctness & logic bugs | Yes | Reconciliation cursor ignores `MoreAvailable`; a single-`UpdatedAtUtc` batch saturation re-pulls without progress (Finding 009 — wasted work only, idempotent upsert prevents corruption). Finding 003 (`IngestedAtUtc`) resolved — now stamped in `OnUpsertAsync` and `ReconcileSiteAsync`. |
+| 2 | Akka.NET conventions | Yes | `Sender` captured before first await on every async handler; `PipeTo` used for all read/relay replies; self-ticks scheduled via `ScheduleTellRepeatedlyCancelable` with `Self` sender, cancelled in `PostStop`. No issues found. |
+| 3 | Concurrency & thread safety | Yes | `_reconciliationCursors`, `_centralCommunication`, timers all mutated only on the actor thread. Per-tick/per-message DI scope (`CreateAsyncScope` for the async tick paths, `CreateScope` for sync read paths) disposed in `finally` / `await using`. No issues found. |
+| 4 | Error handling & resilience | Yes | Reconciliation has per-site try/catch isolation; purge is continue-on-error; ingest catches all and replies `Accepted=false`. Deliberate coarse-retry divergence from `SiteAuditReconciliationActor` (no per-row abandon) is documented and safe given idempotent upsert. No issues found. |
+| 5 | Security | Yes | All SQL parameterised at the repository. Relay carries no user-controlled strings beyond `SourceSite` (a site id). No issues found. |
+| 6 | Performance & resource management | Yes | One DI scope per tick reused across all sites; `MaxPageSize=200` clamp; async DbContext disposal off the dispatcher. `MoreAvailable`-ignoring re-pull is bounded wasted work (Finding 009). |
+| 7 | Design-document adherence | Yes | Reconciliation puller + purge scheduler now implemented (prior Finding 004 resolved). Design doc over-claims six charted KPI trend series; only three are charted (Finding 008). |
+| 8 | Code organization & conventions | Yes | Purge timer gated on reconciliation collaborators rather than just the repository — a host registering SiteCallAudit without the reconciliation client silently never purges (Finding 007, latent). `KpiSampleSource` registered via `TryAddEnumerable`; options owned by the component. |
+| 9 | Testing coverage | Yes | Reconciliation/purge/relay/KPI/options all have dedicated suites. No test exercises a saturated reconciliation batch (`MoreAvailable: true`) or the single-timestamp no-progress edge — every reconciliation test uses `MoreAvailable: false` (relates to Finding 009). |
+| 10 | Documentation & comments | Yes | Prior Findings 001/005 doc fixes still in place. Reconciliation `ReconcileSiteAsync` XML claims parity with `SiteAuditReconciliationActor` while diverging on `MoreAvailable` (the sibling reads it for stalled detection; this actor ignores it) — Finding 009 recommendation includes a doc correction. |
+
 ## Findings
 
 ### SiteCallAudit-001 — SupervisorStrategy override is dead code; XML claims Resume that is not enforced
@@ -323,3 +340,221 @@ Add a test that (a) inserts 6 rows in interleaved order: stuck, not-stuck,
 stuck, not-stuck, stuck, not-stuck (oldest first); (b) issues a `StuckOnly`
 page-size-1 query; (c) asserts each page returns exactly the stuck row, with
 no overlap and all 3 stuck rows visited.
+
+### SiteCallAudit-007 — Daily purge timer is armed only when the reconciliation collaborators are present; a host without the reconciliation client silently never purges
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Code organization & conventions |
+| Status | Resolved |
+| Location | `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:307-321`, `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:202-227` |
+
+**Description**
+
+`StartPurgeTimer` (lines 307-321) gates the daily terminal-row purge tick on
+the **reconciliation** collaborators being present:
+
+```csharp
+private void StartPurgeTimer()
+{
+    if (_pullClient is null || _siteEnumerator is null)
+    {
+        return;
+    }
+    // ... schedule PurgeTick ...
+}
+```
+
+But the purge pass (`OnPurgeTickAsync` → `PurgeWithRepositoryAsync` →
+`ISiteCallAuditRepository.PurgeTerminalAsync`) needs only the repository — it
+has no dependency on `IPullSiteCallsClient` / `ISiteEnumerator`. The two
+collaborators are resolved by the production constructor (lines 202-227) via
+`serviceProvider.GetService<IPullSiteCallsClient>()` /
+`GetService<ISiteEnumerator>()` — both registered by
+`AddAuditLogCentralReconciliationClient`
+(`AuditLog/ServiceCollectionExtensions.cs:473,523`, registered as `ISiteEnumerator`
+and `IPullSiteCallsClient`). `GetService` (not `GetRequiredService`) returns
+`null` if that helper was never called, so a host that registers
+`AddSiteCallAudit()` **without** also calling
+`AddAuditLogCentralReconciliationClient(...)` constructs the actor with both
+collaborators null. In `PreStart` both `StartReconciliationTimer` and
+`StartPurgeTimer` early-return, so the actor runs forever with **no purge timer
+at all** → unbounded growth of the central `SiteCalls` table, with no log line
+to say the purge was skipped.
+
+This is **currently latent, not live**: `Program.cs` co-registers
+`AddAuditLogCentralReconciliationClient(builder.Configuration)` (line 107)
+immediately before `AddSiteCallAudit()` (line 113), so on the real central node
+both collaborators resolve and both timers run today. The risk is a future host
+(or a refactor that splits the reconciliation client out of the central
+composition root, or a test/embedded host that wants only ingest + purge)
+silently losing the purge with no diagnostic. The gate exists for a legitimate
+reason — keeping the repo-only test ctor free of *both* background timers so the
+MSSQL read/upsert tests see no scheduled side effects — but it conflates "no
+reconciliation route" with "no purge", and the actor's own XML
+(lines 36-38) documents the coupling as deliberate rather than flagging it as a
+hazard.
+
+**Recommendation**
+
+Decouple `StartPurgeTimer` from the reconciliation collaborators — purge needs
+only a repository, which both the production and the repo-only test ctors
+always have. Two viable shapes:
+
+- Preferred: gate the purge timer on its own real precondition (a repository is
+  always available, so arm it unconditionally in the production + reconciliation
+  ctors; keep it off only in the repo-only test ctor via an explicit
+  "background timers off" flag rather than by proxy of the reconciliation
+  collaborators). This keeps the MSSQL test isolation while removing the
+  accidental coupling.
+- At minimum: log a `Warning` in `StartPurgeTimer` (and `StartReconciliationTimer`)
+  when the timer is **not** armed on the production path, so a misconfigured host
+  surfaces "SiteCallAudit purge timer not started — `PurgeTerminalAsync` will
+  never run" instead of growing the table silently.
+
+Severity is a judgment call: Medium because the consequence (unbounded central
+table growth) is real and silent, but it is latent today (the only production
+composition root co-registers the reconciliation client), so an argument for Low
+is defensible. Flagged Medium to weight the silent-failure + data-growth aspect.
+
+**Resolution**
+
+Resolved 2026-06-20 (commit `fd618cf1`): decoupled the daily purge timer from the reconciliation collaborators (a new `_backgroundTimersEnabled` flag) so a central node that omits the reconciliation client still purges — no more silent unbounded `SiteCalls` growth; a Warning is logged if the reconciliation collaborators are absent. Test added (purge arms without the reconciliation client).
+
+### SiteCallAudit-008 — Design doc claims six charted KPI trend series; only three are actually charted
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Design-document adherence |
+| Status | Resolved |
+| Location | `docs/requirements/Component-SiteCallAudit.md:149-154`, `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/Kpi/SiteCallAuditKpiSampleSource.cs:42-47`, `src/ZB.MOM.WW.ScadaBridge.Commons/Types/Kpi/KpiMetrics.cs:52-62` |
+
+**Description**
+
+The design doc (`Component-SiteCallAudit.md` lines 149-154, the KPI History
+interaction) states:
+
+> "… the resulting `buffered` / `parked` / `failedLastInterval` /
+> `deliveredLastInterval` / `stuck` / `oldestPendingAgeSeconds` series render as
+> trends on the Site Calls page via `KpiTrendChart`."
+
+That lists **six** series as charted. In the code only **three** are charted:
+
+- The public charted catalog `KpiMetrics.SiteCallAudit`
+  (`KpiMetrics.cs:52-62`) exposes exactly `Buffered`, `Parked`, and
+  `FailedLastInterval` — and its own XML says "Charted Site Call Audit (#22)
+  metrics … Rendered by the Central UI Site Calls report trend panel."
+- `SiteCallAuditKpiSampleSource` (lines 42-47) keys those three off the public
+  Commons catalog (`KpiMetrics.SiteCallAudit.*`) and keeps the other three —
+  `deliveredLastInterval`, `stuck`, `oldestPendingAgeSeconds` — as **private**
+  string literals, with the comment "Charted metrics share the public Commons
+  catalog … the uncharted internal metrics stay private here (#178)."
+- The UI confirms it: `SiteCalls/SiteCallsReport.razor.cs` calls
+  `LoadSeriesAsync` for exactly `KpiMetrics.SiteCallAudit.Buffered`, `.Parked`,
+  and `.FailedLastInterval` — three series, no more.
+
+So all six metrics are *sampled* into the `KpiSample` history store, but only
+three are *charted*. The doc reads as if all six are rendered, which is a small
+design-doc drift (over-claiming the UI surface). The code is internally
+consistent and correctly commented; the doc is the stale party.
+
+**Recommendation**
+
+One-line doc edit on `Component-SiteCallAudit.md` lines 149-154: state that the
+three series `buffered` / `parked` / `failedLastInterval` render as trends on
+the Site Calls page via `KpiTrendChart`, and that `deliveredLastInterval` /
+`stuck` / `oldestPendingAgeSeconds` are **sampled into the KPI-history store but
+not charted** (available for future trend panels / ad-hoc query). No code change
+— the code is already the intended state.
+
+**Resolution**
+
+Resolved 2026-06-20 (commit `fd618cf1`): the design doc now lists the three actually-charted KPI metrics (buffered/parked/failedLastInterval) and marks the other three as sampled-but-not-yet-charted, matching the `KpiMetrics` catalog. Doc-only.
+
+### SiteCallAudit-009 — Reconciliation cursor advances inclusively and ignores `MoreAvailable`; a single-timestamp batch saturation re-pulls without progress
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Correctness & logic bugs |
+| Status | Resolved |
+| Location | `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:505-535` |
+
+**Description**
+
+`ReconcileSiteAsync` (lines 505-535) pulls rows since the per-site cursor,
+upserts each, and advances the cursor to the maximum `UpdatedAtUtc` observed:
+
+```csharp
+var since = _reconciliationCursors.TryGetValue(site.SiteId, out var c) ? c : DateTime.MinValue;
+var response = await client.PullAsync(site.SiteId, since, _options.ReconciliationBatchSize, ...);
+
+var maxUpdated = since;
+foreach (var row in response.SiteCalls)
+{
+    await repository.UpsertAsync(row with { IngestedAtUtc = nowUtc });
+    if (row.UpdatedAtUtc > maxUpdated) maxUpdated = row.UpdatedAtUtc;
+}
+_reconciliationCursors[site.SiteId] = maxUpdated;
+```
+
+Two interacting properties:
+
+1. The pull is **inclusive** — `PullAsync(site, since, …)` asks for rows with
+   `UpdatedAtUtc >= since` (documented in the method's XML, lines 497-503), and
+   the cursor advances to the **max** `UpdatedAtUtc` seen, which is itself one
+   of the rows just pulled. So the boundary row is re-pulled every tick and
+   deduped by the idempotent monotonic upsert — intended, harmless.
+2. `response.MoreAvailable` is **never read** at all. The `PullSiteCallsResponse`
+   carries `MoreAvailable` (`PullSiteCallsResponse.cs:14-17`: "True when the
+   site saturated the requested batch size — the caller should advance the
+   cursor and pull again"), but `ReconcileSiteAsync` ignores it and relies on
+   the natural tick cadence to drain the backlog over successive ticks.
+
+The edge case: if a site has **more rows than the batch size all sharing one
+exact `UpdatedAtUtc`** (e.g. a burst written in the same tick / same clock
+value), the saturated batch returns rows whose max `UpdatedAtUtc` equals
+`since`. `maxUpdated` therefore stays at `since`, the cursor does **not**
+advance, and because the pull is inclusive the next tick re-pulls the identical
+window — and again, and again — making no forward progress on that site's
+backlog. Because the upsert is idempotent and the `SiteCalls` table is an
+eventually-consistent mirror (not the source of truth), this is **wasted work,
+never corruption** — but it is an unbounded re-pull loop on a pathological
+input, and any rows in that backlog beyond the batch ceiling never get
+reconciled.
+
+The sibling `SiteAuditReconciliationActor` shares the inclusive-cursor /
+max-timestamp shape, so the same single-timestamp-saturation no-progress edge
+applies to it — but that sibling **does read `MoreAvailable`** (it feeds its
+stalled-detection state machine, `SiteAuditReconciliationActor.cs:324-325`,
+publishing `SiteAuditTelemetryStalledChanged` so a non-draining site surfaces a
+health signal). `ReconcileSiteAsync`'s XML (lines 530-534) claims the
+no-immediate-re-pull behaviour "match[es] `SiteAuditReconciliationActor`", which
+is only half true: it matches the cursor cadence but diverges by dropping
+`MoreAvailable` entirely, so this actor has neither a continuation pull nor a
+stalled signal — a saturated site lags silently with no observability.
+
+**Recommendation**
+
+- Consume `MoreAvailable`: either continue draining within the same tick while
+  `MoreAvailable` is true (bounded by a max-iterations guard), or — matching the
+  sibling — surface a stalled/non-draining signal when a batch comes back
+  saturated so a stuck site is observable rather than silent.
+- Defend the single-timestamp no-progress edge with a tiebreaker beyond the
+  timestamp (e.g. advance on `(UpdatedAtUtc, TrackedOperationId)` as a composite
+  keyset cursor, and ask the pull for rows strictly after that composite), so a
+  burst sharing one `UpdatedAtUtc` cannot pin the cursor.
+- Correct the `ReconcileSiteAsync` XML (lines 530-534): it claims parity with
+  `SiteAuditReconciliationActor` while diverging on `MoreAvailable` (the sibling
+  reads it for stalled detection; this actor ignores it).
+
+Severity Low: idempotent upsert + mirror-not-source-of-truth mean no data
+corruption, and the saturated-single-timestamp input is pathological; the cost
+is wasted re-pulls and an un-drained backlog tail on that one input, plus the
+missing observability.
+
+**Resolution**
+
+Resolved 2026-06-20 (commit `fd618cf1`): `ReconcileSiteAsync` now consumes `response.MoreAvailable` via a within-tick continuation drain bounded by a max-pages guard, with explicit no-progress (single-timestamp-saturation) detection that breaks and logs a Warning instead of re-pulling forever. The XML claim of parity with the sibling reconciler was corrected. Idempotent upsert retained. Tests added.
diff --git a/code-reviews/SiteEventLogging/findings.md b/code-reviews/SiteEventLogging/findings.md
index 972a1efd..b663b7cb 100644
--- a/code-reviews/SiteEventLogging/findings.md
+++ b/code-reviews/SiteEventLogging/findings.md
@@ -5,10 +5,10 @@
 | Module | `src/ZB.MOM.WW.ScadaBridge.SiteEventLogging` |
 | Design doc | `docs/requirements/Component-SiteEventLogging.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-28 |
+| Last reviewed | 2026-06-20 |
 | Reviewer | claude-agent |
-| Commit reviewed | `1eb6e97` |
-| Open findings | 2 |
+| Commit reviewed | `4307c381` |
+| Open findings | 0 |
 
 ## Summary
 
@@ -101,6 +101,36 @@ _Re-review (2026-05-28, `1eb6e97`):_
 | 9 | Testing coverage | ☑ | Non-volatile `stop` flag in `PurgeByStorageCap_ConcurrentWritesDoNotCorruptConnection` (-023). No tests for `PageSize` bounds, `From`/`To` timezone handling, or unobserved `FailedWriteCount`. |
 | 10 | Documentation & comments | ☑ | `FailedWriteCount` XML doc claims "Health Monitoring can poll" but nothing does (-018). Severity / event-type docs enumerate values that are not enforced (-020). |
 
+#### Re-review 2026-06-20 (commit `4307c381`) — full review
+
+Re-reviewed the module at commit `4307c381`. All twenty-three prior findings are still
+recorded; twenty-two of their resolutions hold up under inspection — **but one does not**.
+SiteEventLogging-016 (**High**, the `From`/`To` UTC-normalisation defect) is marked
+`Resolved` and its Resolution text claims `EventLogQueryService.ExecuteQuery` now calls
+`.ToUniversalTime()` before `ToString("o")` and that a regression test
+`Query_FiltersByTimeRange_HandlesNonUtcOffset` was added — **neither is true at this
+commit**. `EventLogQueryService.cs:77,83` still stringifies `request.From`/`request.To`
+verbatim with `.ToString("o")` and no UTC normalisation, and a repo-wide search finds no
+test by that name. The claimed -016 fix was never committed; the High-severity defect is
+live. This is re-opened as SiteEventLogging-024 (cross-referencing -016) so the audit trail
+shows the resolution was asserted but never landed. Four new findings were recorded: -024
+(the never-landed -016 fix, High), -025 (synchronous severity validation faults
+fire-and-forget callers, Medium), -026 (purge active-node gate diverges from the query
+singleton's placement, Medium), and -027 (the time-range test masks -024, Low).
+
+| # | Category | Examined | Notes |
+|---|----------|----------|-------|
+| 1 | Correctness & logic bugs | ☑ | **-024**: `From`/`To` filters still compare non-UTC-normalised ISO 8601 strings against UTC-stored timestamps — the SiteEventLogging-016 fix was never committed (code at `EventLogQueryService.cs:77,83` is unchanged; the claimed regression test does not exist). |
+| 2 | Akka.NET conventions | ☑ | `EventLogHandlerActor` remains a simple `Receive`/`Tell` bridge; no new findings. |
+| 3 | Concurrency & thread safety | ☑ | Lock-guarded `WithConnection` pattern is correct; no new findings. |
+| 4 | Error handling & resilience | ☑ | **-025**: `LogEventAsync` validates severity/args by throwing synchronously, bypassing the documented "faults the returned Task" contract for fire-and-forget callers. |
+| 5 | Security | ☑ | Queries fully parameterised; `PageSize` now clamped (-017). No new findings. |
+| 6 | Performance & resource management | ☑ | Bounded write queue (-015) and `PageSize` clamp (-017) in place; no new findings. |
+| 7 | Design-document adherence | ☑ | **-026**: the purge active-node gate (`SelfIsPrimary` = cluster leader) can diverge from the query singleton's placement (oldest member of role), leaving the queried node's DB unpurged / over-cap. |
+| 8 | Code organization & conventions | ☑ | Concrete-recorder DI wiring correct; no new findings. |
+| 9 | Testing coverage | ☑ | **-027**: `Query_FiltersByTimeRange` asserts count only with UTC-only inputs, masking -024; no non-UTC-offset regression test exists. |
+| 10 | Documentation & comments | ☑ | No new documentation findings beyond those above. |
+
 ## Findings
 
 ### SiteEventLogging-001 — `PRAGMA incremental_vacuum` is a no-op; storage cap cannot reclaim space
@@ -1144,3 +1174,265 @@ Use a `CancellationTokenSource` (`while (!cts.IsCancellationRequested)`), or cha
 `stop` to a `volatile bool`, or use `Interlocked.Exchange` / `Volatile.Read`.
 `CancellationTokenSource` is the canonical .NET pattern and also lets the test
 cooperate with xUnit's `Task.WhenAll` timeout.
+
+### SiteEventLogging-024 — `From`/`To` filters still compare non-normalised ISO 8601 strings against UTC-stored timestamps (the SiteEventLogging-016 fix was never committed)
+
+| | |
+|--|--|
+| Severity | High |
+| Category | Correctness & logic bugs |
+| Status | Resolved |
+| Location | `src/ZB.MOM.WW.ScadaBridge.SiteEventLogging/EventLogQueryService.cs:77,83` |
+
+**Description**
+
+This is a re-open of SiteEventLogging-016. That finding was marked **Resolved**
+(2026-05-28) with a Resolution stating that `EventLogQueryService.ExecuteQuery` now calls
+`.ToUniversalTime()` on `request.From`/`request.To` before `ToString("o")`, that
+`EventLogPurgeService.PurgeByRetention` was made defensive with an explicit
+`.ToUniversalTime()`, and that a regression test
+`Query_FiltersByTimeRange_HandlesNonUtcOffset` was added. **None of that is present at
+commit `4307c381`.** The code at `EventLogQueryService.cs` is byte-for-byte the original
+defective form:
+
+```
+if (request.From.HasValue)
+{
+    whereClauses.Add("timestamp >= $from");
+    parameters.Add(new SqliteParameter("$from", request.From.Value.ToString("o")));   // line 77
+}
+
+if (request.To.HasValue)
+{
+    whereClauses.Add("timestamp <= $to");
+    parameters.Add(new SqliteParameter("$to", request.To.Value.ToString("o")));       // line 83
+}
+```
+
+There is no `.ToUniversalTime()` (or `.UtcDateTime`) on either bound, and a repo-wide
+search for `Query_FiltersByTimeRange_HandlesNonUtcOffset` returns zero matches — the
+claimed regression test does not exist. The -016 resolution was asserted but never landed
+in source, so the original High-severity defect is **live**.
+
+The mechanism is exactly as -016 described. Event rows are persisted with
+`timestamp = DateTimeOffset.UtcNow.ToString("o")` (`SiteEventLogger.cs:202`), which always
+emits the round-trip ISO 8601 form ending in the literal offset `+00:00`
+(e.g. `2026-06-20T12:34:56.7890123+00:00`). `request.From`/`request.To` are
+`DateTimeOffset?`, and `ToString("o")` preserves whatever offset the caller passed. A
+central client in, say, `UTC+05:00` that filters with `DateTimeOffset.Now` produces
+`"2026-06-20T17:34:56.0000000+05:00"`, which under SQLite's default `BINARY` collation
+sorts lexicographically *greater* than the equivalent UTC instant string
+`"2026-06-20T12:34:56.0000000+00:00"`. The `timestamp >= $from` / `timestamp <= $to`
+comparison is then a byte-by-byte string compare, so the query silently includes events
+from the wrong hour or excludes events that genuinely fall in the window. The design
+states "All timestamps are UTC throughout the system", but the central-`DateTimeOffset`
+→ SQLite boundary does not enforce it. A central UI rendered in a non-UTC timezone is the
+most likely trigger, and it corrupts precisely the "show me what happened around the
+failover" time-range query that operators most often run. The retention purge
+(`DateTimeOffset.UtcNow.AddDays(-N).ToString("o")`) is UTC by construction and remains
+safe; only the central query path is vulnerable.
+
+**Recommendation**
+
+Apply the one-line fix on each bound that -016 specified but that was never committed:
+
+```
+parameters.Add(new SqliteParameter("$from", request.From.Value.ToUniversalTime().ToString("o")));
+parameters.Add(new SqliteParameter("$to",   request.To.Value.ToUniversalTime().ToString("o")));
+```
+
+so the produced offset is always `+00:00` and the comparison is lexicographically sound
+against the UTC-stored strings. Add the regression test that -016 claimed but never
+delivered (see SiteEventLogging-027): construct a `From`/`To` carrying a non-zero offset
+(e.g. `+05:00`) and assert the matching UTC-stored events are returned and out-of-range
+ones excluded, asserting on returned row identities rather than just count. Optionally
+store `timestamp` as Unix-epoch `INTEGER` to eliminate the lexicographic-comparison hazard
+structurally. When fixing, also reconcile SiteEventLogging-016's Resolution text with
+reality (its commit reference was `<pending>` and the fix never merged).
+
+**Resolution**
+
+Resolved 2026-06-20 (commit `fd618cf1`): `EventLogQueryService` now calls `.ToUniversalTime()` on the `From`/`To` `DateTimeOffset` bounds before `.ToString("o")`, so the comparison strings are UTC and match the UTC-stored values for any non-UTC client offset. This is the fix that the closed -016 claimed but never committed. Regression test added (verified failing pre-fix).
+
+### SiteEventLogging-025 — `LogEventAsync` severity/argument validation throws synchronously, bypassing the "faults the returned Task" contract for fire-and-forget callers
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Error handling & resilience |
+| Status | Deferred |
+| Location | `src/ZB.MOM.WW.ScadaBridge.SiteEventLogging/SiteEventLogger.cs:187-199` |
+
+**Description**
+
+`LogEventAsync` is documented as a fire-and-forget enqueue whose returned `Task`
+"completes once the event is durably persisted and faults if the write fails"
+(`SiteEventLogger.cs:24-29`), and the disposed-drop paths were deliberately reworked under
+SiteEventLogging-012 to *fault the Task* rather than throw. But the argument and severity
+validation at the top of the method throws **synchronously** on the caller's thread before
+any `Task` is returned:
+
+```
+ArgumentException.ThrowIfNullOrWhiteSpace(eventType);
+ArgumentException.ThrowIfNullOrWhiteSpace(severity);
+ArgumentException.ThrowIfNullOrWhiteSpace(source);
+ArgumentException.ThrowIfNullOrWhiteSpace(message);
+
+if (!AllowedSeverities.Contains(severity))      // SiteEventLogging-020 closed set
+{
+    throw new ArgumentException(
+        $"Severity '{severity}' is not one of the allowed values: Info, Warning, Error.",
+        nameof(severity));
+}
+```
+
+Every recording caller invokes this fire-and-forget as `_ = _eventLogger.LogEventAsync(...)`
+(no `await`, return discarded), so a thrown `ArgumentException` does **not** flow into the
+discarded `Task` — it propagates straight up the calling actor's message-handling stack.
+For the recorder hot-path callers (the computed `AlarmActor`, the `DataConnectionActor`,
+the `ScriptActor`/`ScriptExecutionActor`) an unhandled synchronous throw would fault the
+actor and trip its supervision strategy, when the documented and intended behaviour for a
+best-effort audit write is that a bad event is dropped/faulted, never that it crashes the
+subsystem doing the logging.
+
+This is **currently latent**: every production call site passes a hard-coded canonical
+`severity` literal (`"Info"`/`"Warning"`/`"Error"`) and non-empty `eventType`/`source`/
+`message`, so the throw path is never reached today. The exposure is a future caller that
+passes a dynamic, computed, or typo'd severity (e.g. `"error"`, `"ERR"`, a value derived
+from an alarm payload) — the SiteEventLogging-020 closed-set check, added defensively, then
+becomes a crash vector for the very actor it is meant to protect. The mismatch is that
+-012 chose Task-faulting for one drop reason (disposal) while validation kept synchronous
+throwing for another (bad input), so the method's failure contract is now inconsistent.
+
+**Recommendation**
+
+Make the failure mode uniform with the rest of the method: return a faulted `Task` for
+validation failures instead of throwing synchronously — e.g.
+`return Task.FromException(new ArgumentException(...))` for the severity check (and,
+consistently, for the null/whitespace guards) — so a fire-and-forget caller is never
+crashed by a bad audit-event payload and an `await`-ing caller still observes the failure.
+Alternatively, accept a `severity` enum on `ISiteEventLogger.LogEventAsync` and convert to
+the canonical string at the boundary, eliminating the runtime-validation throw entirely.
+The exact fix shape is a judgment call — faulted-Task is the minimal-diff change that
+matches the documented contract; the enum is the stronger design but a wider
+interface/caller change. Whichever is chosen, update the XML doc to state the failure
+semantics for invalid input.
+
+**Resolution**
+
+Deferred 2026-06-20: the fix shape (return a faulted Task vs. validate against a severity enum) is a design decision, and the defect is latent (every current caller passes a canonical severity literal). Recorded; no change this pass.
+
+### SiteEventLogging-026 — Purge active-node gate (cluster leader) can diverge from the query singleton's placement (oldest member of role)
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Design-document adherence |
+| Status | Deferred |
+| Location | `src/ZB.MOM.WW.ScadaBridge.Host/SiteServiceRegistration.cs:116-120`, `src/ZB.MOM.WW.ScadaBridge.Host/Health/AkkaClusterNodeProvider.cs:29-39`, `src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs:889-895` |
+
+**Description**
+
+SiteEventLogging-019 closed the "purge runs on every node" finding by gating
+`EventLogPurgeService` on an optional `SiteEventLogActiveNodeCheck`. The Host wires that
+check to `IClusterNodeProvider.SelfIsPrimary` (`SiteServiceRegistration.cs:116-120`), which
+returns true only when "this node is `Up` AND is the **cluster leader**"
+(`AkkaClusterNodeProvider.cs:29-39`: `cluster.State.Leader.Equals(cluster.SelfAddress)`).
+
+The query path, however, is served by the `event-log-handler` **cluster singleton**
+(`AkkaHostedService.cs:889-895`), which Akka.NET pins to the **oldest member of the role**,
+not the cluster leader. These are two different election concepts:
+
+- The Akka **cluster leader** is the lowest-address member in the `Up`/`Leaving` state set
+  for the *whole cluster* (role-agnostic) — it is what `SelfIsPrimary` tests.
+- A **cluster singleton** is hosted on the *oldest* member of its configured role
+  (`siteRole`) — it is what owns the event-log query DB.
+
+In a steady-state 2-node site cluster these usually coincide, so the gate works in
+practice. But they are not guaranteed equal — most plausibly during membership churn /
+after a failover, when the oldest-member computation and the leader election can settle on
+different nodes for a window. When they diverge, the query singleton reads node X's
+SQLite DB while the purge runs on node Y. The result is that the node whose DB is actually
+queried (the populated, active one) is left **unpurged** — its 30-day retention sweep and
+1 GB cap-purge do not run — so the queried database can drift over the documented 1 GB cap
+and retain rows past the 30-day window, which is exactly the invariant the storage section
+of Component-SiteEventLogging requires. The SiteEventLogging-004 re-triage note already
+established the correct co-location principle: the *query* singleton and the
+*deployment-manager* singleton share `siteRole` and so co-locate on the oldest member; the
+purge gate keyed on a *different* signal (`SelfIsPrimary`/leader) silently breaks that
+co-location for the purge.
+
+**Recommendation**
+
+Re-gate the purge on the **same** ownership signal as the query singleton so the two never
+diverge — i.e. gate on "this node hosts the `event-log-handler` singleton" (host the purge
+inside that same cluster singleton, or test oldest-member-of-`siteRole` rather than cluster
+leader). Alternatively, drop the purge gate entirely and let every node purge its own local
+DB: the SiteEventLogging-019 and -004 analysis already showed standby purge is harmless
+(the standby receives no writes), so running it on both nodes guarantees the queried DB is
+always purged at the cost of a no-op sweep on the standby. The direction is a **judgment
+call** — re-gating on singleton ownership keeps the "active node only" intent but adds
+coupling to the singleton's lifecycle; dropping the gate is simpler and provably correct
+for the cap/retention invariant but reverts the -019 "active node only" optimisation.
+Whichever is chosen, the gate (`SelfIsPrimary` = leader) and the singleton (oldest member)
+must not be left keyed on two different elections.
+
+**Resolution**
+
+Deferred 2026-06-20: the direction (re-gate purge on singleton ownership vs. drop the gate per the -019 analysis) is a design-owner decision. Divergence between the Akka leader and the oldest-member singleton is a churn-window condition. Recorded; no change this pass.
+
+### SiteEventLogging-027 — Time-range test asserts count only with UTC-only inputs, masking SiteEventLogging-024
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Testing coverage |
+| Status | Resolved |
+| Location | `tests/ZB.MOM.WW.ScadaBridge.SiteEventLogging.Tests/EventLogQueryServiceTests.cs:197-214` |
+
+**Description**
+
+`Query_FiltersByTimeRange` is the only test covering the `From`/`To` time-range filter, and
+it cannot catch SiteEventLogging-024 by construction:
+
+```
+var now = DateTimeOffset.UtcNow;                                  // UTC-only input
+InsertEventAt(now.AddHours(-2),   "script", "Info", null, "S1", "Old event");
+InsertEventAt(now.AddMinutes(-30),"script", "Info", null, "S2", "Recent event");
+InsertEventAt(now,                "script", "Info", null, "S3", "Now event");
+
+var response = _queryService.ExecuteQuery(MakeRequest(
+    from: now.AddHours(-1),
+    to:   now.AddMinutes(1)));
+
+Assert.True(response.Success);
+Assert.Equal(2, response.Entries.Count);                          // count only
+```
+
+Both shortcomings hide the live -024 defect:
+
+1. **UTC-only inputs.** `now` is `DateTimeOffset.UtcNow`, so `from`/`to` already carry a
+   `+00:00` offset. `ToString("o")` then produces a correctly-comparable string and the
+   missing `.ToUniversalTime()` is never exercised. The defect only manifests for a
+   `DateTimeOffset` with a *non-zero* offset, which this test never constructs.
+2. **Count-only assertion.** It asserts `response.Entries.Count == 2` but never checks
+   *which* rows came back. A boundary off-by-an-hour bug that swapped one in-range row for
+   an out-of-range row (the -024 symptom) could still yield a count of 2 and pass.
+
+The net effect is a green test suite that gives false confidence the time-range filter is
+correct — and almost certainly contributed to SiteEventLogging-016 being marked Resolved
+when its fix was never committed.
+
+**Recommendation**
+
+Add the non-UTC-offset regression test that SiteEventLogging-016 claimed but never
+delivered (`Query_FiltersByTimeRange_HandlesNonUtcOffset`): construct `From`/`To` as a
+`DateTimeOffset` with a non-zero offset (e.g. `+05:00`) bracketing UTC-stored events, and
+assert on the **identities** of the returned rows (e.g. by `Source`/`Message` or `Id`), not
+just `Count`, so a window shifted by the offset is detected. Strengthen the existing
+`Query_FiltersByTimeRange` likewise to assert which rows are returned. These tests should
+fail against the current code and pass once SiteEventLogging-024 is fixed.
+
+**Resolution**
+
+Resolved 2026-06-20 (commit `fd618cf1`): added `Query_FiltersByTimeRange_HandlesNonUtcOffset` — seeds known UTC instants, queries with a +05:00 offset, and asserts the returned row identities (not just count). Verified failing pre-fix, passing post-fix.
diff --git a/code-reviews/SiteRuntime/findings.md b/code-reviews/SiteRuntime/findings.md
index 78087f94..0565a1b7 100644
--- a/code-reviews/SiteRuntime/findings.md
+++ b/code-reviews/SiteRuntime/findings.md
@@ -5,10 +5,10 @@
 | Module | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime` |
 | Design doc | `docs/requirements/Component-SiteRuntime.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-28 |
+| Last reviewed | 2026-06-20 |
 | Reviewer | claude-agent |
-| Commit reviewed | `1eb6e97` |
-| Open findings | 3 |
+| Commit reviewed | `4307c381` |
+| Open findings | 0 |
 
 ## Summary
 
@@ -109,6 +109,40 @@ _Re-review (2026-05-28, `1eb6e97`):_
 
 ## Findings
 
+#### Re-review 2026-06-20 (commit `4307c381`) — full review
+
+The module was re-reviewed in full at `4307c381` (current HEAD). The diff against the
+prior baseline `1eb6e97` shows as 100% additions because of the ScadaBridge rename
+(the project moved paths), so the whole module was re-read at its current state rather
+than relying on the diff. Health is good: all prior findings 001–026 remain
+Resolved/Deferred with no regressions observed in their fixed call sites
+(SetAttribute DCL routing, the watch+buffer redeploy, the SiteRuntime-020 terminating
+shadow, the OperationTrackingStore read/write split + safe Dispose, the
+AuditingDb `Inner` accessor, invariant-culture numeric parsing, the
+`_attributes` child-snapshot isolation). The new M7 surface (NativeAlarmActor,
+CertStoreActor) and the WaitForAttribute/batch-write additions are generally
+well-built. Five new findings were recorded, all in the native-alarm and
+deployment-lifecycle bookkeeping: an unbounded `_latestAlarmEvents` map on the
+Instance Actor (027, Medium), a phantom-active alarm left behind by
+`NativeAlarmActor.EnforceCap` (028, Medium), a delete-during-pending-redeploy
+that both over-decrements the deployed counter and resurrects the deleted
+instance (029, Medium), missing tests for the cap/retention/replication paths
+(030, Low), and stale site-side persistence of notification-list + SMTP config
+that the design decision moved to central-only (031, Low).
+
+| # | Category | Examined | Notes |
+|---|----------|----------|-------|
+| 1 | Correctness & logic bugs | ✓ | EnforceCap leaves phantom-active alarm (028); delete-during-redeploy over-decrements + resurrects (029). HiLo/RoC/edge logic otherwise sound. |
+| 2 | Akka.NET conventions | ✓ | Supervision (Resume coordinators / Stop short-lived) correct; execution actors on ScriptExecutionScheduler; cert broadcast uses PipeTo, captures `sender` before async. Trigger-eval mailbox-block stays Deferred (014). |
+| 3 | Concurrency & thread safety | ✓ | Child-snapshot isolation (017) intact; `_createdConnections` actor-confined via ApplyArtifact… dispatch (021); OperationTrackingStore read/write split (024). No new actor-thread violations. |
+| 4 | Error handling & resilience | ✓ | Best-effort persistence/replication paths log-on-fault; deploy reports Success only post-persist (005). Native source-unavailable retains+marks-uncertain. |
+| 5 | Security | ✓ | Trust verdict delegated to shared ScriptTrustValidator; CertStoreActor path-traversal-guarded; no reflection in scripts; SQL params captured per M4 design (redaction at write time). |
+| 6 | Performance & resource management | ✓ | `_latestAlarmEvents` unbounded growth on native-alarm churn (027). ScriptExecutionScheduler background threads OK; per-call SQLite connections acceptable. |
+| 7 | Design-document adherence | ✓ | Actor hierarchy / native-alarm wiring conform. Stale site persistence of notification-list + SMTP config vs. "central-only, not deployed to sites" (031). |
+| 8 | Code organization & conventions | ✓ | Reflection anti-pattern eliminated (006/022); Options owned by component; additive message evolution honoured. |
+| 9 | Testing coverage | ✓ | Broad new suites (NativeAlarmActor, WaitForAttribute, SetAttribute, ExecutionActor, scheduler). EnforceCap/retention-drop + SiteReplicationActor still untested (030). |
+| 10 | Documentation & comments | ✓ | XML docs accurate and extensive; ReplicationMessages now documented (026). No new stale comments found. |
+
 ### SiteRuntime-001 — `Instance.SetAttribute` never writes to the Data Connection Layer
 
 | | |
@@ -1370,3 +1404,237 @@ the direction (outbound to peer / inbound from peer) and what is replicated.
 The two pre-existing group-header XML blocks were converted to plain `//`
 comments to avoid orphaned doc-summaries above the first record in each group.
 Marker-base-type idea left out of scope.
+
+### SiteRuntime-027 — `InstanceActor._latestAlarmEvents` grows without bound as native conditions churn
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Performance & resource management |
+| Status | Resolved |
+| Location | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/InstanceActor.cs:67`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/InstanceActor.cs:1007` |
+
+**Description**
+
+`HandleAlarmStateChanged` stores the latest enriched `AlarmStateChanged` per
+`changed.AlarmName` into `_latestAlarmEvents` (line 1013) and **never removes
+anything from that dictionary**. For computed alarms the key set is bounded (one
+entry per configured alarm). For **native** alarms, however, the key is the
+per-condition `SourceReference` (every `NativeAlarmActor.Emit` stamps
+`AlarmName = t.SourceReference`). When a native condition fully runs its course —
+`NativeAlarmActor.ApplyLiveTransition` drops it from the mirror once
+`!Active && Acknowledged` (NativeAlarmActor.cs:243), or `EnforceCap` evicts the
+oldest — the Instance Actor still holds the final `AlarmStateChanged` for that
+`SourceReference` forever as a (Normal) entry. There is no message back to the
+Instance Actor to evict it.
+
+On an OPC UA A&C source that mints a fresh `SourceReference` per occurrence
+(conditionId/branchId style references are common), the map accumulates one
+permanently-retained entry per distinct condition the instance has *ever* seen.
+The `MirroredAlarmCapPerSource` cap (default 1000) bounds the
+`NativeAlarmActor._alarms` set but does **not** bound `_latestAlarmEvents`, which
+is on the long-lived Instance Actor. Over weeks of uptime on a busy site this is a
+steadily-growing per-instance memory leak, and `BuildAlarmStatesSnapshot`
+(line 1055, `_latestAlarmEvents.Values.ToList()`) makes every DebugView snapshot
+proportionally larger and slower, returning thousands of stale Normal rows.
+
+**Recommendation**
+
+Evict from `_latestAlarmEvents` when a native condition reaches its terminal
+return-to-normal+dropped state. The cleanest signal is the `AlarmStateChanged`
+the `NativeAlarmActor` already emits on drop-out: have the `NativeAlarmActor`
+stamp a "final/dropped" marker (an additive bool on `AlarmStateChanged`, or a
+dedicated `NativeAlarmDropped(sourceReference)` Tell to the parent), and have
+`HandleAlarmStateChanged` `_latestAlarmEvents.Remove(...)` on it. Alternatively
+cap/prune `_latestAlarmEvents` for native keys mirroring
+`MirroredAlarmCapPerSource`. Computed-alarm keys must stay (they are
+configuration-bounded).
+
+**Resolution**
+
+Resolved 2026-06-20 (commit `fd618cf1`): added an additive `NativeAlarmDropped(SourceReference)` Tell emitted by `NativeAlarmActor` at all terminal-drop sites (snapshot-swap removal, retention drop, cap eviction) after the return-to-normal emit; `InstanceActor.HandleNativeAlarmDropped` removes the native key from `_latestAlarmEvents` (and `_alarmStates`/`_alarmTimestamps`). Computed-alarm keys are never dropped.
+
+### SiteRuntime-028 — `NativeAlarmActor.EnforceCap` drops a condition without a return-to-normal, leaving a phantom Active alarm
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Correctness & logic bugs |
+| Status | Resolved |
+| Location | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/NativeAlarmActor.cs:281` |
+
+**Description**
+
+`EnforceCap` evicts the oldest mirrored conditions once the per-source cap is
+exceeded: it removes them from `_alarms`, calls `PersistDelete`, and logs the
+eviction — but it does **not** `Emit` a return-to-normal `AlarmStateChanged` for
+the dropped condition. If the evicted condition was still `Active`, the Instance
+Actor's `_latestAlarmEvents` (and, downstream, central's gRPC stream and the
+operator Alarm Summary page) keep showing it as **Active** indefinitely. The
+mirror has silently forgotten the condition, so no later transition can ever clear
+it — a phantom stuck-active alarm.
+
+This is inconsistent with the sibling drop path `ApplySnapshotSwap`
+(NativeAlarmActor.cs:205), which correctly emits `prior.Condition with { Active =
+false }` for every condition that falls out of the new snapshot. The retention
+drop in `ApplyLiveTransition` (line 243) is safe only because it drops *after*
+emitting the condition's own already-inactive state; `EnforceCap` drops a
+condition whose last-emitted state may still be Active, with no compensating emit.
+
+The design doc explicitly states the cap eviction is "logged — there is no silent
+truncation," but logging alone does not reconcile the operator-visible state: the
+alarm view is silently wrong.
+
+**Recommendation**
+
+In `EnforceCap`, before removing each overflow condition, `Emit(a, a.Condition
+with { Active = false })` (mirroring `ApplySnapshotSwap`) so the eviction produces
+a return-to-normal on the stream and clears the phantom. Add a test asserting that
+exceeding `MirroredAlarmCapPerSource` with an active oldest condition emits a
+Normal `AlarmStateChanged` for the evicted `SourceReference`.
+
+**Resolution**
+
+Resolved 2026-06-20 (commit `fd618cf1`): `NativeAlarmActor.EnforceCap` now emits `Emit(evicted, evicted.Condition with { Active = false })` for a still-active evicted condition before removing it, clearing the phantom stuck-Active alarm on the stream/UI. Cap-eviction test added.
+
+### SiteRuntime-029 — A delete (or disable) arriving during a pending redeploy over-decrements the counter and is undone by `HandleTerminated`
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Correctness & logic bugs |
+| Status | Resolved |
+| Location | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs:627`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs:402` |
+
+**Description**
+
+`HandleDelete` and `HandleDisable` only consult `_instanceActors` to find the live
+actor; neither consults the SiteRuntime-020 `_terminatingActorsByName` /
+`_pendingRedeploys` bookkeeping. When a redeployment is in flight, `HandleDeploy`
+has already removed the instance from `_instanceActors`, stopped the predecessor,
+and buffered a `PendingRedeploy` keyed by the terminating ref. If a
+`DeleteInstanceCommand` then arrives for that same instance *before* the
+`Terminated` signal fires:
+
+1. `_instanceActors.TryGetValue` misses (the entry was removed at redeploy time),
+   so no actor is stopped.
+2. `_totalDeployedCount = Math.Max(0, _totalDeployedCount - 1)` runs anyway —
+   over-decrementing, because the redeploy path did not increment for an update
+   (the count had already been adjusted), so the deployed/disabled counts reported
+   to the health collector drift.
+3. `RemoveDeployedConfigAsync` deletes the SQLite row and the deleter is told
+   success.
+4. The buffered `_pendingRedeploys` entry is **untouched**, so when `Terminated`
+   fires, `HandleTerminated` calls `ApplyDeployment(..., isRedeploy: true)`, which
+   re-creates the Instance Actor and re-writes the deployed config to SQLite —
+   **resurrecting the instance the operator just deleted**, with the counter now
+   inconsistent.
+
+`HandleDisable` has the milder version: the disable persists `is_enabled = false`,
+but `HandleTerminated` then re-stores the config with `isEnabled: true` and
+re-creates the actor, so a disable issued mid-redeploy is silently reverted to
+enabled.
+
+The window is the redeploy-termination interval — small, but reliably hit when
+central issues a delete/disable immediately after a deploy (e.g. an operator
+correcting a mistaken deploy, or an automated teardown), exactly the kind of rapid
+command sequence SiteRuntime-020 was filed to harden.
+
+**Recommendation**
+
+Make `HandleDelete`/`HandleDisable` authoritative over the mid-redeploy state:
+before falling through, check `_terminatingActorsByName`. On a hit, drop the
+buffered `_pendingRedeploys` entry (so `HandleTerminated` does not resurrect),
+clear the shadow, and for delete tell the buffered redeploy's `OriginalSender` it
+was superseded — mirroring the last-write-wins handling already in `HandleDeploy`.
+Only decrement `_totalDeployedCount` when an instance was actually present
+(in `_instanceActors` **or** terminating). Add a regression test:
+deploy → redeploy → delete-before-Terminated asserts the instance stays deleted
+and the counter is correct.
+
+**Resolution**
+
+Resolved 2026-06-20 (commit `fd618cf1`): `HandleDelete`/`HandleDisable` now check `_terminatingActorsByName` before the `_instanceActors` fall-through; on a hit they drop the buffered `_pendingRedeploys` entry and clear the shadow so `HandleTerminated` can't resurrect the instance, and `_totalDeployedCount` is decremented only when an instance was actually present. Delete-during-redeploy race test added (verified failing pre-fix).
+
+### SiteRuntime-030 — Native-alarm cap/retention-drop and `SiteReplicationActor` paths are untested
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Testing coverage |
+| Status | Resolved |
+| Location | `tests/ZB.MOM.WW.ScadaBridge.SiteRuntime.Tests/Actors/NativeAlarmActorTests.cs`, `tests/ZB.MOM.WW.ScadaBridge.SiteRuntime.Tests/` |
+
+**Description**
+
+`NativeAlarmActorTests` covers subscribe, raise, snapshot-swap return-to-normal,
+out-of-order rejection, ack, and site-event emission, but there is **no** test for
+`EnforceCap` (the per-source cap eviction) or for the `ApplyLiveTransition`
+retention drop (`!Active && Acknowledged`). Both are non-trivial state
+transitions, and the cap path harbours an observable defect (SiteRuntime-028) that
+a targeted test would have caught. `SiteReplicationActor` remains entirely
+untested — the carried-forward gap that SiteRuntime-016 explicitly deferred to a
+clustered-ActorSystem harness, still outstanding at this commit (the actor calls
+`Cluster.Get(Context.System)` in its constructor, so it needs a clustered HOCON
+test host). The outbound forward, inbound apply, and `SendToPeer` no-peer-drop
+behaviour are unverified.
+
+**Recommendation**
+
+Add `NativeAlarmActor` tests for (a) cap eviction emits a return-to-normal for an
+evicted active condition (pairs with SiteRuntime-028) and (b) a resolved condition
+(inactive+acked) drops out and deletes its SQLite row. Stand up the clustered
+test harness SiteRuntime-016 called for and cover `SiteReplicationActor`'s
+outbound/inbound/peer-tracking paths.
+
+**Resolution**
+
+Resolved 2026-06-20 (commit `fd618cf1`): native-alarm cap-eviction and retention-drop tests added (pairs with -027/-028). The clustered `SiteReplicationActor` test harness remains deferred (needs a clustered ActorSystem host, consistent with the prior -016 deferral).
+
+### SiteRuntime-031 — Site still persists notification-list and SMTP config that the design moved to central-only
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Design-document adherence |
+| Status | Resolved |
+| Location | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Persistence/SiteStorageService.cs:90`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Persistence/SiteStorageService.cs:105`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs:1383`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs:1417` |
+
+**Description**
+
+The design decision (CLAUDE.md "External Integrations" / Component-NotificationService;
+echoed in Component-SiteRuntime.md "System-Wide Artifact Handling": *"Notification
+lists and SMTP configuration are central-only and are not deployed to sites"*) makes
+notification delivery central-only — sites store-and-forward to central and never
+talk to SMTP. Yet SiteRuntime still:
+
+- creates the `notification_lists` and `smtp_configurations` SQLite tables
+  (`SiteStorageService.InitializeAsync`),
+- writes them from `HandleDeployArtifacts` (`StoreNotificationListAsync` /
+  `StoreSmtpConfigurationAsync`, DeploymentManagerActor.cs:1383/1417) and the
+  replication path `SiteReplicationActor.HandleApplyArtifacts`, and
+- the `DeployArtifactsCommand` contract still carries `NotificationLists` /
+  `SmtpConfigurations`.
+
+The SMTP path persists the SMTP `password` field (SiteStorageService.cs:693) into
+plaintext site SQLite. If central no longer populates these (per the design), this
+is dead code carrying a latent secret-at-rest footprint; if central still does, the
+site is storing SMTP credentials it must never use — both contradict the
+central-only delivery decision. Either the code/tables are stale and should be
+removed, or the design doc is stale and the decision needs re-stating. This straddles
+the SiteRuntime/NotificationService boundary and the shared `DeployArtifactsCommand`
+contract, so the direction is a design-owner call rather than a clean in-module fix.
+
+**Recommendation**
+
+Confirm with the design owner whether central still ships notification-list / SMTP
+artifacts to sites. If not (the stated decision), remove the `notification_lists`
+and `smtp_configurations` tables, the two `Store…Async` writers, the replication
+branches, and the corresponding `DeployArtifactsCommand` fields (a coordinated
+cross-module change). If the decision has changed, update Component-SiteRuntime.md
+and Component-NotificationService.md and treat the plaintext SMTP password as a
+secret-handling finding in its own right.
+
+**Resolution**
+
+Resolved 2026-06-20 (commit `fd618cf1`): the site no longer persists notification-list / SMTP config and purges any already-persisted rows (incl. the plaintext SMTP password) on both apply paths (`HandleDeployArtifacts`, replication `HandleApplyArtifacts`), inside the all-or-nothing apply. Paired with DeploymentManager-025 (central stops shipping). Tables retained but kept empty.
diff --git a/code-reviews/StoreAndForward/findings.md b/code-reviews/StoreAndForward/findings.md
index 53728126..8f6f8f6f 100644
--- a/code-reviews/StoreAndForward/findings.md
+++ b/code-reviews/StoreAndForward/findings.md
@@ -5,10 +5,10 @@
 | Module | `src/ZB.MOM.WW.ScadaBridge.StoreAndForward` |
 | Design doc | `docs/requirements/Component-StoreAndForward.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-28 |
+| Last reviewed | 2026-06-20 |
 | Reviewer | claude-agent |
-| Commit reviewed | `1eb6e97` |
-| Open findings | 0 (3 Deferred: 002, 011, 012; all 5 Open from Re-review 2026-05-28 resolved 2026-05-28) |
+| Commit reviewed | `4307c381` |
+| Open findings | 0 (025 Resolved; 026, 027 Deferred; 5 Deferred total: 002, 011, 012, 026, 027; all prior Open findings resolved) |
 
 ## Summary
 
@@ -110,6 +110,39 @@ mid-flight `RetryPendingMessagesAsync` invocation continues using `_storage` and
 `_replication` after `StopAsync` returns; downstream resources disposed by the host
 shutdown sequence (the DI container) can then NRE through the still-running sweep.
 
+#### Re-review 2026-06-20 (commit `4307c381`) — full review
+
+Full re-review against commit `4307c381` with the same 10-category checklist. All prior
+fixes are intact: the StoreAndForward-024 stop-wait (the captured `_sweepTask` +
+bounded `StopAsync` wait), the StoreAndForward-020 capture-before-requeue, the
+StoreAndForward-005 conditional `UpdateMessageIfStatusAsync` writes, and the
+StoreAndForward-023 site-id sentinel all remain present and correct, and the new WP-14
+in-process `_bufferedCount` queue-depth counter shows **no drift** — every Pending-population
+transition (`BufferAsync` +1, retry-remove −1, conditional Pending→Parked −1, operator
+requeue +1) adjusts it exactly once and only when the underlying conditional storage write
+wins. The three findings this pass surfaces are not core-delivery defects: a **gauge-stop
+resource leak** (025 — the process-global queue-depth provider slot is never cleared on
+`StopAsync`, so a stopped service's frozen depth is reported indefinitely and the dead
+instance is pinned via the closure), plus **two test/doc items** — the gauge stop/lifecycle
+and the `EmitSiteEvent` retry / retry-delivered branches are untested (026), and the
+per-retry-per-sweep `"Retried"` site event is a spec-conformant design-vs-scale tension
+worth flagging (027).
+
+## Checklist coverage — Re-review 2026-06-20
+
+| # | Category | Examined | Notes |
+|---|----------|----------|-------|
+| 1 | Correctness & logic bugs | ☑ | No new defects — `_bufferedCount` adjusts once per Pending transition, gated on the winning conditional write; retry-count/park semantics (003/005) intact. |
+| 2 | Akka.NET conventions | ☑ | No issues found — `ParkedMessageHandlerActor` `PipeTo` success/failure projections preserved (007). |
+| 3 | Concurrency & thread safety | ☑ | No new findings — StopAsync now awaits the captured `_sweepTask` (024); requeue captures the row up front (020); conditional writes guard sweep-vs-operator (005). |
+| 4 | Error handling & resilience | ☑ | No new findings — replication wired on all delivery + operator paths (001/016); notification parking documented (019). |
+| 5 | Security | ☑ | No issues found — parameterised SQL throughout; payload JSON opaque; no secret material handled. |
+| 6 | Performance & resource management | ☑ | Queue-depth gauge provider registered into a process-global slot is never cleared on `StopAsync` — frozen reading + dead-instance pin for the process lifetime (025). |
+| 7 | Design-document adherence | ☑ | Per-retry `"Retried"` site event per-row-per-sweep is spec-conformant (`Component-StoreAndForward.md:139`) but a flood risk against the unbounded buffer + capped site log — flagged as a judgment call (027). |
+| 8 | Code organization & conventions | ☑ | No new findings — options/POCO placement unchanged from prior passes (011/012 still Deferred to Commons-owning changes). |
+| 9 | Testing coverage | ☑ | Gauge stop/lifecycle and the `EmitSiteEvent` retry / retry-delivered branches are untested (026). |
+| 10 | Documentation & comments | ☑ | No new findings — the `_bufferedCount` and StopAsync XML docs accurately describe the implemented behaviour. |
+
 ## Checklist coverage — Re-review 2026-05-28
 
 | # | Category | Examined | Notes |
@@ -1520,3 +1553,186 @@ mid-sweep and asserts no further storage activity occurs after `StopAsync` retur
 
 _Unresolved._
 
+### StoreAndForward-025 — Queue-depth gauge provider is never cleared on `StopAsync`, leaking a frozen reading and pinning the dead service
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Performance & resource management |
+| Status | Resolved |
+| Location | `src/ZB.MOM.WW.ScadaBridge.StoreAndForward/StoreAndForwardService.cs:347`–`:351`, `:385`–`:425`; `src/ZB.MOM.WW.ScadaBridge.Commons/Observability/ScadaBridgeTelemetry.cs:55`–`:60`, `:85`–`:93` |
+
+**Description**
+
+The WP-14 queue-depth gauge is backed by a **process-global** provider slot.
+`StoreAndForwardService.StartAsync` seeds the in-process `_bufferedCount`, sets a
+one-time instance guard (`_queueDepthProviderRegistered`, line 347), and registers a
+closure that reads this instance's counter into the global slot
+(`ScadaBridgeTelemetry.SetQueueDepthProvider(() => Interlocked.Read(ref _bufferedCount))`,
+line 350). The `scadabridge.store_and_forward.queue.depth` `ObservableGauge`
+(`ScadaBridgeTelemetry.cs:56`–`:60`) invokes that provider on every collector scrape:
+`Volatile.Read(ref _queueDepthProvider) is { } p ? p() : 0L`.
+
+`SetQueueDepthProvider` (`ScadaBridgeTelemetry.cs:85`–`:93`) is the only mutator of the
+slot — there is **no** clear/reset method anywhere in the codebase (a repository-wide
+search for `ClearQueueDepthProvider` / writes to `_queueDepthProvider` finds only the
+registration). `StopAsync` (`StoreAndForwardService.cs:385`–`:425`) disposes the retry
+timer and (StoreAndForward-024) awaits the in-flight sweep, but it never touches the
+gauge: the global slot still points at the stopped instance's closure, and
+`_queueDepthProviderRegistered` is left at 1.
+
+Two consequences after a graceful stop:
+
+1. **Frozen reading.** The gauge keeps reporting the dead service's last `_bufferedCount`
+   value forever — a stopped (or failed-over-away) site node reports a stale, non-zero
+   queue depth that no longer corresponds to any live buffer, which is misleading on a
+   dashboard / alert built on the metric.
+2. **Instance pin (resource leak).** The captured closure holds a reference to the dead
+   `StoreAndForwardService`, so the stopped instance (and everything it transitively
+   roots) cannot be garbage-collected for the process lifetime. This is the classic
+   "static event / static delegate slot holding a finished object" leak. It also means a
+   later instance's `StartAsync` silently stomps the global slot (last-writer-wins), so
+   the gauge's identity is unmanaged.
+
+The instance guard's own XML doc (lines 156–164) already acknowledges the slot is
+process-global and "the last `StartAsync` wins the global slot" — but nothing on the stop
+path releases it, so a clean stop without an immediately-following start leaves a dangling
+provider.
+
+**Recommendation**
+
+Give `ScadaBridgeTelemetry` an identity-checked clear and call it from `StopAsync`. Add a
+compare-and-clear `ClearQueueDepthProvider(Func<long> expected)` that only nulls the slot
+when the current provider is reference-equal to `expected` (so a later instance that
+already re-registered is not stomped):
+
+```csharp
+public static void ClearQueueDepthProvider(Func<long> expected)
+{
+    Interlocked.CompareExchange(ref _queueDepthProvider, null, expected);
+}
+```
+
+In `StoreAndForwardService`, hold the registered provider delegate in a field at
+`StartAsync` time, and in `StopAsync` (only when this instance actually registered —
+`_queueDepthProviderRegistered == 1`) call
+`ScadaBridgeTelemetry.ClearQueueDepthProvider(_registeredProvider)` and reset
+`_queueDepthProviderRegistered` to 0 via `Interlocked.Exchange` so a subsequent restart
+re-seeds and re-registers cleanly. Add a regression test (extend `QueueDepthGaugeTests`)
+asserting that after `StopAsync` the gauge reports 0 (provider cleared) and that a second
+instance's registration is not clobbered by the first instance's stop.
+
+**Resolution**
+
+Resolved 2026-06-20 (commit `fd618cf1`): added an identity-checked `ScadaBridgeTelemetry.ClearQueueDepthProvider(provider)` (compare-and-clear via `Interlocked.CompareExchange`, so a newer instance's provider is never stomped) and call it in `StoreAndForwardService.StopAsync` (with `_queueDepthProviderRegistered` reset). The gauge no longer reports a frozen depth or pin the dead service after a graceful stop. Tests added (incl. the late-stop-doesn't-clobber-successor case).
+
+### StoreAndForward-026 — Gauge stop/lifecycle and the `EmitSiteEvent` retry / retry-delivered branches are untested
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Testing coverage |
+| Status | Deferred |
+| Location | `tests/ZB.MOM.WW.ScadaBridge.StoreAndForward.Tests/QueueDepthGaugeTests.cs`; `tests/ZB.MOM.WW.ScadaBridge.StoreAndForward.Tests/StoreAndForwardSiteEventTests.cs`; `src/ZB.MOM.WW.ScadaBridge.StoreAndForward/StoreAndForwardService.cs:295`–`:308`, `:747` |
+
+**Description**
+
+Two behaviours exercised by the current code have no test coverage:
+
+1. **Gauge stop / lifecycle.** `QueueDepthGaugeTests` covers the live counter across
+   enqueue / drain / park / requeue (`Gauge_TracksBufferedDepth_AcrossEnqueueDrainAndPark`),
+   the start-up seed from existing Pending rows (`Gauge_SeedsFromExistingPendingRows_OnStart`),
+   and the additive-seed race (`Gauge_SeedAddsToConcurrentPreSeedIncrement_NotClobber`).
+   It calls `StopAsync` only as test teardown (`DisposeAsync`) with no assertion — so the
+   gauge's stop behaviour is unverified. This is the same coverage gap that would have
+   caught StoreAndForward-025: no test asserts what the gauge reports after a service stops
+   (today: a stale frozen value; after the 025 fix: 0). The gauge's behaviour across a
+   stop, and across a second instance starting after a first stops, is entirely untested.
+
+2. **`EmitSiteEvent` retry / retry-delivered branches.** `StoreAndForwardSiteEventTests`
+   covers buffer-for-retry (`"Queued"`), park (`"Parked"` for both notification and
+   cached-call categories), routine-enqueue-no-event, and immediate-delivered-no-event.
+   But two `EmitSiteEvent` branches that the retry sweep drives are uncovered:
+   - the per-retry `"Retried"` action (`StoreAndForwardService.cs:747`), which maps to a
+     `store_and_forward` site event at `Warning` severity (the `_ => "Warning"` arm of the
+     severity switch, lines 298–303);
+   - the retry-loop `"Delivered"` action whose detail is `"Delivered to … after N retries"`
+     (line 644–645), which — because the detail does **not** start with `"Immediate"` —
+     is **not** short-circuited at line 268 and instead emits a `store_and_forward` event
+     at `Info` severity (the `"Delivered" => "Info"` arm, line 301). This "recovery
+     recorded" path is the only one that produces an `Info`-severity S&F site event, and
+     nothing asserts it fires (or that the immediate-delivered path is correctly suppressed
+     by contrast).
+
+**Recommendation**
+
+Add a gauge-stop test to `QueueDepthGaugeTests`: enqueue to a known non-zero depth, call
+`StopAsync`, and assert the gauge reports the cleared value (0 after the StoreAndForward-025
+fix) — and, ideally, that a second `StoreAndForwardService` started afterward owns the slot
+without being clobbered by the first instance's stop. Add two `StoreAndForwardSiteEventTests`
+cases: (a) a transient handler with `MaxRetries > 1` driven through one sweep, asserting a
+`store_and_forward` / `Warning` event whose message contains `"retried"`; and (b) a handler
+that fails on the immediate attempt then succeeds on the sweep, asserting a `store_and_forward`
+/ `Info` event whose message contains `"delivered"` (recording the recovery), distinct from
+the suppressed immediate-delivery path.
+
+**Resolution**
+
+Deferred 2026-06-20: additional gauge-lifecycle and `EmitSiteEvent` retry-branch tests are low-value coverage gaps; recorded for a follow-up.
+
+### StoreAndForward-027 — Per-message `"Retried"` site event fires per-row-per-sweep, risking a flood of the capped site event log
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Design-document adherence |
+| Status | Deferred |
+| Location | `src/ZB.MOM.WW.ScadaBridge.StoreAndForward/StoreAndForwardService.cs:734`–`:759`, `:295`–`:308`; `docs/requirements/Component-StoreAndForward.md:72`, `:139` |
+
+**Description**
+
+When `_siteEventLogger` is wired, `EmitSiteEvent` is subscribed to `OnActivity`, so every
+`RaiseActivity("Retried", …)` call in `RetryMessageAsync`'s transient-not-yet-max branch
+(`StoreAndForwardService.cs:747`–`:748`) produces one `store_and_forward` site event at
+`Warning` severity (lines 295–308). That branch fires **once per still-pending row per
+retry sweep**. Combined with the design's deliberate "**no maximum buffer size**"
+(`Component-StoreAndForward.md:72`), a sustained outage can buffer a large Pending
+population, and each fixed-interval sweep then emits a `"Retried"` event for every one of
+those rows — `(pending rows) × (sweeps)` events. The site event log is **capped** (default
+1 GB, 30-day retention, with oldest-first purge when the cap is hit before the retention
+window — `Component-SiteEventLogging.md:45`–`:46`), so a large, long-lived backlog can
+churn the log and evict genuinely useful operational history (deployments, connection
+events, alarms) well before the 30-day window.
+
+**This is a judgment call, not a clear bug.** The behaviour is **spec-conformant**:
+`Component-StoreAndForward.md:139` explicitly lists "retried" as a logged store-and-forward
+activity ("Logs store-and-forward activity (queued, delivered, retried, parked)"), so the
+code is doing exactly what the design doc says. The tension is between that
+per-retry-logging contract and the combination of an unbounded buffer + a capped site log:
+at scale the "retried" stream can dominate the log. The site-event-log cap bounds the
+absolute disk risk (it cannot exhaust disk), so the residual concern is **eviction of
+other history**, not unbounded growth — hence Low and flagged as a design-vs-scale
+decision rather than a defect.
+
+**Recommendation**
+
+Reconcile the design-vs-scale tension deliberately; this is a judgment call for the owners,
+not a forced fix. Options, in rough order of preference:
+
+1. **Drop or coarsen the per-retry event.** Stop emitting a `store_and_forward` event for
+   the routine `"Retried"` action (keep `"Queued"`, `"Parked"`, and the retry-recovery
+   `"Delivered"`), since the per-attempt detail is already available via the
+   `OnActivity`/health/telemetry surfaces and the central audit log. The first-buffer and
+   the terminal park/deliver events are the operationally interesting ones.
+2. **Sample / threshold it.** Emit a `"Retried"` event only on the first retry, on a count
+   boundary (e.g. every Nth retry), or once per sweep as a rolled-up summary
+   ("retried N messages this sweep") rather than one per row.
+3. **Document the volume risk.** If per-retry logging is intentionally retained, add a note
+   to `Component-StoreAndForward.md` / `Component-SiteEventLogging.md` that a large buffered
+   backlog can dominate the capped site event log and evict other history, so operators
+   size the cap / retention accordingly.
+
+**Resolution**
+
+Deferred 2026-06-20: the per-retry `"Retried"` site-event volume is spec-conformant (Component-StoreAndForward.md lists 'retried' as a logged activity) and the site log is capped, so the residual concern is history eviction, not disk exhaustion. Whether to drop/sample/threshold the event vs. document the volume is a design-vs-scale decision for the owner.
+
diff --git a/code-reviews/TemplateEngine/findings.md b/code-reviews/TemplateEngine/findings.md
index 786f36b4..de52e5c4 100644
--- a/code-reviews/TemplateEngine/findings.md
+++ b/code-reviews/TemplateEngine/findings.md
@@ -5,10 +5,10 @@
 | Module | `src/ZB.MOM.WW.ScadaBridge.TemplateEngine` |
 | Design doc | `docs/requirements/Component-TemplateEngine.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-28 |
+| Last reviewed | 2026-06-20 |
 | Reviewer | claude-agent |
-| Commit reviewed | `1eb6e97` |
-| Open findings | 1 |
+| Commit reviewed | `4307c381` |
+| Open findings | 0 |
 
 ## Summary
 
@@ -87,6 +87,45 @@ the duplicate-Id / null-sentinel fix from the last batch, and audit-write
 ordering inconsistency between `TemplateService` (logs then saves) and
 `InstanceService` (saves then logs).
 
+#### Re-review 2026-06-20 (commit `4307c381`) — full review
+
+Re-reviewed the whole module against all ten checklist categories at commit
+`4307c381`. The large delta since `1eb6e97` (722 commits, including the
+`ScadaLink → ZB.MOM.WW.ScadaBridge` rename, the Script Analysis #25 extraction,
+native-alarm-source authoring/flatten, the read-only inheritance-resolve service,
+List-attribute semantics, and Transport line-diff) holds up well. All 22 prior
+findings remain closed and verified: the deploy gate now delegates to the
+authoritative `ScriptTrustValidator` + `RoslynScriptCompiler` (real semantic
+compile, no residual substring scan or hand-rolled brace counter in the security
+path — `ScriptCompiler` and `ValidationService.CheckExpressionSyntax` both call
+the shared analyzer); the recursive composition walk, per-slot alarm override,
+duplicate-Id-tolerant `BuildLookup`, `int?`-parent walk, and the revision-hash /
+diff `Description`+`Connections`+`ElementDataType` coverage are all in place and
+test-guarded. The new `TemplateInheritanceResolver` (M9-T26a) mirrors the
+flattener's inheritance precedence exactly (including HiLo per-setpoint merge),
+verified by `Resolve_AgreesWithFlatteningService_*` tests. Three new findings
+surfaced — all lock/audit asymmetries where a newer feature did not pick up an
+invariant an older sibling already enforces: native-alarm-source instance
+overrides bypass the lock at flatten time (the unfixed-for-native sibling of the
+closed TemplateEngine-008; `ResolvedNativeAlarmSource` carries no `IsLocked`
+flag), the four `TemplateFolderService` mutators stage their audit row *after*
+the final `SaveChangesAsync` so the row is never persisted, and a dead/misnamed
+`SemanticValidator.ParseParameterDefinitions` legacy branch returns types not
+names.
+
+| # | Category | Examined | Notes |
+|---|----------|----------|-------|
+| 1 | Correctness & logic bugs | ✓ | New: native-source instance override applies with no lock check (TemplateEngine-023). Flattening inherit/compose/override precedence + recursion verified correct for attrs/alarms/scripts/native-sources; HiLo per-setpoint merge consistent with resolver. |
+| 2 | Akka.NET conventions | ✓ | No actors. `AddTemplateEngineActors` is still an empty placeholder. Nothing to assess. |
+| 3 | Concurrency & thread safety | ✓ | Services stateless/scoped; static helpers hold no mutable state. No new findings. |
+| 4 | Error handling & resilience | ✓ | `Result<T>` used consistently; `Flatten` wraps in try/catch; JSON parse failures fall back safely. No store-and-forward/failover surface. |
+| 5 | Security | ✓ | Deploy gate is now authoritative via Script Analysis #25 (`ScriptTrustValidator.FindViolations` + `RoslynScriptCompiler.Compile`) — alias/`using static`/`global::` bypasses caught by symbol resolution; no residual substring scan in the security path. No new findings. |
+| 6 | Performance & resource management | ✓ | TemplateEngine-009 N+1 fix holds (`Compositions` navigation reused). `JsonDocument`/streams disposed. No new leaks. |
+| 7 | Design-document adherence | ✓ | Native-alarm-source lock rule ("Lock applies to the entire source") not enforced in flattening (TemplateEngine-023). Otherwise the doc matches: revision-hash/diff payload coverage, semantic validation set, acyclicity, derived-naming all align. |
+| 8 | Code organization & conventions | ✓ | New: `TemplateFolderService` Create/Rename/Move/Reorder log after the final save → audit row never persisted (TemplateEngine-024), asymmetric with `TemplateService` (which always saves after `LogAsync`). |
+| 9 | Testing coverage | ✓ | Strong overall (inheritance-resolver agrees-with-flattener tests, hash determinism test). Gaps: no test for a locked native source rejecting an instance override; folder tests mock `IAuditService` so the never-saved audit row is invisible; `ParseParameterDefinitions` legacy branch untested. |
+| 10 | Documentation & comments | ✓ | New: `ParseParameterDefinitions` is documented as returning parameter NAMES but its legacy-array branch returns the type field (TemplateEngine-025). XML docs otherwise accurate and thorough. |
+
 ## Checklist coverage
 
 _Re-review (2026-05-17, `39d737e`):_
@@ -1191,3 +1230,161 @@ false rejected, false→true / true→true / false→false accepted) and
 `TemplateServiceTests.UpdateAttribute_LockedInDerivedDowngrade_OnBase_Rejected`
 (end-to-end on the attribute update path).
 
+### TemplateEngine-023 — Instance native-alarm-source override bypasses the source lock during flattening
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Correctness & logic bugs |
+| Status | Deferred |
+| Location | `src/ZB.MOM.WW.ScadaBridge.TemplateEngine/Flattening/FlatteningService.cs:775`, `src/ZB.MOM.WW.ScadaBridge.Commons/Types/Flattening/FlattenedConfiguration.cs:137` |
+
+**Description**
+
+`ApplyInstanceNativeAlarmSourceOverrides` applies a per-instance native-alarm-source
+override unconditionally — it looks the binding up by canonical name and, if found,
+overlays the override's non-null fields with **no lock check**:
+
+```csharp
+foreach (var ovr in overrides)
+{
+    if (!sources.TryGetValue(ovr.SourceCanonicalName, out var existing))
+        continue; // Cannot add new bindings via overrides.
+
+    sources[ovr.SourceCanonicalName] = existing with { ConnectionName = ..., ... };
+}
+```
+
+Contrast `ApplyInstanceOverrides` (attributes, `:309` — `if (existing.IsLocked) continue;`)
+and `ApplyInstanceAlarmOverrides` (alarms, `:337` — `if (existing.IsLocked) continue;`).
+The design (`Component-TemplateEngine.md` Override Granularity, Native alarm sources:
+*"Lock applies to the entire source"*) requires the same treatment. The root cause is
+in Commons: `ResolvedNativeAlarmSource` (`FlattenedConfiguration.cs:137`) carries **no
+`IsLocked` field**, so the flattener has nothing to gate on even if it wanted to —
+unlike `ResolvedAttribute`/`ResolvedAlarm`, which both expose `IsLocked`. The
+inheritance-side `ResolveInheritedNativeAlarmSources` does track a base lock (via the
+`lockedNames` set) so a *derived template* cannot override a base-locked source, but
+that lock state is dropped on the way out and the *instance* layer is unguarded.
+
+This is the same defect class as the closed **TemplateEngine-008** (which fixed the
+identical instance-override lock-bypass for alarms), never applied to native alarm
+sources because they were introduced after that finding. The companion gap lives in
+`ManagementService.HandleSetInstanceNativeAlarmSourceOverride`
+(`ManagementActor.cs:868`), which — unlike `InstanceService.SetAlarmOverrideAsync` /
+`SetAttributeOverrideAsync` — performs neither an existence check (an override for an
+unknown source name is silently persisted as a dead record) nor a lock check before
+writing the override row straight through the repository, bypassing `InstanceService`
+entirely. Blast radius is limited (native alarms are a read-only mirror — no write-back
+to the source), but the documented lock invariant is not enforced anywhere on this path.
+
+**Recommendation**
+
+Add an `IsLocked` field to `ResolvedNativeAlarmSource` (Commons) and carry it through
+`ResolveInheritedNativeAlarmSources` / `ResolveComposedNativeAlarmSources`; then gate
+`ApplyInstanceNativeAlarmSourceOverrides` on `if (existing.IsLocked) continue;`,
+mirroring the attribute/alarm paths. Add the matching existence + lock guard to
+`HandleSetInstanceNativeAlarmSourceOverride` (resolve the effective native-source set
+via `TemplateResolver`/flattening as `SetAlarmOverrideAsync` does). Add a flattening
+regression test (`Flatten_LockedNativeSource_InstanceOverrideIgnored`) and a management
+test rejecting an override of a locked / unknown source.
+
+**Resolution**
+
+Deferred 2026-06-20: the native-alarm-source override lock-bypass fix spans Commons (`ResolvedNativeAlarmSource` has no `IsLocked` field to gate on) and ManagementService (the override handler skips the existence/lock checks), and the blast radius is limited (native alarms are read-only mirrors, no write-back). Recorded as a cross-module follow-up.
+
+### TemplateEngine-024 — `TemplateFolderService` mutators stage their audit row after the final save, so it is never persisted
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Code organization & conventions |
+| Status | Resolved |
+| Location | `src/ZB.MOM.WW.ScadaBridge.TemplateEngine/Services/TemplateFolderService.cs:58`, `src/ZB.MOM.WW.ScadaBridge.TemplateEngine/Services/TemplateFolderService.cs:91`, `src/ZB.MOM.WW.ScadaBridge.TemplateEngine/Services/TemplateFolderService.cs:154`, `src/ZB.MOM.WW.ScadaBridge.TemplateEngine/Services/TemplateFolderService.cs:204` |
+
+**Description**
+
+`IAuditService.LogAsync` only *stages* the audit row in the EF change tracker — the
+implementation (`ConfigurationDatabase/Services/AuditService.cs:61`) ends with
+`AddAsync(entry)` and explicitly documents *"caller is responsible for calling
+SaveChangesAsync to ensure atomicity."* Four of the five `TemplateFolderService`
+mutators call `LogAsync` **after** their final `SaveChangesAsync` and never save again:
+
+- `CreateFolderAsync` (`:56-58`): `AddFolderAsync` → `SaveChangesAsync` → `LogAsync` (no save).
+- `RenameFolderAsync` (`:89-91`): `UpdateFolderAsync` → `SaveChangesAsync` → `LogAsync` (no save).
+- `MoveFolderAsync` (`:152-154`): `UpdateFolderAsync` → `SaveChangesAsync` → `LogAsync` (no save).
+- `ReorderFolderAsync` (`:201-204`): `Update`×2 → `SaveChangesAsync` → `LogAsync` (no save).
+
+`ManagementActor.ProcessCommand` (`ManagementActor.cs:130`) wraps each command in a DI
+scope, dispatches, and disposes the scope with no outer `SaveChangesAsync`; EF Core does
+**not** flush on dispose. The folder operation itself persists (it was saved *before* the
+`LogAsync`), but the staged `TemplateFolder` audit row is silently discarded when the
+scoped `DbContext` is disposed. Every folder Create / Rename / Move / Reorder therefore
+leaves **no audit trail**, even though `_auditService.LogAsync(...)` is called and reads
+as if it does. (`DeleteFolderAsync` has the same shape but the lost row is the *Delete*
+record specifically.) This is asymmetric with `TemplateService`, where every mutator
+ends with `LogAsync` → `SaveChangesAsync` (e.g. `:225-226`). The folder tests mock
+`IAuditService` and the repository, so they never exercise the real change-tracker
+lifecycle and the gap is invisible to them.
+
+**Recommendation**
+
+Add a final `await _repository.SaveChangesAsync(cancellationToken);` after the `LogAsync`
+in all five folder mutators (matching `TemplateService`), or — cleaner — reorder each to
+mutate → log → single `SaveChangesAsync` so the entity change and its audit row commit
+atomically. Add a test that uses a real (in-memory) `AuditService` + `DbContext` and
+asserts an `AuditLogEntry` row exists after a folder create/rename/move/reorder.
+
+**Resolution**
+
+Resolved 2026-06-20 (commit `fd618cf1`): added `await _repository.SaveChangesAsync(...)` after `LogAsync` in all five `TemplateFolderService` mutators (Create/Rename/Move/Reorder/Delete), so folder-mutation audit rows are now persisted (previously staged then discarded with the scope). Matches the `TemplateService` two-save pattern. Tests added.
+
+### TemplateEngine-025 — `SemanticValidator.ParseParameterDefinitions` legacy-array branch returns types, not names; method is dead production code
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Documentation & comments |
+| Status | Resolved |
+| Location | `src/ZB.MOM.WW.ScadaBridge.TemplateEngine/Validation/SemanticValidator.cs:573` |
+
+**Description**
+
+`ParseParameterDefinitions` is documented and named as returning the declared parameter
+**names**:
+
+```csharp
+/// ... returns the declared parameter names.
+internal static List<string> ParseParameterDefinitions(string? parameterDefinitionsJson)
+```
+
+Its JSON-Schema branch correctly returns `props.EnumerateObject().Select(p => p.Name)`,
+but its legacy flat-array branch returns the `type` field:
+
+```csharp
+else if (doc.RootElement.ValueKind == JsonValueKind.Array)
+{
+    return doc.RootElement.EnumerateArray()
+        .Select(e => e.TryGetProperty("type", out var t) ? t.GetString() ?? "unknown" : "unknown")
+        .ToList();
+}
+```
+
+This is a copy-paste from the sibling `ParseParameterTypes` (`:491`, which legitimately
+returns types) — so for legacy `[{name,type}]` definitions this method returns the wrong
+field (types instead of names). The method has **no non-test caller** in `src/` (only its
+own `SemanticValidatorTests`), and the test exercises only the JSON-Schema branch, so the
+buggy legacy branch is both dead and untested. The live param-count/type validation path
+uses `ParseParameterTypes`, not this method, so there is no behavioural impact today — but
+the method is `internal` (callable from the rest of the module) and silently returns
+mislabelled data for one input shape.
+
+**Recommendation**
+
+Delete `ParseParameterDefinitions` (and its tests) as dead code, or — if it is meant to
+back a future feature — fix the legacy branch to return `e.TryGetProperty("name", ...)`
+and add a legacy-array test asserting names are returned.
+
+**Resolution**
+
+Resolved 2026-06-20 (commit `fd618cf1`): deleted the dead, misnamed `SemanticValidator.ParseParameterDefinitions` (no production caller per grep; its only references were two tests, which asserted the buggy type-not-name output). Tests removed.
+