docs(code-review): full review at 4307c381 — 18 modules, 67 findings recorded + remediation tracked

Full per-module re-review of the 16 stale modules (last seen 1eb6e97 / 2026-05-28) plus first-ever reviews of KpiHistory (#26) and ScriptAnalysis (#25), at HEAD 4307c381. 67 new findings (0 Critical, 6 High, 27 Medium, 34 Low). Remediation in commit fd618cf1 closed 5 of the 6 Highs and ~33 Medium/Low; the rest are Deferred/Won't Fix with rationale. Remaining pending (4) are all InboundAPI's Database-helper findings (IA-026 High .. IA-029), left to the active feat/ipsen-movein effort per owner decision. Highlights: caught a central-only-delivery security drift (SMTP creds broadcast to sites — DM-025/SR-031), a never-committed 'Resolved' fix (SiteEventLogging-016 → -024), an unguarded KPI recorder tick (KH-001), a trust-analyzer fallback weakening (SA-001), and a native-alarm subscribe-path leak (DCL-023). ScriptAnalysis verdict: trust boundary is semantically sound (symbol-based) in the production cluster config. README regenerated; regen-readme.py --check passes (4 pending / 567 total).
2026-06-20 18:02:32 -04:00
parent fd618cf1dc
commit d39089f4ed
19 changed files with 4031 additions and 69 deletions
@@ -5,9 +5,9 @@
 | Module | `src/ZB.MOM.WW.ScadaBridge.AuditLog` |
 | Design doc | `docs/requirements/Component-AuditLog.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-28 |
+| Last reviewed | 2026-06-20 |
 | Reviewer | claude-agent |
-| Commit reviewed | `1eb6e97` |
+| Commit reviewed | `4307c381` |
 | Open findings | 0 |
 ## Summary
@@ -59,6 +59,231 @@ chain doesn't reject a central composition root that mistakenly calls the site b
 ## Findings
 #### Re-review 2026-06-20 (commit `4307c381`) — full review
 Since `1eb6e97` the module was renamed (`ScadaLink → ZB.MOM.WW.ScadaBridge`) and substantially
 re-architected: the site SQLite store is now a two-table canonical design (`audit_event`
 append-only + `audit_forward_state` sidecar with a precomputed `IsCachedKind`); the payload
 filter became the `IAuditRedactor` seam (`ScadaBridgeAuditRedactor` + `SafeDefaultAuditRedactor`
 + stateless `AuditRedactionPrimitives`); and milestone work landed (M5.3 T7 inbound response
 capture, M5.5 T3 per-channel retention, K8 KPI source, central health snapshot). The module
 remains one of the best-engineered in the tree: the AuditLog-001..011 fixes from the last pass
 are all present and intact (dual read connection, async scopes, per-EventId reconciliation
 retry escape valve, thread-pool-hop `Dispose`, lifecycle CTS), the best-effort "never abort the
 action" contract is honoured throughout, and redaction over-redacts on fault. This pass found
 **no Critical/High** issues; two design-adherence drifts where documented per-target/purge
 config knobs are silently inert, one error-handling asymmetry that lets a transient DI fault
 restart the ingest singleton, and two Low items. Five new Open findings (AuditLog-012..016).
 | # | Category | Examined | Notes |
 |---|----------|----------|-------|
 | 1 | Correctness & logic bugs | Yes | `IsErrorRow` case-sensitive parse is correct (Status is `enum.ToString()` PascalCase). `IsCachedKind` parse falls back to `InboundRequest` (non-cached) on bad JSON — safe. `OnIngestAsync` scope-resolution throw is uncaught (AuditLog-014). |
 | 2 | Akka.NET conventions | Yes | Sender/EventStream captured before first await across all three central actors; Tell for hot path, Ask only at the gRPC/ClusterClient boundary; supervisor-strategy comments now correctly credit per-row catch (AuditLog-002 fix intact). No findings. |
 | 3 | Concurrency & thread safety | Yes | WAL dual-connection + `_readLock`/`_writeLock` split intact; `Interlocked` counters; `_drainGate` serialises ring drain. `WriteAsync` fast path ignores `ct` when `TryWrite` succeeds (AuditLog-015). |
 | 4 | Error handling & resilience | Yes | Best-effort contract honoured everywhere; gRPC pull collapses tolerable faults to empty; reconciliation retry escape valve (AuditLog-004) intact. `OnIngestAsync` lacks the scope-creation try/catch its sibling `OnCachedTelemetryAsync` has (AuditLog-014). |
 | 5 | Security | Yes | Header/body/SQL-param redaction + over-redact-on-fault safety net solid; `SafeDefaultAuditRedactor` fallback wired at all three writer sites (AuditLog-008 fix intact); SQL built with bound parameters only. No findings. |
 | 6 | Performance & resource management | Yes | Hot path batched + back-pressured; backlog scan off the write lock; partition switch metadata-only; per-channel DELETE batched + clamped; gRPC channel cache race-safe. No findings. |
 | 7 | Design-document adherence | Yes | Per-target `CapBytes` override is documented + has a config property but is read NOWHERE (AuditLog-012); purge config section/key drift from the doc's `AuditLogPurge`/`ChannelPurgeBatchSize` (AuditLog-013). Combined-telemetry transport (AuditLog-001) now wired. `SkipBodyCapture` honoured in InboundAPI middleware (out of module). |
 | 8 | Code organization & conventions | Yes | Composition root well-segmented with "safe from any root" invariant; `INodeIdentityProvider` standardised on `GetRequiredService` (AuditLog-007 fix intact); idempotency sentinel on the health bridge (AuditLog-011 fix intact). No new findings. |
 | 9 | Testing coverage | Yes | Broad (~13k lines). Gaps: no test asserts per-target `CapBytes` is applied (because it isn't — AuditLog-012), and no test binds `AuditLogPurgeOptions` from an `IConfiguration` shaped like the doc (would catch AuditLog-013). |
 | 10 | Documentation & comments | Yes | Mostly excellent and accurate. `AuditLogIngestActor` class XML has a duplicated "the central-side" phrase and several stale "Bundle X" milestone tags that mislead (AuditLog-016). |
 ### AuditLog-012 — Per-target `CapBytes` override is documented and bindable but never read by the redactor
 | | |
 |--|--|
 | Severity | Medium |
 | Category | Design-document adherence |
 | Status | Deferred |
 | Location | `src/ZB.MOM.WW.ScadaBridge.AuditLog/Redaction/ScadaBridgeAuditRedactor.cs:185-196`, `src/ZB.MOM.WW.ScadaBridge.AuditLog/Configuration/PerTargetRedactionOverride.cs:13` |
 **Description**
 The design (Component-AuditLog.md §"Payload Capture Policy" / §"Configuration") documents a
 per-target payload-cap override and gives it as a worked example:
 `"PerTargetOverrides": { "Weather/GetForecast": { "CapBytes": 4096 } }`. The
 `PerTargetRedactionOverride.CapBytes` property exists and its XML doc promises "Optional payload
 cap override (bytes); null inherits the global cap." But `ScadaBridgeAuditRedactor.SelectCap`
 chooses among `InboundMaxBytes` / `ErrorCapBytes` / `DefaultCapBytes` only — it never consults
 `opts.PerTargetOverrides[target].CapBytes`. A repo-wide grep confirms `CapBytes` (the per-target
 property) is read in zero production code paths. An operator who sets a per-target cap to bound a
 chatty target's payloads gets the global `DefaultCapBytes`/`ErrorCapBytes` instead, with no error
 and no log — the knob is silently inert. This is a code-vs-spec drift: either the feature was
 dropped in the redactor re-architecture (C2/C3) without updating the doc + override XML, or it
 was never implemented. The `AdditionalBodyRedactors` and `RedactSqlParamsMatching` per-target
 fields ARE honoured, which makes the silent `CapBytes` omission especially easy to miss.
 **Recommendation**
 Either (a) implement it: in `SelectCap`, when `category != ApiInbound` and the resolved
 `PerTargetOverrides` entry for `rawEvent.Target` has a non-null `CapBytes`, use
 `Math.Min`/`Max(over.CapBytes.Value, …)` per the intended semantics (the doc says it *overrides*
 the global cap — clarify whether error rows still get the larger of the two), and add a
 `ScadaBridgeAuditRedactorTests` case asserting a per-target cap shortens a payload the global cap
 would not; or (b) if the feature is intentionally dropped, delete `PerTargetRedactionOverride.CapBytes`,
 remove the `CapBytes` example from the Component-AuditLog.md Configuration block, and note the
 removal in the design doc so the override surface stops advertising a no-op.
 **Resolution**
 Deferred 2026-06-20: whether to implement per-target `CapBytes` in `SelectCap` or delete the unused property + doc example is a design-owner decision (was it intentionally dropped in the redactor re-architecture, or lost?). Recorded; no change this pass.
 ### AuditLog-013 — Purge config section + key drift: documented `AuditLogPurge`/`ChannelPurgeBatchSize` does not bind
 | | |
 |--|--|
 | Severity | Medium |
 | Category | Design-document adherence |
 | Status | Resolved |
 | Location | `src/ZB.MOM.WW.ScadaBridge.AuditLog/ServiceCollectionExtensions.cs:56,370-371`, `src/ZB.MOM.WW.ScadaBridge.AuditLog/Central/AuditLogPurgeOptions.cs:40,47` |
 **Description**
 Component-AuditLog.md §"Retention & Purge" / §"Configuration" documents the purge tuning as a
 top-level section named `AuditLogPurge` with keys `IntervalHours` and `ChannelPurgeBatchSize`:
 ```jsonc
 "AuditLogPurge": { "IntervalHours": 24, "ChannelPurgeBatchSize": 5000 }
 ```
 The code does not match on either axis. (1) `ServiceCollectionExtensions.PurgeSectionName` is
 `"AuditLog:Purge"` — a nested subsection, not the documented top-level `AuditLogPurge`. (2) The
 batch-size property is `AuditLogPurgeOptions.ChannelPurgeBatchSizeConfigured`, so its bind key is
 `ChannelPurgeBatchSizeConfigured`, not the documented `ChannelPurgeBatchSize` (which is a
 computed read-only property the binder ignores). There is no `[ConfigurationKeyName]` attribute
 to reconcile the names. The net effect: an operator who follows the design doc and writes
 `"AuditLogPurge": { "ChannelPurgeBatchSize": 1000 }` has BOTH values silently ignored — the
 wrong section path means `IntervalHours` falls back to the 24 h default, and the property-name
 mismatch means the channel batch size falls back to 5000. The actor still functions on defaults,
 so nothing crashes, but the documented operator-facing tuning surface is inert. No test binds
 `AuditLogPurgeOptions` from an `IConfiguration` shaped like the doc, so the drift is unguarded.
 (The §407 doc text says `AuditLogPurge:ChannelPurgeBatchSize`, agreeing with the JSON block and
 disagreeing with the code on both section and key.)
 **Recommendation**
 Pick one canonical shape and make code, doc, and a binding test agree. Recommended: keep the
 nested `AuditLog:Purge` section (consistent with `AuditLog:SiteWriter` / `AuditLog:SiteTelemetry`
 / `AuditLog:Reconciliation`), but then fix the doc's JSON block + §407 to read
 `"AuditLog": { "Purge": { "IntervalHours": …, "ChannelPurgeBatchSize": … } }`, AND add
 `[ConfigurationKeyName("ChannelPurgeBatchSize")]` to `ChannelPurgeBatchSizeConfigured` (or rename
 it and move the clamp into the `Interval`-style computed getter pattern) so the documented key
 binds. Add an `AuditLogOptionsBindingTests`-style test that binds the purge section from an
 in-memory config and asserts both `IntervalHours` and the channel batch size take effect.
 **Resolution**
 Resolved 2026-06-20 (commit `fd618cf1`): added `[ConfigurationKeyName("ChannelPurgeBatchSize")]` so the documented key binds, reconciled the design doc to the real nested `AuditLog:Purge` section, and added a binding test proving the purge options bind from the documented keys.
 ### AuditLog-014 — `AuditLogIngestActor.OnIngestAsync` does not guard scope/repository resolution, so a transient DI fault restarts the singleton and drops the reply
 | | |
 |--|--|
 | Severity | Medium |
 | Category | Error handling & resilience |
 | Status | Resolved |
 | Location | `src/ZB.MOM.WW.ScadaBridge.AuditLog/Central/AuditLogIngestActor.cs:135-148` |
 **Description**
 `OnIngestAsync` opens `await using var scope = _serviceProvider!.CreateAsyncScope();` and
 `scope.ServiceProvider.GetRequiredService<IAuditLogRepository>()` with NO surrounding try/catch.
 The per-row try/catch lives inside `IngestWithRepositoryAsync`, so it only protects the insert
 loop — a throw from scope creation or from `GetRequiredService` (a `DbContext` factory throw on
 transient SQL-connection exhaustion, a pooled-context init fault, a DI resolution race during
 host churn) propagates straight out of the message handler. When a `ReceiveActor`'s async handler
 faults, Akka applies the parent's supervision and restarts the singleton; the captured
 `replyTo.Tell(...)` at line 147 never runs, so the site's `Ask<IngestAuditEventsReply>` times out
 and (correctly) retries — but the singleton has been bounced over a transient fault. The class
 XML even claims "that per-row catch is what keeps this actor alive across handler throws" — which
 is true for insert throws but false for the scope-resolution throw this handler leaves uncaught.
 The sibling `OnCachedTelemetryAsync` (line 216) wraps its entire scope-creation-plus-loop body in
 a try/catch and logs + replies-with-partial precisely to avoid this; `OnIngestAsync` is the
 asymmetric one. `AuditLogPurgeActor.OnTickAsync` and `SiteAuditReconciliationActor.OnTickAsync`
 both guard their `GetRequiredService` with try/catch too — `OnIngestAsync` is the outlier.
 **Recommendation**
 Wrap the scope-creation + repository-resolution in `OnIngestAsync` in a try/catch that mirrors
 `OnCachedTelemetryAsync`: on failure, log the resolution fault and `replyTo.Tell(new
 IngestAuditEventsReply(accepted))` (accepted will be whatever was processed before the throw,
 typically empty) so the site keeps its rows `Pending` and retries on the next drain without
 bouncing the singleton. Optionally bump the `ICentralAuditWriteFailureCounter` on that path too,
 so a sustained DI/connection fault is visible on the dashboard.
 **Resolution**
 Resolved 2026-06-20 (commit `fd618cf1`): `OnIngestAsync` now wraps the scope-creation + repo resolution + ingest in a try/catch that logs, best-effort bumps the write-failure counter, and still `replyTo.Tell(IngestAuditEventsReply(accepted))` — a transient DI/DbContext fault no longer bounces the central ingest singleton or drops the reply (mirrors `OnCachedTelemetryAsync`).
 ### AuditLog-015 — `SqliteAuditWriter.WriteAsync` ignores the cancellation token on the fast (`TryWrite`) path
 | | |
 |--|--|
 | Severity | Low |
 | Category | Concurrency & thread safety |
 | Location | `src/ZB.MOM.WW.ScadaBridge.AuditLog/Site/SqliteAuditWriter.cs:254-275` |
 | Status | Deferred |
 **Description**
 `WriteAsync(AuditEvent, CancellationToken ct)` accepts a `ct` but, on the common fast path where
 `_writeQueue.Writer.TryWrite(pending)` succeeds, it returns `pending.Completion.Task` without ever
 observing `ct`. The token is only honoured on the slow path (`WriteSlowPathAsync`, which passes it
 to `WriteAsync`). So a caller that passes an already-cancelled (or soon-cancelled) token still
 enqueues the row and then awaits an uncancellable completion. In practice the hot-path callers
 (`FallbackAuditWriter`, `CachedCallTelemetryForwarder`) thread a real `ct` through, and the audit
 contract is best-effort durable-in-microseconds, so the impact is minor — the row simply gets
 written instead of cancelled. But the signature advertises cancellation it does not deliver on the
 dominant path, which is a latent surprise for any future caller that relies on it (e.g. to bound a
 shutdown drain). The completion `Task` returned is not linked to `ct`, so an awaiter cannot abandon
 the wait either.
 **Recommendation**
 Either (a) honour the token on the fast path — `ct.ThrowIfCancellationRequested()` before
 `TryWrite`, and return a `ct`-linked wait (e.g. `pending.Completion.Task.WaitAsync(ct)`) so an
 awaiter can abandon a stuck completion; or (b) if cancellation is intentionally a no-op for this
 best-effort hot path, drop the `ct` parameter (or document on the method that it is observed only
 on the back-pressured slow path) so the surface stops advertising behaviour it does not provide.
 **Resolution**
 Deferred 2026-06-20: whether cancellation should be honoured on the best-effort `TryWrite` hot path (vs. intentionally a no-op) is a design decision. Recorded; no change this pass.
 ### AuditLog-016 — `AuditLogIngestActor` class XML has a duplicated phrase and stale "Bundle"/`IngestedAtUtc`-column wording
 | | |
 |--|--|
 | Severity | Low |
 | Category | Documentation & comments |
 | Status | Resolved |
 | Location | `src/ZB.MOM.WW.ScadaBridge.AuditLog/Central/AuditLogIngestActor.cs:13-46` |
 **Description**
 The `AuditLogIngestActor` class summary reads "Each row is stamped with the central-side **the
 central-side IngestedAtUtc**" — the phrase "the central-side" is duplicated, a copy/paste slip
 that survived the rename. The same summary describes `IngestedAtUtc` as if it were a promoted
 column, but post-C3 it is a `DetailsJson` field stamped via `AuditRowProjection.WithIngestedAtUtc`
 (as the handler bodies and their own inline comments correctly note) — the class-level prose is
 now stale relative to the canonical-record shim. The remarks also still narrate the design in
 "Bundle A / Bundle D / Bundle E" milestone terms (and the two-constructor rationale references
 "Bundle D's tests" / "Bundle E's host wiring"), which no longer map to any current artifact and
 read as archaeology to a new maintainer. None of this is wrong behaviourally, but the duplicated
 phrase plus the column-vs-DetailsJson mismatch is exactly the kind of doc rot that misleads.
 **Recommendation**
 Fix the duplicated "the central-side the central-side" to a single phrase; reword the
 `IngestedAtUtc` sentence to say it is stamped into `DetailsJson` (matching the handler comments);
 and either drop the "Bundle X" tags in favour of the feature/milestone names used elsewhere in
 the module, or add a one-line legend. A quick sweep of the other central actors' class XML for
 the same "Bundle" archaeology would be worthwhile while in there.
 **Resolution**
 Resolved 2026-06-20 (commit `fd618cf1`): class XML doc deduped ('the central-side' duplicate removed), `IngestedAtUtc` reworded to 'stamped into DetailsJson (no promoted column)', and the stale Bundle milestone tags replaced with role-neutral wording.
 ### AuditLog-001 — Combined-telemetry transport is plumbed end-to-end but never invoked in production
 | | |
@@ -5,10 +5,10 @@
 | Module | `src/ZB.MOM.WW.ScadaBridge.ClusterInfrastructure` |
 | Design doc | `docs/requirements/Component-ClusterInfrastructure.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-28 |
+| Last reviewed | 2026-06-20 |
 | Reviewer | claude-agent |
-| Commit reviewed | `1eb6e97` |
+| Commit reviewed | `4307c381` |
-| Open findings | 1 |
+| Open findings | 0 |
 ## Summary
@@ -82,6 +82,32 @@ either did not surface or that have aged into the file:
  but the method now offers nothing — keeping it is API-surface noise that an
  IDE will still suggest via auto-complete.
 #### Re-review 2026-06-20 (commit `4307c381`) — full review
 The module remains small and clean — `ClusterOptions`, `ClusterOptionsValidator`,
 and `ServiceCollectionExtensions`, all well-documented and well-tested. The
 split-brain-resolver, seed-node, and failure-detector configuration contract
 matches the design doc, and all fourteen prior findings (CI-001..014) still hold
 as resolved: `DownIfAlone` is now consumed by the Host's HOCON builder, the dead
 `AddClusterInfrastructureActors` surface is gone, and the validator enforces every
 catastrophic value. The only new observation is one Low code-organization item — the
 validator hand-rolls every rule via raw `RequireThat` rather than the base's
 `MinCount`/`OneOf`/`PositiveTimeSpan` primitives (CI-015), a deliberate trade-off
 flagged as a judgment call rather than a clear defect.
 | # | Category | Examined | Notes |
 |---|----------|----------|-------|
 | 1 | Correctness & logic bugs | ✓ | Validator rules and DI registration are correct. No new defects. |
 | 2 | Akka.NET conventions | ✓ | No actors in this module (legitimate, per CI-001 resolution). Nothing actor-shaped to evaluate. |
 | 3 | Concurrency & thread safety | ✓ | Validator and DI extension remain stateless. No issues. |
 | 4 | Error handling & resilience | ✓ | Validator rejects every catastrophic value the design doc enumerates, now including `DownIfAlone = false` (CI-010) and `SeedNodes.Count < 2` (CI-012). No issues. |
 | 5 | Security | ✓ | No authn/authz surface, no secret handling, no remoting transport configured here. No issues. |
 | 6 | Performance & resource management | ✓ | No resources held; validator allocates a small failure list per call only. No issues. |
 | 7 | Design-document adherence | ✓ | `ClusterOptions` contract complete; SBR/seed/failure-detector config matches the design doc. No new drift. |
 | 8 | Code organization & conventions | ✓ | Options/validator placement and Options pattern correct; CI-014 dead surface removed. New — validator hand-rolls every rule via raw `RequireThat` instead of the `OptionsValidatorBase` primitives `MinCount`/`OneOf`/`PositiveTimeSpan` (CI-015, a judgment call). |
 | 9 | Testing coverage | ✓ | 19 tests across three classes covering options, validator (incl. single-seed and `DownIfAlone=false`), and DI registration. No gaps. |
 | 10 | Documentation & comments | ✓ | XML docs accurate across all source files; CI-013 intent comment present. No issues. |
 ## Checklist coverage
 Original review (2026-05-16, `9c60592`) below; the re-review notes (2026-05-17,
@@ -857,3 +883,72 @@ explicitly stating that this project exposes no actor-registration extension
 (actor wiring lives in `ZB.MOM.WW.ScadaBridge.Host`). If the user prefers to keep the
 "fail-fast" trap, mark the method `[Obsolete(true, error: true)]` so the compiler —
 not the runtime — rejects the call.
 ### ClusterInfrastructure-015 — Validator hand-rolls every rule via raw `RequireThat` instead of the base primitives
 | | |
 |--|--|
 | Severity | Low |
 | Category | Code organization & conventions |
 | Status | Deferred |
 | Location | `src/ZB.MOM.WW.ScadaBridge.ClusterInfrastructure/ClusterOptionsValidator.cs:29-57` |
 **Description**
 `ClusterOptionsValidator` derives from `OptionsValidatorBase<ClusterOptions>`
 (`ZB.MOM.WW.Configuration`), whose `ValidationBuilder` exposes a set of typed
 validation primitives — `MinCount<T>` ("requires a collection with at least N
 items"), `OneOf` ("requires the value to be one of `allowed`, case-insensitive"),
 `PositiveTimeSpan` ("requires a strictly positive duration"), plus `Required`,
 `Port`, and `HostPort` — alongside the lower-level `RequireThat(bool, message)`
 escape hatch for "custom or cross-field rules". `ClusterOptionsValidator.Validate`
 expresses *every* one of its rules through raw `RequireThat`, even where a
 first-class primitive exists for the exact check:
 - the `SeedNodes.Count >= 2` rule (29-33) is what `MinCount<string>(options.SeedNodes, 2, …)` does;
 - the `keep-oldest`-only strategy rule (35-39) is what `OneOf(options.SplitBrainResolverStrategy, AllowedStrategies, …)` does;
 - the three `> TimeSpan.Zero` timing rules (45-52) are each what `PositiveTimeSpan(…)` does.
 Only the `MinNrOfMembers == 1` equality check, the cross-field
 `HeartbeatInterval < FailureDetectionThreshold` comparison, and the `DownIfAlone`
 boolean (59-63) genuinely require `RequireThat` (no primitive covers them). Routing
 the three primitive-eligible rules through `RequireThat` makes this validator
 inconsistent with the *intent* of the shared base (the primitives exist precisely so
 common rules read uniformly across the codebase's validators) and risks subtle
 wording drift between modules for identical checks.
 This is **explicitly a deliberate trade-off, not a clear win**, and is filed as a
 judgment call rather than a defect. Adopting the primitives would **lose** what makes
 the current messages valuable: each hand-rolled message cites the governing design-doc
 section (`Component-ClusterInfrastructure.md → Node Configuration` / `→ Split-Brain
 Resolution`) and spells out the catastrophic consequence of the misconfiguration
 ("would risk a total cluster shutdown on a partition", "blocks the cluster singleton
 after failover and halts all data collection", "the oldest node can run as an isolated
 single-node cluster"). The shared primitives emit terse, generic messages
 ("requires a collection with at least 2 items") that cannot carry that
 safety-critical context. The peer `AuditLogOptionsValidator` likewise prefers raw
 `RequireThat` for the same reason, so the current style is at least internally
 consistent.
 **Recommendation**
 Treat this as a conscious choice between two competing goods rather than assuming a
 fix is warranted:
 - **Adopt the primitives** (`MinCount`/`OneOf`/`PositiveTimeSpan` for the three
  eligible rules; keep `RequireThat` for the equality, cross-field, and boolean
  checks) if codebase-wide *wording consistency* across validators is the priority —
  accepting that the design-doc citations and consequence explanations move out of the
  failure message (e.g. into XML docs or an appended suffix).
 - **Keep as-is** if the *richer, safety-critical failure messages* are the priority
  for a module whose misconfiguration is cluster-fatal — accepting the minor
  inconsistency with the base's primitive surface.
 Do not assume either direction; decide deliberately. If kept as-is, a one-line code
 comment recording the choice (richer messages over primitive uniformity) would stop a
 future reader from "tidying" the validator into the primitives and silently dropping
 the consequence text.
 **Resolution**
 Deferred 2026-06-20: this is a deliberate trade-off, not a clear win — the hand-rolled `RequireThat` messages carry design-doc citations and cluster-fatal consequence text that the shared base primitives (`MinCount`/`OneOf`/`PositiveTimeSpan`) cannot, and the peer `AuditLogOptionsValidator` follows the same style. Adopting the primitives would prioritize cross-validator wording consistency over those richer safety-critical messages — a deliberate decision left to the owner. No change this pass.
@@ -5,10 +5,10 @@
 | Module | `src/ZB.MOM.WW.ScadaBridge.Communication` |
 | Design doc | `docs/requirements/Component-Communication.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-28 |
+| Last reviewed | 2026-06-20 |
 | Reviewer | claude-agent |
-| Commit reviewed | `1eb6e97` |
+| Commit reviewed | `4307c381` |
-| Open findings | 2 |
+| Open findings | 0 |
 ## Summary
@@ -68,6 +68,41 @@ keyed by caller-supplied `CorrelationId` could orphan a subscriber on ID reuse
 (Communication-022). Seven new findings, all Open: one High, one Medium, five
 Low.
 #### Re-review 2026-06-20 (commit `4307c381`) — full review
 All prior findings (Communication-001..022) remain closed (Resolved, or Resolved-by-Comm-016).
 This module was renamed `ScadaLink → ZB.MOM.WW.ScadaBridge` since `1eb6e97`, so the
 `git diff 1eb6e97 HEAD` shows the whole module as added; the review re-examined the
 full current state across all 10 categories, with focus on the surface added since the
 last review — the four new gRPC RPCs on `SiteStreamService` (`IngestAuditEvents`,
 `IngestCachedTelemetry`, `PullAuditEvents`, `PullSiteCalls`) and the M7 command/control
 forwards (`SearchAddressSpaceCommand`, `WriteTagRequest`, `VerifyEndpointCommand`,
 cert-trust commands). The module is healthy: actor state stays on the actor thread,
 `PipeTo`/`Forward`/Sender-capture are clean, gRPC channel/stream/CTS lifecycles are
 disciplined, the duplicate-stream and Subscribe-throws cleanup paths are correct, and
 the gRPC ingest/pull handlers degrade best-effort (empty ack / empty response, rows stay
 Pending) on fault. The only new findings are **design-doc drift**: the doc still claims
 `SiteStreamService` has "a single RPC `SubscribeInstance`" and that audit/telemetry
 reconciliation rides ClusterClient, when in fact four RPCs now exist and the
 reconciliation *pull* runs over gRPC (Communication-023); and the AlarmStateUpdate field
 table stops at field 21 while the proto/code carry fields 22–23 (Communication-024). Two
 new findings, both Open: zero Critical/High, zero Medium-code, one Medium-doc, one Low.
 ## Checklist coverage 2026-06-20
 | # | Category | Examined | Notes |
 |---|----------|----------|-------|
 | 1 | Correctness & logic bugs | ✓ | `SubscribeInstance` duplicate-removal + Subscribe-throws cleanup (Comm-021) correct; telemetry open/close gauge is balanced (counted only after Subscribe succeeds, decremented in finally). `DebugStreamBridgeActor` stream-first buffer/dedup sound. No new issues. |
 | 2 | Akka.NET conventions | ✓ | Coordinator `Resume` strategies present on both actors; `Forward` preserves Sender on every site relay; `SiteAddressCacheLoaded` now read-only-typed (Comm-020). gRPC ingest handlers capture Sender before `PipeTo`. No new issues. |
 | 3 | Concurrency & thread safety | ✓ | `_activeStreams`/`_subscriptions` use ownership-preserving `TryRemove(KeyValuePair)`; `GetOrCreate` uses `AddOrUpdate` atomically; bridge state mutated only on the actor thread via `Self.Tell`. No new issues. |
 | 4 | Error handling & resilience | ✓ | gRPC `Ingest*`/`Pull*` degrade to empty ack/response on fault (rows stay Pending) — divergent from the ClusterClient path's `Status.Failure` pipe but behaviourally equivalent (both → Pending+retry), and explicitly documented. `LoadSiteAddressesFromDb` lifecycle-CT bounded (Comm-019). No new issues. |
 | 5 | Security | ✓ | `correlation_id` validated against `ActorPath.IsValidPathElement` before actor naming (Comm-014). `Pull*` reject `batch_size <= 0` with `InvalidArgument`. Secrets ride encrypted blocks, not this layer. No new issues. |
 | 6 | Performance & resource management | ✓ | gRPC channel disposed synchronously (Comm-007); per-stream CTS `CancelAfter(GrpcMaxStreamLifetime)`; bounded `Channel` (1000, DropOldest). `RemoveSiteAsync` still test-only but Comm-013 documents that endpoint change is handled by `GetOrCreate`. No new issues. |
 | 7 | Design-document adherence | ✓ | Doc claims a *single* RPC and ClusterClient-only reconciliation; reality is 4 RPCs + gRPC reconciliation pulls (Communication-023). AlarmStateUpdate field table stops at 21; proto/code carry 22–23 (Communication-024). |
 | 8 | Code organization & conventions | ✓ | DTO mappers correctly hosted in Communication (owns generated DTOs); proto evolution is additive (field numbers 22–23 appended, never reused); Options pattern intact. No new issues. |
 | 9 | Testing coverage | ✓ | `ProtoContractTests` guards oneof variants; `StreamRelayActorTests` + `ConvertToDomainEvent` cover enriched fields 8–23 end-to-end; dedicated proto tests cover the audit/telemetry/pull DTOs. `ProtoRoundtripTests.AlarmStateUpdate_RoundTrip` only asserts fields 1–5, but the gap is covered elsewhere — no separate finding. |
 | 10 | Documentation & comments | ✓ | In-code XML docs are thorough and accurate (the prior-finding comments are precise). The drift is in the design doc only (Communication-023/024), not the code comments. |
 ## Checklist coverage 2026-05-28
 | # | Category | Examined | Notes |
@@ -1120,3 +1155,102 @@ The `_debugSubscriptions` dictionary, `TrackMessageForCleanup` helper, and the
 `HandleConnectionStateChanged` handler that consumed them were all deleted as
 part of Comm-016's dead-code purge. There is no longer any caller-supplied
 correlation-id keyed map to overwrite — the orphan-on-reuse hazard is gone.
 ---
 ### Communication-023 — Design doc describes `SiteStreamService` as a single RPC and reconciliation as ClusterClient-only; both are now stale
 | | |
 |--|--|
 | Severity | Medium |
 | Category | Design-document adherence |
 | Status | Resolved |
 | Location | `docs/requirements/Component-Communication.md:85`, `docs/requirements/Component-Communication.md:160-166`, `src/ZB.MOM.WW.ScadaBridge.Communication/Protos/sitestream.proto:8-14` |
 **Description**
 The design doc's "gRPC Proto Definition" section states: *"**Service**: `SiteStreamService`
 with **a single RPC** `SubscribeInstance(InstanceStreamRequest) returns (stream
 SiteStreamEvent)`."* (line 85). The proto and the generated service now expose **five**
 RPCs — `SubscribeInstance`, `IngestAuditEvents`, `IngestCachedTelemetry`,
 `PullAuditEvents`, and `PullSiteCalls` (`sitestream.proto:8-14`,
 `SiteStreamGrpcServer.cs:211/330/386/436/519`). The four new RPCs are not mentioned
 anywhere in the component doc — neither the proto section, the responsibilities, the
 topology diagram, nor the Dependencies (the server now also depends on `ISiteAuditQueue`
 and `IOperationTrackingStore`).
 This drift also contradicts two transport statements elsewhere in the same doc. The
 "Cached Call Telemetry" section (line 166) and the "Notification Submission" section
 (line 158) both assert **"Transport: ClusterClient"**, and the reconciliation-pull
 paragraph (line 164) describes `CachedCallReconcileRequest`/`CachedCallReconcileResponse`.
 In the shipped code the reconciliation **pull** runs over **gRPC**, not ClusterClient:
 `GrpcPullAuditEventsClient` / `GrpcPullSiteCallsClient`
 (`src/ZB.MOM.WW.ScadaBridge.AuditLog/Central/`) call the `PullAuditEvents` / `PullSiteCalls`
 RPCs against `SiteStreamGrpcServer`. So the gRPC channel — documented as carrying
 **real-time data only** ("`SiteCommunicationActor` and `CentralCommunicationActor` have no
 role in streaming event delivery", "gRPC server-streaming for real-time data") — now also
 carries site→central audit/telemetry **ingest** and central→site **reconciliation pulls**.
 A reader relying on the doc to understand the two-transport split (the central CLAUDE.md
 "Data & Communication" invariant) would be materially misled about what flows over which
 transport.
 **Recommendation**
 Update `Component-Communication.md`: (1) replace "a single RPC" with the full RPC list
 and document each new RPC's direction, payload, and idempotency; (2) reconcile the
 "Transport: ClusterClient" / `CachedCallReconcile*` wording in §10 (and the Site Call
 Audit / Audit Log interaction notes) with the gRPC `PullAuditEvents` / `PullSiteCalls`
 reality; (3) add `ISiteAuditQueue` / `IOperationTrackingStore` to the SiteStreamGrpcServer
 description and Dependencies. If the doc's "gRPC = real-time data only" framing is meant
 to be a hard invariant, state explicitly that audit/telemetry ingest and reconciliation
 are the documented exceptions.
 **Resolution**
 Resolved 2026-06-20 (commit `fd618cf1`): the design doc (Component-Communication.md) now
 enumerates all five `SiteStreamService` RPCs (the one server-streaming `SubscribeInstance`
 plus the unary `IngestAuditEvents`/`IngestCachedTelemetry`/`PullAuditEvents`/`PullSiteCalls`),
 reconciles the §10 reconciliation wording with the gRPC-pull reality (push telemetry stays
 ClusterClient; reconciliation pulls are gRPC request/response), and adds `ISiteAuditQueue`/`IOperationTrackingStore`
 to Dependencies. Doc-only.
 ---
 ### Communication-024 — AlarmStateUpdate proto carries fields 22–23 but the design-doc field table stops at 21
 | | |
 |--|--|
 | Severity | Low |
 | Category | Design-document adherence |
 | Status | Resolved |
 | Location | `docs/requirements/Component-Communication.md:92-114`, `src/ZB.MOM.WW.ScadaBridge.Communication/Protos/sitestream.proto:87-88` |
 **Description**
 The "Enriched AlarmStateUpdate" section states the message "was extended **additively**:
 existing fields 1–7 are unchanged, and **fields 8–21** carry the enriched native-alarm
 state" (line 92), and the field table enumerates fields 8–21 (lines 94-109). The proto
 now also defines **field 22 `native_source_canonical_name`** and **field 23
 `is_configured_placeholder`** (`sitestream.proto:87-88`), both populated on the server
 (`StreamRelayActor.HandleAlarmStateChanged`, lines 97-98) and read back on the client
 (`SiteStreamGrpcClient.ConvertToDomainEvent`, lines 257-258). The doc's field table and
 its "fields 8–21" / "repopulated from fields 8–21" prose (lines 92, 112) are stale — they
 omit fields 22–23. `is_configured_placeholder` in particular is load-bearing: it is the
 discriminator `StreamRelayActor` uses to drop snapshot-only placeholder rows before
 packing the live stream (lines 63-66), a behaviour the doc does not capture.
 The proto evolution itself is correct (additive, field numbers appended and never reused),
 so this is purely doc staleness, not a contract regression.
 **Recommendation**
 Extend the AlarmStateUpdate field table to fields 22–23, update the "fields 8–21" prose to
 "8–23" in both the intro and the client-mapping bullet, and add a one-line note that
 `is_configured_placeholder` rows are dropped at the relay (never relayed to the live gRPC
 stream).
 **Resolution**
 Resolved 2026-06-20 (commit `fd618cf1`): the `AlarmStateUpdate` field table was extended to
 fields 22–23 (`native_source_canonical_name`, `is_configured_placeholder`) with a note that
 placeholder rows are dropped at the relay. Doc-only.
@@ -5,10 +5,10 @@
 | Module | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer` |
 | Design doc | `docs/requirements/Component-DataConnectionLayer.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-28 |
+| Last reviewed | 2026-06-20 |
 | Reviewer | claude-agent |
-| Commit reviewed | `1eb6e97` |
+| Commit reviewed | `4307c381` |
-| Open findings | 5 |
+| Open findings | 0 |
 ## Summary
@@ -116,6 +116,37 @@ DCL-007 fixed for `ReadBatchAsync`). New findings are numbered from
 ## Findings
 #### Re-review 2026-06-20 (commit `4307c381`) — full review
 The 2026-06-20 full re-review walked all 10 checklist categories against the current
 source, focused on the M7 native-alarm subscribe path, the OPC UA node-browser
 (BrowseNext/search/type-info), and the verify-endpoint + site-local cert-trust surface.
 The M7 surface is well-built overall — the verify-endpoint probe correctly *captures
 but never trusts* an untrusted server certificate, the per-node `CertStoreActor`
 broadcast keeps both site nodes' PKI stores consistent across failover, and the
 node-browser paging/search/type-info additions are clean. All 22 prior findings remain
 `Resolved` and their fixes were verified in place. The review found **4 new findings**,
 all clustering on the native-alarm subscribe path, which did **not** inherit the
 guards the tag-subscribe path earned across DCL-018 / DCL-021 / DCL-022: the alarm
 path leaks its adapter feed on a mid-flight unsubscribe (no DCL-021-style obsolete-
 completion guard), has no alarm-resolution retry to match the tag path's
 `tag-resolution-retry` timer, leaks `_alarmCts` on `DisposeAsync` (only
 `DisconnectAsync` tears it down), and carries a stale "first subscriber wins" comment
 on a last-wins filter assignment.
 | # | Category | Examined | Notes |
 |---|----------|----------|-------|
 | 1 | Correctness & logic bugs | x | No new issues; DCL-016/020/021 counter/response fixes verified. Alarm-path snapshot atomic-swap + per-source prefix routing are correct. |
 | 2 | Akka.NET conventions | x | No new issues; DCL-022 `IsTimerActive` gate verified. Alarm subscribe uses `ContinueWith(...).PipeTo(Self)` with `Self`/generation captured (no `Sender`/`this` in closures). |
 | 3 | Concurrency & thread safety | x | Finding 023 — native-alarm mid-flight unsubscribe leaks the adapter alarm feed (the tag path's DCL-021 obsolete-completion guard was not mirrored on the alarm path). |
 | 4 | Error handling & resilience | x | Finding 024 — no alarm-resolution retry: a failed initial alarm subscribe strands the subscriber with no feed until the next full reconnect (the tag path retries on `tag-resolution-retry`). |
 | 5 | Security | x | No new issues. The M7 verify-endpoint probe builds a temporary `RealOpcUaClient`, captures the untrusted server cert, and NEVER trusts it; cert trust is gated through `CertStoreActor` broadcast to both nodes. DCL-012/014 auto-accept warning + Commons default remain out-of-scope follow-ups. |
 | 6 | Performance & resource management | x | Finding 025 — `MxGatewayDataConnection.DisposeAsync` abandons `_alarmCts` (alarm task + CTS leak on failover/stop); `DisconnectAsync` already cancels+disposes it under `_alarmLock`. |
 | 7 | Design-document adherence | x | No new issues; M7 native-alarm read-only mirror, `AlarmKind` discriminator, and per-source `IAlarmSubscribableConnection` feed match the design. DCL-009 doc action still open at doc level (out of scope). |
 | 8 | Code organization & conventions | x | No issues — alarm messages/POCOs in Commons, `IAlarmSubscribableConnection` capability seam in the module, options class owned by component. |
 | 9 | Testing coverage | x | DCL001–022 regression tests present. Gaps for findings 023 (alarm unsubscribe mid-flight), 024 (failed alarm subscribe → no retry), 025 (`DisposeAsync` alarm-CTS leak). |
 | 10 | Documentation & comments | x | Finding 028 — the `_alarmSourceFilter` "first subscriber wins" XML comment contradicts the last-wins assignment in `HandleSubscribeAlarms`. |
 ### DataConnectionLayer-001 — `Task.Run` in `HandleSubscribe` mutates actor state off the actor thread
 | | |
@@ -1267,3 +1298,209 @@ calling `StartPeriodicTimer` — `IsTimerActive` is on `ITimerScheduler`. Apply
 same gate at both call sites. Add a regression test that fires 5 subscribes with
 unresolved tags within one retry interval and asserts the retry fires at most one
 interval after the first failure (not after the fifth subscribe).
 ### DataConnectionLayer-023 — Native-alarm mid-flight unsubscribe leaks the adapter alarm feed
 | | |
 |--|--|
 | Severity | High |
 | Category | Concurrency & thread safety |
 | Status | Resolved |
 | Location | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Actors/DataConnectionActor.cs:1773,1790-1809,1864-1882` |
 **Description**
 The native-alarm subscribe path mirrors the tag-subscribe path's
 dispatch-I/O-then-`PipeTo(Self)` shape but never inherited the obsolete-completion
 guard the tag path earned in DCL-021. `HandleSubscribeAlarms` adds the source to
 `_alarmSubscribesInFlight` (line 1773) and dispatches
 `alarmable.SubscribeAlarmsAsync(...)` whose `ContinueWith` pipes an
 `AlarmSubscribeCompleted` back to `Self`. If an `UnsubscribeAlarmsRequest` for the
 last (or only) subscriber is processed on the actor thread between that dispatch and
 the completion, `HandleUnsubscribeAlarms` (lines 1864-1882) removes the subscriber,
 empties `_alarmSourceSubscribers`, and removes the filter entries — but it tries to
 tear down the adapter feed via `_alarmSubscriptionIds.Remove(...)`, which **fails**
 because the subscription id has not been stored yet (it is still in flight). It also
 leaves the stale `_alarmSubscribesInFlight` entry in place. The late
 `AlarmSubscribeCompleted` then reaches `HandleAlarmSubscribeCompleted` (lines
 1790-1809), which **unconditionally** stores `_alarmSubscriptionIds[msg.SourceReference]
 = msg.SubscriptionId` (line 1796) without re-checking whether any subscriber still
 exists in `_alarmSourceSubscribers`.
 The net result is an orphaned adapter alarm feed: the OPC UA monitored-item set / the
 MxGateway gateway-wide alarm stream stays alive for a source with **zero** subscribers,
 streaming transitions into `HandleAlarmTransitionReceived` that match no subscriber set
 and are silently dropped — the feed is never torn down for the lifetime of the adapter
 (only a full `ReSubscribeAllAlarms` on reconnect clears `_alarmSubscriptionIds`, and
 even that re-subscribes from `_alarmSourceSubscribers`, which no longer contains the
 orphaned source, so the stale device-side feed is simply abandoned, not closed). The
 tag path closes exactly this race in `HandleSubscribeCompleted` (lines 834-867) by
 releasing the just-created adapter handle and clearing `_subscribesInFlight` when the
 instance entry has gone; the alarm path does neither. This matters because native-alarm
 sources are created/destroyed on every deploy/undeploy and instance stop, so each
 mid-flight unsubscribe permanently leaks one device-side alarm subscription and its
 publish traffic.
 **Recommendation**
 Mirror the closed DCL-021 fix on the alarm path. In `HandleAlarmSubscribeCompleted`,
 before storing the id, guard on the live subscriber set: `if
 (!_alarmSourceSubscribers.ContainsKey(msg.SourceReference)) {
 if (msg.Success && msg.SubscriptionId != null && _adapter is
 IAlarmSubscribableConnection alarmable) _ =
 alarmable.UnsubscribeAlarmsAsync(msg.SubscriptionId); return; }` so a feed that
 completed after its last subscriber left is released at the adapter rather than stored.
 Additionally clear the `_alarmSubscribesInFlight` entry in `HandleUnsubscribeAlarms`
 when the last subscriber leaves, so the in-flight marker does not linger. Add a
 regression test that subscribes a source, sends `UnsubscribeAlarmsRequest` while the
 alarm subscribe I/O is in flight, completes the subscribe, and asserts
 `UnsubscribeAlarmsAsync` is called and `_alarmSubscriptionIds` /
 `_alarmSubscribesInFlight` are clean.
 **Resolution**
 Resolved 2026-06-20 (commit `fd618cf1`): `HandleAlarmSubscribeCompleted` now guards against an orphaned completion (if no live subscriber remains for the source, it tears down the just-created adapter subscription and returns) and `HandleUnsubscribeAlarms` clears the in-flight marker when the last subscriber leaves — mirroring the DCL-021 tag-path fix. Regression test added (verified failing pre-fix).
 ### DataConnectionLayer-024 — No alarm-resolution retry: a failed initial alarm subscribe strands the subscriber until the next full reconnect
 | | |
 |--|--|
 | Severity | Medium |
 | Category | Error handling & resilience |
 | Status | Deferred |
 | Location | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Actors/DataConnectionActor.cs:1752-1757,1790-1809,1885-1908` |
 **Description**
 When an initial native-alarm subscribe fails, the subscriber is left registered but
 with no feed and no path to recovery short of a full connection reconnect.
 `HandleSubscribeAlarms` registers the subscriber in `_alarmSourceSubscribers` (lines
 1752-1757) **before** issuing the adapter subscribe, so a transition arriving
 mid-subscribe is still routed. If `alarmable.SubscribeAlarmsAsync` fails, the piped
 `AlarmSubscribeCompleted` reaches `HandleAlarmSubscribeCompleted` (lines 1790-1809),
 which on the failure branch only logs a warning (line 1801) and replies a
 `SubscribeAlarmsResponse(success: false, ...)`. It does **not** re-arm any retry, does
 **not** push `NativeAlarmSourceUnavailable` to the subscriber, and does **not** drop
 the source from `_alarmSourceSubscribers`. The subscriber therefore stays in the
 routing map with the adapter feed never established.
 Recovery only happens via `ReSubscribeAllAlarms` (lines 1885-1908), which is invoked
 solely from `BecomeConnected` after a full connection reconnect (line 583). So a
 transient device-side failure on the *initial* alarm subscribe — the alarm server
 still booting, a momentary fault — leaves the source dark until the entire connection
 happens to cycle through `Reconnecting`. This diverges from the tag path, which arms a
 periodic `tag-resolution-retry` timer (lines 989-993, 1678-1682; fired by
 `HandleRetryTagResolution` at line 1491) so a tag that fails to resolve at subscribe
 time is retried every `TagResolutionRetryInterval` (10 s default) without needing a
 reconnect. The native-alarm source has no analogous self-healing.
 This is partly a **design judgment call** rather than a clear-cut defect: the M7 design
 models native alarms as a read-only mirror, and "drop the source and signal the
 subscriber" is an equally valid contract to "retry the alarm subscribe periodically."
 The defect is that the code currently does **neither** — it silently leaves a
 registered subscriber with a permanently-dark feed and no signal — which is the worst
 of both.
 **Recommendation**
 Pick one of the two valid contracts and implement it (this is a judgment call — do not
 assume which is wanted):
 - **Retry:** add an `alarm-resolution-retry` periodic timer (mirroring
  `tag-resolution-retry`): track sources whose subscribe failed, re-issue
  `SubscribeAlarmsAsync` on the timer, and gate the timer start with
  `IsTimerActive` (per the DCL-022 pattern) so a burst of failed subscribes does not
  reset it.
 - **Fail fast:** on the failure branch of `HandleAlarmSubscribeCompleted`, push
  `NativeAlarmSourceUnavailable` to the subscriber(s) for that source and remove the
  source from `_alarmSourceSubscribers` / `_alarmSubscribesInFlight`, so the
  `NativeAlarmActor` learns the feed is unavailable rather than waiting forever.
 Either way, add a regression test for "initial alarm subscribe fails → subscriber is
 not silently left dark."
 **Resolution**
 Deferred 2026-06-20: the fix (add an alarm-resolution retry timer vs. push `NativeAlarmSourceUnavailable` and drop the source on initial-subscribe failure) is a design-owner decision; the current code does neither. Awaiting that decision before implementing.
 ### DataConnectionLayer-025 — `MxGatewayDataConnection.DisposeAsync` abandons `_alarmCts`, leaking the alarm task on failover/stop
 | | |
 |--|--|
 | Severity | Low |
 | Category | Performance & resource management |
 | Status | Resolved |
 | Location | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Adapters/MxGatewayDataConnection.cs:277-283,121-134` |
 **Description**
 `MxGatewayDataConnection.DisposeAsync` (lines 277-283) cancels `_eventLoopCts` and
 disposes the underlying `_client`, but never touches `_alarmCts`. The alarm feed is
 established in `SubscribeAlarmsAsync` (lines 153-175): under `_alarmLock` it lazily
 creates `_alarmCts = new CancellationTokenSource()` and launches a long-running
 `Task.Run(() => client.RunAlarmStreamAsync(null, ..., token))`. That background alarm
 stream is bound only to `_alarmCts.Token`. `DisconnectAsync` (lines 121-134) already
 tears it down correctly — under `_alarmLock` it cancels and disposes `_alarmCts`,
 nulls it, and resets `_alarmSubCount` — but `DisposeAsync` does not, so the
 `CancellationTokenSource` and the alarm-stream `Task` are both leaked whenever the
 adapter is disposed without a prior `DisconnectAsync`.
 The `DataConnectionActor` disposes adapters fire-and-forget on failover (and on
 actor/connection stop) via `_adapter.DisposeAsync()` without necessarily calling
 `DisconnectAsync` first, so every MxGateway failover or connection teardown that goes
 through `DisposeAsync` leaks one running alarm-stream task plus its CTS. Severity is
 Low because the leaked task is cancellation-bound to a CTS that will be GC-reclaimed
 eventually and the gRPC stream it holds will fault when `_client` is disposed, but a
 still-running alarm-stream loop racing a disposed client is precisely the class of
 dangling background work the lock-guarded `DisconnectAsync` block was written to avoid.
 **Recommendation**
 In `DisposeAsync`, cancel and dispose `_alarmCts` under `_alarmLock` before disposing
 the client — copy the block already present in `DisconnectAsync` (lines 124-130):
 `lock (_alarmLock) { _alarmCts?.Cancel(); _alarmCts?.Dispose(); _alarmCts = null;
 _alarmSubCount = 0; }`. This guarantees the alarm-stream task is cancelled
 deterministically on every teardown path, not just the `DisconnectAsync` one.
 **Resolution**
 Resolved 2026-06-20 (commit `fd618cf1`): `MxGatewayDataConnection.DisposeAsync` now cancels and disposes `_alarmCts` under `_alarmLock`, matching `DisconnectAsync` — no more alarm task/CTS leak on failover/stop.
 ### DataConnectionLayer-028 — `_alarmSourceFilter` "first subscriber wins" comment contradicts the last-wins code
 | | |
 |--|--|
 | Severity | Low |
 | Category | Documentation & comments |
 | Status | Resolved |
 | Location | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Actors/DataConnectionActor.cs:102-103,1758-1761` |
 **Description**
 The XML doc on the `_alarmSourceFilter` field (lines 102-103) reads "sourceReference →
 raw condition filter string passed to the adapter (**first subscriber wins**)." The
 code does the opposite: `HandleSubscribeAlarms` **unconditionally** overwrites the
 filter on every subscribe — `_alarmSourceFilter[request.SourceReference] =
 request.ConditionFilter;` (line 1758) and likewise re-parses
 `_alarmSourceFilterPredicate[request.SourceReference] =
 AlarmConditionFilter.Parse(request.ConditionFilter)` (line 1761) — with no
 "already present" guard. So a **second** subscriber to the same source reference
 re-filters the shared feed with its own condition filter, i.e. **last subscriber
 wins**, not first. The comment will mislead a maintainer reasoning about which filter
 governs a shared alarm feed when two instances subscribe to the same source with
 different condition filters, and it understates a real behavioural subtlety (the second
 subscriber silently changes the gate applied to the first subscriber's transitions in
 `HandleAlarmTransitionReceived`).
 **Recommendation**
 Correct the comment to "last subscriber wins" (and note that the shared feed carries a
 single filter, so co-subscribers to one source reference must agree on the condition
 filter). If per-subscriber filtering is actually intended, make the predicate
 per-subscriber rather than per-source so each subscriber's own filter governs only its
 own deliveries — but that is a behaviour change, not a comment fix, and should be
 decided explicitly.
 **Resolution**
 Resolved 2026-06-20 (commit `fd618cf1`): the `_alarmSourceFilter` comment corrected from 'first subscriber wins' to 'last subscriber wins' (and notes co-subscribers share the single filter).
@@ -5,9 +5,9 @@
 | Module | `src/ZB.MOM.WW.ScadaBridge.DeploymentManager` |
 | Design doc | `docs/requirements/Component-DeploymentManager.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-28 |
+| Last reviewed | 2026-06-20 |
 | Reviewer | claude-agent |
-| Commit reviewed | `1eb6e97` |
+| Commit reviewed | `4307c381` |
 | Open findings | 0 |
 ## Summary
@@ -83,8 +83,51 @@ The remaining six findings are medium/low: lifecycle-timeout audit gap
 (DeploymentManager-023), and shared static state across `*ProbeActor` tests
 (DeploymentManager-024).
 #### Re-review 2026-06-20 (commit `4307c381`) — full review
 Re-reviewed the whole current module at HEAD after the rename, the
 cert-broadcast / Transport `IStaleInstanceProbe` work, and milestone changes.
 DeploymentManager-001..024 all remain `Resolved` and verified against source —
 the ref-counted `OperationLockManager`, the broadened/non-cancellable failure
 writes, the `ApplyPostSuccessSideEffectsAsync` shared helper with
 `forceEnabledState` (Disabled-preservation), the lifecycle-timeout audit helper,
 the structured diff with List-value normalization, the hoisted global artifact
 fetch, and the instance-state-aware reconciliation are all present and correct.
 The two flagged cross-module/architectural seams the prompt called out — the
 `TrustServerCert`/`RemoveServerCert` broadcast-to-both-nodes and the
 `DeploymentManagerActor` deploy-state query handler — live in **SiteRuntime /
 Communication / CentralUI**, not this module, so they are out of scope here.
 This review found **3 new findings**. The material one is DeploymentManager-025:
 the system-wide artifact path still fetches and broadcasts **notification lists
 and SMTP configurations (including SMTP credentials)** to every site, in direct
 contradiction of the now-explicit design decision that these are central-only
 and "no SMTP credential is ever distributed to sites" (Component-DeploymentManager.md
 lines 142-146; CLAUDE.md notification-central-only decision). This supersedes
 the earlier accepted-deployable-artifact framing of the closed
 DeploymentManager-013. DeploymentManager-026 (deployment records are insert-only
 — a new row per deploy accumulates per instance, contradicting "only current
 status stored, no history table", and the same-tick `OrderByDescending(DeployedAt)`
 read has no tiebreaker) and DeploymentManager-027 (artifact tests *assert* the
 forbidden notif/SMTP shipping, cementing the DeploymentManager-025 violation)
 are the remaining two.
 ## Checklist coverage
 #### Re-review 2026-06-20 (commit `4307c381`)
 | # | Category | Examined | Notes |
 |---|----------|----------|-------|
 | 1 | Correctness & logic bugs | ✓ | New: deployment records are insert-only — `DeployInstanceAsync` Adds a new row per deploy; reconciliation's `GetCurrentDeploymentStatusAsync` orders by `DeployedAt` with no tiebreaker (DeploymentManager-026). |
 | 2 | Akka.NET conventions | ✓ | Module remains a plain service layer; no actors. The deploy-state-query/cert-broadcast actors live in SiteRuntime, out of scope. No issues. |
 | 3 | Concurrency & thread safety | ✓ | `OperationLockManager` ref-counting + gate re-verified; `DeployToAllSitesAsync` prebuilds per-site commands before the parallel phase (no shared DbContext under `Task.WhenAll`). No issues. |
 | 4 | Error handling & resilience | ✓ | Failure-status writes use `CancellationToken.None`; lifecycle timeouts now audit; delete-orphan path surfaced. No new issues. |
 | 5 | Security | ✓ | New: SMTP credentials are still serialized into the per-site artifact command and broadcast to every site, which the current design forbids outright (DeploymentManager-025). |
 | 6 | Performance & resource management | ✓ | Global artifact queries hoisted (DM-023 resolved). Deployment-record row growth is unbounded per instance (part of DeploymentManager-026). |
 | 7 | Design-document adherence | ✓ | New: notification lists + SMTP configs are still treated as deployable artifacts, contradicting the "central-only, never distributed to sites" design (DeploymentManager-025). |
 | 8 | Code organization & conventions | ✓ | Options bound via Host; `OptionsSection` constant correct. No new issues. |
 | 9 | Testing coverage | ✓ | Broad and current. New: artifact tests assert the forbidden notif/SMTP shipping (DeploymentManager-027). |
 | 10 | Documentation & comments | ✓ | `ArtifactDeploymentService` class XML doc still lists notification lists + SMTP as broadcast artifacts (stale vs design — folded into DeploymentManager-025). |
 #### Re-review 2026-05-28 (commit `1eb6e97`)
 | # | Category | Examined | Notes |
@@ -1234,3 +1277,164 @@ recipient (an `IActorRef` to a TestKit probe), and assert via `ExpectMsg`
 in each test. Where the simpler counter shape is preferred, pass a
 shared-state object into the actor's constructor so each test owns its own
 instance — never reach for `static` mutable test state.
 ### DeploymentManager-025 — Notification lists and SMTP configurations (with credentials) are still broadcast to every site, contradicting the central-only design
 | | |
 |--|--|
 | Severity | High |
 | Category | Design-document adherence |
 | Status | Resolved |
 | Location | `src/ZB.MOM.WW.ScadaBridge.DeploymentManager/ArtifactDeploymentService.cs:13-22,128,130,150-151,181-188` |
 **Description**
 The design now states explicitly, in two authoritative places, that notification
 lists and SMTP configuration are **not** deployable artifacts:
 - `docs/requirements/Component-DeploymentManager.md:142-146`: *"Notification lists
  and SMTP configuration are **not** deployable artifacts — they are central-only
  definitions managed by the Notification Service ... Notification delivery happens
  on the central cluster, so **no notification artifact or SMTP credential is ever
  distributed to sites**."*
 - CLAUDE.md "External Integrations" decision: *"Notification delivery is
  central-only ... Notification lists and SMTP config are no longer deployed to
  sites; recipient resolution happens at central, at delivery time."*
 `ArtifactDeploymentService` still does the opposite. `FetchGlobalArtifactsAsync`
 queries `_notificationRepo.GetAllNotificationListsAsync` and
 `GetAllSmtpConfigurationsAsync` (lines 150-151), maps them into
 `NotificationListArtifact`s and `SmtpConfigurationArtifact`s — the SMTP artifact
 carrying `smtp.Credentials` verbatim (line 188) — and `BuildDeployArtifactsCommandAsync`
 places both into the `DeployArtifactsCommand` sent to **every site** (lines 128,
 130). The site side persists them: `SiteReplicationActor` (lines 192-201) and
 `DeploymentManagerActor` (lines 1383-1419) loop over `command.NotificationLists`
 and `command.SmtpConfigurations` and write them to site SQLite via
 `SiteNotificationRepository`.
 This is the precise scenario the design says must never happen: SMTP credentials
 travel across the inter-cluster transport and land on every site's SQLite. It
 supersedes the framing of the now-closed DeploymentManager-013, which accepted
 SMTP-as-deployable-artifact as a documented design decision — the design has since
 flipped to forbid distribution entirely, so this is a fresh divergence, not the
 same finding. The class-level XML doc (lines 13-22) is correspondingly stale: it
 still advertises "notification lists ... and SMTP configurations" as artifacts the
 service "broadcasts ... to all sites."
 Secondary defect in the same mapping: `NotificationListArtifact` is built from
 `nl.Recipients.Where(r => r.EmailAddress is not null)` (line 182), which silently
 drops every SMS-only recipient (`PhoneNumber` set, `EmailAddress` null) of a
 `NotificationType.Sms` list. Even if list distribution were intended, the SMS
 recipient set would be lost — but since lists must not be distributed at all, this
 is subsumed by the primary fix.
 **Recommendation**
 Stop fetching, mapping, and shipping notification lists and SMTP configurations
 from the artifact deployment path. Drop the `_notificationRepo` queries from
 `FetchGlobalArtifactsAsync`, pass `null` (or empty) for the `NotificationLists`
 and `SmtpConfigurations` fields of `DeployArtifactsCommand`, and update the class
 XML doc to remove both from the artifact list. The message fields can remain on
 `DeployArtifactsCommand` for additive compatibility but must never be populated
 from central. Coordinate removal of the consuming code in SiteRuntime
 (`SiteReplicationActor`, `DeploymentManagerActor`, `SiteNotificationRepository`)
 in the same session per the project's "design + code + tests travel together"
 rule. Update the contradicting tests (see DeploymentManager-027).
 **Resolution**
 Resolved 2026-06-20 (commit `fd618cf1`): central `FetchGlobalArtifactsAsync` no longer queries or ships notification lists / SMTP configs (passes null; `DeployArtifactsCommand` fields kept for contract compatibility), and the site purges any already-persisted notification_lists / smtp_configurations rows (clearing the plaintext SMTP password) on both apply paths — enforcing the central-only delivery design. Verified no site runtime/delivery path reads this config.
 ### DeploymentManager-026 — `DeployInstanceAsync` inserts a new deployment record every deploy; per-instance rows accumulate and the "current status" read has no tiebreaker
 | | |
 |--|--|
 | Severity | Medium |
 | Category | Design-document adherence |
 | Status | Resolved |
 | Location | `src/ZB.MOM.WW.ScadaBridge.DeploymentManager/DeploymentService.cs:215-225`, `src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase/Repositories/DeploymentManagerRepository.cs:55-61` |
 **Description**
 `DeployInstanceAsync` always creates a brand-new `DeploymentRecord` and calls
 `_repository.AddDeploymentRecordAsync(record, …)` (line 223) — it never reuses or
 updates the instance's existing record. There is no update-in-place or
 delete-prior path on the deploy flow, so every successful or failed deployment of
 an instance leaves its predecessor row behind. Over the life of a central process
 — amplified by the bulk "deploy all out-of-date instances" workflow and by repeated
 redeploys after timeouts — the `DeploymentRecords` table grows without bound, one
 row per deploy attempt per instance.
 This contradicts the design's "Deployment Status Persistence" section
 (`Component-DeploymentManager.md:106-109`): *"Only the **current deployment
 status** per instance is stored in the configuration database ... No deployment
 history table — the audit log (via IAuditService) already captures every
 deployment action."* The audit log is the history; the deployment-record table is
 supposed to hold only the current status. The implementation instead keeps an
 ad-hoc, unindexed history there.
 The accumulation also makes the reconciliation read order-sensitive.
 `TryReconcileWithSiteAsync` reads the "prior" record via
 `GetCurrentDeploymentStatusAsync`, which is
 `OrderByDescending(d => d.DeployedAt).FirstOrDefault()` with **no secondary sort
 key** (e.g. `ThenByDescending(d => d.Id)`). `DeployedAt` is a `DateTimeOffset`
 stamped with `DateTimeOffset.UtcNow` at record creation; two records inserted
 within the same clock tick (rapid redeploy, or a redeploy immediately after a
 timed-out attempt) tie on `DeployedAt`, and SQL Server's choice between equal
 sort keys is undefined. Reconciliation could then read the *wrong* prior record
 (e.g. an older Success instead of the latest stuck InProgress), skipping the
 intended site query, or vice-versa.
 **Recommendation**
 Either (a) make the deploy path upsert the instance's single current record
 (update-in-place when one exists, insert only on first deploy) so the table holds
 exactly one row per instance per the design, or (b) if multiple rows are
 deliberately retained, add a deterministic tiebreaker to
 `GetCurrentDeploymentStatusAsync` (`OrderByDescending(d => d.DeployedAt)
 .ThenByDescending(d => d.Id)`) and document the retention/cleanup story so the
 table does not grow unbounded. Option (a) aligns with the design and is preferred.
 **Resolution**
 Resolved 2026-06-20 (commit `fd618cf1`): added a deterministic `.ThenByDescending(d => d.Id)` tiebreaker to `GetCurrentDeploymentStatusAsync` so same-tick deployment records resolve to the newest row. Insert-per-deploy behaviour unchanged (history-vs-upsert remains a separate design question).
 ### DeploymentManager-027 — Artifact tests assert that notification lists and SMTP configs are shipped, cementing the DeploymentManager-025 design violation
 | | |
 |--|--|
 | Severity | Low |
 | Category | Testing coverage |
 | Status | Resolved |
 | Location | `tests/ZB.MOM.WW.ScadaBridge.DeploymentManager.Tests/ArtifactDeploymentServiceTests.cs:173-174,200-201` |
 **Description**
 `ArtifactDeploymentServiceTests` asserts, via NSubstitute, that the artifact
 deployment path queries the forbidden artifact sets exactly once per deployment:
 ```
 await _notificationRepo.Received(1).GetAllNotificationListsAsync(Arg.Any<CancellationToken>());
 await _notificationRepo.Received(1).GetAllSmtpConfigurationsAsync(Arg.Any<CancellationToken>());
 ```
 (both in the `DeployToAllSitesAsync` global-query-hoisting test at 173-174 and the
 `RetryForSiteAsync` test at 200-201). These assertions pin the exact behaviour the
 current design forbids (DeploymentManager-025): they will keep the service shipping
 notification lists and SMTP configs to sites and will actively block the fix —
 removing the queries makes these `Received(1)` assertions fail. Tests that lock in
 a design violation are worse than no test, because they make the correct change
 look like a regression.
 **Recommendation**
 When DeploymentManager-025 is fixed, change these to
 `DidNotReceive().GetAllNotificationListsAsync(...)` /
 `DidNotReceive().GetAllSmtpConfigurationsAsync(...)` (and assert the
 `DeployArtifactsCommand`'s `NotificationLists` / `SmtpConfigurations` fields are
 null/empty) so the tests enforce the central-only design instead of contradicting
 it.
 **Resolution**
 Resolved 2026-06-20 (commit `fd618cf1`): the artifact tests no longer assert the forbidden notification/SMTP shipping — flipped `Received(1)` → `DidNotReceive()` and added assertions that the shipped command's NotificationLists/SmtpConfigurations are null.
@@ -5,9 +5,9 @@
 | Module | `src/ZB.MOM.WW.ScadaBridge.ExternalSystemGateway` |
 | Design doc | `docs/requirements/Component-ExternalSystemGateway.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-28 |
+| Last reviewed | 2026-06-20 |
 | Reviewer | claude-agent |
-| Commit reviewed | `1eb6e97` |
+| Commit reviewed | `4307c381` |
 | Open findings | 0 |
 ## Summary
@@ -81,6 +81,42 @@ design-doc drift). Theme: every new finding is in a code path that was added or
 touched by the earlier fix bundle but whose error-propagation contract was not
 verified end-to-end against the S&F engine or the design doc.
 #### Re-review 2026-06-20 (commit `4307c381`) — full review
 All twenty-three prior findings (001–023) remain `Resolved`; spot-checks against the
 current source confirm the fixes still hold. Since `1eb6e97` the module gained a new
 SQL error-classification layer (`SqlErrorClassifier` + `TransientDatabaseException` /
 `PermanentDatabaseException`, commits `d0527064` / `de375ff7`) and `DatabaseGateway.CachedWriteAsync`
 was reshaped to attempt the write immediately and classify the outcome (transient →
 buffer, permanent → synchronous `Failed`, mirroring `CachedCallAsync`); the `ScadaLink → ZB.MOM.WW.ScadaBridge`
 rename also landed. The full 10-category checklist was re-walked and surfaced **three new
 findings**, none Critical/High. The most serious (`ExternalSystemGateway-024`, Medium) is
 that `SqlErrorClassifier.IsTransient(Exception)` classifies *every* `InvalidOperationException`
 (and `TimeoutException`) as transient, so a DB-layer authoring/driver-misuse bug that surfaces
 as `InvalidOperationException` is silently buffered and retried instead of propagating —
 contradicting the classifier's own stated "authoring bugs must propagate" contract and the
 HTTP-path symmetry it claims. `-025` (Low) is a caller-cancellation that the SQL driver
 raises as a `SqlException` (mid-flight cancel) being reclassified as a *permanent* DB error
 rather than propagating the cancellation (a narrow asymmetry with the `-008` contract; untested).
 `-026` (Low) is a mis-ordered numbered-label comment block in `ExecuteWriteAsync`. Theme:
 the new SQL-classification seam faithfully mirrors the HTTP path's *shape* but is more
 permissive on the non-typed-exception branch than the HTTP path it cites as its model.
 _Re-review (2026-06-20, `4307c381`):_
 | # | Category | Examined | Notes |
 |---|----------|----------|-------|
 | 1 | Correctness & logic bugs | ☑ | `BuildUrl` / `JsonElementToParameterValue` precision cascade / verb validation all still correct (006/017/020/022). New: caller-cancel surfacing as `SqlException` is misclassified permanent — finding 025. |
 | 2 | Akka.NET conventions | ☑ | Still no actors; `AddExternalSystemGatewayActors` remains a no-op. Cached-call/write lifecycle + audit emission live in SiteRuntime/AuditLog, correct boundary. No issues. |
 | 3 | Concurrency & thread safety | ☑ | Services stateless and DI-scoped; new `ExecuteWriteAsync`/`RunSqlAsync`/`SqlErrorClassifier` seams introduce no shared mutable state (static `HashSet` is read-only). No findings. |
 | 4 | Error handling & resilience | ☑ | New SQL transient/permanent split is the headline area. `SqlErrorClassifier.IsTransient(Exception)` is over-broad — all `InvalidOperationException`/`TimeoutException` are transient, sweeping authoring bugs into retry — finding 024. Number-based set (024 aside) is sound; unknown→permanent fail-fast is correct. |
 | 5 | Security | ☑ | `ExecuteWriteAsync`/`SqlErrorClassifier` use the connection NAME, never the connection string, in error text (verified — credentials not leaked). `ApplyAuth` secrets still never logged. SQL `CommandText` is script-authored by design (parameterised values bound separately). No new findings. |
 | 6 | Performance & resource management | ☑ | `RunSqlAsync` uses `await using`/`using` for connection + command. `EmptyParameters` reused. `client.Timeout` set per rented client (correct — `IHttpClientFactory` clients are per-call). No new findings. |
 | 7 | Design-document adherence | ☑ | Updated doc (lines 65–66) now describes the immediate-attempt-then-buffer cached-write model the code implements — in sync. PATCH reconciled (023). No new drift. |
 | 8 | Code organization & conventions | ☑ | `Transient`/`PermanentDatabaseException` are component-local exception types (parallel to the HTTP ones); acceptable. `SqlErrorClassifier` is a static policy class in the component. No new findings. |
 | 9 | Testing coverage | ☑ | New SQL-classification paths well covered (immediate + retry, transient/permanent/non-Sql/cancel/unexpected). Gaps: caller-cancel-as-`SqlException` (025) and the `InvalidOperationException`-authoring-bug case (024) are untested — recorded under those findings. |
 | 10 | Documentation & comments | ☑ | XML docs accurate and thorough. The numbered case labels in `ExecuteWriteAsync`'s comment block are mis-ordered vs the catch sequence — finding 026. |
 ## Checklist coverage
 | # | Category | Examined | Notes |
@@ -1312,3 +1348,162 @@ code travel together" rule:
 - If PATCH is not in scope, remove `method.HttpMethod.Equals("PATCH", ...)` from the
  body branch in `InvokeHttpAsync` and let finding-022's verb validation reject it.
  The design-doc list then remains the single source of truth.
 ### ExternalSystemGateway-024 — `SqlErrorClassifier.IsTransient(Exception)` classifies all `InvalidOperationException`/`TimeoutException` as transient, so DB authoring/driver-misuse bugs are buffered and retried instead of propagating
 | | |
 |--|--|
 | Severity | Medium |
 | Category | Error handling & resilience |
 | Status | Deferred |
 | Location | `src/ZB.MOM.WW.ScadaBridge.ExternalSystemGateway/SqlErrorClassifier.cs:123-142`, `src/ZB.MOM.WW.ScadaBridge.ExternalSystemGateway/DatabaseGateway.cs:346-355` |
 **Description**
 `SqlErrorClassifier.IsTransient(Exception)` is a pure exception-type test:
 ```csharp
 return exception is InvalidOperationException
    or IOException
    or SocketException
    or TimeoutException
    or TaskCanceledException
    or DbException;
 ```
 `DatabaseGateway.ExecuteWriteAsync` uses it as the third ordered catch
 (`catch (Exception ex) when (SqlErrorClassifier.IsTransient(ex))`) to decide whether a
 write that did *not* surface as a `SqlException` should be buffered and retried. The
 type's own XML doc justifies the set as "a live DB outage does not always surface as a
 `SqlException`" and explicitly contrasts it with authoring bugs: *"Authoring bugs
 (`ArgumentException`, `NullReferenceException`, etc.) are loud, fixable failures —
 silently buffering and retrying them forever would hide the bug."*
 The problem is that `InvalidOperationException` and `TimeoutException` are exactly the
 two most common shapes of an authoring/driver-misuse bug on the ADO.NET write path, and
 both are now swept into the transient set:
 1. `Microsoft.Data.SqlClient` throws `InvalidOperationException` for a long list of
   *programming* errors that are NOT outages and will fail identically on every retry —
   e.g. a `SqlParameter` already belongs to another `SqlParameterCollection`, the
   command's `Connection` property is not set, an open `SqlDataReader` already exists on
   the connection, or `CommandText` is invalid for the call. The doc treats
   `InvalidOperationException` as "the connection is not open / pooled connection
   broken", but the type carries no such discrimination — any of those authoring
   conditions is classified transient and buffered.
 2. A bare `TimeoutException` (or any of these) thrown by a custom/wrapped provider for a
   non-outage reason is likewise classified transient.
 Consequence: a `CachedWrite` whose payload triggers one of these deterministic
 programming errors is returned to the script as `WasBuffered: true` (a lie — it will
 never succeed), then re-attempted on every S&F sweep, failing identically each time,
 until `RetryCount >= DefaultMaxRetries` and it is finally parked — a noisy retry loop
 hiding a fixable bug, which is the exact outcome the doc says the design avoids. It is
 also asymmetric with the HTTP path it cites as its model: `ErrorClassifier.IsTransient(Exception)`
 (ExternalSystemClient) deliberately does NOT catch `InvalidOperationException`, so the
 same class of bug propagates on the API side.
 **Recommendation**
 Narrow the non-`SqlException` transient set to genuine transport/outage shapes. Drop
 the broad `InvalidOperationException` arm — or, if connection-state errors must stay
 transient, narrow it (e.g. only when the message indicates a broken/closed connection,
 or only the `DbException`/`SocketException`/`IOException`/timeout shapes) so that
 ADO.NET programming `InvalidOperationException`s propagate as the design intends.
 Mirror the HTTP path's narrower `ErrorClassifier.IsTransient(Exception)` set unless
 there is a documented reason the SQL path must be broader. Add a regression test that
 throws a programming-shaped `InvalidOperationException` (e.g. "The SqlParameter is
 already contained by another SqlParameterCollection.") from `RunSqlAsync` and asserts
 it propagates (not buffered), companion to the existing
 `CachedWrite_UnexpectedException_Propagates_NotClassifiedTransient`.
 **Resolution**
 Deferred 2026-06-20: narrowing the over-broad SQL transient set (currently all `InvalidOperationException`/`TimeoutException` are treated transient) to match the HTTP path is a classification-policy decision for the owner; recommended narrowing recorded. No data loss today (bounded noisy retry returning a false WasBuffered), so deferred rather than forced.
 ### ExternalSystemGateway-025 — Caller-token cancellation surfacing from the SQL driver as a `SqlException` is reclassified as a permanent DB error instead of propagating
 | | |
 |--|--|
 | Severity | Low |
 | Category | Error handling & resilience |
 | Status | Resolved |
 | Location | `src/ZB.MOM.WW.ScadaBridge.ExternalSystemGateway/DatabaseGateway.cs:329-345`, `src/ZB.MOM.WW.ScadaBridge.ExternalSystemGateway/SqlErrorClassifier.cs:158-174` |
 **Description**
 `ExecuteWriteAsync`'s ordered catches handle caller cancellation only via
 `catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested)`.
 That covers the common case where the driver observes the token before/around the
 network call and raises `OperationCanceledException`/`TaskCanceledException`. However,
 when `Microsoft.Data.SqlClient` is cancelled *mid-statement* by the caller's token it
 can instead raise a **`SqlException`** (historically error number `0` / message
 "Operation cancelled by user", and on some paths `3980` "the request failed to run
 because the batch is aborted"). Such a `SqlException` does not derive from
 `OperationCanceledException`, so the first filter is skipped; it falls into
 `catch (SqlException ex) { throw SqlErrorClassifier.Throw(connectionName, ex); }`, and
 because `0`/`3980` are not in the transient set, `SqlErrorClassifier.Throw` raises a
 `PermanentDatabaseException`.
 The caller-requested cancellation is therefore reported to the script as a *permanent
 database failure* (`ExternalCallResult.Success == false`, "Permanent database error: …")
 rather than propagating as an `OperationCanceledException`. This contradicts the
 `ExternalSystemGateway-008` contract ("the caller asked to abandon the work — do not
 reclassify") that the comment block at `DatabaseGateway.cs:333-337` explicitly cites,
 and it is the SQL-path analogue of the cancellation/timeout conflation that `-008`
 fixed for HTTP. The existing `CachedWrite_CancellationRequested_PropagatesOperationCanceled_NotReclassified`
 test only exercises the `OperationCanceledException` shape, so the `SqlException`
 cancellation path is uncovered.
 **Recommendation**
 Before classifying a `SqlException`, re-check `cancellationToken.IsCancellationRequested`
 and, if set, rethrow as `OperationCanceledException(cancellationToken)` (or call
 `cancellationToken.ThrowIfCancellationRequested()`), so a caller-initiated cancel
 propagates regardless of whether the driver surfaced it as `OperationCanceledException`
 or as a cancellation-shaped `SqlException`. Add a regression test that cancels the
 caller token and throws a cancellation-shaped `SqlException` (or, since `SqlException`
 has no public constructor, drive it through `RunSqlAsync` with a token-cancelled
 `OperationCanceledException` wrapped to assert the guard) and asserts an
 `OperationCanceledException` propagates and nothing is buffered.
 **Resolution**
 Resolved 2026-06-20 (commit `fd618cf1`): `ExecuteWriteAsync` now calls `cancellationToken.ThrowIfCancellationRequested()` at the top of the `catch (SqlException)` block (before `SqlErrorClassifier.Throw`), so a caller-token cancel that surfaces as a SqlException propagates as `OperationCanceledException` regardless of the driver's exception shape (version-independent). Test added (verified failing pre-fix).
 ### ExternalSystemGateway-026 — `ExecuteWriteAsync` comment-block case labels are mis-ordered relative to the catch sequence
 | | |
 |--|--|
 | Severity | Low |
 | Category | Documentation & comments |
 | Status | Resolved |
 | Location | `src/ZB.MOM.WW.ScadaBridge.ExternalSystemGateway/DatabaseGateway.cs:316-355` |
 **Description**
 The header comment for `ExecuteWriteAsync` (lines 316-328) numbers the classification
 cases 1–4 in catch order: `1.` caller-requested cancellation propagates unchanged; `2.`
 a `SqlException` is classified by number; `3.` a non-`SqlException` transport/connection
 failure is transient; `4.` unexpected exceptions propagate. The inline labels on the
 catch bodies, however, are inverted: the cancellation catch (line 333) is annotated
 `// [2]` and the non-`SqlException` transient catch (line 348) is annotated `// [1]`.
 The labels also do not match the parallel labelling in the test file, where `[1]` is
 used for the non-Sql-outage case and `[2]` for cancellation — so the source comment and
 the tests disagree with the source's own header list.
 This is cosmetic but actively misleading: a reader cross-referencing the numbered
 rationale in the header against the `[n]`-tagged catch bodies will map each case to the
 wrong rationale, and the discrepancy undermines confidence in an otherwise carefully
 documented classification seam.
 **Recommendation**
 Renumber the inline `// [n]` labels to match the header list (cancellation = `[1]`,
 non-`SqlException` transient = `[3]`), or drop the bracketed numbers entirely and keep
 the prose, so the comment is self-consistent. Align the test-file annotations to
 whatever scheme is chosen.
 **Resolution**
 Resolved 2026-06-20 (commit `fd618cf1`): renumbered the inline `// [n]` case labels in `ExecuteWriteAsync` to match the method header's rationale list (cancellation=[1], SqlException=[2], non-Sql=[3]), and realigned the three corresponding labels in the test file. Comment-only.
@@ -5,10 +5,10 @@
 | Module | `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring` |
 | Design doc | `docs/requirements/Component-HealthMonitoring.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-28 |
+| Last reviewed | 2026-06-20 |
 | Reviewer | claude-agent |
-| Commit reviewed | `1eb6e97` |
+| Commit reviewed | `4307c381` |
-| Open findings | 2 |
+| Open findings | 0 |
 ## Summary
@@ -80,6 +80,45 @@ covers (HealthMonitoring-023). All sequence-number and offline-detection
 arithmetic uses `_timeProvider.GetUtcNow()` consistently — no wall-clock vs
 monotonic mismatch was observed.
 #### Re-review 2026-06-20 (commit `4307c381`) — full review
 All twenty-three prior findings (HealthMonitoring-001..023) remain `Resolved`
 and were spot-verified against the current source: the HealthMonitoring-017/018
 "snapshot-then-restore-on-failure" counter logic is in place
 (`HealthReportSender.cs:158-176`, `CentralHealthReportLoop.cs:118-134`,
 `SiteHealthCollector.AddIntervalCounters`), the HealthMonitoring-020 offline→online
 `max(receivedAt, now)` anchor is correct (`CentralHealthAggregator.cs:138-149`),
 and the HealthMonitoring-021 `$central` sentinel is honoured by the aggregator and
 the Central UI. The full 10-category checklist produced **2 new findings, both Low,
 none crash-class** — both confined to the new M6 `SiteHealthKpiSampleSource`
 (KPI History sampling) added since the last baseline: the synthetic `$central`
 self-report is sampled as a real `KpiScopes.Site` series with meaningless
 zero connection/instance/S&F values (HealthMonitoring-024), and 8 of the 12
 emitted Site Health metrics are persisted every minute per site but never read by
 any chart while the design doc claims all 12 are "rendered in the dashboard's
 per-site KpiTrendChart panel" (HealthMonitoring-025). The KPI-recorder
 scope/captive-dependency reasoning is sound (recorder opens a per-pass scope and
 the scoped source over a singleton aggregator is the normal lifetime relationship),
 the per-source fault isolation is correct, and the `AddIntervalCounters` /
 `SetSiteEventLogWriteFailures` default-interface no-ops are an appropriate
 test-fake-compatibility seam. No concurrency, security, or sequence-arithmetic
 regressions were found.
 _Re-review (2026-06-20, `4307c381`):_
 | # | Category | Examined | Notes |
 |---|----------|----------|-------|
 | 1 | Correctness & logic bugs | x | Re-verified HealthMonitoring-017/018/020 fixes in place and correct. `SiteHealthKpiSampleSource.CollectAsync` arithmetic (connectionsUp/Down split, summed `sfBufferDepth`, null-`SiteAuditBacklog`→0) is correct. No new logic bug. |
 | 2 | Akka.NET conventions | x | Module still contains no actors. `IHealthReportTransport.Send` is fire-and-forget (Tell semantics). KPI source is a plain `IKpiSampleSource` consumed by the KpiHistory recorder singleton (out of scope). No issues found. |
 | 3 | Concurrency & thread safety | x | `SiteHealthCollector` counters use `Interlocked`; `AddIntervalCounters` restore is per-field `Interlocked.Add` and correctly sums with concurrent increments against the zero left by `CollectReport`'s `Exchange`. Aggregator CAS pattern unchanged. `SiteEventLogFailureCountReporter` uses a linked CTS + `Task.Run` loop with catch-all probe isolation — sound. No issues found. |
 | 4 | Error handling & resilience | x | Counter-restore-on-failure (HM-017/018) confirmed. KPI source is exception-isolated by the recorder per-source. `SiteEventLogFailureCountReporter.SafeProbe` logs and continues. No issues found. |
 | 5 | Security | x | No issues found. Numeric/string operational metrics only; no secrets, auth surface, or untrusted-input parsing. |
 | 6 | Performance & resource management | x | `PeriodicTimer`/CTS disposed correctly. New: `SiteHealthKpiSampleSource` writes 12 Site-scoped rows per site per minute, of which 8 are never charted (HealthMonitoring-025) — bounded but wasteful; also emits a full row-set for the synthetic `$central` (HealthMonitoring-024). |
 | 7 | Design-document adherence | x | Design doc line 119 claims all 12 emitted Site Health metrics are "rendered in the dashboard's per-site KpiTrendChart panel" — only 4 (`connectionsDown`/`scriptErrors`/`sfBufferDepth`/`deadLetters`) are actually charted (HealthMonitoring-025). Synthetic `$central` is sampled as a real Site scope (HealthMonitoring-024). |
 | 8 | Code organization & conventions | x | KPI source's metric-name catalog is split: 4 charted metrics share `KpiMetrics.SiteHealth` (Commons), 8 are private literals — noted under HealthMonitoring-025. Options/validator ownership and idempotent registration correct. `AddCentralHealthAggregation` registers the scoped KPI source via `TryAddEnumerable` — correct. |
 | 9 | Testing coverage | x | `SiteHealthKpiSampleSourceTests` covers the populated/heartbeat-only/null-backlog/empty paths and exact (metric,value) tuples well. No test asserts `$central` is excluded (it currently is not — HealthMonitoring-024). Otherwise coverage is strong (73 tests). |
 | 10 | Documentation & comments | x | XML docs accurate. The only doc drift is design-doc line 119's "all 12 rendered" claim (HealthMonitoring-025). |
 ## Checklist coverage
 | # | Category | Examined | Notes |
@@ -1153,3 +1192,110 @@ a non-bug.
 Rename to `StoreAndForwardBufferDepths_DefaultsToEmpty_WhenSetterNotCalled`
 (or similar) and update the test body's intent — purely a documentation /
 maintainability fix; no behaviour change.
 ### HealthMonitoring-024 — `SiteHealthKpiSampleSource` samples the synthetic `$central` self-report as a real `KpiScopes.Site` series
 | | |
 |--|--|
 | Severity | Low |
 | Category | Design-document adherence |
 | Status | Resolved |
 | Location | `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/Kpi/SiteHealthKpiSampleSource.cs:69-86`, `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/CentralHealthReportLoop.cs:115-120` |
 **Description**
 `SiteHealthKpiSampleSource.CollectAsync` iterates
 `ICentralHealthAggregator.GetAllSiteStates()` and emits the 12-metric catalog
 for **every** entry whose `LatestReport` is non-null. The aggregator keyspace
 includes the synthetic central self-report under `CentralHealthReportLoop.CentralSiteId`
 (`"$central"`), which `CentralHealthReportLoop` feeds via `ProcessReport` every
 30s — so `$central` always has a non-null `LatestReport` once the loop has run.
 The source therefore writes 12 `KpiSample` rows per minute with
 `Scope = KpiScopes.Site`, `ScopeKey = "$central"` for the central cluster.
 The class XML doc, the design doc (`Component-HealthMonitoring.md:119`,
 "emits `IKpiSampleSource` (`SiteHealthKpiSampleSource`, per-Site)"), and the
 `KpiScopes.Site` scope all describe this as a **per-real-site** series.
 `$central` is not a site — its `DataConnectionStatuses`, `StoreAndForwardBufferDepths`,
 `ParkedMessageCount`, and instance counts are structurally empty/zero
 (the central node runs no DCL, no S&F engine, no instances), so the persisted
 `connectionsUp`/`connectionsDown`/`sfBufferDepth`/`parkedMessages`/`deployedInstances`
 trend rows for `$central` are meaningless constant-zero noise. The Central UI
 trend selector (`Health.razor:570-572`) deliberately pins `$central` first and
 queries `KpiScopes.Site`/`scopeKey="$central"`, so the dashboard will plot these
 flat-zero "Central Cluster" trends.
 This is consistent (the UI renders central as a card by design) so it is not a
 behaviour bug — but it conflates a synthetic non-site with real-site KPI history,
 permanently stores meaningless zero series, and contradicts the "per-Site"
 contract. None of the other M6 sample sources sample a synthetic scope key.
 **Recommendation**
 Decide whether central-cluster trends belong in the per-Site KPI store. If not,
 skip the `CentralHealthReportLoop.CentralSiteId` entry in `CollectAsync` (a single
 `if (siteId == CentralHealthReportLoop.CentralSiteId) continue;` guard) and remove
 the `$central` option from the trend selector. If central trends are intended,
 give them a distinct scope (e.g. a `KpiScopes.Central` scope or a central-specific
 metric subset) so the data is not labelled as a real site, and update the design
 doc + class XML doc to say so. Either way add a test asserting the chosen
 behaviour for `$central`.
 **Resolution**
 Resolved 2026-06-20 (commit `fd618cf1`): the design doc now documents that the synthetic `$central` self-report is intentionally sampled into the per-site KPI store (the Central UI deliberately pins `$central` first in the trend selector as 'Central Cluster'), and that its zero connection/instance/S&F values reflect the self-report carrying no site-runtime data — not a defect. Doc-only.
 ### HealthMonitoring-025 — 8 of the 12 emitted Site Health KPI metrics are never charted; design doc claims all 12 are rendered
 | | |
 |--|--|
 | Severity | Low |
 | Category | Design-document adherence |
 | Status | Resolved |
 | Location | `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/Kpi/SiteHealthKpiSampleSource.cs:40-51`, `docs/requirements/Component-HealthMonitoring.md:119` |
 **Description**
 `SiteHealthKpiSampleSource` emits 12 metrics per site per sampling pass:
 `connectionsUp`, `connectionsDown`, `scriptErrors`, `alarmEvalErrors`,
 `sfBufferDepth`, `deadLetters`, `parkedMessages`, `deployedInstances`,
 `enabledInstances`, `disabledInstances`, `auditBacklogPending`,
 `eventLogWriteFailures`. Only **four** are ever read back: the Central UI
 per-site trend panel calls `LoadTrendSeriesAsync` for exactly
 `KpiMetrics.SiteHealth.ConnectionsDown`, `.ScriptErrors`, `.SfBufferDepth`, and
 `.DeadLetters` (`Health.razor:596-603`) — and only those four are in the public
 `KpiMetrics.SiteHealth` catalog (`Commons/Types/Kpi/KpiMetrics.cs:84-97`). The
 other 8 metric names are private string literals in the source and have no
 reader anywhere in the codebase.
 Two consequences:
 1. **Stale design doc.** `Component-HealthMonitoring.md:119` lists all 12 metric
   names and states they are "rendered in the dashboard's per-site
   `KpiTrendChart` panel." Eight of them are never rendered. A reader / future
   maintainer is misled about what the trend panel shows.
 2. **Persisted-but-dead samples.** The `KpiHistoryRecorderActor` writes all 12
   rows to the central `KpiSample` table every minute for every reporting site;
   8 are pure write-amplification that nothing queries and that the 90-day
   purge eventually drops. Bounded, but wasteful — roughly two-thirds of the
   Site Health KPI rows are never read.
 A related maintainability smell: the split catalog (4 metrics shared via
 `KpiMetrics.SiteHealth`, 8 private literals) means a future chart added for, say,
 `alarmEvalErrors` would have to re-type the literal and risk a silent typo
 mismatch against the source.
 **Recommendation**
 Reconcile intent in one of two directions: (a) if only the four charted metrics
 are wanted, stop emitting the other eight (and correct the design doc's list to
 the four actually rendered); or (b) if the full 12-metric history is intended for
 future charts / ad-hoc query, promote all 12 names into the public
 `KpiMetrics.SiteHealth` catalog so source and any future UI binding key off one
 symbol, and either add the missing charts or soften the doc's "rendered … panel"
 wording to "recorded for trend history." Add the design-doc fix in the same
 session per the repo's doc-and-code-travel-together rule.
 **Resolution**
 Resolved 2026-06-20 (commit `fd618cf1`): corrected the design doc — of the 12 sampled Site Health KPI metrics only four are charted (connectionsDown, scriptErrors, sfBufferDepth, deadLetters); the other eight are now documented as sampled-but-not-charted (retained for future surfaces / ad-hoc query). Doc-only.
@@ -5,9 +5,9 @@
 | Module | `src/ZB.MOM.WW.ScadaBridge.Host` |
 | Design doc | `docs/requirements/Component-Host.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-28 |
+| Last reviewed | 2026-06-20 |
 | Reviewer | claude-agent |
-| Commit reviewed | `1eb6e97` |
+| Commit reviewed | `4307c381` |
 | Open findings | 0 |
 ## Summary
@@ -80,6 +80,33 @@ Serilog is the only logger provider and the section is dead config (Host-021);
 and `ParseLevel` silently swallows an unrecognised `MinimumLevel` value (e.g.
 a typo) and falls back to `Information` with no warning (Host-022).
 #### Re-review 2026-06-20 (commit `4307c381`) — full review
 All twenty-two prior findings (Host-001..022) remain `Resolved` in the current tree
 (Host-008 was deliberately *reverted* in M2.9 — see Host-024). This is the first
 review since the `ScadaBridge` rename, so a `git diff 1eb6e97 HEAD` shows the whole
 directory as additions and is not a useful delta; the review walked the full current
 state plus the per-file commit history. Major changes since `1eb6e97`: REQ-HOST-4a was
 re-architected (the leader-only `active-node` check left `/health/ready`; a new
 leadership-agnostic `RequiredSingletonsHealthCheck` probes the five required central
 singletons via `Identify` — clean), shared `ZB.MOM.WW.Health`/`Telemetry` adoption
 (`/healthz`, `/metrics`, OTel), a site HTTP/1.1 `MetricsPort` (8084) listener, the
 KpiHistory + Transport central registrations, the disable-login dev flag, and the
 library-backed inbound API-key auth. Four new findings, none crash/data-loss class.
 Host-023 (Medium) is a regression-of-resolution: the shipped `appsettings.Site.json`
 second seed-node now targets `localhost:8084` — the new HTTP/1.1 MetricsPort, not an
 Akka remoting endpoint — the same defect class as the resolved Host-004 (which chose
 8084 as a "remoting port" before MetricsPort claimed it); `StartupValidator` rejects a
 seed on the GrpcPort but not on the MetricsPort. Host-024 (Medium) flags the M2.9
 reversal of Host-008: `MachineDataDb` is again *required* for Central yet is consumed
 by no DbContext/repository AND is absent from the shipped `appsettings.Central.json`,
 so a developer running the shipped Central binary fails startup for a value nothing
 uses. Host-025 (Low): the Host directly consumes KpiHistory (`AddKpiHistory`,
 `KpiHistoryRecorderActor`, `KpiHistoryOptions`) but has no direct `ProjectReference` —
 it relies on a transitive pull through CentralUI. Host-026 (Low): the Component
 Registration Matrix in `Component-Host.md` is stale — KpiHistory and Transport are
 registered central-only in `Program.cs` but absent from the matrix.
 ## Checklist coverage
 | # | Category | Examined | Notes |
@@ -110,6 +137,21 @@ _Re-review (2026-05-28, `1eb6e97`):_
 | 9 | Testing coverage | ☑ | Strong existing suite. No coverage for the Site `CentralContactPoints` second-entry rule (Host-016), the site-shutdown ordering (Host-017), the `NodeName`-absent shipped config (Host-018), the unused `CancellationToken` parameter (Host-019), the `MinimumLevel.Is` override semantics (Host-020) or the `ParseLevel` silent fallback (Host-022). |
 | 10 | Documentation & comments | ☑ | Re-review: layered `MinimumLevel.Is` / `ReadFrom.Configuration` semantics are not surfaced — an operator-set `Serilog:MinimumLevel` is silently overridden by `ScadaBridge:Logging:MinimumLevel` (Host-020); `ParseLevel` silently coerces a misspelled level to `Information` with no warning (Host-022). |
 _Re-review (2026-06-20, `4307c381`):_
 | # | Category | Examined | Notes |
 |---|----------|----------|-------|
 | 1 | Correctness & logic bugs | ☑ | Re-review: shipped `appsettings.Site.json` second seed-node targets MetricsPort 8084 (HTTP/1.1 Kestrel listener), not an Akka remoting endpoint — regression of the resolved Host-004 fix (Host-023). |
 | 2 | Akka.NET conventions | ☑ | All five central singletons + the four site singletons reviewed: ClusterSingletonManager/Proxy pairing, role scoping (site singletons `WithRole(site-*)`, central on base "Central"), `PhaseClusterLeave` graceful-stop drain tasks, receptionist registrations, `RequiredSingletonsHealthCheck` `Identify` probe — all sound. No new issues. |
 | 3 | Concurrency & thread safety | ☑ | `GetOrCreateActorSystem` double-checked lock (HOST-021) is correct; `_trackedDisposables` snapshot-under-lock; `RequiredSingletonsHealthCheck` probes concurrently and never throws. No new issues. |
 | 4 | Error handling & resilience | ☑ | `StartupRetry` now threads `ApplicationStopping` + transient-only classifier (Host-019/015 resolved); migration retry sound. No new issues. |
 | 5 | Security | ☑ | Shipped configs keep `${...}` secret placeholders; pepper fail-fast validated (≥16 chars). Note: docker per-node `appsettings.Central.json` still carry plaintext SQL passwords / JWT key — out of this module's edit scope (Host-003 follow-up), not re-filed. |
 | 6 | Performance & resource management | ☑ | ActorSystem singleton-not-transient avoids per-probe disposal; gRPC stream cancel-on-shutdown wired. No new issues. |
 | 7 | Design-document adherence | ☑ | Re-review: Component Registration Matrix omits KpiHistory + Transport (both central-only in Program.cs) (Host-026); `MachineDataDb` required-but-unconsumed and absent from shipped Central config (Host-024). |
 | 8 | Code organization & conventions | ☑ | Re-review: Host directly consumes KpiHistory types but has no direct `ProjectReference` (transitive via CentralUI) (Host-025). |
 | 9 | Testing coverage | ☑ | Strong suite (RequiredSingletons, Hocon, Serilog, StartupRetry covered). Gaps: no test asserts a seed on the MetricsPort is rejected (Host-023 — the existing 8084-seed test only avoids the grpc rule), and no shipped-config test catches the missing `MachineDataDb` key (Host-024). |
 | 10 | Documentation & comments | ☑ | Re-review: `Component-Host.md` matrix stale vs `Program.cs` (Host-026). Inline XML/comments otherwise accurate and unusually thorough. |
 ## Findings
 ### Host-001 — `/health/ready` includes the leader-only `active-node` check
@@ -1118,3 +1160,163 @@ behaviour you choose.
 **Resolution**
 _Open._
 ### Host-023 — Shipped Site `SeedNodes` second entry targets the MetricsPort, not a remoting port
 | | |
 |--|--|
 | Severity | Medium |
 | Category | Correctness & logic bugs |
 | Status | Deferred |
 | Location | `src/ZB.MOM.WW.ScadaBridge.Host/appsettings.Site.json:13-17`, `src/ZB.MOM.WW.ScadaBridge.Host/StartupValidator.cs:117-127` |
 **Description**
 The shipped site config sets `Node:RemotingPort = 8082`, `Node:GrpcPort = 8083`,
 `Node:MetricsPort = 8084`, but `Cluster:SeedNodes` is
 `["akka.tcp://scadabridge@localhost:8082", "akka.tcp://scadabridge@localhost:8084"]`.
 The second seed targets `8084` — which `Program.cs:419-422` binds as the **HTTP/1.1
 Kestrel `/metrics` listener** (`HttpProtocols.Http1AndHttp2`), not an Akka.Remote
 endpoint. A node joining via that seed attempts an Akka.Remote TCP association against
 the Prometheus scrape listener and fails. This is the exact defect class as the
 **resolved Host-004** (where the second seed targeted the gRPC port 8083): Host-004's
 fix corrected the seed to `8084` and called it a "remoting port", but the later
 introduction of `MetricsPort` (commits `bbc9f092`/`c41cb41c`) claimed `8084`, silently
 re-breaking the example. `StartupValidator` (`StartupValidator.cs:121-127`) was
 extended by Host-004 to reject a seed whose port equals `GrpcPort`, but it does **not**
 reject a seed whose port equals `MetricsPort`, so this misconfiguration passes
 validation silently. For the single-node dev loopback the first seed (`8082`) succeeds
 and the bug is masked, but it is an incorrect example that copies into multi-node
 configs (the docker site configs correctly use distinct hostnames both on `8082`).
 **Recommendation**
 The loopback dev site cannot host two distinct remoting endpoints, so drop the
 second seed (mirror Host-016's single-entry `CentralContactPoints` template) or use a
 distinct hostname placeholder both on `8082`, with a comment that multi-node sites
 list each node's *remoting* port. Extend the Site block of `StartupValidator` to also
 reject any seed whose port equals the resolved `MetricsPort` (and ideally any port
 the node binds for non-remoting use). Add a `StartupValidatorTests` case
 `Site_SeedNodeOnMetricsPort_FailsValidation` mirroring
 `Site_SeedNodeOnGrpcPort_FailsValidation`.
 **Resolution**
 Deferred 2026-06-20: the shipped `appsettings.Site.json` second seed-node points at the MetricsPort and the validator lacks a `seedPort != metricsPort` rule; fixing touches shipped dev config and a single-node-loopback assumption. Recorded for a follow-up (no production node ships this dev config).
 ### Host-024 — `MachineDataDb` re-required for Central but consumed by nothing and absent from the shipped config
 | | |
 |--|--|
 | Severity | Medium |
 | Category | Design-document adherence |
 | Status | Deferred |
 | Location | `src/ZB.MOM.WW.ScadaBridge.Host/StartupValidator.cs:63-65`, `src/ZB.MOM.WW.ScadaBridge.Host/appsettings.Central.json:22-24` |
 **Description**
 Commit `76198b36` ("reverts Host-008") re-added a fail-fast `.Require` for
 `ScadaBridge:Database:MachineDataDb` to the Central `.When` block, citing
 REQ-HOST-3/REQ-HOST-4 and that the docker per-node configs carry the key. Two problems
 remain after the reversal:
 1. **Still consumed by nothing.** A repo-wide search shows `MachineDataDb` is read only
   in `StartupValidator.cs` and declared on `DatabaseOptions.cs:11` — no DbContext, no
   repository, no component reads it. Only `ConfigurationDb` is wired into
   `AddConfigurationDatabase` (`Program.cs:200-202`). The original Host-008 rationale
   ("dead configuration that fails startup for a value nothing uses") is therefore
   still entirely true; the reversal re-instated the fail-fast without adding a
   consumer.
 2. **Absent from the shipped Central config.** `appsettings.Central.json` (the template
   a developer running the binary directly uses) contains `Database:ConfigurationDb`
   (a `${...}` placeholder) but **no `MachineDataDb` key at all**. So a Central node
   started from the shipped config fails `StartupValidator` immediately with
   "ScadaBridge:Database:MachineDataDb connection string required for Central" — for a
   value nothing consumes. The docker configs happen to set it; the shipped default
   does not. (This is *not* a re-file of Host-008, which was Resolved-then-reverted;
   the new, distinct defect is the validation breaking the shipped dev config.)
 This is also a code-vs-doc inconsistency: the design intent (validate a value the
 system needs) is unmet because the value is needed by nothing.
 **Recommendation**
 Pick one and make code, shipped config, and doc agree: either (a) wire a machine-data
 store so the value is actually consumed and add a `MachineDataDb` placeholder to the
 shipped `appsettings.Central.json` (matching the `_secrets` env-var pattern), or (b)
 remove the `.Require`, the `DatabaseOptions.MachineDataDb` property, and the docker
 keys again. Add a config-shape regression test asserting the shipped
 `appsettings.Central.json` satisfies `StartupValidator` for the Central role (the
 absence of such a test is why this slipped through).
 **Resolution**
 Deferred 2026-06-20: `MachineDataDb` is re-required for Central but consumed by nothing and absent from shipped Central config (the shipped Central binary would fail StartupValidator). This is a reversal of the reverted Host-008 and needs a product decision — whether a machine-data store is actually coming (add a config placeholder) or the requirement should be removed again. Recorded for that decision.
 ### Host-025 — Host directly consumes KpiHistory but has no direct project reference
 | | |
 |--|--|
 | Severity | Low |
 | Category | Code organization & conventions |
 | Status | Resolved |
 | Location | `src/ZB.MOM.WW.ScadaBridge.Host/ZB.MOM.WW.ScadaBridge.Host.csproj:42-63`, `src/ZB.MOM.WW.ScadaBridge.Host/Program.cs:118`, `src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs:717-757` |
 **Description**
 `Program.cs` calls `builder.Services.AddKpiHistory(builder.Configuration)` and
 `AkkaHostedService.RegisterCentralActors` constructs the `KpiHistoryRecorderActor`
 cluster singleton from `KpiHistoryOptions` / `KpiHistoryRecorderActor`
 (`ZB.MOM.WW.ScadaBridge.KpiHistory`), yet the Host `.csproj` carries **no
 `ProjectReference` to `ZB.MOM.WW.ScadaBridge.KpiHistory`**. It compiles only because
 CentralUI (`ZB.MOM.WW.ScadaBridge.CentralUI.csproj`) transitively re-exports it. The
 Host is the composition root for every component and the project deliberately declares
 direct dependencies elsewhere — the csproj even comments that `ZB.MOM.WW.Theme` is
 declared directly "rather than leaning on CentralUI re-exporting it transitively"
 (lines 37-39). KpiHistory is the same situation but was not given the same treatment.
 If CentralUI ever drops or restructures its KpiHistory reference, the Host fails to
 compile for a dependency it directly uses — a latent build fragility.
 **Recommendation**
 Add `<ProjectReference Include="../ZB.MOM.WW.ScadaBridge.KpiHistory/ZB.MOM.WW.ScadaBridge.KpiHistory.csproj" />`
 to the Host csproj's `ProjectReference` group, mirroring the explicit Theme reference
 rationale already documented there.
 **Resolution**
 Resolved 2026-06-20 (commit `fd618cf1`): added an explicit `<ProjectReference>` to the KpiHistory project in the Host csproj, so Host's direct use of KpiHistory types no longer relies on the transitive pull through CentralUI.
 ### Host-026 — Component Registration Matrix omits KpiHistory and Transport
 | | |
 |--|--|
 | Severity | Low |
 | Category | Documentation & comments |
 | Status | Resolved |
 | Location | `docs/requirements/Component-Host.md` (Component Registration Matrix, lines 177-198), `src/ZB.MOM.WW.ScadaBridge.Host/Program.cs:92,118` |
 **Description**
 `Program.cs` registers two central-only components that have shipped since the matrix
 was last updated: **Transport** (`builder.Services.AddTransport()`, line 92, #24) and
 **KpiHistory** (`builder.Services.AddKpiHistory(...)`, line 118, plus the
 `kpi-history-recorder` cluster singleton in `AkkaHostedService`, #26). Neither appears
 in the Component Registration Matrix in `Component-Host.md` (lines 177-198), which is
 the documented source of which components register on which role. The matrix is
 described in CLAUDE.md as something that "must stay in sync with actual component
 documents"; it is now stale against the actual composition root. A maintainer auditing
 role-based registration against the doc would miss both components (and miss that
 KpiHistory's recorder is deliberately *not* readiness-gated — worth a matrix note).
 **Recommendation**
 Add `Transport` (Central: Yes / Site: No / DI: Yes / Actors: No / Endpoints: No) and
 `KpiHistory` (Central: Yes / Site: No / DI: Yes / Actors: Yes / Endpoints: No) rows to
 the matrix, and note KpiHistory's recorder singleton is intentionally absent from
 `RequiredSingletonsHealthCheck` (not readiness-gated). Re-check the Dependencies list
 ("All 19 component libraries") against the actual csproj reference count while editing.
 **Resolution**
 Resolved 2026-06-20 (commit `fd618cf1`): added `KpiHistory` and `Transport` rows to the Component Registration Matrix in Component-Host.md (both Central=Yes/Site=No/DI=Yes), reflecting their actual central-only registration.
@@ -5,10 +5,10 @@
 | Module | `src/ZB.MOM.WW.ScadaBridge.InboundAPI` |
 | Design doc | `docs/requirements/Component-InboundAPI.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-28 |
+| Last reviewed | 2026-06-20 |
 | Reviewer | claude-agent |
-| Commit reviewed | `1eb6e97` |
+| Commit reviewed | `4307c381` |
-| Open findings | 0 |
+| Open findings | 4 |
 ## Summary
@@ -124,6 +124,36 @@ configuration database, but the invariant is undocumented.)
 | 9 | Testing coverage | ☑ | `EndpointExtensions.HandleInboundApiRequest` composition wiring has no test (InboundAPI-023); middleware/filter/validator/executor/route are individually covered. |
 | 10 | Documentation & comments | ☑ | No new issues. |
 #### Re-review 2026-06-20 (commit `4307c381`) — full review
 All 25 prior findings remain `Resolved` (InboundAPI-022's `IActiveNodeGate` Host
 registration is now present at `Program.cs:260`; InboundAPI-025's POST-only audit
 `UseWhen` predicate is live at `Program.cs:337-344`). The module changed materially:
 the whole project was renamed (ScadaLink → ZB.MOM.WW.ScadaBridge), inbound auth was
 re-architected to the shared `ZB.MOM.WW.Auth.ApiKeys` Bearer/`sbk_` verifier (the SQL
 `ApiKey` store, `ApiKeyValidator`, and peppered `IApiKeyHasher` were retired),
 `ForbiddenApiChecker` became a thin shim over Script Analysis #25's `ScriptTrustValidator`,
 and JSON-Schema `$ref` runtime resolution (`SchemaRefResolver`) plus recursive
 Object/List validation+materialization landed. The dominant new finding is that a
 **read-only `InboundDatabaseHelper`** was added to `InboundScriptContext.Database` —
 reintroducing the very raw-DB-access capability the design doc still explicitly
 forbids (regressing the InboundAPI-007 closure), with no read-only enforcement and a
 sync-over-async, token-ignoring data path. **4 new findings**: 1 High, 1 Medium,
 2 Low — no Critical.
 | # | Category | Examined | Notes |
 |---|----------|----------|-------|
 | 1 | Correctness & logic bugs | ☑ | Recursive Materialize/Validate look sound; `WaitForAttribute` deadline-vs-wait-timeout tension (InboundAPI-029). |
 | 2 | Akka.NET conventions | ☑ | ASP.NET-hosted; routes via `IInstanceRouter` → `CommunicationService`. Correlation/parent ids stamped on all four `Route.To()` verbs. No issues. |
 | 3 | Concurrency & thread safety | ☑ | `ConcurrentDictionary` caches; per-request DI scope created from root in singleton executor (valid). New DB helper blocks a pool thread (InboundAPI-027). |
 | 4 | Error handling & resilience | ☑ | DB helper's `ExecuteScalar`/`ExecuteReader` are synchronous and ignore `_ct` — a slow query is unbounded by the method timeout (InboundAPI-027). |
 | 5 | Security | ☑ | `InboundDatabaseHelper` runs arbitrary script-supplied SQL (not read-only despite the name) on central with host privileges (InboundAPI-026). Bearer verifier delegates constant-time compare to shared lib. Forbidden-API check delegates to #25. |
 | 6 | Performance & resource management | ☑ | Capped capture/known-bad caches; `ArrayPool` rent. Sync-over-async DB calls (InboundAPI-027). |
 | 7 | Design-document adherence | ☑ | `Database` helper contradicts the doc's "No direct database access" decision (InboundAPI-026, regresses InboundAPI-007); `WaitForAttribute` timeout binding diverges from spec §6 line 192 (InboundAPI-029). |
 | 8 | Code organization & conventions | ☑ | `InboundDatabaseHelper` is a component-local helper (fine); schema types in Commons. No issues. |
 | 9 | Testing coverage | ☑ | `InboundDatabaseHelper` tests cover SELECT happy-path only — no write/DDL-rejection test (because none is enforced); `WaitForAttribute`/`Database` not end-to-end via the endpoint (InboundAPI-028). |
 | 10 | Documentation & comments | ☑ | Helper XML doc claims "read-only" which is unenforced (folded into InboundAPI-026). |
 ## Findings
 ### InboundAPI-001 — Singleton script handler cache mutated without synchronization
@@ -1299,3 +1329,199 @@ route, not on the `/api/` prefix. Options:
 The endpoint-filter form is the recommended fix — it co-locates the audit-emission
 scope with the route definition and matches how InboundAPI-006/008 gating is
 already wired.
 ### InboundAPI-026 — `InboundDatabaseHelper` gives inbound scripts arbitrary raw SQL (not read-only); contradicts the design doc's "No direct database access" decision and regresses InboundAPI-007
 | | |
 |--|--|
 | Severity | High |
 | Category | Security |
 | Status | Open |
 | Location | `src/ZB.MOM.WW.ScadaBridge.InboundAPI/InboundDatabaseHelper.cs:13-70`, `src/ZB.MOM.WW.ScadaBridge.InboundAPI/InboundScriptExecutor.cs:267-281`, `:387`; design doc `docs/requirements/Component-InboundAPI.md:202-215` |
 **Description**
 A new `InboundDatabaseHelper` is exposed to every inbound API script as
 `InboundScriptContext.Database` (`InboundScriptExecutor.cs:387`), backed by the
 production `IDatabaseGateway` (`AddExternalSystemGateway()` in the central-role branch
 of `Program.cs:81`). Its `QuerySingle<T>`/`Query` methods take a script-supplied `sql`
 string and run it verbatim — `cmd.CommandText = sql; cmd.ExecuteScalar()/ExecuteReader()`
 (`InboundDatabaseHelper.cs:27-29`, `:41-44`) — against a raw ADO.NET `DbConnection`
 obtained from `IDatabaseGateway.GetConnectionAsync` (`IDatabaseGateway.cs:20`, the same
 gateway the External System Gateway uses for SCADA machine-data DBs). Two problems:
 1. **It directly contradicts the design doc.** `Component-InboundAPI.md:202-215` — which
   is the *resolution* of finding InboundAPI-007 — states verbatim: *"**No direct
   database access.** Inbound API scripts are not given a raw database client. Handing a
   script a raw `SqlConnection` is in direct tension with the ScadaBridge script trust
   model … Scripts interact with the system only through the curated `Route` and
   `Parameters` surfaces … If a method needs data from the configuration or machine-data
   databases, that access belongs behind a dedicated, scoped helper — not a
   general-purpose connection — and would be added here as an explicit design change."*
   The code now ships exactly the capability the doc forbids, and the doc was **not**
   updated. This is a regression of the closed InboundAPI-007 (whose chosen resolution
   was to remove the DB API and document its absence). Inbound API scripts are authored
   by the less-trusted-than-Admin Design role and execute on the central node with the
   host process's privileges, so this materially widens the trust boundary the design
   deliberately drew.
 2. **"Read-only" is unenforced — it is a comment, not a constraint.** The class summary
   says *"Read-only database access"* but nothing restricts the SQL: a script may run
   `Database.Query("conn", "DELETE FROM Machine")`, `UPDATE`, `INSERT`, `DROP`, or any
   DDL/DML, and `Query`/`QuerySingle` will execute it (`ExecuteReader`/`ExecuteScalar`
   happily run non-SELECT statements). The gateway opens a normal read-write connection
   (no `ApplicationIntent=ReadOnly`, no statement allow-listing). The "read-only" claim
   in the XML doc and the call-site comment (`InboundScriptExecutor.cs:240`,
   "read-only Database helper") is therefore false and misleading.
 Parameter *values* are correctly bound (`AddParameters`, `:59-68`) so value-injection is
 not the issue — the issue is that arbitrary statement text is script-controlled and the
 "read-only" containment does not exist.
 **Recommendation**
 Reconcile code and design, and make the safety claim real. Either (a) if scoped DB
 access is now intended, update `Component-InboundAPI.md:202-215` to authorize it as the
 explicit design change the prior text demanded, AND enforce read-only — restrict to
 `SELECT` (reject any statement whose first significant token is not `SELECT`/`WITH`, or
 better, route through a read-intent connection / a curated query API that cannot mutate),
 fix the XML summary to describe what is actually enforced, and gate which connection
 names a script may reach; or (b) if direct DB access is still out of scope, remove
 `InboundDatabaseHelper` and the `Database` member (reverting to the InboundAPI-007
 posture). Add a regression test asserting a non-SELECT statement is rejected once
 read-only is enforced (see InboundAPI-028).
 **Resolution**
 _Unresolved._
 ### InboundAPI-027 — `InboundDatabaseHelper` is sync-over-async and ignores the method-deadline token — thread-pool starvation and an unbounded slow query
 | | |
 |--|--|
 | Severity | Medium |
 | Category | Performance & resource management |
 | Status | Open |
 | Location | `src/ZB.MOM.WW.ScadaBridge.InboundAPI/InboundDatabaseHelper.cs:25-33`, `:40-44` |
 **Description**
 `QuerySingle`/`Query` execute on the ASP.NET request thread (the script runs inside
 `CSharpScript.RunAsync` on a thread-pool thread). Both methods do
 `_gateway.GetConnectionAsync(connectionName, _ct).GetAwaiter().GetResult()`
 (`:25`, `:40`) — a synchronous block on an async DB-open — and then call the **synchronous**
 `cmd.ExecuteScalar()` (`:29`) / `cmd.ExecuteReader()` (`:44`). Two consequences:
 1. **Thread-pool starvation.** The inbound API has no rate limiting (a deliberate design
   choice). Each DB-backed method invocation blocks a pool thread for the full duration
   of the DB round-trip via `.GetAwaiter().GetResult()`. Under concurrent load this
   consumes pool threads that cannot be released until the DB responds — exactly the
   pattern ASP.NET guidance warns against (it won't deadlock without a sync context, but
   it does starve the pool and degrade the whole central node).
 2. **The slow query is not bounded by the method timeout.** The cancellation token
   `_ct` (the method-deadline `cts.Token` passed at `InboundScriptExecutor.cs:268`) is
   forwarded to `GetConnectionAsync` but **not** to the synchronous `ExecuteScalar()` /
   `ExecuteReader()` — those overloads ignore cancellation entirely. So once a connection
   is open, a hung or slow `SELECT` runs to completion (or the provider's own command
   timeout) regardless of the method timeout, holding the blocked thread the whole time.
   This contradicts the InboundAPI-016 guarantee that the method timeout bounds the
   method's work.
 **Recommendation**
 Use the async ADO.NET path end-to-end and honour the token: `await
 _gateway.GetConnectionAsync(...)`, then `await cmd.ExecuteReaderAsync(_ct)` /
 `await cmd.ExecuteScalarAsync(_ct)`, and make `QuerySingle`/`Query` async (`Task<T?>` /
 `Task<IReadOnlyList<...>>`) so the script `await`s them rather than blocking a pool
 thread. Set a `CommandTimeout` derived from the remaining method deadline as a backstop.
 If a synchronous script-facing signature must be kept, at minimum register the command
 with the token and a command timeout so a slow query cannot run unbounded.
 **Resolution**
 _Unresolved._
 ### InboundAPI-028 — `InboundDatabaseHelper` has no negative-path tests; `Database`/`WaitForAttribute` are not exercised end-to-end through the endpoint
 | | |
 |--|--|
 | Severity | Low |
 | Category | Testing coverage |
 | Status | Open |
 | Location | `tests/ZB.MOM.WW.ScadaBridge.InboundAPI.Tests/InboundDatabaseHelperTests.cs:34-67`, `tests/ZB.MOM.WW.ScadaBridge.InboundAPI.Tests/EndpointExtensionsTests.cs` |
 **Description**
 `InboundDatabaseHelperTests` covers only the SELECT happy path (single value, no-rows
 default, multi-column row, empty result). There is no test that:
 - a non-SELECT statement (`DELETE`/`UPDATE`/`DROP`) is rejected — because nothing rejects
  it today (InboundAPI-026); a test would currently *prove* the helper is not read-only;
 - the `InvalidOperationException("Database is not available …")` path fires when the
  gateway is null (`InboundDatabaseHelper.cs:24`, `:39`);
 - the method-deadline token cancels a slow query (InboundAPI-027).
 Separately, `InboundScriptContext.Database` and `RouteTarget.WaitForAttribute` — both
 new script-facing capabilities — are unit-tested at the helper/route level but never
 driven end-to-end through `HandleInboundApiRequest` (the `EndpointExtensionsTests`
 TestServer suite does not invoke a script that touches `Database` or `WaitForAttribute`),
 so a wiring regression that fails to construct the `Database` helper or thread the
 deadline into a wait would be silent.
 **Recommendation**
 Add negative-path tests to `InboundDatabaseHelperTests` (null-gateway throw; once
 InboundAPI-026 enforces read-only, a write/DDL-rejection test; an InboundAPI-027
 cancellation test). Add an `EndpointExtensionsTests` case driving a method script that
 calls `Database.Query(...)` and one that calls `Route.To(...).WaitForAttribute(...)` so
 the executor→context→helper wiring is covered through the real endpoint.
 **Resolution**
 _Unresolved._
 ### InboundAPI-029 — Routed `WaitForAttribute` is cancelled by the method-level deadline, contradicting spec §6 (wait bounded by the wait timeout, not the method timeout)
 | | |
 |--|--|
 | Severity | Low |
 | Category | Design-document adherence |
 | Status | Open |
 | Location | `src/ZB.MOM.WW.ScadaBridge.InboundAPI/RouteHelper.cs:220-247`, `:302-303`; design doc `docs/requirements/Component-InboundAPI.md:192` |
 **Description**
 `Component-InboundAPI.md:192` says of `WaitForAttribute`: *"The cluster call is bounded
 by the **wait timeout** rather than the generic integration timeout."* But
 `RouteTarget.WaitForAttribute` routes with `Effective(cancellationToken)` (`:226`),
 which — when the script passes no explicit token, the normal case — resolves to
 `_deadlineToken` (the method-level timeout CTS bound by `WithDeadline` at
 `InboundScriptExecutor.cs:279`). So a method whose timeout (default 30 s) is shorter than
 the requested wait timeout (e.g. `WaitForAttribute("Flag", true, TimeSpan.FromMinutes(5))`)
 has its wait cancelled at 30 s by the method deadline, never reaching the 5-minute wait
 timeout the spec says governs it. The routed wait is thus bounded by the *generic method
 timeout* — exactly what spec line 192 says it should not be.
 This is in genuine tension with the InboundAPI-016 decision ("the method timeout covers
 routed calls, including waits"). The two design statements conflict for the
 `WaitForAttribute` case: §6 line 192 wants the wait timeout to win; InboundAPI-016 wants
 the method deadline to be the hard ceiling. The current code silently resolves it in
 favour of the method deadline, which makes the documented `WaitForAttribute` timeout
 semantics (and its `false`-on-timeout return contract) unreliable whenever the method
 timeout is the smaller of the two.
 **Recommendation**
 Reconcile the two design statements explicitly. Options: (a) treat the method deadline as
 the hard ceiling and update `Component-InboundAPI.md:192` to say the wait is bounded by
 `min(waitTimeout, remaining method deadline)` and that an undersized method timeout will
 cut a wait short; or (b) honour spec §6 by deriving the wait's effective token from the
 wait timeout (a per-wait CTS) rather than the method deadline, accepting that a long wait
 can outlive the generic method timeout. Either way, add a `RouteHelperTests` case pinning
 the chosen semantics when method-timeout < wait-timeout (the current suite only tests
 deadline-token *inheritance*, not the timeout *ordering* conflict).
 **Resolution**
 _Unresolved._
@@ -0,0 +1,307 @@
 # Code Review — KpiHistory
 | Field | Value |
 |-------|-------|
 | Module | `src/ZB.MOM.WW.ScadaBridge.KpiHistory` |
 | Design doc | `docs/requirements/Component-KpiHistory.md` |
 | Status | Reviewed |
 | Last reviewed | 2026-06-20 |
 | Reviewer | claude-agent |
 | Commit reviewed | `4307c381` |
 | Open findings | 0 |
 ## Summary
 KpiHistory is a small, well-built observability module: a single ~305-line recorder
 singleton (`KpiHistoryRecorderActor`), a strongly-typed options class with a fail-fast
 validator, and a thin DI composition root. The owned code is clean — the actor is
 textbook best-effort: every sample pass and purge sweep runs off the actor thread via
 `PipeTo`, per-source faults are isolated so one throwing `IKpiSampleSource` never aborts
 a tick or the other sources, the repository write/purge is guarded, no exception escapes
 either tick handler, and a lifecycle `CancellationTokenSource` is cancelled in `PostStop`
 so an in-flight pass observes shutdown promptly. The singleton is wired correctly in the
 Host (ClusterSingletonManager + proxy + `PhaseClusterLeave` drain) and is deliberately
 absent from the readiness barrier, exactly as the design requires. The options validator,
 the EAV table mapping, both named indexes, and all four `IKpiSampleSource` implementations
 exist and are registered on the central host as designed.
 The dominant theme is **unbounded work under load**, in two places. First, the recorder
 has **no in-flight guard** on its sample timer — directly contradicting its own XML-doc
 claim to "mirror the NotificationOutboxActor timer + scope-per-tick + PipeTo pattern,"
 because the NotificationOutbox dispatcher *does* hold an in-flight guard and the recorder
 does not. When a sample pass runs longer than `SampleInterval` (slow/recovering DB), Akka
 periodic timers enqueue, not coalesce, so overlapping `RunSamplePass` tasks pile up,
 multiplying DB load at exactly the moment the store is struggling and double-writing
 samples for overlapping windows. Second, `GetRawSeriesAsync` has **no server-side row
 cap**: the design's `DefaultMaxSeriesPoints` ceiling is applied by `KpiSeriesBucketer`
 only *after* the full raw window is materialised into memory — a 7-day window at the
 default 60 s cadence is ~10 080 rows per series pulled to the Central UI before
 downsampling, defeating the stated intent of the cap. A secondary theme is **bucketer
 contract drift** (the "largest-timestamp-wins for unsorted input" doc claim is not what
 the code does, and the short-series early-return emits raw capture timestamps where the
 downsample path emits bucket-boundary timestamps) — both live in Commons but are core to
 this module's query reducer. No Critical findings; one High, three Medium, two Low.
 ## Checklist coverage
 | # | Category | Examined | Notes |
 |---|----------|----------|-------|
 | 1 | Correctness & logic bugs | Yes | `capturedAt`/cut-off captured on the actor thread (correct). `KpiSeriesBucketer` short-series early-return emits raw capture timestamps vs bucket-boundary timestamps on the downsample path (KpiHistory-005). Bucketer "largest-timestamp-wins for unsorted input" doc claim is false — it is last-in-iteration (KpiHistory-006). |
 | 2 | Akka.NET conventions | Yes | `PipeTo` + scope-per-tick + off-thread I/O + `IWithTimers` all correct; sender not captured across awaits; messages immutable singletons. But no in-flight guard despite the XML claiming to mirror NotificationOutbox (KpiHistory-001). |
 | 3 | Concurrency & thread safety | Yes | Actor state is effectively stateless; `_shutdownCts` lifecycle is correct. The missing in-flight guard allows overlapping sample passes under DB latency (KpiHistory-001). |
 | 4 | Error handling & resilience | Yes | Per-source isolation, write/purge guards, and `OperationCanceledException` shutdown handling are all correct and tested. No issues beyond KpiHistory-001's load amplification. |
 | 5 | Security | Yes | No injection surface — all queries are parameterised LINQ; `Source`/`Metric`/`Scope`/`ScopeKey` are equality predicates, never interpolated SQL. Scope isolation via `ScopeKey == scopeKey` (incl. `IS NULL` for Global) is correct. No issues found. |
 | 6 | Performance & resource management | Yes | `GetRawSeriesAsync` returns the entire window with no server-side cap before bucketing (KpiHistory-002). `RecordSamplesAsync` short-circuits empty batches (good). Purge is set-based `ExecuteDeleteAsync` but is a single unbatched statement (KpiHistory-004). |
 | 7 | Design-document adherence | Yes | The recorder XML claims to mirror NotificationOutbox's pattern but omits its in-flight guard (KpiHistory-001). `DefaultMaxSeriesPoints`-as-a-true-cap intent is undermined by KpiHistory-002. Otherwise faithful: singleton, not-readiness-gated, daily purge, EAV schema, indexes. |
 | 8 | Code organization & conventions | Yes | Options class owned by the component (correct); validator co-located; `IKpiHistoryRepository`/`IKpiSampleSource`/`KpiSample`/bucketer in Commons, impl in ConfigurationDatabase (correct); singleton Props built in Host. No issues found. |
 | 9 | Testing coverage | Yes | Recorder isolation, faulted-tick recovery, and purge cut-off are tested; validator bounds fully covered; bucketer has strong sorted-input coverage. Gaps: no overlapping-tick test, no unsorted-input bucketer test, no short-series timestamp-semantics assertion (KpiHistory-003). |
 | 10 | Documentation & comments | Yes | XML is generally excellent. Two drift points: the "mirror NotificationOutbox" claim (KpiHistory-001) and the bucketer unsorted-input claim (KpiHistory-006). `SampleComplete`/`PurgeComplete` no-op handlers are documented. |
 ## Findings
 ### KpiHistory-001 — Recorder has no in-flight guard; overlapping sample passes pile up under DB latency
 | | |
 |--|--|
 | Severity | High |
 | Category | Akka.NET conventions |
 | Status | Resolved |
 | Location | `src/ZB.MOM.WW.ScadaBridge.KpiHistory/KpiHistoryRecorderActor.cs:89-92`, `:103-107`, `:143-159` |
 **Description**
 The sample timer is a plain Akka periodic timer
 (`Timers.StartPeriodicTimer(SampleTimerKey, SampleTick.Instance, …, interval: _options.SampleInterval)`).
 `HandleSampleTick` launches `RunSamplePass(...)` off-thread via `PipeTo` and returns
 immediately; the piped-back `SampleComplete` is a deliberate no-op
 (`Receive<SampleComplete>(_ => { })`). There is **no in-flight guard** — nothing prevents
 a second `SampleTick` from starting a second `RunSamplePass` while the first is still
 awaiting its DB round-trip.
 Akka periodic timers do not coalesce missed ticks — they enqueue. So when a sample pass
 takes longer than `SampleInterval` (a slow, contended, or recovering central MS SQL —
 exactly the regime where observability matters most), each subsequent tick spawns *another*
 concurrent pass. Each pass opens its own DI scope + `DbContext` and issues its own
 `AddRange` + `SaveChangesAsync`. The result is a self-amplifying load spiral against the
 struggling store, plus duplicate sample rows whose `CapturedAtUtc` values straddle the same
 real-time window (the design's "one shared tick timestamp" invariant assumes one pass per
 tick).
 The actor's own XML-doc states it "mirrors the `NotificationOutboxActor` timer +
 scope-per-tick + PipeTo pattern" (lines 37-39) — but `NotificationOutboxActor` holds an
 explicit in-flight boolean cleared on `DispatchComplete` precisely to serialise sweeps.
 This recorder dropped that half of the pattern. The piped-back `SampleComplete` message is
 the natural place the guard would be lowered, and its current empty body is the tell that
 the guard was intended but omitted.
 **Recommendation**
 Add an in-flight guard: set a `_sampleInFlight` flag at the top of `HandleSampleTick`, skip
 (and log at debug) the tick if already set, and clear it in the `SampleComplete` handler
 (both success and failure projections already route there). This matches the
 NotificationOutbox pattern the doc claims to follow and bounds the recorder to one pass per
 tick. Add a regression test that drives two `SampleTick`s before the first pass completes
 (e.g. a repository whose `RecordSamplesAsync` blocks on a gate) and asserts only one pass
 ran. The purge tick is daily + idempotent so a guard there is optional, but consider the
 same treatment for symmetry.
 **Resolution**
 Resolved 2026-06-20 (commit `fd618cf1`): added a `_sampleInFlight` guard to the recorder — `HandleSampleTick` skips (debug-logs) if a pass is in flight; the flag is cleared via a `SampleComplete` message piped on BOTH success and failure, so a faulted source can't wedge the guard. Overlapping-tick regression test added.
 ### KpiHistory-002 — `GetRawSeriesAsync` materialises the entire window with no server-side cap; `DefaultMaxSeriesPoints` is applied only after the fact
 | | |
 |--|--|
 | Severity | Medium |
 | Category | Performance & resource management |
 | Status | Deferred |
 | Location | `src/ZB.MOM.WW.ScadaBridge.Commons/Interfaces/Repositories/IKpiHistoryRepository.cs:39-41` (contract); `src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase/Repositories/KpiHistoryRepository.cs:45-62` (impl); consumed by `src/ZB.MOM.WW.ScadaBridge.CentralUI/Services/KpiHistoryQueryService.cs:77-93` |
 **Description**
 `GetRawSeriesAsync` has no `limit`/`maxPoints`/`Take` parameter. The implementation issues a
 `Where(...).OrderBy(CapturedAtUtc).Select(...).ToListAsync()` that pulls **every** sample in
 `[fromUtc, toUtc]` for the series into memory. The `DefaultMaxSeriesPoints` ceiling
 (default 200, design-described as the cap that prevents "a single trend query [from
 streaming] an arbitrarily large series to the Central UI" — `KpiHistoryOptionsValidator`
 XML) is enforced by `KpiSeriesBucketer.Bucket` **after** the full raw set has already been
 fetched across the wire and allocated as a `List<KpiSeriesPoint>` in the query service.
 At the default 60 s sample cadence, a 24 h window is ~1 440 rows/series, a 7 d window is
 ~10 080 rows/series, and a 30 d window ~43 200 rows/series — per chart, with up to four
 trend panels on a page each issuing its own query. The cap that the design relies on for
 back-pressure provides none at the data tier; it only trims the in-memory result the UI
 binds to. The `IX_KpiSample_Series` index makes the *range scan* efficient, but the row
 count returned is still unbounded by anything except the window the parent page chooses.
 **Recommendation**
 Push the downsampling toward the store, or at least bound the fetch. Cheapest correct fix:
 add an optional `int? maxRows` to `GetRawSeriesAsync` and have the query service pass a
 generous multiple of `effectiveMax` (e.g. `effectiveMax * k`) so the bucketer still has
 enough density to pick representative last-values while the DB-side `Take` caps the
 transfer. A more thorough fix is a server-side bucketed aggregation (GROUP BY a computed
 bucket index, MAX(CapturedAtUtc) per bucket), but that is a larger change the design
 explicitly deferred ("v1 ships exactly one aggregation"). At minimum, document the
 unbounded-fetch behaviour and the practical window ceiling so an operator does not point a
 30 d chart at a busy multi-site deployment.
 **Resolution**
 Deferred 2026-06-20: bounding the raw window fetch (cheap `Take`-cap vs. server-side bucketed aggregation) is a design decision and the code lives in ConfigurationDatabase/Commons, not this project. Recorded as a query-path enhancement; the practical window ceiling should be documented when addressed.
 ### KpiHistory-003 — Missing tests: overlapping ticks, unsorted-input bucketing, short-series timestamp semantics
 | | |
 |--|--|
 | Severity | Medium |
 | Category | Testing coverage |
 | Status | Resolved |
 | Location | `tests/ZB.MOM.WW.ScadaBridge.KpiHistory.Tests/KpiHistoryRecorderActorTests.cs`; `tests/ZB.MOM.WW.ScadaBridge.Commons.Tests/Kpi/KpiSeriesBucketerTests.cs` |
 **Description**
 The recorder and bucketer tests are otherwise strong (per-source isolation, faulted-tick
 recovery, purge cut-off, validator bounds, sorted-input bucketing, right-edge handling,
 empty-bucket omission, out-of-window filtering). Three behaviour gaps remain, each tied to
 a finding here:
 1. **No overlapping-tick test** (KpiHistory-001). Every recorder test sends a single tick
   and awaits its effect; nothing exercises a second `SampleTick` arriving while the first
   pass is still in flight, so the missing in-flight guard is invisible to the suite. A
   gated-repository test (block `RecordSamplesAsync`, send two ticks, count passes) would
   pin the intended one-pass-per-tick behaviour.
 2. **No unsorted-input bucketer test** (KpiHistory-006). All bucketer tests pass ascending
   input, so the doc's "largest-timestamp-wins for unsorted input" claim is never checked —
   and it is in fact wrong.
 3. **No short-series timestamp-semantics assertion** (KpiHistory-005).
   `Bucket_BucketStartUtc_IsSetToBucketStart…` covers only the downsample path; no test
   asserts what `BucketStartUtc` the early-return (`raw.Count <= maxPoints`) path emits, so
   the inconsistency between the two paths is untested.
 **Recommendation**
 Add the three tests above. The overlapping-tick test belongs in
 `KpiHistoryRecorderActorTests` (it is a recorder behaviour); the two bucketer tests belong
 in `KpiSeriesBucketerTests`.
 **Resolution**
 Resolved 2026-06-20 (commit `fd618cf1`): added the overlapping-tick gated-repository test plus unsorted-input and short-series bucketer tests (the latter pin the documented short-series behaviour).
 ### KpiHistory-004 — Retention purge is a single unbatched `ExecuteDeleteAsync`; a large backlog deletes in one transaction
 | | |
 |--|--|
 | Severity | Low |
 | Category | Performance & resource management |
 | Status | Deferred |
 | Location | `src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase/Repositories/KpiHistoryRepository.cs:65-71` |
 **Description**
 `PurgeOlderThanAsync` runs one set-based `Where(s => s.CapturedAtUtc < before).ExecuteDeleteAsync(...)`.
 For the steady-state daily cadence this deletes one day of expired rows and is fine. But if
 the purge has not run for an extended period (singleton down across a long failover window,
 `PurgeInterval` mis-set and later corrected, or `RetentionDays` shortened), a single
 unbounded `DELETE` can touch a very large row count in one transaction — lock escalation on
 `KpiSample`, a long-running transaction, and transaction-log growth, which on the shared
 central MS SQL can affect operational tables. The Audit Log purge path, by contrast, uses a
 bounded batched delete for exactly this reason.
 This is observability data on a non-partitioned `[PRIMARY]` table, so the blast radius is
 bounded and the severity is Low — but the unbatched delete is a latent operational hazard
 on the shared store.
 **Recommendation**
 Loop a bounded delete (`DELETE TOP (N) … WHERE CapturedAtUtc < @before`, or EF Core's
 batching) until zero rows are affected, mirroring the Audit Log purge shape. Keep the
 returned total for the existing log line.
 **Resolution**
 Deferred 2026-06-20: batching the retention purge is a shared-MS-SQL-store tradeoff (vs. the Audit Log's batched-delete precedent); low severity for non-partitioned observability data. Recorded as a future enhancement.
 ### KpiHistory-005 — `KpiSeriesBucketer` short-series early-return emits raw capture timestamps where the downsample path emits bucket-boundary timestamps
 | | |
 |--|--|
 | Severity | Low |
 | Category | Correctness & logic bugs |
 | Status | Deferred |
 | Location | `src/ZB.MOM.WW.ScadaBridge.Commons/Types/Kpi/KpiSeriesBucketer.cs:57-58` (early return) vs `:93-94` (downsample) |
 **Description**
 `KpiSeriesBucketer.Bucket` has two return paths that disagree on what `BucketStartUtc`
 means. When `raw.Count <= maxPoints` it returns `raw` unchanged — those points carry the
 raw `CapturedAtUtc` as their `BucketStartUtc` (the repository builds
 `new KpiSeriesPoint(s.CapturedAtUtc, s.Value)`). When `raw.Count > maxPoints` it returns
 points whose `BucketStartUtc` is the **bucket boundary**
 (`fromUtc + bucketIndex * bucketWidthTicks`), as the dedicated test
 `Bucket_BucketStartUtc_IsSetToBucketStartNotRawPointTimestamp` asserts.
 So the same series, charted at a density below vs above `maxPoints`, plots its points at
 different x-positions: actual capture instants in the sparse case, evenly-spaced bucket
 starts in the dense case. The downstream `KpiTrendChart` normalises X across
 `[min(BucketStartUtc), max(BucketStartUtc)]`, so the visual impact is minor (the time range
 is essentially the same), but the contract is inconsistent and the x-axis "tick spacing"
 subtly changes as a window crosses the cap. This is the bucketer that the KpiHistory design
 defines as the module's query reducer, so the inconsistency is in-scope even though the file
 lives in Commons.
 **Recommendation**
 Make the two paths agree. Either document the difference explicitly on `Bucket` (the
 short-series path returns raw capture instants; the downsample path returns bucket starts),
 or — cleaner — have the short-series path also project onto a consistent timestamp basis if
 exact bucket-start semantics are part of the contract. Add the short-series timestamp test
 from KpiHistory-003.
 **Resolution**
 Deferred 2026-06-20: whether the bucketer's short-series early-return and the downsample path must agree on `BucketStartUtc` semantics (vs. documenting the difference) is a contract decision for the design owner. The doc was corrected (KpiHistory-006) to describe current behaviour; the contract change is deferred.
 ### KpiHistory-006 — `KpiSeriesBucketer` doc claims "largest timestamp within each bucket is selected" for unsorted input; the code selects last-in-iteration
 | | |
 |--|--|
 | Severity | Low |
 | Category | Documentation & comments |
 | Status | Resolved |
 | Location | `src/ZB.MOM.WW.ScadaBridge.Commons/Types/Kpi/KpiSeriesBucketer.cs:20-21` (doc), `:88-97` (code) |
 **Description**
 The `raw` param doc states: *"If not sorted, the point with the largest timestamp within
 each bucket is selected."* The code does not do this. When a point is stored into a bucket,
 `best[bucketIndex].BucketStartUtc` is set to the **bucket-start** timestamp
 (`fromUtc + bucketIndex * bucketWidthTicks`), not the raw point's timestamp. The
 last-value comparison for a subsequent point in the same bucket is then
 `point.BucketStartUtc > best[bucketIndex].BucketStartUtc` — i.e. it compares the new raw
 point's capture time against the *bucket start*, which (for any in-bucket point) is almost
 always true. The effect is that each later-in-iteration point overwrites the previous one
 regardless of their relative timestamps, so the **last point in iteration order** wins, not
 the point with the largest timestamp.
 For the production caller this is harmless: `KpiHistoryRepository.GetRawSeriesAsync` always
 `OrderBy(CapturedAtUtc)`, so iteration order equals time order and last-in-iteration is the
 largest timestamp. But the documented contract for unsorted input is simply false, and the
 "if ties, keep first encountered — stable" comment (line 89) is also inaccurate — the
 overwrite triggers on equal-as-well-as-greater for any in-bucket point. A future caller that
 trusts the unsorted-input guarantee will get wrong results silently.
 **Recommendation**
 Either (a) fix the comparison to track the selected raw point's actual timestamp (store the
 raw `point.BucketStartUtc` alongside the emitted value and compare against that), making the
 "largest timestamp wins" claim true for unsorted input; or (b) tighten the doc to state the
 method requires ascending-sorted input and selects last-in-iteration otherwise, and drop the
 inaccurate "largest timestamp" / "stable ties" language. Pair with the unsorted-input test
 from KpiHistory-003.
 **Resolution**
 Resolved 2026-06-20 (commit `fd618cf1`): corrected the `KpiSeriesBucketer` param XML doc — dropped the false 'largest-timestamp-wins / stable ties' claim; now states it requires ascending-sorted input and selects last-in-iteration otherwise. Doc-only.
@@ -5,9 +5,9 @@
 | Module | `src/ZB.MOM.WW.ScadaBridge.NotificationService` |
 | Design doc | `docs/requirements/Component-NotificationService.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-28 |
+| Last reviewed | 2026-06-20 |
 | Reviewer | claude-agent |
-| Commit reviewed | `1eb6e97` |
+| Commit reviewed | `4307c381` |
 | Open findings | 0 |
 ## Summary
@@ -100,6 +100,39 @@ and `CredentialRedactor` masks any component of the credential string that is
 ≥ 4 characters long — a 4-character user name like `root` or a 4-char tenant
 prefix could be aggressively scrubbed out of unrelated log text (NS-025).
 #### Re-review 2026-06-20 (commit `4307c381`) — full review
 Re-reviewed the whole current state. Since `1eb6e97` the module was renamed
 (`ScadaLink → ZB.MOM.WW.ScadaBridge`, so the diff shows every file as "added" at
 the new path), gained complete XML-doc coverage, and — most importantly — **NS-021
 is now RESOLVED**: `ISmtpClientWrapper.AuthenticateAsync` threads an `oauth2UserName`
 and the OAuth2 branch builds `new SaslMechanismOAuth2(oauth2UserName, token)`,
 throwing `SmtpPermanentException` if the identity is empty; the real production
 caller `EmailNotificationDeliveryAdapter.SendAsync` passes `config.FromAddress`
 (`src/ZB.MOM.WW.ScadaBridge.NotificationOutbox/Delivery/EmailNotificationDeliveryAdapter.cs:196`),
 and `MailKitSmtpClientWrapperTests` pins both the throw and the XOAUTH2 framing.
 All ten prior NS-019..NS-025 findings hold; the module is now cleanly scoped to the
 shared central SMTP primitives (no orphaned site-delivery code). The re-review
 surfaced **three new findings**, none above Medium: the OAuth2 token endpoint and
 scope are hardcoded to Microsoft 365 so no other OAuth2 provider can authenticate
 despite the design doc promising "other modern SMTP providers" (NS-026), one test
 file is misplaced in this test project (its system-under-test lives in
 NotificationOutbox — NS-027), and one test's XML doc still names the
 NS-019-deleted `NotificationDeliveryService` (NS-028).
 | # | Category | Examined | Notes |
 |---|----------|----------|-------|
 | 1 | Correctness & logic bugs | ☑ | NS-021 now resolved (oauth2 user identity threaded + tested). Classification/redaction correct. No new correctness defect. |
 | 2 | Akka.NET conventions | ☑ | No actors. `AddNotificationServiceActors` is a documented no-op. |
 | 3 | Concurrency & thread safety | ☑ | `OAuth2TokenService` per-credential `ConcurrentDictionary` + per-entry `SemaphoreSlim` double-check is correct; concurrent-fetch test passes. No issues. |
 | 4 | Error handling & resilience | ☑ | Delivery error path now lives entirely in NotificationOutbox; the shared `SmtpErrorClassifier` is sound (typed exceptions + numeric SMTP code, cancellation excluded). No issues here. |
 | 5 | Security | ☑ | OAuth2 secret/token never logged (tenant only); `CredentialRedactor` tightened (NS-025). At-rest encryption of `SmtpConfiguration.Credentials` still NS-013 Deferred (Commons-scoped). OAuth2 endpoint hardcoded to M365 (NS-026). |
 | 6 | Performance & resource management | ☑ | Single `SmtpClient` per wrapper, disposed by caller's `using`; per-delivery factory documented (NS-022). No leak. |
 | 7 | Design-document adherence | ☑ | Central-only redesign clean. New gap: OAuth2 hardcoded to M365 token URL + outlook.office365.com scope contradicts "other modern SMTP providers" in the doc (NS-026). |
 | 8 | Code organization & conventions | ☑ | One-type-per-file holds. `NotificationOutboxKpiSampleSourceTests.cs` is in this test project but its SUT is in NotificationOutbox (NS-027). |
 | 9 | Testing coverage | ☑ | Primitives well covered (classifier, TLS parser, redactor, OAuth2 token cache/refresh/concurrency, auth-wrapper negative paths incl. NS-021). No false-coverage; orphaned tests gone (NS-024). |
 | 10 | Documentation & comments | ☑ | XML docs accurate after NS-019/023, EXCEPT `SmtpErrorClassifierTests.cs:11` still references the deleted `NotificationDeliveryService` (NS-028). |
 ## Checklist coverage
 | # | Category | Examined | Notes |
@@ -860,3 +893,118 @@ The threshold is a defence-in-depth choice; the existing tests assert that `Hunt
 **Recommendation**
 Tighten the redaction policy: mask only the obviously-secret components — the password (Basic), the client secret (OAuth2), and the whole packed string — not the user name / tenant / client id. The simplest implementation is to redact only the **last** colon-separated component (the secret) plus the full packed string. Bump the per-component minimum length to something high enough that a typical short user name does not match (≥ 12 chars is the usual heuristic for a password). Add a test asserting `Scrub("/root/.config", "root:hunter2")` does not mask `/root/.config`'s `root`.
 ### NotificationService-026 — OAuth2 token endpoint and scope are hardcoded to Microsoft 365; no other OAuth2 SMTP provider can authenticate
 | | |
 |--|--|
 | Severity | Medium |
 | Category | Design-document adherence |
 | Status | Deferred |
 | Location | `src/ZB.MOM.WW.ScadaBridge.NotificationService/OAuth2TokenService.cs:73`, `src/ZB.MOM.WW.ScadaBridge.NotificationService/OAuth2TokenService.cs:80` |
 **Description**
 `OAuth2TokenService.GetTokenAsync` constructs the token URL as
 `$"https://login.microsoftonline.com/{tenantId}/oauth2/v2.0/token"` (`:73`) and
 requests the fixed scope `"https://outlook.office365.com/.default"` (`:80`). Both
 are Microsoft-365-specific: the authority host is Azure AD's, and the scope is the
 Outlook/Office-365 SMTP-send resource. The design doc (`Component-NotificationService.md`,
 "Email Server Configuration") states the OAuth2 Client Credentials mode is
 "For Microsoft 365 **and other modern SMTP providers** that require OAuth2", and
 CLAUDE.md's External Integrations decision likewise says "OAuth2 Client Credentials
 (Microsoft 365) **or** Basic Auth". As implemented, only Microsoft 365 can ever
 work: a Google Workspace, on-prem Keycloak/ADFS, or any non-Azure OAuth2 SMTP relay
 would need a different authority host and a different scope, neither of which is
 configurable. The credential triple carries only `tenantId:clientId:clientSecret`
 — there is no authority-host or scope field to override either constant. An operator
 who configures an `oauth2` SMTP config pointing at a non-M365 provider gets a token
 request sent to `login.microsoftonline.com` with `{tenantId}` interpolated as a path
 segment, which fails (404/400) and surfaces as a permanent delivery failure with no
 hint that the provider is simply unsupported.
 This is latent rather than actively broken — the only documented/tested target is
 M365 — but it contradicts the design doc's stated breadth and is the kind of gap that
 silently blocks a future deployment against a non-Microsoft tenant.
 **Recommendation**
 Either (a) make the authority host and scope configurable — add optional
 `OAuth2Authority`/`OAuth2Scope` fields to `SmtpConfiguration` (defaulting to the M365
 values) and thread them into `GetTokenAsync` — and update the cache key to include
 them; or (b) narrow the design doc and CLAUDE.md to say OAuth2 Client Credentials is
 **Microsoft 365 only** and document the M365-only constraint on `OAuth2TokenService`,
 so an operator is not misled into configuring an unsupported provider. Option (b) is
 the smaller change if M365-only is acceptable for v1.
 **Resolution**
 Deferred 2026-06-20: the OAuth2 token endpoint + scope are hardcoded to Microsoft 365 while the doc implies multi-provider support. Whether to make authority/scope configurable or narrow the doc to 'Microsoft 365 only' for v1 is a product/scope decision. Latent today (only M365 is deployed/tested). Recorded for that decision.
 ### NotificationService-027 — `NotificationOutboxKpiSampleSourceTests` is in the wrong test project (its system-under-test lives in NotificationOutbox)
 | | |
 |--|--|
 | Severity | Low |
 | Category | Code organization & conventions |
 | Status | Resolved |
 | Location | `tests/ZB.MOM.WW.ScadaBridge.NotificationService.Tests/Kpi/NotificationOutboxKpiSampleSourceTests.cs:1`, `src/ZB.MOM.WW.ScadaBridge.NotificationOutbox/Kpi/NotificationOutboxKpiSampleSource.cs` |
 **Description**
 `tests/ZB.MOM.WW.ScadaBridge.NotificationService.Tests/Kpi/NotificationOutboxKpiSampleSourceTests.cs`
 was added (commit `0d6c026d`, "K6 — NotificationOutbox sample source") inside the
 **NotificationService** test project, but the class it exercises —
 `NotificationOutboxKpiSampleSource` — lives in the **NotificationOutbox** project
 (`src/ZB.MOM.WW.ScadaBridge.NotificationOutbox/Kpi/NotificationOutboxKpiSampleSource.cs`),
 confirmed by a repo-wide grep showing the type defined only there. A module's test
 project should test that module: this test belongs in
 `tests/ZB.MOM.WW.ScadaBridge.NotificationOutbox.Tests`. As placed, it forces the
 NotificationService test project to reference the NotificationOutbox project, blurs
 the module boundary the review process treats as the unit of review, and means a
 reviewer scoping "the NotificationService tests" picks up coverage for a different
 component. It also inflates this module's apparent test count with tests that assert
 nothing about NotificationService.
 **Recommendation**
 Move `Kpi/NotificationOutboxKpiSampleSourceTests.cs` to
 `tests/ZB.MOM.WW.ScadaBridge.NotificationOutbox.Tests/Kpi/`, and drop the
 NotificationOutbox project reference from `ZB.MOM.WW.ScadaBridge.NotificationService.Tests.csproj`
 if nothing else in that test project needs it.
 **Resolution**
 Resolved 2026-06-20 (commit `fd618cf1`): moved `NotificationOutboxKpiSampleSourceTests` from NotificationService.Tests into NotificationOutbox.Tests (where its SUT lives), with the namespace adjusted to that project's convention. Passes in its new home.
 ### NotificationService-028 — Stale XML cross-reference: `SmtpErrorClassifierTests` still names the NS-019-deleted `NotificationDeliveryService`
 | | |
 |--|--|
 | Severity | Low |
 | Category | Documentation & comments |
 | Status | Resolved |
 | Location | `tests/ZB.MOM.WW.ScadaBridge.NotificationService.Tests/SmtpErrorClassifierTests.cs:11` |
 **Description**
 The class-summary XML doc on `SmtpErrorClassifierTests` says the classification
 policy "is shared between `<see cref="NotificationDeliveryService"/>` and the central
 outbox's `EmailNotificationDeliveryAdapter`". `NotificationDeliveryService` was
 deleted as part of the NS-019 resolution (the orphaned site-side sender), so this
 cross-reference points at a type that no longer exists. The `<see cref=...>` will not
 resolve and the sentence describes a sharing relationship that ended with the
 central-only redesign — the classifier is now consumed only by
 `EmailNotificationDeliveryAdapter`. This is the same stale-doc class NS-023 cleaned up
 in the production source, missed in this one test file. Minor, but misleading to a
 maintainer and a dangling `cref`.
 **Recommendation**
 Drop the `NotificationDeliveryService` reference from the summary; state that the
 classifier is shared by the central `EmailNotificationDeliveryAdapter` (and is a
 reusable primitive of this module), consistent with the NS-019/NS-023 cleanup.
 **Resolution**
 Resolved 2026-06-20 (commit `fd618cf1`): removed the dangling `<see cref="NotificationDeliveryService"/>` (type deleted by NS-019) from the `SmtpErrorClassifierTests` XML doc, repointing the reference to the actual production caller. Doc-only.
@@ -40,37 +40,39 @@ module file and counted in **Total**.
 | Severity | Open findings |
 |----------|---------------|
 | Critical | 0 |
-| High | 0 |
+| High | 1 |
-| Medium | 0 |
+| Medium | 1 |
-| Low | 0 |
+| Low | 2 |
-| **Total** | **0** |
+| **Total** | **4** |
 ## Module Status
 | Module | Last reviewed | Commit | Open (C/H/M/L) | Open | Total |
 |--------|---------------|--------|----------------|------|-------|
-| [AuditLog](AuditLog/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 11 |
+| [AuditLog](AuditLog/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 16 |
 | [CLI](CLI/findings.md) | 2026-06-19 | `d6ead8ae` | 0/0/0/0 | 0 | 24 |
 | [CentralUI](CentralUI/findings.md) | 2026-06-19 | `d6ead8ae` | 0/0/0/0 | 0 | 36 |
-| [ClusterInfrastructure](ClusterInfrastructure/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 14 |
+| [ClusterInfrastructure](ClusterInfrastructure/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 15 |
 | [Commons](Commons/findings.md) | 2026-06-19 | `d6ead8ae` | 0/0/0/0 | 0 | 26 |
-| [Communication](Communication/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 22 |
+| [Communication](Communication/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 24 |
 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | 2026-06-19 | `d6ead8ae` | 0/0/0/0 | 0 | 27 |
-| [DataConnectionLayer](DataConnectionLayer/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 22 |
+| [DataConnectionLayer](DataConnectionLayer/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 26 |
-| [DeploymentManager](DeploymentManager/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 24 |
+| [DeploymentManager](DeploymentManager/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 27 |
-| [ExternalSystemGateway](ExternalSystemGateway/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 23 |
+| [ExternalSystemGateway](ExternalSystemGateway/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 26 |
-| [HealthMonitoring](HealthMonitoring/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 23 |
+| [HealthMonitoring](HealthMonitoring/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 25 |
-| [Host](Host/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 22 |
+| [Host](Host/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 26 |
-| [InboundAPI](InboundAPI/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 25 |
+| [InboundAPI](InboundAPI/findings.md) | 2026-06-20 | `4307c381` | 0/1/1/2 | 4 | 29 |
 | [KpiHistory](KpiHistory/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 6 |
 | [ManagementService](ManagementService/findings.md) | 2026-06-19 | `d6ead8ae` | 0/0/0/0 | 0 | 26 |
 | [NotificationOutbox](NotificationOutbox/findings.md) | 2026-06-19 | `d6ead8ae` | 0/0/0/0 | 0 | 13 |
-| [NotificationService](NotificationService/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 25 |
+| [NotificationService](NotificationService/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 28 |
-| [Security](Security/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 21 |
+| [ScriptAnalysis](ScriptAnalysis/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 8 |
-| [SiteCallAudit](SiteCallAudit/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 6 |
+| [Security](Security/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 25 |
-| [SiteEventLogging](SiteEventLogging/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 23 |
+| [SiteCallAudit](SiteCallAudit/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 9 |
-| [SiteRuntime](SiteRuntime/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 26 |
+| [SiteEventLogging](SiteEventLogging/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 27 |
-| [StoreAndForward](StoreAndForward/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 24 |
+| [SiteRuntime](SiteRuntime/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 31 |
-| [TemplateEngine](TemplateEngine/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 22 |
+| [StoreAndForward](StoreAndForward/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 27 |
 | [TemplateEngine](TemplateEngine/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 25 |
 | [Transport](Transport/findings.md) | 2026-06-19 | `d6ead8ae` | 0/0/0/0 | 0 | 15 |
 ## Pending Findings
@@ -84,14 +86,21 @@ description, location, recommendation — lives in the module's `findings.md`.
 _None open._
-### High (0)
+### High (1)
-_None open._
+| ID | Module | Title |
 |----|--------|-------|
 | InboundAPI-026 | [InboundAPI](InboundAPI/findings.md) | `InboundDatabaseHelper` gives inbound scripts arbitrary raw SQL (not read-only); contradicts the design doc's "No direct database access" decision and regresses InboundAPI-007 |
-### Medium (0)
+### Medium (1)
-_None open._
+| ID | Module | Title |
 |----|--------|-------|
 | InboundAPI-027 | [InboundAPI](InboundAPI/findings.md) | `InboundDatabaseHelper` is sync-over-async and ignores the method-deadline token — thread-pool starvation and an unbounded slow query |
-### Low (0)
+### Low (2)
-_None open._
+| ID | Module | Title |
 |----|--------|-------|
 | InboundAPI-028 | [InboundAPI](InboundAPI/findings.md) | `InboundDatabaseHelper` has no negative-path tests; `Database`/`WaitForAttribute` are not exercised end-to-end through the endpoint |
 | InboundAPI-029 | [InboundAPI](InboundAPI/findings.md) | Routed `WaitForAttribute` is cancelled by the method-level deadline, contradicting spec §6 (wait bounded by the wait timeout, not the method timeout) |
@@ -0,0 +1,407 @@
 # Code Review — ScriptAnalysis
 | Field | Value |
 |-------|-------|
 | Module | `src/ZB.MOM.WW.ScadaBridge.ScriptAnalysis` |
 | Design doc | `docs/requirements/Component-ScriptAnalysis.md` |
 | Status | Reviewed |
 | Last reviewed | 2026-06-20 |
 | Reviewer | claude-agent |
 | Commit reviewed | `4307c381` |
 | Open findings | 0 |
 ## Summary
 This is the first formal review of the ScriptAnalysis component (#25), the single
 authoritative source of truth for the ScadaBridge script trust boundary. The module
 is small (5 source files, ~700 lines) and well-documented, and its core design is
 **sound**: the trust verdict is genuinely **semantic** (Roslyn symbol resolution in
 Pass 1), not a string/regex deny-list, so the classic spelling-based evasions —
 aliases, `using static`, `global::`, partial-namespace qualification — are correctly
 defeated by resolving the symbol rather than the text. The reflection-gateway
 hardening (Pass 2) closes the `typeof(x).Assembly.GetType("System.IO.File")` class of
 bypass that a namespace-only deny-list cannot see. All four documented consumers
 (Template Engine, Site Runtime, Inbound API, Central UI) genuinely delegate to
 `ScriptTrustValidator.FindViolations` and **hard-reject** on a non-empty result; none
 re-implements or relaxes the policy. A reflection-based parity guard for the
 compile-only surfaces exists in `SiteRuntime.Tests`. **No working trust-boundary
 bypass was found at HEAD against the production (cluster-hosted) configuration.**
 That said, the review surfaced several real issues. The highest-impact one is a
 **silent degradation of the semantic pass when `TRUSTED_PLATFORM_ASSEMBLIES` is
 unavailable** (single-file/AOT/trimmed host): `AnalysisReferences` falls back to the
 5-assembly minimal set, after which a *bare* forbidden type inside an *allowed*
 namespace (the documented `Process` via `using System.Diagnostics;` case) no longer
 resolves and slips Pass 1 entirely — and Pass 2 never flags bare identifiers, only
 dotted spellings (ScriptAnalysis-001). This is the only finding that can weaken the
 verdict, and it is narrow (the production hosts carry a TPA list, and the downstream
 real `Compile` gate catches it for two of the four consumers), so it is rated High
 rather than Critical. The remaining findings are robustness/correctness/doc issues:
 the semantic pass swallows nothing but the unresolved-symbol fallback is namespace-
 spelling-dependent (Medium); `nameof(...)` over-flags legitimate non-forbidden uses
 inconsistently (Low, documented-intentional but worth pinning); `ParseDiagnostics`/
 `Compile` drop warnings despite the doc promising "errors and warnings"; there is no
 caching/throttling of the Roslyn compilations (performance); the `extraReferences`
 XML-doc on `FindViolations` is contradicted by the Central UI consumer; and the
 adversarial test corpus, though good, has gaps (no extension-method, no
 TPA-fallback, no `file`-scoped/`unsafe`, no verbatim/Unicode-escape identifier
 cases). The trust boundary is **semantically sound in production**; the open items
 harden the edges and align the spec.
 ## Checklist coverage
 | # | Category | Examined | Notes |
 |---|----------|----------|-------|
 | 1 | Correctness & logic bugs | ☑ | `nameof` over-flags inconsistently (ScriptAnalysis-005); `GetSyntacticScopes` only returns text for dotted names so a bare unresolved forbidden identifier yields no scope (root of ScriptAnalysis-001/002). |
 | 2 | Akka.NET conventions | ☑ | No issues found — module has no actors, no Akka dependency (by design; csproj references only Roslyn + Commons). Stateless static APIs, thread-safe. |
 | 3 | Concurrency & thread safety | ☑ | No issues found — all public surfaces are static/stateless; the shared `static readonly` policy collections are read-only after init; `HardeningWalker` is per-call. `AnalysisReferences` eager-built once at type init. |
 | 4 | Error handling & resilience | ☑ | `RoslynScriptCompiler.Compile` converts exceptions into a fail-the-gate error entry (fail-safe). `BuildAnalysisReferences` swallows per-assembly read errors and silently falls back to the minimal set — the silent fallback is the resilience gap (ScriptAnalysis-001). |
 | 5 | Security | ☑ | Semantic (symbol-based) verdict — no spelling bypass found in production. One conditional degradation (TPA-absent host) weakens Pass 1 to spelling-dependent (ScriptAnalysis-001); `extraReferences` can never cause a false allow (verified). Defence-in-depth only, as documented — not a runtime sandbox (ScriptAnalysis-008). |
 | 6 | Performance & resource management | ☑ | No compilation caching/throttling; every gate call re-parses, rebuilds a compilation, and re-runs Roslyn — `FindViolations` builds a fresh `CSharpCompilation` against the *full framework* reference set on every call (ScriptAnalysis-006). |
 | 7 | Design-document adherence | ☑ | `ParseDiagnostics`/`Compile` drop warnings though doc says "errors and warnings" (ScriptAnalysis-004); `AnalysisReferences` (full-framework set) is not described in REQ-SA-2, which says DefaultReferences (ScriptAnalysis-007). |
 | 8 | Code organization & conventions | ☑ | No issues found — POCO/static layout is clean, namespace correct, depends only on Commons + Roslyn. `ScriptCompileSurface` lives in ScriptAnalysis (not Commons) — acceptable as it mirrors a runtime surface and is consumed here. |
 | 9 | Testing coverage | ☑ | Good adversarial reject/allow corpus, but missing: TPA-fallback degradation, extension methods, verbatim/Unicode-escape identifiers, `unsafe`/pointer, comment/trivia, and `AnalysisReferences`/exception paths (ScriptAnalysis-003). |
 | 10 | Documentation & comments | ☑ | `FindViolations` XML-doc claim "Forbidden references are NOT added here" is contradicted by the Central UI consumer forwarding `compilation.References` (ScriptAnalysis-007, doc half). Otherwise thorough and accurate. |
 ## Findings
 ### ScriptAnalysis-001 — Semantic trust pass silently degrades to spelling-dependent when `TRUSTED_PLATFORM_ASSEMBLIES` is unavailable
 | | |
 |--|--|
 | Severity | High |
 | Category | Security |
 | Status | Resolved |
 | Location | `src/ZB.MOM.WW.ScadaBridge.ScriptAnalysis/ScriptTrustPolicy.cs:146-184`; interacts with `ScriptTrustValidator.cs:111-136`, `:195-213` |
 **Description**
 `AnalysisReferences` is the reference set that lets Pass 1 (the semantic symbol pass —
 the *authoritative* guard) resolve every type a script names to its true namespace.
 It is built from `AppContext.GetData("TRUSTED_PLATFORM_ASSEMBLIES")`. When that value
 is absent — a single-file deployment, an AOT/trimmed host, or any runtime that does
 not publish the TPA list — `BuildAnalysisReferences` falls back to the 5-assembly
 `DefaultReferences` set (`ScriptTrustPolicy.cs:183`) **silently** (the per-assembly
 `catch { }` at `:165`/`:178` and the `Count > 0` fallback at `:183` emit nothing).
 After that fallback, a *bare* forbidden type that lives inside an *allowed* namespace
 no longer resolves. The policy's own documented example is exactly this case:
 `using System.Diagnostics; var p = Process.Start("x");`. `System.Diagnostics` is an
 allowed namespace (so `VisitUsingDirective` does not flag the import), and bare
 `Process` is an `IdentifierName`. With `Process.dll` absent from the minimal
 reference set, `model.GetSymbolInfo` returns no symbol, so Pass 1 falls to
 `GetSyntacticScopes`, which returns an empty scope list for a bare (dotless)
 identifier (`ScriptTrustValidator.cs:210-212`). Pass 2's `HardeningWalker` only flags
 forbidden *dotted* names and forbidden `using` imports — never a bare identifier
 token (`VisitIdentifierName`, `:373-395`, only checks `dynamic`/`Activator`). The
 forbidden `Process.Start` therefore slips the trust validator entirely. The same
 degradation reopens the broader "forbidden type in an allowed namespace reached as a
 bare identifier" class that the full-framework reference set was specifically added
 (commit `06975720`, M3.6) to close.
 This is rated High not Critical because: (a) the production hosts (Akka.NET cluster,
 docker topology) run on a normal framework-dependent runtime that publishes the TPA
 list, so the full set is loaded; and (b) for two of the four consumers (Template
 Engine, Site Runtime) the downstream real `Compile` gate against a curated reference
 set independently rejects an undefined `Process` symbol. But Inbound API and the
 Central UI run gate rely on `FindViolations` for the reflection/namespace verdict,
 and a deployment-mode change that drops the TPA list would silently weaken the
 boundary with no diagnostic.
 **Recommendation**
 Make the degradation loud and fail-safe: when the TPA list is unavailable, log a
 warning (or expose a diagnostic flag the consumers can surface), and — more
 importantly — add the BCL assemblies that host the forbidden-type-in-allowed-namespace
 cases (at minimum `System.Diagnostics.Process`) to `DefaultAssemblies` so they resolve
 even in the minimal fallback. Alternatively, extend Pass 2 to flag bare identifiers
 whose name matches the simple name of a known forbidden type (e.g. `Process`,
 `WebClient`), independent of symbol resolution, so the verdict never depends solely on
 the reference set being complete.
 **Resolution**
 Resolved 2026-06-20 (commit `fd618cf1`): the TPA-absent fallback now folds the
 assemblies hosting the forbidden anchor types (Process, Socket, File, Thread, Assembly,
 Marshal) into the minimal validation reference set so a bare forbidden type still
 resolves and is flagged; an `AnalysisReferencesDegraded` flag + a `Trace.TraceWarning`
 make the degraded mode observable. CRITICAL: anchors are added ONLY to the validation
 `AnalysisReferences`, never to `DefaultReferences` — keeping the compile gate from being
 able to bind `Process`. Verified by a TPA-fallback test that a bare `Process.Start` is
 still rejected.
 ### ScriptAnalysis-002 — Pass 2 (syntactic hardening) is spelling-dependent and cannot catch a bare forbidden identifier on its own
 | | |
 |--|--|
 | Severity | Medium |
 | Category | Security |
 | Status | Deferred |
 | Location | `src/ZB.MOM.WW.ScadaBridge.ScriptAnalysis/ScriptTrustValidator.cs:289-395`; `:195-213` |
 **Description**
 Pass 2 (`HardeningWalker`) is purely textual: `VisitUsingDirective` flags forbidden
 `using` imports (`:289-295`), `VisitQualifiedName`/`VisitMemberAccessExpression` flag
 forbidden *dotted* names (`:298-369`), and `VisitIdentifierName` flags only the bare
 identifiers `dynamic`/`Activator` (`:373-395`). It cannot independently flag any other
 bare forbidden identifier — it relies entirely on Pass 1's semantic resolution for
 that. The two passes are therefore not the "two independent guards" the design implies
 for the bare-identifier case: for an unresolved bare identifier (see
 ScriptAnalysis-001) **neither** pass fires. `GetSyntacticScopes` deliberately returns
 nothing for dotless names (`:210-212`, "bare local identifiers cannot reach a forbidden
 namespace") — which is true only while Pass 1 resolves them; once resolution fails the
 assumption breaks.
 This is the structural root of ScriptAnalysis-001 and is recorded separately because
 it is a design property (the syntactic pass is not self-sufficient against bare names)
 that should be acknowledged even after the TPA-fallback hole is closed: the defence is
 single-layered, not double-layered, for the bare-identifier vector.
 **Recommendation**
 Document explicitly that Pass 1 is the sole guard for bare (single-identifier)
 forbidden references and that Pass 2 only backstops *spelled* forbidden namespaces. If
 genuine defence-in-depth is wanted for bare names, add a simple-name deny-set to Pass 2
 (the simple names of the forbidden root types — `Process`, `WebClient`, `Marshal`,
 `Registry`, `Thread`, `File`, `Directory`, etc.), accepting the small false-positive
 risk on a script's own identically-named local, which a security gate should tolerate.
 **Resolution**
 Deferred 2026-06-20: strengthening Pass-2 (syntactic) so a bare forbidden identifier is
 caught without Pass-1 symbol resolution is a defence-in-depth enhancement; the
 production path (TPA present, two-layer compile gate on most call sites) is sound.
 Recorded for a follow-up.
 ### ScriptAnalysis-003 — Adversarial test corpus has bypass-relevant gaps
 | | |
 |--|--|
 | Severity | Medium |
 | Category | Testing coverage |
 | Status | Resolved |
 | Location | `tests/ZB.MOM.WW.ScadaBridge.ScriptAnalysis.Tests/ScriptTrustValidatorTests.cs`; `RoslynScriptCompilerTests.cs` |
 **Description**
 The reject/allow corpus is solid for the spelled-namespace, alias, `using static`,
 `global::`, reflection-gateway, generic-argument, and nested-lambda vectors. For a
 security-critical trust boundary, however, several bypass-relevant cases are untested:
 - **TPA-fallback degradation** (ScriptAnalysis-001) — no test forces
  `AnalysisReferences` onto the minimal set and asserts that bare `Process`/`WebClient`
  is still rejected. This is the highest-value missing test.
 - **Extension methods** — no test that a forbidden-namespace extension method invoked
  in receiver-position (`x.SomeForbiddenExt()`) is resolved and flagged.
 - **Verbatim / Unicode-escaped identifiers** — `@dynamic`, `dynamic`,
  `Activator`, or `GetType` are not exercised; `VisitIdentifierName` compares
  `Identifier.ValueText`, which decodes Unicode escapes (so it likely holds), but the
  case is not pinned, and `dynamic` is matched via `text == "dynamic"` (`:380`) which is
  a separate path from the `ForbiddenIdentifiers` set.
 - **`unsafe`/pointer/`stackalloc`** — relies implicitly on `ScriptOptions` defaulting
  `AllowUnsafe=false`; not asserted.
 - **Comment/trivia and string-literal payloads** — no test that a forbidden namespace
  inside a comment or string is *not* flagged (false-positive guard) and that a
  forbidden token reconstructed only at runtime from strings is correctly *not* the
  validator's concern (documents the sandbox caveat).
 **Recommendation**
 Add the cases above, prioritising the TPA-fallback test (inject a minimal-reference
 build of `AnalysisReferences` or factor the reference set so a test can force the
 fallback) and the extension-method test. Add a verbatim/Unicode-escape identifier
 reject test for `dynamic`/`Activator`/`GetType`.
 **Resolution**
 Resolved 2026-06-20 (commit `fd618cf1`): added 18 adversarial tests — TPA-fallback
 bare-Process/Socket, extension-method invocation, verbatim/Unicode-escaped identifiers,
 `unsafe` blocks (benign vs forbidden-API-inside), and comment/string-literal
 false-positive guards.
 ### ScriptAnalysis-004 — `ParseDiagnostics`/`Compile` drop warnings though the design doc promises "errors and warnings"
 | | |
 |--|--|
 | Severity | Low |
 | Category | Design-document adherence |
 | Status | Resolved |
 | Location | `src/ZB.MOM.WW.ScadaBridge.ScriptAnalysis/RoslynScriptCompiler.cs:30-33`, `:73-76` |
 **Description**
 REQ-SA-3 states `Compile` returns the diagnostics "filtered to errors and warnings"
 and `ParseDiagnostics` returns "syntax-level diagnostics (errors and warnings)". The
 code filters `d.Severity == DiagnosticSeverity.Error` only in both methods
 (`RoslynScriptCompiler.cs:31` and `:74`), so warnings are silently dropped. Filtering
 to errors-only is arguably the safer choice for a compile *gate* (a warning should not
 block a deploy), but the code and the spec disagree, and a future maintainer reading
 the doc will expect warnings in the returned list.
 **Recommendation**
 Reconcile: either change the doc (REQ-SA-3) to say "errors only", or include warnings
 and let callers decide. Given the gate semantics, updating the doc to "error-severity
 diagnostics" is the cleaner fix.
 **Resolution**
 Resolved 2026-06-20 (commit `fd618cf1`): corrected the design doc (REQ-SA-3) to
 'error-severity only', matching the compile wrapper's existing behaviour (a warning must
 not block a deploy gate). Code unchanged.
 ### ScriptAnalysis-005 — `nameof(forbiddenType)` flagging is inconsistent between passes and across resolution states
 | | |
 |--|--|
 | Severity | Low |
 | Category | Correctness & logic bugs |
 | Status | Deferred |
 | Location | `src/ZB.MOM.WW.ScadaBridge.ScriptAnalysis/ScriptTrustValidator.cs:42-48`, `:99-136`, `:298-325` |
 **Description**
 The class comment (`:42-48`) states that a forbidden type inside `nameof(...)` is
 "deliberately flagged" as a conservative fail-safe, and `Rejects_NameOf_ForbiddenType`
 pins `nameof(System.IO.File)`. However the flagging is incidental, not deliberate, and
 inconsistent: `nameof(System.IO.File)` is caught because the inner `System.IO.File` is
 a `QualifiedNameSyntax`/member access that both passes inspect directly — the
 `nameof` wrapper is irrelevant. A bare `nameof(Process)` (after `using
 System.Diagnostics;`, with full TPA loaded) resolves `Process` to a real symbol and
 *is* flagged by Pass 1; but the same `nameof(Process)` under the minimal-reference
 fallback resolves to nothing and is *not* flagged. So the "deliberate fail-safe"
 behaviour actually varies with the reference set, and `nameof` — which produces only a
 compile-time string and reaches no API — is a genuine false positive that will reject a
 legitimate script doing `nameof(System.Net.IPAddress)` for a log message.
 **Recommendation**
 Decide the intended semantics explicitly. If `nameof` should be allowed (it reaches no
 runtime API), special-case `NameOfExpression`/`InvocationExpression` named `nameof` to
 suppress the inner-reference report. If it should be rejected, make that uniform
 regardless of resolution state (tie to a simple-name check), and keep the test. Either
 way, align the comment with the actual, deterministic behaviour.
 **Resolution**
 Deferred 2026-06-20: the `nameof(forbiddenType)` over-flagging is a documented
 deliberate fail-safe; whether `nameof` (which reaches no API) should be allowed vs
 uniformly rejected is a design-owner decision. Recorded; no behaviour change this pass.
 ### ScriptAnalysis-006 — No caching or throttling of Roslyn compilations; full-framework compilation rebuilt per call
 | | |
 |--|--|
 | Severity | Low |
 | Category | Performance & resource management |
 | Status | Deferred |
 | Location | `src/ZB.MOM.WW.ScadaBridge.ScriptAnalysis/ScriptTrustValidator.cs:73-97`; `RoslynScriptCompiler.cs:50-71` |
 **Description**
 `FindViolations` parses the script and builds a fresh `CSharpCompilation` against the
 **entire framework reference set** (`AnalysisReferences`, potentially hundreds of
 `MetadataReference`s) on every call (`:87-95`). `RoslynScriptCompiler.Compile`
 similarly creates a new `CSharpScript` and calls `.Compile()` per call. Roslyn
 compilation is the expensive part of these calls. The metadata references are cached
 (built once in `AnalysisReferences`), which is the main cost, so this is Low rather
 than Medium — but for the Central UI editor path (which can call the validator on every
 keystroke-debounced analysis) and for batch template validation (many scripts per
 deploy), the absence of any per-source memoisation or a bounded compilation cache means
 repeated work. There is no back-pressure or concurrency cap either; the Inbound API
 review (InboundAPI-009) already noted uncapped recompilation downstream.
 **Recommendation**
 Consider a small bounded cache keyed on a hash of `(code, globalsType)` for `Compile`
 results and on `code` for `FindViolations` verdicts (the verdict is a pure function of
 the source and the static policy). At minimum, document that callers on hot paths
 (editor live-analysis) should debounce and cache, and confirm the Central UI debounces
 before calling.
 **Resolution**
 Deferred 2026-06-20: compilation caching/throttling (the full-framework
 `CSharpCompilation` is rebuilt per `FindViolations` call) is a performance enhancement,
 not a correctness defect. Recorded for a follow-up.
 ### ScriptAnalysis-007 — `FindViolations` `extraReferences` XML-doc contradicts the Central UI consumer; `AnalysisReferences` undocumented in REQ-SA-2
 | | |
 |--|--|
 | Severity | Low |
 | Category | Documentation & comments |
 | Status | Resolved |
 | Location | `src/ZB.MOM.WW.ScadaBridge.ScriptAnalysis/ScriptTrustValidator.cs:57-64`; `ScriptTrustPolicy.cs:125-144`; `docs/requirements/Component-ScriptAnalysis.md:74-80` |
 **Description**
 Two doc/code mismatches, neither a vulnerability:
 1. The `extraReferences` XML-doc on `FindViolations` (`:57-64`) says "Forbidden
   references are NOT added here". But the Central UI run gate calls
   `FindViolations(tree.ToString(), compilation.References)`
   (`CentralUI/ScriptAnalysis/ScriptAnalysisService.cs:1319-1320`), forwarding the full
   BCL surface as `extraReferences`. This is in fact *safe* (the deny-list is by
   namespace/type, so adding references can only improve resolution, never produce a
   false allow — and the policy comment at `ScriptTrustPolicy.cs:138-142` says exactly
   that), but the `FindViolations` doc as written implies the caller must not pass
   forbidden assemblies, which is both unnecessary and contradicted in practice.
 2. REQ-SA-2 in the design doc describes Pass 1 as using "the full trusted-platform
   reference set from `ScriptTrustPolicy.DefaultReferences`". The code uses a separate
   `ScriptTrustPolicy.AnalysisReferences` (the full framework), and `DefaultReferences`
   is explicitly the *minimal* set used by `RoslynScriptCompiler`. The doc conflates the
   two and never mentions `AnalysisReferences` or the TPA mechanism — material because
   that mechanism is the heart of the security verdict (and of ScriptAnalysis-001).
 **Recommendation**
 Rewrite the `extraReferences` doc to state that extra references only widen symbol
 resolution and never whitelist a forbidden API. Update REQ-SA-2 to describe
 `AnalysisReferences` (full-framework, TPA-sourced) versus `DefaultReferences` (minimal,
 runtime-fidelity), and note the TPA-fallback behaviour.
 **Resolution**
 Resolved 2026-06-20 (commit `fd618cf1`): corrected the `FindViolations`
 `extraReferences` XML doc (extra references only widen resolution, never whitelist) and
 documented `AnalysisReferences` vs `DefaultReferences` in REQ-SA-2.
 ### ScriptAnalysis-008 — Static trust gate is defence-in-depth, not a sandbox; correctly documented but worth recording as a standing risk
 | | |
 |--|--|
 | Severity | Low |
 | Category | Security |
 | Status | Won't Fix |
 | Location | `src/ZB.MOM.WW.ScadaBridge.ScriptAnalysis/ScriptTrustValidator.cs:34-40`; `docs/requirements/Component-ScriptAnalysis.md:151` |
 **Description**
 The component is explicit (both in code comments and REQ-SA-5) that it is a
 compile-time deny-list, not a runtime sandbox: scripts execute in-process, and a
 sufficiently determined script could reconstruct a forbidden API at runtime via paths
 the static walker cannot see (e.g. building type names from runtime strings, or any
 gap in the deny-list). The reflection-gateway hardening narrows this, and the
 `dynamic`/`Activator` bans close the obvious late-binding routes, but the residual risk
 is real and inherent to in-process execution of partially-trusted C#. This is recorded
 (not as a defect of this module, which behaves as designed) so the standing risk is
 visible in the audit trail and is revisited if/when the deferred runtime boundary
 (restricted `AssemblyLoadContext` / curated reference set / out-of-process sandbox) is
 scoped.
 **Recommendation**
 Keep the caveat prominent. Track the runtime-boundary work as a roadmap item; in the
 meantime ensure the curated *execution* reference sets (SiteRuntime `ScriptAssemblies`,
 `RoslynScriptCompiler` `DefaultReferences`) stay minimal so that even a deny-list miss
 cannot bind a forbidden type at execution time (this is the real second layer of
 defence for the compile-running consumers).
 **Resolution**
 Won't Fix 2026-06-20: the static gate is intentionally defence-in-depth, not a runtime
 sandbox — this is the explicitly-documented design posture, not a defect. Recorded as a
 standing risk for visibility.
@@ -5,10 +5,10 @@
 | Module | `src/ZB.MOM.WW.ScadaBridge.Security` |
 | Design doc | `docs/requirements/Component-Security.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-28 |
+| Last reviewed | 2026-06-20 |
 | Reviewer | claude-agent |
-| Commit reviewed | `1eb6e97` |
+| Commit reviewed | `4307c381` |
-| Open findings | 0 (1 deferred — Security-008) |
+| Open findings | 0 (2 deferred — Security-008, Security-022) |
 ## Summary
@@ -78,6 +78,50 @@ values silently surface only on first login (Security-020); and the
 `RequireHttpsCookie=false` dev opt-out emits no warning, so an HTTP production
 deployment silently transmits the JWT bearer credential in cleartext (Security-021).
 #### Re-review 2026-06-20 (commit `4307c381`) — full review
 Major architectural change since `1eb6e97`: the project was renamed (ScadaLink →
 ZB.MOM.WW.ScadaBridge) and the bespoke `LdapAuthService`/`LdapTransport`/`LdapAuthResult`
 were **cut over to the external `ZB.MOM.WW.Auth.Ldap` library** (commit `ac34dac4`); the
 prior `SiteScopeAuthorizationHandler` was deleted (Security-017 resolution), and the role
 model was canonicalised + collapsed (`Admin→Administrator`, `Design→Designer`,
 `Deployment→Deployer`, `Audit→Administrator`, `AuditReadOnly→Viewer`; commit `b104760b`)
 with M7 `Operator`/`Verifier` added (`a0ce8b6c`). The interactive cookie session now signs
 in with **bare claims** via the shared `SessionClaimBuilder` and is policed by
 `CookieSessionValidator` (idle-timeout + LDAP-free DB role-refresh, fail-closed, well
 tested). All prior fixes that survived the cutover remain in place (key-length guard,
 issuer/audience binding, `RoleMapper` union semantics, `Roles` constants, cookie hardening
 via `ZbCookieDefaults`, the `RequireHttpsCookie` warning). Security-008 (N+1 scope-rule
 query) is still correctly **Deferred**. This pass surfaced **4 new findings**
 (Security-022..025): three Medium and one Low. The most consequential is **Security-022** —
 the documented "cookie + JWT hybrid" no longer holds at HEAD: `/auth/token` still *mints* a
 bearer JWT but **no code anywhere validates it** (CLI uses HTTP Basic Auth, Inbound API uses
 the `sbk_` ApiKeys verifier, the cookie path uses bare claims), so `JwtTokenService`'s entire
 `ValidateToken`/`RefreshToken`/`RecordActivity`/`ShouldRefresh`/`IsIdleTimedOut` surface is
 dead code minting an orphaned credential. **Security-023** — `LdapGroupMapping.Role` accepts
 arbitrary free strings via CLI/API with no canonical-set validation, and the Central UI
 `RequireClaim` policies are case-sensitive while the ManagementActor authz check is
 case-insensitive, so a mis-cased/typo'd mapping silently grants inconsistent access across
 the two surfaces with no operator feedback. **Security-024** — the M7 `RequireOperator`/
 `RequireVerifier` SoD policies have no functional `AuthorizeAsync` grant/deny test (only
 constant-name assertions), unlike every other policy. **Security-025** (Low) — a unit test
 makes a live network connect to `nonexistent.invalid:9999`.
 _Re-review (2026-06-20, `4307c381`):_
 | # | Category | Examined | Notes |
 |---|----------|----------|-------|
 | 1 | Correctness & logic bugs | ☑ | `RoleMapper` union semantics (Security-016 fix) hold and read correctly. `LdapGroupMapping.Role` is an unvalidated free string + case-sensitive UI policy vs case-insensitive ManagementActor authz (Security-023). |
 | 2 | Akka.NET conventions | ☑ | No actors. `AddSecurityActors` is still a placeholder. No issues. |
 | 3 | Concurrency & thread safety | ☑ | Services stateless/DI-scoped; `CookieSessionValidator` reads injected `TimeProvider`. LDAP now in the external library. No shared mutable state. No issues. |
 | 4 | Error handling & resilience | ☑ | `CookieSessionValidator` fails closed on idle (missing anchor = timed out) and swallows refresh faults keeping current roles (mirrors "LDAP failure: active sessions continue") — correct and well tested. `LdapAuthFailureMessages` maps the library's structured failure enum to stable user text. No issues. |
 | 5 | Security | ☑ | Cookie hardening (HttpOnly/SameSite=Strict/Secure-when-required/sliding) comes from `ZbCookieDefaults` — verified correct. JWT mint binds issuer/audience, pins `MapInbound/OutboundClaims=false`, fails fast on short key. BUT the minted bearer JWT is validated by no consumer (Security-022). DisableLogin auto-login is gated at registration + loud warning + doc'd (acceptable). |
 | 6 | Performance & resource management | ☑ | Security-008 N+1 remains correctly Deferred (gated on `ISecurityRepository`). `CookieSessionValidator` deliberately avoids re-minting on the no-refresh path. No new perf issues. |
 | 7 | Design-document adherence | ☑ | The cookie+JWT-hybrid design is now cookie-with-bare-claims; the JWT half is orphaned/unvalidated (Security-022) — the M2.19 doc note acknowledges bare-claim cookies but the doc still presents JWT-in-cookie as the model and does not flag that `/auth/token`'s JWT is unconsumed. Operator/Verifier policies match the design. |
 | 8 | Code organization & conventions | ☑ | `Roles`/`SecurityOptions` owned by the component; repository interface in Commons; additive `IGroupRoleMapper<string>` adapter. Dead JWT methods (Security-022) are the main org smell. |
 | 9 | Testing coverage | ☑ | `CookieSessionValidator`, `SessionClaimBuilder`, `RoleMapper` union, cookie wiring, `SecurityOptionsValidator`, DisableLogin are well covered. Gaps: no functional authz test for `RequireOperator`/`RequireVerifier` (Security-024); a unit test does live network I/O (Security-025); no test for a non-canonical/mis-cased mapping role (Security-023). |
 | 10 | Documentation & comments | ☑ | XML docs are thorough and current. `SecurityOptions` and `JwtTokenService` docs still describe an active JWT-refresh lifecycle that no longer has a caller (Security-022). |
 ## Checklist coverage
 | # | Category | Examined | Notes |
@@ -1008,3 +1052,178 @@ SecurityOptions:RequireHttpsCookie=true in production.")`. Optionally, also fail
 startup when `RequireHttpsCookie=false` AND `ASPNETCORE_ENVIRONMENT=Production`. Add a
 regression test that asserts the warning is emitted when the flag is disabled and not
 when it is enabled.
 ### Security-022 — `/auth/token` mints a bearer JWT that no consumer validates; the entire `JwtTokenService` validate/refresh surface is dead code
 | | |
 |--|--|
 | Severity | Medium |
 | Category | Design-document adherence |
 | Status | Deferred |
 | Location | `src/ZB.MOM.WW.ScadaBridge.Security/JwtTokenService.cs:149`, `:194`, `:209`, `:239`, `:272`; `src/ZB.MOM.WW.ScadaBridge.CentralUI/Auth/AuthEndpoints.cs:99-148` |
 **Description**
 The design (Component-Security.md "Cookie + JWT Hybrid", CLAUDE.md "cookie+JWT hybrid …
 HMAC-SHA256, 15-minute expiry w/ sliding refresh") presents a cookie-embedded JWT as the
 session model. At HEAD this is no longer how the interactive session works (the M2.19 note
 acknowledges the cookie now carries **bare claims** policed by `CookieSessionValidator`),
 and — more importantly — the JWT half is now an **orphaned, unconsumed credential**. A grep
 across `src/` shows `JwtTokenService.ValidateToken`, `RefreshToken`, `RecordActivity`, and
 `ShouldRefresh` have **zero production callers**; only `GenerateToken` is called, solely by
 `POST /auth/token` (`AuthEndpoints.cs:134`). That endpoint returns an `access_token` to the
 caller, but **no endpoint in the codebase accepts or validates that JWT** as a credential:
 the CLI authenticates to the Management API with HTTP Basic Auth
 (`CLI/ManagementHttpClient.cs:36`), the Inbound API authenticates with the external
 `ZB.MOM.WW.Auth.ApiKeys` `sbk_<keyId>_<secret>` verifier, and the cookie pipeline uses
 `SessionClaimBuilder` bare claims (never `ValidateToken`). The result: `/auth/token` issues
 HMAC-signed bearer tokens that grant access to nothing, and ~120 lines of carefully-hardened
 issuer/audience/clock-skew/idle/refresh logic (the subject of resolved findings
 Security-006/007/014) are dead code. The hazard is twofold: (a) the design doc materially
 overstates the live session mechanism, misleading any future reader/auditor; and (b) an
 orphaned token mint is a latent footgun — a future endpoint added under the assumption that
 `/auth/token` produces a *usable* credential would wire a second, parallel auth path with
 no test coverage proving it actually validates.
 **Recommendation**
 Decide the intent explicitly. Either (a) **wire the bearer path**: add
 `AddAuthentication().AddJwtBearer(...)` (or a custom handler delegating to
 `JwtTokenService.ValidateToken`) on whatever surface is meant to accept the token, and add
 an end-to-end test that a `/auth/token`-issued JWT authorizes a protected request — then the
 hybrid model the doc describes is real; or (b) **remove the dead code**: delete
 `/auth/token` and `JwtTokenService.ValidateToken`/`RefreshToken`/`RecordActivity`/
 `ShouldRefresh`/`IsIdleTimedOut` (keep only what the bare-claim cookie path needs, which is
 none of `JwtTokenService`), and rewrite Component-Security.md's "Cookie + JWT Hybrid" /
 "Token Lifecycle" sections to describe the actual bare-claim cookie + `CookieSessionValidator`
 mechanism as the primary model rather than a footnote. Whichever is chosen, the design doc
 and the code must agree on whether a usable JWT exists.
 **Resolution**
 Deferred 2026-06-20: this needs a product decision — either wire a JWT-bearer auth handler so the documented cookie+JWT hybrid is real, or delete `/auth/token` + the orphaned `JwtTokenService` validate/refresh surface and rewrite the design doc to describe the actual bare-claim cookie model. Recorded for that decision.
 ### Security-023 — `LdapGroupMapping.Role` is an unvalidated free string; case-sensitive UI policy vs case-insensitive server authz grants inconsistent access
 | | |
 |--|--|
 | Severity | Medium |
 | Category | Correctness & logic bugs |
 | Status | Resolved |
 | Location | `src/ZB.MOM.WW.ScadaBridge.Security/AuthorizationPolicies.cs:144-157`; `src/ZB.MOM.WW.ScadaBridge.Security/RoleMapper.cs:40-42`; `src/ZB.MOM.WW.ScadaBridge.Commons/Entities/Security/LdapGroupMapping.cs:10`; `src/ZB.MOM.WW.ScadaBridge.ManagementService/ManagementActor.cs:106`, `:1893` |
 **Description**
 The role string an operator maps an LDAP group to (`LdapGroupMapping.Role`) is a free
 `string` with no validation against the canonical `Roles.*` set anywhere on the write path:
 `ManagementActor` (`:1893`) does `new LdapGroupMapping(cmd.LdapGroupName, cmd.Role)` verbatim,
 and the CLI `security role-mapping create --role <value>`
 (`CLI/Commands/SecurityCommands.cs:157`) accepts any string. The Central UI dropdown
 (`LdapMappingForm.razor`) constrains the *UI* path to the canonical constants, but the CLI/API
 path does not. Two enforcement surfaces then disagree on casing:
 - **Central UI authorization** uses `policy.RequireClaim(JwtTokenService.RoleClaimType, Roles.Deployer)`
  (`AuthorizationPolicies.cs:150`). ASP.NET Core's `RequireClaim` compares claim *values*
  with ordinal (case-**sensitive**) equality.
 - **ManagementActor authorization** uses `user.Roles.Contains(requiredRole, StringComparer.OrdinalIgnoreCase)`
  (`ManagementActor.cs:106`) — case-**insensitive**.
 - **`RoleMapper`** matches the Deployer scope branch with `StringComparison.OrdinalIgnoreCase`
  (`RoleMapper.cs:42`) but stamps the **DB row's verbatim casing** into the role claim.
 So a mapping created via CLI/API as `--role deployer` (lowercase) or a typo like `Deploy`
 produces a principal whose role claim is `"deployer"`: the **ManagementActor/CLI surface
 authorizes it** (case-insensitive), the **Central UI `RequireDeployment` policy denies it**
 (case-sensitive), and for the Deployer case `RoleMapper` still resolves site scope (matching
 case-insensitively) — a user who can deploy via CLI but is locked out of the equivalent UI
 pages, or vice versa, with no error explaining why. A pure typo (`"Deploer"`) silently grants
 a role claim that *no* policy and no ManagementActor check matches — the user appears to hold
 a role yet is denied everywhere, with no validation feedback at mapping-creation time.
 **Recommendation**
 Validate `cmd.Role` against the canonical `Roles.All` set (case-insensitively, normalising to
 the canonical casing) when creating/updating a mapping — in `ManagementActor` (the single
 server-side write path) and reject non-canonical values with a clear error; optionally also
 guard in the CLI for a fast local failure. Separately, make the two authorization surfaces
 agree on casing: either normalise the role claim to canonical casing in `SessionClaimBuilder`/
 `RoleMapper` before it is stamped, or make the ManagementActor check case-sensitive to match
 `RequireClaim` (the safer direction once values are validated). Add a regression test for a
 mis-cased mapping role.
 **Resolution**
 Resolved 2026-06-20 (commit `fd618cf1`): added membership validation rejecting any `cmd.Role` not in `Roles.All` (canonical set, case-insensitive) at LDAP-group-mapping create AND update, before any DB write — closing the unvalidated-free-string gap. The casing/comparison-asymmetry half is a larger change and remains a recorded follow-up (membership check is safe: a non-canonical role never worked). No existing test regressed.
 ### Security-024 — `RequireOperator` / `RequireVerifier` policies have no functional authorization test
 | | |
 |--|--|
 | Severity | Medium |
 | Category | Testing coverage |
 | Status | Resolved |
 | Location | `src/ZB.MOM.WW.ScadaBridge.Security/AuthorizationPolicies.cs:153-157`; `tests/ZB.MOM.WW.ScadaBridge.Security.Tests/SecurityTests.cs:1045-1186` (AuthorizationPolicyTests); `tests/ZB.MOM.WW.ScadaBridge.Security.Tests/RolesTests.cs:44-45` |
 **Description**
 The M7 two-person Secured Write feature added the `RequireOperator` and `RequireVerifier`
 policies (`AuthorizationPolicies.cs:153-157`), each a single-role `RequireClaim`. These gate a
 SCADA control-surface write workflow whose entire safety argument is separation of duties, so
 correct grant/deny behaviour is security-critical. But `AuthorizationPolicyTests` — which
 exercises every *other* policy with real `IAuthorizationService.AuthorizeAsync` evaluations
 (Admin/Design/Deployment/OperationalAudit/AuditExport, including the load-bearing
 "Viewer reads but cannot export" SoD case) — has **no** Operator/Verifier test. The only
 coverage is `RolesTests.cs:44-45`, which asserts the *constant string values*
 (`"RequireOperator"`, `"RequireVerifier"`) — not that an `Operator` principal satisfies
 `RequireOperator`, that a `Verifier` does not, or that the two policies are mutually distinct.
 A regression that, say, mapped `RequireOperator` to `Roles.Verifier` (or to
 `OperationalAuditRoles`) would compile and pass the existing tests while silently collapsing
 the separation of duties.
 **Recommendation**
 Add `AuthorizeAsync`-based `[Theory]` cases to `AuthorizationPolicyTests` mirroring the
 existing policy tests: `RequireOperator` succeeds for `Roles.Operator` and fails for
 `Roles.Verifier`/`Administrator`/empty; `RequireVerifier` succeeds for `Roles.Verifier` and
 fails for `Roles.Operator`; and a combined case asserting an Operator-only principal cannot
 satisfy `RequireVerifier` (the SoD invariant at the policy layer).
 **Resolution**
 Resolved 2026-06-20 (commit `fd618cf1`): added functional `AuthorizeAsync` grant/deny tests for `RequireOperator`/`RequireVerifier` (and a combined SoD-distinctness test), guarding the separation-of-duties policy wiring against regression.
 ### Security-025 — Unit test performs a live network connection to `nonexistent.invalid:9999`
 | | |
 |--|--|
 | Severity | Low |
 | Category | Testing coverage |
 | Status | Resolved |
 | Location | `tests/ZB.MOM.WW.ScadaBridge.Security.Tests/SecurityTests.cs:71-89` (`AuthenticateAsync_ConnectionFailure_FailsClosed_NeverThrows`) |
 **Description**
 `LdapAuthServiceTests.AuthenticateAsync_ConnectionFailure_FailsClosed_NeverThrows` constructs
 the external library's `LdapAuthService` pointed at `Server = "nonexistent.invalid"`,
 `Port = 9999` with a 2-second `ConnectionTimeoutMs`, then asserts the result fails closed.
 Although the test relies on the connection *failing*, it still performs a real DNS resolution
 and TCP connect attempt from the unit-test process. In an offline or network-sandboxed CI
 environment the resolution/connect can behave differently (immediate failure vs. the full
 2-second timeout vs. a captive-portal redirect), making the test both **slow** (up to ~2s of
 wall-clock dead time) and **environment-dependent**. Unit tests in this suite are otherwise
 network-free (the sibling cases are explicitly named `_NoNetwork`). The behaviour under test
 (fail-closed mapping of an unreachable directory) is really the external library's
 responsibility and is, per the file's own header comment, already covered by the library's own
 test suite through its internal connection seam.
 **Recommendation**
 Drop this case from the ScadaBridge unit suite (the library owns the fail-closed contract via
 its internal `ILdapConnection` seam, as the region comment notes), or move it to an explicitly
 opt-in integration-test category so it does not run in the default network-free unit pass. If
 kept as a smoke test, point it at a guaranteed-unroutable loopback address (e.g.
 `127.0.0.1` on a closed port) to bound the failure deterministically rather than relying on
 DNS resolution of `.invalid`.
 **Resolution**
 Resolved 2026-06-20 (commit `fd618cf1`): the fail-closed connection test now targets `127.0.0.1` on a closed port instead of `nonexistent.invalid:9999` — deterministic loopback connection-refused, no external DNS/timeout dependency. Same assertion preserved.
@@ -5,10 +5,10 @@
 | Module | `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit` |
 | Design doc | `docs/requirements/Component-SiteCallAudit.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-28 |
+| Last reviewed | 2026-06-20 |
 | Reviewer | claude-agent |
-| Commit reviewed | `1eb6e97` |
+| Commit reviewed | `4307c381` |
-| Open findings | 2 |
+| Open findings | 0 |
 ## Summary
@@ -42,6 +42,23 @@ tests using a shared `MsSqlMigrationFixture`.
 | 9 | Testing coverage | Yes | Relay path well covered (6 unit tests). Ingest/KPI well covered by MSSQL fixture. Stuck-only paging boundary edge not directly exercised (Finding 006). |
 | 10 | Documentation & comments | Yes | XML docstring claims `SupervisorStrategy` uses Resume — incorrect (Finding 001). `AckErrorMessage` switch arm for `SiteUnreachable` falls through instead of throwing (Finding 005). |
 #### Re-review 2026-06-20 (commit `4307c381`) — full review
 Since the `1eb6e97` baseline the module grew the two pieces that were "still deferred" at the prior pass: the periodic per-site reconciliation puller (Piece A — `OnReconciliationTickAsync` → `ReconcileSiteAsync`, in-memory per-site cursor + idempotent monotonic upsert + per-site failure isolation) and the daily terminal-row purge scheduler (Piece B — `OnPurgeTickAsync` → `PurgeTerminalAsync`, continue-on-error). All of the new code is clean, idiomatic, and well-tested (dedicated `SiteCallAuditReconciliationTests`, `SiteCallAuditPurgeTests`, `SiteCallRelayTests`, KPI-sample and options suites); the M6 `SiteCallAuditKpiSampleSource` correctly anchors both cutoffs on a single capture instant and emits Global + per-Site + per-Node snapshots, and the per-node KPI handler is a clean additive peer of the per-site one. The three new findings are all forward-compat / cosmetic: a latent purge-gating gap (the purge timer is armed only when the reconciliation collaborators are present — currently masked because `Program.cs` co-registers them) plus two doc/edge items (the design doc over-claims six charted trend series when only three are charted, and the reconciliation cursor ignores `MoreAvailable` so a single-timestamp batch saturation re-pulls without progress). No correctness, concurrency, security, or resilience defects in the new code; the inclusive-cursor + idempotent-upsert pairing keeps every edge case safe-by-construction (wasted work, never corruption).
 | # | Category | Examined | Notes |
 |---|----------|----------|-------|
 | 1 | Correctness & logic bugs | Yes | Reconciliation cursor ignores `MoreAvailable`; a single-`UpdatedAtUtc` batch saturation re-pulls without progress (Finding 009 — wasted work only, idempotent upsert prevents corruption). Finding 003 (`IngestedAtUtc`) resolved — now stamped in `OnUpsertAsync` and `ReconcileSiteAsync`. |
 | 2 | Akka.NET conventions | Yes | `Sender` captured before first await on every async handler; `PipeTo` used for all read/relay replies; self-ticks scheduled via `ScheduleTellRepeatedlyCancelable` with `Self` sender, cancelled in `PostStop`. No issues found. |
 | 3 | Concurrency & thread safety | Yes | `_reconciliationCursors`, `_centralCommunication`, timers all mutated only on the actor thread. Per-tick/per-message DI scope (`CreateAsyncScope` for the async tick paths, `CreateScope` for sync read paths) disposed in `finally` / `await using`. No issues found. |
 | 4 | Error handling & resilience | Yes | Reconciliation has per-site try/catch isolation; purge is continue-on-error; ingest catches all and replies `Accepted=false`. Deliberate coarse-retry divergence from `SiteAuditReconciliationActor` (no per-row abandon) is documented and safe given idempotent upsert. No issues found. |
 | 5 | Security | Yes | All SQL parameterised at the repository. Relay carries no user-controlled strings beyond `SourceSite` (a site id). No issues found. |
 | 6 | Performance & resource management | Yes | One DI scope per tick reused across all sites; `MaxPageSize=200` clamp; async DbContext disposal off the dispatcher. `MoreAvailable`-ignoring re-pull is bounded wasted work (Finding 009). |
 | 7 | Design-document adherence | Yes | Reconciliation puller + purge scheduler now implemented (prior Finding 004 resolved). Design doc over-claims six charted KPI trend series; only three are charted (Finding 008). |
 | 8 | Code organization & conventions | Yes | Purge timer gated on reconciliation collaborators rather than just the repository — a host registering SiteCallAudit without the reconciliation client silently never purges (Finding 007, latent). `KpiSampleSource` registered via `TryAddEnumerable`; options owned by the component. |
 | 9 | Testing coverage | Yes | Reconciliation/purge/relay/KPI/options all have dedicated suites. No test exercises a saturated reconciliation batch (`MoreAvailable: true`) or the single-timestamp no-progress edge — every reconciliation test uses `MoreAvailable: false` (relates to Finding 009). |
 | 10 | Documentation & comments | Yes | Prior Findings 001/005 doc fixes still in place. Reconciliation `ReconcileSiteAsync` XML claims parity with `SiteAuditReconciliationActor` while diverging on `MoreAvailable` (the sibling reads it for stalled detection; this actor ignores it) — Finding 009 recommendation includes a doc correction. |
 ## Findings
 ### SiteCallAudit-001 — SupervisorStrategy override is dead code; XML claims Resume that is not enforced
@@ -323,3 +340,221 @@ Add a test that (a) inserts 6 rows in interleaved order: stuck, not-stuck,
 stuck, not-stuck, stuck, not-stuck (oldest first); (b) issues a `StuckOnly`
 page-size-1 query; (c) asserts each page returns exactly the stuck row, with
 no overlap and all 3 stuck rows visited.
 ### SiteCallAudit-007 — Daily purge timer is armed only when the reconciliation collaborators are present; a host without the reconciliation client silently never purges
 | | |
 |--|--|
 | Severity | Medium |
 | Category | Code organization & conventions |
 | Status | Resolved |
 | Location | `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:307-321`, `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:202-227` |
 **Description**
 `StartPurgeTimer` (lines 307-321) gates the daily terminal-row purge tick on
 the **reconciliation** collaborators being present:
 ```csharp
 private void StartPurgeTimer()
 {
    if (_pullClient is null || _siteEnumerator is null)
    {
        return;
    }
    // ... schedule PurgeTick ...
 }
 ```
 But the purge pass (`OnPurgeTickAsync` → `PurgeWithRepositoryAsync` →
 `ISiteCallAuditRepository.PurgeTerminalAsync`) needs only the repository — it
 has no dependency on `IPullSiteCallsClient` / `ISiteEnumerator`. The two
 collaborators are resolved by the production constructor (lines 202-227) via
 `serviceProvider.GetService<IPullSiteCallsClient>()` /
 `GetService<ISiteEnumerator>()` — both registered by
 `AddAuditLogCentralReconciliationClient`
 (`AuditLog/ServiceCollectionExtensions.cs:473,523`, registered as `ISiteEnumerator`
 and `IPullSiteCallsClient`). `GetService` (not `GetRequiredService`) returns
 `null` if that helper was never called, so a host that registers
 `AddSiteCallAudit()` **without** also calling
 `AddAuditLogCentralReconciliationClient(...)` constructs the actor with both
 collaborators null. In `PreStart` both `StartReconciliationTimer` and
 `StartPurgeTimer` early-return, so the actor runs forever with **no purge timer
 at all** → unbounded growth of the central `SiteCalls` table, with no log line
 to say the purge was skipped.
 This is **currently latent, not live**: `Program.cs` co-registers
 `AddAuditLogCentralReconciliationClient(builder.Configuration)` (line 107)
 immediately before `AddSiteCallAudit()` (line 113), so on the real central node
 both collaborators resolve and both timers run today. The risk is a future host
 (or a refactor that splits the reconciliation client out of the central
 composition root, or a test/embedded host that wants only ingest + purge)
 silently losing the purge with no diagnostic. The gate exists for a legitimate
 reason — keeping the repo-only test ctor free of *both* background timers so the
 MSSQL read/upsert tests see no scheduled side effects — but it conflates "no
 reconciliation route" with "no purge", and the actor's own XML
 (lines 36-38) documents the coupling as deliberate rather than flagging it as a
 hazard.
 **Recommendation**
 Decouple `StartPurgeTimer` from the reconciliation collaborators — purge needs
 only a repository, which both the production and the repo-only test ctors
 always have. Two viable shapes:
 - Preferred: gate the purge timer on its own real precondition (a repository is
  always available, so arm it unconditionally in the production + reconciliation
  ctors; keep it off only in the repo-only test ctor via an explicit
  "background timers off" flag rather than by proxy of the reconciliation
  collaborators). This keeps the MSSQL test isolation while removing the
  accidental coupling.
 - At minimum: log a `Warning` in `StartPurgeTimer` (and `StartReconciliationTimer`)
  when the timer is **not** armed on the production path, so a misconfigured host
  surfaces "SiteCallAudit purge timer not started — `PurgeTerminalAsync` will
  never run" instead of growing the table silently.
 Severity is a judgment call: Medium because the consequence (unbounded central
 table growth) is real and silent, but it is latent today (the only production
 composition root co-registers the reconciliation client), so an argument for Low
 is defensible. Flagged Medium to weight the silent-failure + data-growth aspect.
 **Resolution**
 Resolved 2026-06-20 (commit `fd618cf1`): decoupled the daily purge timer from the reconciliation collaborators (a new `_backgroundTimersEnabled` flag) so a central node that omits the reconciliation client still purges — no more silent unbounded `SiteCalls` growth; a Warning is logged if the reconciliation collaborators are absent. Test added (purge arms without the reconciliation client).
 ### SiteCallAudit-008 — Design doc claims six charted KPI trend series; only three are actually charted
 | | |
 |--|--|
 | Severity | Low |
 | Category | Design-document adherence |
 | Status | Resolved |
 | Location | `docs/requirements/Component-SiteCallAudit.md:149-154`, `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/Kpi/SiteCallAuditKpiSampleSource.cs:42-47`, `src/ZB.MOM.WW.ScadaBridge.Commons/Types/Kpi/KpiMetrics.cs:52-62` |
 **Description**
 The design doc (`Component-SiteCallAudit.md` lines 149-154, the KPI History
 interaction) states:
 > "… the resulting `buffered` / `parked` / `failedLastInterval` /
 > `deliveredLastInterval` / `stuck` / `oldestPendingAgeSeconds` series render as
 > trends on the Site Calls page via `KpiTrendChart`."
 That lists **six** series as charted. In the code only **three** are charted:
 - The public charted catalog `KpiMetrics.SiteCallAudit`
  (`KpiMetrics.cs:52-62`) exposes exactly `Buffered`, `Parked`, and
  `FailedLastInterval` — and its own XML says "Charted Site Call Audit (#22)
  metrics … Rendered by the Central UI Site Calls report trend panel."
 - `SiteCallAuditKpiSampleSource` (lines 42-47) keys those three off the public
  Commons catalog (`KpiMetrics.SiteCallAudit.*`) and keeps the other three —
  `deliveredLastInterval`, `stuck`, `oldestPendingAgeSeconds` — as **private**
  string literals, with the comment "Charted metrics share the public Commons
  catalog … the uncharted internal metrics stay private here (#178)."
 - The UI confirms it: `SiteCalls/SiteCallsReport.razor.cs` calls
  `LoadSeriesAsync` for exactly `KpiMetrics.SiteCallAudit.Buffered`, `.Parked`,
  and `.FailedLastInterval` — three series, no more.
 So all six metrics are *sampled* into the `KpiSample` history store, but only
 three are *charted*. The doc reads as if all six are rendered, which is a small
 design-doc drift (over-claiming the UI surface). The code is internally
 consistent and correctly commented; the doc is the stale party.
 **Recommendation**
 One-line doc edit on `Component-SiteCallAudit.md` lines 149-154: state that the
 three series `buffered` / `parked` / `failedLastInterval` render as trends on
 the Site Calls page via `KpiTrendChart`, and that `deliveredLastInterval` /
 `stuck` / `oldestPendingAgeSeconds` are **sampled into the KPI-history store but
 not charted** (available for future trend panels / ad-hoc query). No code change
 — the code is already the intended state.
 **Resolution**
 Resolved 2026-06-20 (commit `fd618cf1`): the design doc now lists the three actually-charted KPI metrics (buffered/parked/failedLastInterval) and marks the other three as sampled-but-not-yet-charted, matching the `KpiMetrics` catalog. Doc-only.
 ### SiteCallAudit-009 — Reconciliation cursor advances inclusively and ignores `MoreAvailable`; a single-timestamp batch saturation re-pulls without progress
 | | |
 |--|--|
 | Severity | Low |
 | Category | Correctness & logic bugs |
 | Status | Resolved |
 | Location | `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:505-535` |
 **Description**
 `ReconcileSiteAsync` (lines 505-535) pulls rows since the per-site cursor,
 upserts each, and advances the cursor to the maximum `UpdatedAtUtc` observed:
 ```csharp
 var since = _reconciliationCursors.TryGetValue(site.SiteId, out var c) ? c : DateTime.MinValue;
 var response = await client.PullAsync(site.SiteId, since, _options.ReconciliationBatchSize, ...);
 var maxUpdated = since;
 foreach (var row in response.SiteCalls)
 {
    await repository.UpsertAsync(row with { IngestedAtUtc = nowUtc });
    if (row.UpdatedAtUtc > maxUpdated) maxUpdated = row.UpdatedAtUtc;
 }
 _reconciliationCursors[site.SiteId] = maxUpdated;
 ```
 Two interacting properties:
 1. The pull is **inclusive** — `PullAsync(site, since, …)` asks for rows with
   `UpdatedAtUtc >= since` (documented in the method's XML, lines 497-503), and
   the cursor advances to the **max** `UpdatedAtUtc` seen, which is itself one
   of the rows just pulled. So the boundary row is re-pulled every tick and
   deduped by the idempotent monotonic upsert — intended, harmless.
 2. `response.MoreAvailable` is **never read** at all. The `PullSiteCallsResponse`
   carries `MoreAvailable` (`PullSiteCallsResponse.cs:14-17`: "True when the
   site saturated the requested batch size — the caller should advance the
   cursor and pull again"), but `ReconcileSiteAsync` ignores it and relies on
   the natural tick cadence to drain the backlog over successive ticks.
 The edge case: if a site has **more rows than the batch size all sharing one
 exact `UpdatedAtUtc`** (e.g. a burst written in the same tick / same clock
 value), the saturated batch returns rows whose max `UpdatedAtUtc` equals
 `since`. `maxUpdated` therefore stays at `since`, the cursor does **not**
 advance, and because the pull is inclusive the next tick re-pulls the identical
 window — and again, and again — making no forward progress on that site's
 backlog. Because the upsert is idempotent and the `SiteCalls` table is an
 eventually-consistent mirror (not the source of truth), this is **wasted work,
 never corruption** — but it is an unbounded re-pull loop on a pathological
 input, and any rows in that backlog beyond the batch ceiling never get
 reconciled.
 The sibling `SiteAuditReconciliationActor` shares the inclusive-cursor /
 max-timestamp shape, so the same single-timestamp-saturation no-progress edge
 applies to it — but that sibling **does read `MoreAvailable`** (it feeds its
 stalled-detection state machine, `SiteAuditReconciliationActor.cs:324-325`,
 publishing `SiteAuditTelemetryStalledChanged` so a non-draining site surfaces a
 health signal). `ReconcileSiteAsync`'s XML (lines 530-534) claims the
 no-immediate-re-pull behaviour "match[es] `SiteAuditReconciliationActor`", which
 is only half true: it matches the cursor cadence but diverges by dropping
 `MoreAvailable` entirely, so this actor has neither a continuation pull nor a
 stalled signal — a saturated site lags silently with no observability.
 **Recommendation**
 - Consume `MoreAvailable`: either continue draining within the same tick while
  `MoreAvailable` is true (bounded by a max-iterations guard), or — matching the
  sibling — surface a stalled/non-draining signal when a batch comes back
  saturated so a stuck site is observable rather than silent.
 - Defend the single-timestamp no-progress edge with a tiebreaker beyond the
  timestamp (e.g. advance on `(UpdatedAtUtc, TrackedOperationId)` as a composite
  keyset cursor, and ask the pull for rows strictly after that composite), so a
  burst sharing one `UpdatedAtUtc` cannot pin the cursor.
 - Correct the `ReconcileSiteAsync` XML (lines 530-534): it claims parity with
  `SiteAuditReconciliationActor` while diverging on `MoreAvailable` (the sibling
  reads it for stalled detection; this actor ignores it).
 Severity Low: idempotent upsert + mirror-not-source-of-truth mean no data
 corruption, and the saturated-single-timestamp input is pathological; the cost
 is wasted re-pulls and an un-drained backlog tail on that one input, plus the
 missing observability.
 **Resolution**
 Resolved 2026-06-20 (commit `fd618cf1`): `ReconcileSiteAsync` now consumes `response.MoreAvailable` via a within-tick continuation drain bounded by a max-pages guard, with explicit no-progress (single-timestamp-saturation) detection that breaks and logs a Warning instead of re-pulling forever. The XML claim of parity with the sibling reconciler was corrected. Idempotent upsert retained. Tests added.
@@ -5,10 +5,10 @@
 | Module | `src/ZB.MOM.WW.ScadaBridge.SiteEventLogging` |
 | Design doc | `docs/requirements/Component-SiteEventLogging.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-28 |
+| Last reviewed | 2026-06-20 |
 | Reviewer | claude-agent |
-| Commit reviewed | `1eb6e97` |
+| Commit reviewed | `4307c381` |
-| Open findings | 2 |
+| Open findings | 0 |
 ## Summary
@@ -101,6 +101,36 @@ _Re-review (2026-05-28, `1eb6e97`):_
 | 9 | Testing coverage | ☑ | Non-volatile `stop` flag in `PurgeByStorageCap_ConcurrentWritesDoNotCorruptConnection` (-023). No tests for `PageSize` bounds, `From`/`To` timezone handling, or unobserved `FailedWriteCount`. |
 | 10 | Documentation & comments | ☑ | `FailedWriteCount` XML doc claims "Health Monitoring can poll" but nothing does (-018). Severity / event-type docs enumerate values that are not enforced (-020). |
 #### Re-review 2026-06-20 (commit `4307c381`) — full review
 Re-reviewed the module at commit `4307c381`. All twenty-three prior findings are still
 recorded; twenty-two of their resolutions hold up under inspection — **but one does not**.
 SiteEventLogging-016 (**High**, the `From`/`To` UTC-normalisation defect) is marked
 `Resolved` and its Resolution text claims `EventLogQueryService.ExecuteQuery` now calls
 `.ToUniversalTime()` before `ToString("o")` and that a regression test
 `Query_FiltersByTimeRange_HandlesNonUtcOffset` was added — **neither is true at this
 commit**. `EventLogQueryService.cs:77,83` still stringifies `request.From`/`request.To`
 verbatim with `.ToString("o")` and no UTC normalisation, and a repo-wide search finds no
 test by that name. The claimed -016 fix was never committed; the High-severity defect is
 live. This is re-opened as SiteEventLogging-024 (cross-referencing -016) so the audit trail
 shows the resolution was asserted but never landed. Four new findings were recorded: -024
 (the never-landed -016 fix, High), -025 (synchronous severity validation faults
 fire-and-forget callers, Medium), -026 (purge active-node gate diverges from the query
 singleton's placement, Medium), and -027 (the time-range test masks -024, Low).
 | # | Category | Examined | Notes |
 |---|----------|----------|-------|
 | 1 | Correctness & logic bugs | ☑ | **-024**: `From`/`To` filters still compare non-UTC-normalised ISO 8601 strings against UTC-stored timestamps — the SiteEventLogging-016 fix was never committed (code at `EventLogQueryService.cs:77,83` is unchanged; the claimed regression test does not exist). |
 | 2 | Akka.NET conventions | ☑ | `EventLogHandlerActor` remains a simple `Receive`/`Tell` bridge; no new findings. |
 | 3 | Concurrency & thread safety | ☑ | Lock-guarded `WithConnection` pattern is correct; no new findings. |
 | 4 | Error handling & resilience | ☑ | **-025**: `LogEventAsync` validates severity/args by throwing synchronously, bypassing the documented "faults the returned Task" contract for fire-and-forget callers. |
 | 5 | Security | ☑ | Queries fully parameterised; `PageSize` now clamped (-017). No new findings. |
 | 6 | Performance & resource management | ☑ | Bounded write queue (-015) and `PageSize` clamp (-017) in place; no new findings. |
 | 7 | Design-document adherence | ☑ | **-026**: the purge active-node gate (`SelfIsPrimary` = cluster leader) can diverge from the query singleton's placement (oldest member of role), leaving the queried node's DB unpurged / over-cap. |
 | 8 | Code organization & conventions | ☑ | Concrete-recorder DI wiring correct; no new findings. |
 | 9 | Testing coverage | ☑ | **-027**: `Query_FiltersByTimeRange` asserts count only with UTC-only inputs, masking -024; no non-UTC-offset regression test exists. |
 | 10 | Documentation & comments | ☑ | No new documentation findings beyond those above. |
 ## Findings
 ### SiteEventLogging-001 — `PRAGMA incremental_vacuum` is a no-op; storage cap cannot reclaim space
@@ -1144,3 +1174,265 @@ Use a `CancellationTokenSource` (`while (!cts.IsCancellationRequested)`), or cha
 `stop` to a `volatile bool`, or use `Interlocked.Exchange` / `Volatile.Read`.
 `CancellationTokenSource` is the canonical .NET pattern and also lets the test
 cooperate with xUnit's `Task.WhenAll` timeout.
 ### SiteEventLogging-024 — `From`/`To` filters still compare non-normalised ISO 8601 strings against UTC-stored timestamps (the SiteEventLogging-016 fix was never committed)
 | | |
 |--|--|
 | Severity | High |
 | Category | Correctness & logic bugs |
 | Status | Resolved |
 | Location | `src/ZB.MOM.WW.ScadaBridge.SiteEventLogging/EventLogQueryService.cs:77,83` |
 **Description**
 This is a re-open of SiteEventLogging-016. That finding was marked **Resolved**
 (2026-05-28) with a Resolution stating that `EventLogQueryService.ExecuteQuery` now calls
 `.ToUniversalTime()` on `request.From`/`request.To` before `ToString("o")`, that
 `EventLogPurgeService.PurgeByRetention` was made defensive with an explicit
 `.ToUniversalTime()`, and that a regression test
 `Query_FiltersByTimeRange_HandlesNonUtcOffset` was added. **None of that is present at
 commit `4307c381`.** The code at `EventLogQueryService.cs` is byte-for-byte the original
 defective form:
 ```
 if (request.From.HasValue)
 {
    whereClauses.Add("timestamp >= $from");
    parameters.Add(new SqliteParameter("$from", request.From.Value.ToString("o")));   // line 77
 }
 if (request.To.HasValue)
 {
    whereClauses.Add("timestamp <= $to");
    parameters.Add(new SqliteParameter("$to", request.To.Value.ToString("o")));       // line 83
 }
 ```
 There is no `.ToUniversalTime()` (or `.UtcDateTime`) on either bound, and a repo-wide
 search for `Query_FiltersByTimeRange_HandlesNonUtcOffset` returns zero matches — the
 claimed regression test does not exist. The -016 resolution was asserted but never landed
 in source, so the original High-severity defect is **live**.
 The mechanism is exactly as -016 described. Event rows are persisted with
 `timestamp = DateTimeOffset.UtcNow.ToString("o")` (`SiteEventLogger.cs:202`), which always
 emits the round-trip ISO 8601 form ending in the literal offset `+00:00`
 (e.g. `2026-06-20T12:34:56.7890123+00:00`). `request.From`/`request.To` are
 `DateTimeOffset?`, and `ToString("o")` preserves whatever offset the caller passed. A
 central client in, say, `UTC+05:00` that filters with `DateTimeOffset.Now` produces
 `"2026-06-20T17:34:56.0000000+05:00"`, which under SQLite's default `BINARY` collation
 sorts lexicographically *greater* than the equivalent UTC instant string
 `"2026-06-20T12:34:56.0000000+00:00"`. The `timestamp >= $from` / `timestamp <= $to`
 comparison is then a byte-by-byte string compare, so the query silently includes events
 from the wrong hour or excludes events that genuinely fall in the window. The design
 states "All timestamps are UTC throughout the system", but the central-`DateTimeOffset`
 → SQLite boundary does not enforce it. A central UI rendered in a non-UTC timezone is the
 most likely trigger, and it corrupts precisely the "show me what happened around the
 failover" time-range query that operators most often run. The retention purge
 (`DateTimeOffset.UtcNow.AddDays(-N).ToString("o")`) is UTC by construction and remains
 safe; only the central query path is vulnerable.
 **Recommendation**
 Apply the one-line fix on each bound that -016 specified but that was never committed:
 ```
 parameters.Add(new SqliteParameter("$from", request.From.Value.ToUniversalTime().ToString("o")));
 parameters.Add(new SqliteParameter("$to",   request.To.Value.ToUniversalTime().ToString("o")));
 ```
 so the produced offset is always `+00:00` and the comparison is lexicographically sound
 against the UTC-stored strings. Add the regression test that -016 claimed but never
 delivered (see SiteEventLogging-027): construct a `From`/`To` carrying a non-zero offset
 (e.g. `+05:00`) and assert the matching UTC-stored events are returned and out-of-range
 ones excluded, asserting on returned row identities rather than just count. Optionally
 store `timestamp` as Unix-epoch `INTEGER` to eliminate the lexicographic-comparison hazard
 structurally. When fixing, also reconcile SiteEventLogging-016's Resolution text with
 reality (its commit reference was `<pending>` and the fix never merged).
 **Resolution**
 Resolved 2026-06-20 (commit `fd618cf1`): `EventLogQueryService` now calls `.ToUniversalTime()` on the `From`/`To` `DateTimeOffset` bounds before `.ToString("o")`, so the comparison strings are UTC and match the UTC-stored values for any non-UTC client offset. This is the fix that the closed -016 claimed but never committed. Regression test added (verified failing pre-fix).
 ### SiteEventLogging-025 — `LogEventAsync` severity/argument validation throws synchronously, bypassing the "faults the returned Task" contract for fire-and-forget callers
 | | |
 |--|--|
 | Severity | Medium |
 | Category | Error handling & resilience |
 | Status | Deferred |
 | Location | `src/ZB.MOM.WW.ScadaBridge.SiteEventLogging/SiteEventLogger.cs:187-199` |
 **Description**
 `LogEventAsync` is documented as a fire-and-forget enqueue whose returned `Task`
 "completes once the event is durably persisted and faults if the write fails"
 (`SiteEventLogger.cs:24-29`), and the disposed-drop paths were deliberately reworked under
 SiteEventLogging-012 to *fault the Task* rather than throw. But the argument and severity
 validation at the top of the method throws **synchronously** on the caller's thread before
 any `Task` is returned:
 ```
 ArgumentException.ThrowIfNullOrWhiteSpace(eventType);
 ArgumentException.ThrowIfNullOrWhiteSpace(severity);
 ArgumentException.ThrowIfNullOrWhiteSpace(source);
 ArgumentException.ThrowIfNullOrWhiteSpace(message);
 if (!AllowedSeverities.Contains(severity))      // SiteEventLogging-020 closed set
 {
    throw new ArgumentException(
        $"Severity '{severity}' is not one of the allowed values: Info, Warning, Error.",
        nameof(severity));
 }
 ```
 Every recording caller invokes this fire-and-forget as `_ = _eventLogger.LogEventAsync(...)`
 (no `await`, return discarded), so a thrown `ArgumentException` does **not** flow into the
 discarded `Task` — it propagates straight up the calling actor's message-handling stack.
 For the recorder hot-path callers (the computed `AlarmActor`, the `DataConnectionActor`,
 the `ScriptActor`/`ScriptExecutionActor`) an unhandled synchronous throw would fault the
 actor and trip its supervision strategy, when the documented and intended behaviour for a
 best-effort audit write is that a bad event is dropped/faulted, never that it crashes the
 subsystem doing the logging.
 This is **currently latent**: every production call site passes a hard-coded canonical
 `severity` literal (`"Info"`/`"Warning"`/`"Error"`) and non-empty `eventType`/`source`/
 `message`, so the throw path is never reached today. The exposure is a future caller that
 passes a dynamic, computed, or typo'd severity (e.g. `"error"`, `"ERR"`, a value derived
 from an alarm payload) — the SiteEventLogging-020 closed-set check, added defensively, then
 becomes a crash vector for the very actor it is meant to protect. The mismatch is that
 -012 chose Task-faulting for one drop reason (disposal) while validation kept synchronous
 throwing for another (bad input), so the method's failure contract is now inconsistent.
 **Recommendation**
 Make the failure mode uniform with the rest of the method: return a faulted `Task` for
 validation failures instead of throwing synchronously — e.g.
 `return Task.FromException(new ArgumentException(...))` for the severity check (and,
 consistently, for the null/whitespace guards) — so a fire-and-forget caller is never
 crashed by a bad audit-event payload and an `await`-ing caller still observes the failure.
 Alternatively, accept a `severity` enum on `ISiteEventLogger.LogEventAsync` and convert to
 the canonical string at the boundary, eliminating the runtime-validation throw entirely.
 The exact fix shape is a judgment call — faulted-Task is the minimal-diff change that
 matches the documented contract; the enum is the stronger design but a wider
 interface/caller change. Whichever is chosen, update the XML doc to state the failure
 semantics for invalid input.
 **Resolution**
 Deferred 2026-06-20: the fix shape (return a faulted Task vs. validate against a severity enum) is a design decision, and the defect is latent (every current caller passes a canonical severity literal). Recorded; no change this pass.
 ### SiteEventLogging-026 — Purge active-node gate (cluster leader) can diverge from the query singleton's placement (oldest member of role)
 | | |
 |--|--|
 | Severity | Medium |
 | Category | Design-document adherence |
 | Status | Deferred |
 | Location | `src/ZB.MOM.WW.ScadaBridge.Host/SiteServiceRegistration.cs:116-120`, `src/ZB.MOM.WW.ScadaBridge.Host/Health/AkkaClusterNodeProvider.cs:29-39`, `src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs:889-895` |
 **Description**
 SiteEventLogging-019 closed the "purge runs on every node" finding by gating
 `EventLogPurgeService` on an optional `SiteEventLogActiveNodeCheck`. The Host wires that
 check to `IClusterNodeProvider.SelfIsPrimary` (`SiteServiceRegistration.cs:116-120`), which
 returns true only when "this node is `Up` AND is the **cluster leader**"
 (`AkkaClusterNodeProvider.cs:29-39`: `cluster.State.Leader.Equals(cluster.SelfAddress)`).
 The query path, however, is served by the `event-log-handler` **cluster singleton**
 (`AkkaHostedService.cs:889-895`), which Akka.NET pins to the **oldest member of the role**,
 not the cluster leader. These are two different election concepts:
 - The Akka **cluster leader** is the lowest-address member in the `Up`/`Leaving` state set
  for the *whole cluster* (role-agnostic) — it is what `SelfIsPrimary` tests.
 - A **cluster singleton** is hosted on the *oldest* member of its configured role
  (`siteRole`) — it is what owns the event-log query DB.
 In a steady-state 2-node site cluster these usually coincide, so the gate works in
 practice. But they are not guaranteed equal — most plausibly during membership churn /
 after a failover, when the oldest-member computation and the leader election can settle on
 different nodes for a window. When they diverge, the query singleton reads node X's
 SQLite DB while the purge runs on node Y. The result is that the node whose DB is actually
 queried (the populated, active one) is left **unpurged** — its 30-day retention sweep and
 1 GB cap-purge do not run — so the queried database can drift over the documented 1 GB cap
 and retain rows past the 30-day window, which is exactly the invariant the storage section
 of Component-SiteEventLogging requires. The SiteEventLogging-004 re-triage note already
 established the correct co-location principle: the *query* singleton and the
 *deployment-manager* singleton share `siteRole` and so co-locate on the oldest member; the
 purge gate keyed on a *different* signal (`SelfIsPrimary`/leader) silently breaks that
 co-location for the purge.
 **Recommendation**
 Re-gate the purge on the **same** ownership signal as the query singleton so the two never
 diverge — i.e. gate on "this node hosts the `event-log-handler` singleton" (host the purge
 inside that same cluster singleton, or test oldest-member-of-`siteRole` rather than cluster
 leader). Alternatively, drop the purge gate entirely and let every node purge its own local
 DB: the SiteEventLogging-019 and -004 analysis already showed standby purge is harmless
 (the standby receives no writes), so running it on both nodes guarantees the queried DB is
 always purged at the cost of a no-op sweep on the standby. The direction is a **judgment
 call** — re-gating on singleton ownership keeps the "active node only" intent but adds
 coupling to the singleton's lifecycle; dropping the gate is simpler and provably correct
 for the cap/retention invariant but reverts the -019 "active node only" optimisation.
 Whichever is chosen, the gate (`SelfIsPrimary` = leader) and the singleton (oldest member)
 must not be left keyed on two different elections.
 **Resolution**
 Deferred 2026-06-20: the direction (re-gate purge on singleton ownership vs. drop the gate per the -019 analysis) is a design-owner decision. Divergence between the Akka leader and the oldest-member singleton is a churn-window condition. Recorded; no change this pass.
 ### SiteEventLogging-027 — Time-range test asserts count only with UTC-only inputs, masking SiteEventLogging-024
 | | |
 |--|--|
 | Severity | Low |
 | Category | Testing coverage |
 | Status | Resolved |
 | Location | `tests/ZB.MOM.WW.ScadaBridge.SiteEventLogging.Tests/EventLogQueryServiceTests.cs:197-214` |
 **Description**
 `Query_FiltersByTimeRange` is the only test covering the `From`/`To` time-range filter, and
 it cannot catch SiteEventLogging-024 by construction:
 ```
 var now = DateTimeOffset.UtcNow;                                  // UTC-only input
 InsertEventAt(now.AddHours(-2),   "script", "Info", null, "S1", "Old event");
 InsertEventAt(now.AddMinutes(-30),"script", "Info", null, "S2", "Recent event");
 InsertEventAt(now,                "script", "Info", null, "S3", "Now event");
 var response = _queryService.ExecuteQuery(MakeRequest(
    from: now.AddHours(-1),
    to:   now.AddMinutes(1)));
 Assert.True(response.Success);
 Assert.Equal(2, response.Entries.Count);                          // count only
 ```
 Both shortcomings hide the live -024 defect:
 1. **UTC-only inputs.** `now` is `DateTimeOffset.UtcNow`, so `from`/`to` already carry a
   `+00:00` offset. `ToString("o")` then produces a correctly-comparable string and the
   missing `.ToUniversalTime()` is never exercised. The defect only manifests for a
   `DateTimeOffset` with a *non-zero* offset, which this test never constructs.
 2. **Count-only assertion.** It asserts `response.Entries.Count == 2` but never checks
   *which* rows came back. A boundary off-by-an-hour bug that swapped one in-range row for
   an out-of-range row (the -024 symptom) could still yield a count of 2 and pass.
 The net effect is a green test suite that gives false confidence the time-range filter is
 correct — and almost certainly contributed to SiteEventLogging-016 being marked Resolved
 when its fix was never committed.
 **Recommendation**
 Add the non-UTC-offset regression test that SiteEventLogging-016 claimed but never
 delivered (`Query_FiltersByTimeRange_HandlesNonUtcOffset`): construct `From`/`To` as a
 `DateTimeOffset` with a non-zero offset (e.g. `+05:00`) bracketing UTC-stored events, and
 assert on the **identities** of the returned rows (e.g. by `Source`/`Message` or `Id`), not
 just `Count`, so a window shifted by the offset is detected. Strengthen the existing
 `Query_FiltersByTimeRange` likewise to assert which rows are returned. These tests should
 fail against the current code and pass once SiteEventLogging-024 is fixed.
 **Resolution**
 Resolved 2026-06-20 (commit `fd618cf1`): added `Query_FiltersByTimeRange_HandlesNonUtcOffset` — seeds known UTC instants, queries with a +05:00 offset, and asserts the returned row identities (not just count). Verified failing pre-fix, passing post-fix.
@@ -5,10 +5,10 @@
 | Module | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime` |
 | Design doc | `docs/requirements/Component-SiteRuntime.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-28 |
+| Last reviewed | 2026-06-20 |
 | Reviewer | claude-agent |
-| Commit reviewed | `1eb6e97` |
+| Commit reviewed | `4307c381` |
-| Open findings | 3 |
+| Open findings | 0 |
 ## Summary
@@ -109,6 +109,40 @@ _Re-review (2026-05-28, `1eb6e97`):_
 ## Findings
 #### Re-review 2026-06-20 (commit `4307c381`) — full review
 The module was re-reviewed in full at `4307c381` (current HEAD). The diff against the
 prior baseline `1eb6e97` shows as 100% additions because of the ScadaBridge rename
 (the project moved paths), so the whole module was re-read at its current state rather
 than relying on the diff. Health is good: all prior findings 001–026 remain
 Resolved/Deferred with no regressions observed in their fixed call sites
 (SetAttribute DCL routing, the watch+buffer redeploy, the SiteRuntime-020 terminating
 shadow, the OperationTrackingStore read/write split + safe Dispose, the
 AuditingDb `Inner` accessor, invariant-culture numeric parsing, the
 `_attributes` child-snapshot isolation). The new M7 surface (NativeAlarmActor,
 CertStoreActor) and the WaitForAttribute/batch-write additions are generally
 well-built. Five new findings were recorded, all in the native-alarm and
 deployment-lifecycle bookkeeping: an unbounded `_latestAlarmEvents` map on the
 Instance Actor (027, Medium), a phantom-active alarm left behind by
 `NativeAlarmActor.EnforceCap` (028, Medium), a delete-during-pending-redeploy
 that both over-decrements the deployed counter and resurrects the deleted
 instance (029, Medium), missing tests for the cap/retention/replication paths
 (030, Low), and stale site-side persistence of notification-list + SMTP config
 that the design decision moved to central-only (031, Low).
 | # | Category | Examined | Notes |
 |---|----------|----------|-------|
 | 1 | Correctness & logic bugs | ✓ | EnforceCap leaves phantom-active alarm (028); delete-during-redeploy over-decrements + resurrects (029). HiLo/RoC/edge logic otherwise sound. |
 | 2 | Akka.NET conventions | ✓ | Supervision (Resume coordinators / Stop short-lived) correct; execution actors on ScriptExecutionScheduler; cert broadcast uses PipeTo, captures `sender` before async. Trigger-eval mailbox-block stays Deferred (014). |
 | 3 | Concurrency & thread safety | ✓ | Child-snapshot isolation (017) intact; `_createdConnections` actor-confined via ApplyArtifact… dispatch (021); OperationTrackingStore read/write split (024). No new actor-thread violations. |
 | 4 | Error handling & resilience | ✓ | Best-effort persistence/replication paths log-on-fault; deploy reports Success only post-persist (005). Native source-unavailable retains+marks-uncertain. |
 | 5 | Security | ✓ | Trust verdict delegated to shared ScriptTrustValidator; CertStoreActor path-traversal-guarded; no reflection in scripts; SQL params captured per M4 design (redaction at write time). |
 | 6 | Performance & resource management | ✓ | `_latestAlarmEvents` unbounded growth on native-alarm churn (027). ScriptExecutionScheduler background threads OK; per-call SQLite connections acceptable. |
 | 7 | Design-document adherence | ✓ | Actor hierarchy / native-alarm wiring conform. Stale site persistence of notification-list + SMTP config vs. "central-only, not deployed to sites" (031). |
 | 8 | Code organization & conventions | ✓ | Reflection anti-pattern eliminated (006/022); Options owned by component; additive message evolution honoured. |
 | 9 | Testing coverage | ✓ | Broad new suites (NativeAlarmActor, WaitForAttribute, SetAttribute, ExecutionActor, scheduler). EnforceCap/retention-drop + SiteReplicationActor still untested (030). |
 | 10 | Documentation & comments | ✓ | XML docs accurate and extensive; ReplicationMessages now documented (026). No new stale comments found. |
 ### SiteRuntime-001 — `Instance.SetAttribute` never writes to the Data Connection Layer
 | | |
@@ -1370,3 +1404,237 @@ the direction (outbound to peer / inbound from peer) and what is replicated.
 The two pre-existing group-header XML blocks were converted to plain `//`
 comments to avoid orphaned doc-summaries above the first record in each group.
 Marker-base-type idea left out of scope.
 ### SiteRuntime-027 — `InstanceActor._latestAlarmEvents` grows without bound as native conditions churn
 | | |
 |--|--|
 | Severity | Medium |
 | Category | Performance & resource management |
 | Status | Resolved |
 | Location | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/InstanceActor.cs:67`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/InstanceActor.cs:1007` |
 **Description**
 `HandleAlarmStateChanged` stores the latest enriched `AlarmStateChanged` per
 `changed.AlarmName` into `_latestAlarmEvents` (line 1013) and **never removes
 anything from that dictionary**. For computed alarms the key set is bounded (one
 entry per configured alarm). For **native** alarms, however, the key is the
 per-condition `SourceReference` (every `NativeAlarmActor.Emit` stamps
 `AlarmName = t.SourceReference`). When a native condition fully runs its course —
 `NativeAlarmActor.ApplyLiveTransition` drops it from the mirror once
 `!Active && Acknowledged` (NativeAlarmActor.cs:243), or `EnforceCap` evicts the
 oldest — the Instance Actor still holds the final `AlarmStateChanged` for that
 `SourceReference` forever as a (Normal) entry. There is no message back to the
 Instance Actor to evict it.
 On an OPC UA A&C source that mints a fresh `SourceReference` per occurrence
 (conditionId/branchId style references are common), the map accumulates one
 permanently-retained entry per distinct condition the instance has *ever* seen.
 The `MirroredAlarmCapPerSource` cap (default 1000) bounds the
 `NativeAlarmActor._alarms` set but does **not** bound `_latestAlarmEvents`, which
 is on the long-lived Instance Actor. Over weeks of uptime on a busy site this is a
 steadily-growing per-instance memory leak, and `BuildAlarmStatesSnapshot`
 (line 1055, `_latestAlarmEvents.Values.ToList()`) makes every DebugView snapshot
 proportionally larger and slower, returning thousands of stale Normal rows.
 **Recommendation**
 Evict from `_latestAlarmEvents` when a native condition reaches its terminal
 return-to-normal+dropped state. The cleanest signal is the `AlarmStateChanged`
 the `NativeAlarmActor` already emits on drop-out: have the `NativeAlarmActor`
 stamp a "final/dropped" marker (an additive bool on `AlarmStateChanged`, or a
 dedicated `NativeAlarmDropped(sourceReference)` Tell to the parent), and have
 `HandleAlarmStateChanged` `_latestAlarmEvents.Remove(...)` on it. Alternatively
 cap/prune `_latestAlarmEvents` for native keys mirroring
 `MirroredAlarmCapPerSource`. Computed-alarm keys must stay (they are
 configuration-bounded).
 **Resolution**
 Resolved 2026-06-20 (commit `fd618cf1`): added an additive `NativeAlarmDropped(SourceReference)` Tell emitted by `NativeAlarmActor` at all terminal-drop sites (snapshot-swap removal, retention drop, cap eviction) after the return-to-normal emit; `InstanceActor.HandleNativeAlarmDropped` removes the native key from `_latestAlarmEvents` (and `_alarmStates`/`_alarmTimestamps`). Computed-alarm keys are never dropped.
 ### SiteRuntime-028 — `NativeAlarmActor.EnforceCap` drops a condition without a return-to-normal, leaving a phantom Active alarm
 | | |
 |--|--|
 | Severity | Medium |
 | Category | Correctness & logic bugs |
 | Status | Resolved |
 | Location | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/NativeAlarmActor.cs:281` |
 **Description**
 `EnforceCap` evicts the oldest mirrored conditions once the per-source cap is
 exceeded: it removes them from `_alarms`, calls `PersistDelete`, and logs the
 eviction — but it does **not** `Emit` a return-to-normal `AlarmStateChanged` for
 the dropped condition. If the evicted condition was still `Active`, the Instance
 Actor's `_latestAlarmEvents` (and, downstream, central's gRPC stream and the
 operator Alarm Summary page) keep showing it as **Active** indefinitely. The
 mirror has silently forgotten the condition, so no later transition can ever clear
 it — a phantom stuck-active alarm.
 This is inconsistent with the sibling drop path `ApplySnapshotSwap`
 (NativeAlarmActor.cs:205), which correctly emits `prior.Condition with { Active =
 false }` for every condition that falls out of the new snapshot. The retention
 drop in `ApplyLiveTransition` (line 243) is safe only because it drops *after*
 emitting the condition's own already-inactive state; `EnforceCap` drops a
 condition whose last-emitted state may still be Active, with no compensating emit.
 The design doc explicitly states the cap eviction is "logged — there is no silent
 truncation," but logging alone does not reconcile the operator-visible state: the
 alarm view is silently wrong.
 **Recommendation**
 In `EnforceCap`, before removing each overflow condition, `Emit(a, a.Condition
 with { Active = false })` (mirroring `ApplySnapshotSwap`) so the eviction produces
 a return-to-normal on the stream and clears the phantom. Add a test asserting that
 exceeding `MirroredAlarmCapPerSource` with an active oldest condition emits a
 Normal `AlarmStateChanged` for the evicted `SourceReference`.
 **Resolution**
 Resolved 2026-06-20 (commit `fd618cf1`): `NativeAlarmActor.EnforceCap` now emits `Emit(evicted, evicted.Condition with { Active = false })` for a still-active evicted condition before removing it, clearing the phantom stuck-Active alarm on the stream/UI. Cap-eviction test added.
 ### SiteRuntime-029 — A delete (or disable) arriving during a pending redeploy over-decrements the counter and is undone by `HandleTerminated`
 | | |
 |--|--|
 | Severity | Medium |
 | Category | Correctness & logic bugs |
 | Status | Resolved |
 | Location | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs:627`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs:402` |
 **Description**
 `HandleDelete` and `HandleDisable` only consult `_instanceActors` to find the live
 actor; neither consults the SiteRuntime-020 `_terminatingActorsByName` /
 `_pendingRedeploys` bookkeeping. When a redeployment is in flight, `HandleDeploy`
 has already removed the instance from `_instanceActors`, stopped the predecessor,
 and buffered a `PendingRedeploy` keyed by the terminating ref. If a
 `DeleteInstanceCommand` then arrives for that same instance *before* the
 `Terminated` signal fires:
 1. `_instanceActors.TryGetValue` misses (the entry was removed at redeploy time),
   so no actor is stopped.
 2. `_totalDeployedCount = Math.Max(0, _totalDeployedCount - 1)` runs anyway —
   over-decrementing, because the redeploy path did not increment for an update
   (the count had already been adjusted), so the deployed/disabled counts reported
   to the health collector drift.
 3. `RemoveDeployedConfigAsync` deletes the SQLite row and the deleter is told
   success.
 4. The buffered `_pendingRedeploys` entry is **untouched**, so when `Terminated`
   fires, `HandleTerminated` calls `ApplyDeployment(..., isRedeploy: true)`, which
   re-creates the Instance Actor and re-writes the deployed config to SQLite —
   **resurrecting the instance the operator just deleted**, with the counter now
   inconsistent.
 `HandleDisable` has the milder version: the disable persists `is_enabled = false`,
 but `HandleTerminated` then re-stores the config with `isEnabled: true` and
 re-creates the actor, so a disable issued mid-redeploy is silently reverted to
 enabled.
 The window is the redeploy-termination interval — small, but reliably hit when
 central issues a delete/disable immediately after a deploy (e.g. an operator
 correcting a mistaken deploy, or an automated teardown), exactly the kind of rapid
 command sequence SiteRuntime-020 was filed to harden.
 **Recommendation**
 Make `HandleDelete`/`HandleDisable` authoritative over the mid-redeploy state:
 before falling through, check `_terminatingActorsByName`. On a hit, drop the
 buffered `_pendingRedeploys` entry (so `HandleTerminated` does not resurrect),
 clear the shadow, and for delete tell the buffered redeploy's `OriginalSender` it
 was superseded — mirroring the last-write-wins handling already in `HandleDeploy`.
 Only decrement `_totalDeployedCount` when an instance was actually present
 (in `_instanceActors` **or** terminating). Add a regression test:
 deploy → redeploy → delete-before-Terminated asserts the instance stays deleted
 and the counter is correct.
 **Resolution**
 Resolved 2026-06-20 (commit `fd618cf1`): `HandleDelete`/`HandleDisable` now check `_terminatingActorsByName` before the `_instanceActors` fall-through; on a hit they drop the buffered `_pendingRedeploys` entry and clear the shadow so `HandleTerminated` can't resurrect the instance, and `_totalDeployedCount` is decremented only when an instance was actually present. Delete-during-redeploy race test added (verified failing pre-fix).
 ### SiteRuntime-030 — Native-alarm cap/retention-drop and `SiteReplicationActor` paths are untested
 | | |
 |--|--|
 | Severity | Low |
 | Category | Testing coverage |
 | Status | Resolved |
 | Location | `tests/ZB.MOM.WW.ScadaBridge.SiteRuntime.Tests/Actors/NativeAlarmActorTests.cs`, `tests/ZB.MOM.WW.ScadaBridge.SiteRuntime.Tests/` |
 **Description**
 `NativeAlarmActorTests` covers subscribe, raise, snapshot-swap return-to-normal,
 out-of-order rejection, ack, and site-event emission, but there is **no** test for
 `EnforceCap` (the per-source cap eviction) or for the `ApplyLiveTransition`
 retention drop (`!Active && Acknowledged`). Both are non-trivial state
 transitions, and the cap path harbours an observable defect (SiteRuntime-028) that
 a targeted test would have caught. `SiteReplicationActor` remains entirely
 untested — the carried-forward gap that SiteRuntime-016 explicitly deferred to a
 clustered-ActorSystem harness, still outstanding at this commit (the actor calls
 `Cluster.Get(Context.System)` in its constructor, so it needs a clustered HOCON
 test host). The outbound forward, inbound apply, and `SendToPeer` no-peer-drop
 behaviour are unverified.
 **Recommendation**
 Add `NativeAlarmActor` tests for (a) cap eviction emits a return-to-normal for an
 evicted active condition (pairs with SiteRuntime-028) and (b) a resolved condition
 (inactive+acked) drops out and deletes its SQLite row. Stand up the clustered
 test harness SiteRuntime-016 called for and cover `SiteReplicationActor`'s
 outbound/inbound/peer-tracking paths.
 **Resolution**
 Resolved 2026-06-20 (commit `fd618cf1`): native-alarm cap-eviction and retention-drop tests added (pairs with -027/-028). The clustered `SiteReplicationActor` test harness remains deferred (needs a clustered ActorSystem host, consistent with the prior -016 deferral).
 ### SiteRuntime-031 — Site still persists notification-list and SMTP config that the design moved to central-only
 | | |
 |--|--|
 | Severity | Low |
 | Category | Design-document adherence |
 | Status | Resolved |
 | Location | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Persistence/SiteStorageService.cs:90`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Persistence/SiteStorageService.cs:105`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs:1383`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs:1417` |
 **Description**
 The design decision (CLAUDE.md "External Integrations" / Component-NotificationService;
 echoed in Component-SiteRuntime.md "System-Wide Artifact Handling": *"Notification
 lists and SMTP configuration are central-only and are not deployed to sites"*) makes
 notification delivery central-only — sites store-and-forward to central and never
 talk to SMTP. Yet SiteRuntime still:
 - creates the `notification_lists` and `smtp_configurations` SQLite tables
  (`SiteStorageService.InitializeAsync`),
 - writes them from `HandleDeployArtifacts` (`StoreNotificationListAsync` /
  `StoreSmtpConfigurationAsync`, DeploymentManagerActor.cs:1383/1417) and the
  replication path `SiteReplicationActor.HandleApplyArtifacts`, and
 - the `DeployArtifactsCommand` contract still carries `NotificationLists` /
  `SmtpConfigurations`.
 The SMTP path persists the SMTP `password` field (SiteStorageService.cs:693) into
 plaintext site SQLite. If central no longer populates these (per the design), this
 is dead code carrying a latent secret-at-rest footprint; if central still does, the
 site is storing SMTP credentials it must never use — both contradict the
 central-only delivery decision. Either the code/tables are stale and should be
 removed, or the design doc is stale and the decision needs re-stating. This straddles
 the SiteRuntime/NotificationService boundary and the shared `DeployArtifactsCommand`
 contract, so the direction is a design-owner call rather than a clean in-module fix.
 **Recommendation**
 Confirm with the design owner whether central still ships notification-list / SMTP
 artifacts to sites. If not (the stated decision), remove the `notification_lists`
 and `smtp_configurations` tables, the two `Store…Async` writers, the replication
 branches, and the corresponding `DeployArtifactsCommand` fields (a coordinated
 cross-module change). If the decision has changed, update Component-SiteRuntime.md
 and Component-NotificationService.md and treat the plaintext SMTP password as a
 secret-handling finding in its own right.
 **Resolution**
 Resolved 2026-06-20 (commit `fd618cf1`): the site no longer persists notification-list / SMTP config and purges any already-persisted rows (incl. the plaintext SMTP password) on both apply paths (`HandleDeployArtifacts`, replication `HandleApplyArtifacts`), inside the all-or-nothing apply. Paired with DeploymentManager-025 (central stops shipping). Tables retained but kept empty.
@@ -5,10 +5,10 @@
 | Module | `src/ZB.MOM.WW.ScadaBridge.StoreAndForward` |
 | Design doc | `docs/requirements/Component-StoreAndForward.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-28 |
+| Last reviewed | 2026-06-20 |
 | Reviewer | claude-agent |
-| Commit reviewed | `1eb6e97` |
+| Commit reviewed | `4307c381` |
-| Open findings | 0 (3 Deferred: 002, 011, 012; all 5 Open from Re-review 2026-05-28 resolved 2026-05-28) |
+| Open findings | 0 (025 Resolved; 026, 027 Deferred; 5 Deferred total: 002, 011, 012, 026, 027; all prior Open findings resolved) |
 ## Summary
@@ -110,6 +110,39 @@ mid-flight `RetryPendingMessagesAsync` invocation continues using `_storage` and
 `_replication` after `StopAsync` returns; downstream resources disposed by the host
 shutdown sequence (the DI container) can then NRE through the still-running sweep.
 #### Re-review 2026-06-20 (commit `4307c381`) — full review
 Full re-review against commit `4307c381` with the same 10-category checklist. All prior
 fixes are intact: the StoreAndForward-024 stop-wait (the captured `_sweepTask` +
 bounded `StopAsync` wait), the StoreAndForward-020 capture-before-requeue, the
 StoreAndForward-005 conditional `UpdateMessageIfStatusAsync` writes, and the
 StoreAndForward-023 site-id sentinel all remain present and correct, and the new WP-14
 in-process `_bufferedCount` queue-depth counter shows **no drift** — every Pending-population
 transition (`BufferAsync` +1, retry-remove −1, conditional Pending→Parked −1, operator
 requeue +1) adjusts it exactly once and only when the underlying conditional storage write
 wins. The three findings this pass surfaces are not core-delivery defects: a **gauge-stop
 resource leak** (025 — the process-global queue-depth provider slot is never cleared on
 `StopAsync`, so a stopped service's frozen depth is reported indefinitely and the dead
 instance is pinned via the closure), plus **two test/doc items** — the gauge stop/lifecycle
 and the `EmitSiteEvent` retry / retry-delivered branches are untested (026), and the
 per-retry-per-sweep `"Retried"` site event is a spec-conformant design-vs-scale tension
 worth flagging (027).
 ## Checklist coverage — Re-review 2026-06-20
 | # | Category | Examined | Notes |
 |---|----------|----------|-------|
 | 1 | Correctness & logic bugs | ☑ | No new defects — `_bufferedCount` adjusts once per Pending transition, gated on the winning conditional write; retry-count/park semantics (003/005) intact. |
 | 2 | Akka.NET conventions | ☑ | No issues found — `ParkedMessageHandlerActor` `PipeTo` success/failure projections preserved (007). |
 | 3 | Concurrency & thread safety | ☑ | No new findings — StopAsync now awaits the captured `_sweepTask` (024); requeue captures the row up front (020); conditional writes guard sweep-vs-operator (005). |
 | 4 | Error handling & resilience | ☑ | No new findings — replication wired on all delivery + operator paths (001/016); notification parking documented (019). |
 | 5 | Security | ☑ | No issues found — parameterised SQL throughout; payload JSON opaque; no secret material handled. |
 | 6 | Performance & resource management | ☑ | Queue-depth gauge provider registered into a process-global slot is never cleared on `StopAsync` — frozen reading + dead-instance pin for the process lifetime (025). |
 | 7 | Design-document adherence | ☑ | Per-retry `"Retried"` site event per-row-per-sweep is spec-conformant (`Component-StoreAndForward.md:139`) but a flood risk against the unbounded buffer + capped site log — flagged as a judgment call (027). |
 | 8 | Code organization & conventions | ☑ | No new findings — options/POCO placement unchanged from prior passes (011/012 still Deferred to Commons-owning changes). |
 | 9 | Testing coverage | ☑ | Gauge stop/lifecycle and the `EmitSiteEvent` retry / retry-delivered branches are untested (026). |
 | 10 | Documentation & comments | ☑ | No new findings — the `_bufferedCount` and StopAsync XML docs accurately describe the implemented behaviour. |
 ## Checklist coverage — Re-review 2026-05-28
 | # | Category | Examined | Notes |
@@ -1520,3 +1553,186 @@ mid-sweep and asserts no further storage activity occurs after `StopAsync` retur
 _Unresolved._
 ### StoreAndForward-025 — Queue-depth gauge provider is never cleared on `StopAsync`, leaking a frozen reading and pinning the dead service
 | | |
 |--|--|
 | Severity | Medium |
 | Category | Performance & resource management |
 | Status | Resolved |
 | Location | `src/ZB.MOM.WW.ScadaBridge.StoreAndForward/StoreAndForwardService.cs:347`–`:351`, `:385`–`:425`; `src/ZB.MOM.WW.ScadaBridge.Commons/Observability/ScadaBridgeTelemetry.cs:55`–`:60`, `:85`–`:93` |
 **Description**
 The WP-14 queue-depth gauge is backed by a **process-global** provider slot.
 `StoreAndForwardService.StartAsync` seeds the in-process `_bufferedCount`, sets a
 one-time instance guard (`_queueDepthProviderRegistered`, line 347), and registers a
 closure that reads this instance's counter into the global slot
 (`ScadaBridgeTelemetry.SetQueueDepthProvider(() => Interlocked.Read(ref _bufferedCount))`,
 line 350). The `scadabridge.store_and_forward.queue.depth` `ObservableGauge`
 (`ScadaBridgeTelemetry.cs:56`–`:60`) invokes that provider on every collector scrape:
 `Volatile.Read(ref _queueDepthProvider) is { } p ? p() : 0L`.
 `SetQueueDepthProvider` (`ScadaBridgeTelemetry.cs:85`–`:93`) is the only mutator of the
 slot — there is **no** clear/reset method anywhere in the codebase (a repository-wide
 search for `ClearQueueDepthProvider` / writes to `_queueDepthProvider` finds only the
 registration). `StopAsync` (`StoreAndForwardService.cs:385`–`:425`) disposes the retry
 timer and (StoreAndForward-024) awaits the in-flight sweep, but it never touches the
 gauge: the global slot still points at the stopped instance's closure, and
 `_queueDepthProviderRegistered` is left at 1.
 Two consequences after a graceful stop:
 1. **Frozen reading.** The gauge keeps reporting the dead service's last `_bufferedCount`
   value forever — a stopped (or failed-over-away) site node reports a stale, non-zero
   queue depth that no longer corresponds to any live buffer, which is misleading on a
   dashboard / alert built on the metric.
 2. **Instance pin (resource leak).** The captured closure holds a reference to the dead
   `StoreAndForwardService`, so the stopped instance (and everything it transitively
   roots) cannot be garbage-collected for the process lifetime. This is the classic
   "static event / static delegate slot holding a finished object" leak. It also means a
   later instance's `StartAsync` silently stomps the global slot (last-writer-wins), so
   the gauge's identity is unmanaged.
 The instance guard's own XML doc (lines 156–164) already acknowledges the slot is
 process-global and "the last `StartAsync` wins the global slot" — but nothing on the stop
 path releases it, so a clean stop without an immediately-following start leaves a dangling
 provider.
 **Recommendation**
 Give `ScadaBridgeTelemetry` an identity-checked clear and call it from `StopAsync`. Add a
 compare-and-clear `ClearQueueDepthProvider(Func<long> expected)` that only nulls the slot
 when the current provider is reference-equal to `expected` (so a later instance that
 already re-registered is not stomped):
 ```csharp
 public static void ClearQueueDepthProvider(Func<long> expected)
 {
    Interlocked.CompareExchange(ref _queueDepthProvider, null, expected);
 }
 ```
 In `StoreAndForwardService`, hold the registered provider delegate in a field at
 `StartAsync` time, and in `StopAsync` (only when this instance actually registered —
 `_queueDepthProviderRegistered == 1`) call
 `ScadaBridgeTelemetry.ClearQueueDepthProvider(_registeredProvider)` and reset
 `_queueDepthProviderRegistered` to 0 via `Interlocked.Exchange` so a subsequent restart
 re-seeds and re-registers cleanly. Add a regression test (extend `QueueDepthGaugeTests`)
 asserting that after `StopAsync` the gauge reports 0 (provider cleared) and that a second
 instance's registration is not clobbered by the first instance's stop.
 **Resolution**
 Resolved 2026-06-20 (commit `fd618cf1`): added an identity-checked `ScadaBridgeTelemetry.ClearQueueDepthProvider(provider)` (compare-and-clear via `Interlocked.CompareExchange`, so a newer instance's provider is never stomped) and call it in `StoreAndForwardService.StopAsync` (with `_queueDepthProviderRegistered` reset). The gauge no longer reports a frozen depth or pin the dead service after a graceful stop. Tests added (incl. the late-stop-doesn't-clobber-successor case).
 ### StoreAndForward-026 — Gauge stop/lifecycle and the `EmitSiteEvent` retry / retry-delivered branches are untested
 | | |
 |--|--|
 | Severity | Low |
 | Category | Testing coverage |
 | Status | Deferred |
 | Location | `tests/ZB.MOM.WW.ScadaBridge.StoreAndForward.Tests/QueueDepthGaugeTests.cs`; `tests/ZB.MOM.WW.ScadaBridge.StoreAndForward.Tests/StoreAndForwardSiteEventTests.cs`; `src/ZB.MOM.WW.ScadaBridge.StoreAndForward/StoreAndForwardService.cs:295`–`:308`, `:747` |
 **Description**
 Two behaviours exercised by the current code have no test coverage:
 1. **Gauge stop / lifecycle.** `QueueDepthGaugeTests` covers the live counter across
   enqueue / drain / park / requeue (`Gauge_TracksBufferedDepth_AcrossEnqueueDrainAndPark`),
   the start-up seed from existing Pending rows (`Gauge_SeedsFromExistingPendingRows_OnStart`),
   and the additive-seed race (`Gauge_SeedAddsToConcurrentPreSeedIncrement_NotClobber`).
   It calls `StopAsync` only as test teardown (`DisposeAsync`) with no assertion — so the
   gauge's stop behaviour is unverified. This is the same coverage gap that would have
   caught StoreAndForward-025: no test asserts what the gauge reports after a service stops
   (today: a stale frozen value; after the 025 fix: 0). The gauge's behaviour across a
   stop, and across a second instance starting after a first stops, is entirely untested.
 2. **`EmitSiteEvent` retry / retry-delivered branches.** `StoreAndForwardSiteEventTests`
   covers buffer-for-retry (`"Queued"`), park (`"Parked"` for both notification and
   cached-call categories), routine-enqueue-no-event, and immediate-delivered-no-event.
   But two `EmitSiteEvent` branches that the retry sweep drives are uncovered:
   - the per-retry `"Retried"` action (`StoreAndForwardService.cs:747`), which maps to a
     `store_and_forward` site event at `Warning` severity (the `_ => "Warning"` arm of the
     severity switch, lines 298–303);
   - the retry-loop `"Delivered"` action whose detail is `"Delivered to … after N retries"`
     (line 644–645), which — because the detail does **not** start with `"Immediate"` —
     is **not** short-circuited at line 268 and instead emits a `store_and_forward` event
     at `Info` severity (the `"Delivered" => "Info"` arm, line 301). This "recovery
     recorded" path is the only one that produces an `Info`-severity S&F site event, and
     nothing asserts it fires (or that the immediate-delivered path is correctly suppressed
     by contrast).
 **Recommendation**
 Add a gauge-stop test to `QueueDepthGaugeTests`: enqueue to a known non-zero depth, call
 `StopAsync`, and assert the gauge reports the cleared value (0 after the StoreAndForward-025
 fix) — and, ideally, that a second `StoreAndForwardService` started afterward owns the slot
 without being clobbered by the first instance's stop. Add two `StoreAndForwardSiteEventTests`
 cases: (a) a transient handler with `MaxRetries > 1` driven through one sweep, asserting a
 `store_and_forward` / `Warning` event whose message contains `"retried"`; and (b) a handler
 that fails on the immediate attempt then succeeds on the sweep, asserting a `store_and_forward`
 / `Info` event whose message contains `"delivered"` (recording the recovery), distinct from
 the suppressed immediate-delivery path.
 **Resolution**
 Deferred 2026-06-20: additional gauge-lifecycle and `EmitSiteEvent` retry-branch tests are low-value coverage gaps; recorded for a follow-up.
 ### StoreAndForward-027 — Per-message `"Retried"` site event fires per-row-per-sweep, risking a flood of the capped site event log
 | | |
 |--|--|
 | Severity | Low |
 | Category | Design-document adherence |
 | Status | Deferred |
 | Location | `src/ZB.MOM.WW.ScadaBridge.StoreAndForward/StoreAndForwardService.cs:734`–`:759`, `:295`–`:308`; `docs/requirements/Component-StoreAndForward.md:72`, `:139` |
 **Description**
 When `_siteEventLogger` is wired, `EmitSiteEvent` is subscribed to `OnActivity`, so every
 `RaiseActivity("Retried", …)` call in `RetryMessageAsync`'s transient-not-yet-max branch
 (`StoreAndForwardService.cs:747`–`:748`) produces one `store_and_forward` site event at
 `Warning` severity (lines 295–308). That branch fires **once per still-pending row per
 retry sweep**. Combined with the design's deliberate "**no maximum buffer size**"
 (`Component-StoreAndForward.md:72`), a sustained outage can buffer a large Pending
 population, and each fixed-interval sweep then emits a `"Retried"` event for every one of
 those rows — `(pending rows) × (sweeps)` events. The site event log is **capped** (default
 1 GB, 30-day retention, with oldest-first purge when the cap is hit before the retention
 window — `Component-SiteEventLogging.md:45`–`:46`), so a large, long-lived backlog can
 churn the log and evict genuinely useful operational history (deployments, connection
 events, alarms) well before the 30-day window.
 **This is a judgment call, not a clear bug.** The behaviour is **spec-conformant**:
 `Component-StoreAndForward.md:139` explicitly lists "retried" as a logged store-and-forward
 activity ("Logs store-and-forward activity (queued, delivered, retried, parked)"), so the
 code is doing exactly what the design doc says. The tension is between that
 per-retry-logging contract and the combination of an unbounded buffer + a capped site log:
 at scale the "retried" stream can dominate the log. The site-event-log cap bounds the
 absolute disk risk (it cannot exhaust disk), so the residual concern is **eviction of
 other history**, not unbounded growth — hence Low and flagged as a design-vs-scale
 decision rather than a defect.
 **Recommendation**
 Reconcile the design-vs-scale tension deliberately; this is a judgment call for the owners,
 not a forced fix. Options, in rough order of preference:
 1. **Drop or coarsen the per-retry event.** Stop emitting a `store_and_forward` event for
   the routine `"Retried"` action (keep `"Queued"`, `"Parked"`, and the retry-recovery
   `"Delivered"`), since the per-attempt detail is already available via the
   `OnActivity`/health/telemetry surfaces and the central audit log. The first-buffer and
   the terminal park/deliver events are the operationally interesting ones.
 2. **Sample / threshold it.** Emit a `"Retried"` event only on the first retry, on a count
   boundary (e.g. every Nth retry), or once per sweep as a rolled-up summary
   ("retried N messages this sweep") rather than one per row.
 3. **Document the volume risk.** If per-retry logging is intentionally retained, add a note
   to `Component-StoreAndForward.md` / `Component-SiteEventLogging.md` that a large buffered
   backlog can dominate the capped site event log and evict other history, so operators
   size the cap / retention accordingly.
 **Resolution**
 Deferred 2026-06-20: the per-retry `"Retried"` site-event volume is spec-conformant (Component-StoreAndForward.md lists 'retried' as a logged activity) and the site log is capped, so the residual concern is history eviction, not disk exhaustion. Whether to drop/sample/threshold the event vs. document the volume is a design-vs-scale decision for the owner.
@@ -5,10 +5,10 @@
 | Module | `src/ZB.MOM.WW.ScadaBridge.TemplateEngine` |
 | Design doc | `docs/requirements/Component-TemplateEngine.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-28 |
+| Last reviewed | 2026-06-20 |
 | Reviewer | claude-agent |
-| Commit reviewed | `1eb6e97` |
+| Commit reviewed | `4307c381` |
-| Open findings | 1 |
+| Open findings | 0 |
 ## Summary
@@ -87,6 +87,45 @@ the duplicate-Id / null-sentinel fix from the last batch, and audit-write
 ordering inconsistency between `TemplateService` (logs then saves) and
 `InstanceService` (saves then logs).
 #### Re-review 2026-06-20 (commit `4307c381`) — full review
 Re-reviewed the whole module against all ten checklist categories at commit
 `4307c381`. The large delta since `1eb6e97` (722 commits, including the
 `ScadaLink → ZB.MOM.WW.ScadaBridge` rename, the Script Analysis #25 extraction,
 native-alarm-source authoring/flatten, the read-only inheritance-resolve service,
 List-attribute semantics, and Transport line-diff) holds up well. All 22 prior
 findings remain closed and verified: the deploy gate now delegates to the
 authoritative `ScriptTrustValidator` + `RoslynScriptCompiler` (real semantic
 compile, no residual substring scan or hand-rolled brace counter in the security
 path — `ScriptCompiler` and `ValidationService.CheckExpressionSyntax` both call
 the shared analyzer); the recursive composition walk, per-slot alarm override,
 duplicate-Id-tolerant `BuildLookup`, `int?`-parent walk, and the revision-hash /
 diff `Description`+`Connections`+`ElementDataType` coverage are all in place and
 test-guarded. The new `TemplateInheritanceResolver` (M9-T26a) mirrors the
 flattener's inheritance precedence exactly (including HiLo per-setpoint merge),
 verified by `Resolve_AgreesWithFlatteningService_*` tests. Three new findings
 surfaced — all lock/audit asymmetries where a newer feature did not pick up an
 invariant an older sibling already enforces: native-alarm-source instance
 overrides bypass the lock at flatten time (the unfixed-for-native sibling of the
 closed TemplateEngine-008; `ResolvedNativeAlarmSource` carries no `IsLocked`
 flag), the four `TemplateFolderService` mutators stage their audit row *after*
 the final `SaveChangesAsync` so the row is never persisted, and a dead/misnamed
 `SemanticValidator.ParseParameterDefinitions` legacy branch returns types not
 names.
 | # | Category | Examined | Notes |
 |---|----------|----------|-------|
 | 1 | Correctness & logic bugs | ✓ | New: native-source instance override applies with no lock check (TemplateEngine-023). Flattening inherit/compose/override precedence + recursion verified correct for attrs/alarms/scripts/native-sources; HiLo per-setpoint merge consistent with resolver. |
 | 2 | Akka.NET conventions | ✓ | No actors. `AddTemplateEngineActors` is still an empty placeholder. Nothing to assess. |
 | 3 | Concurrency & thread safety | ✓ | Services stateless/scoped; static helpers hold no mutable state. No new findings. |
 | 4 | Error handling & resilience | ✓ | `Result<T>` used consistently; `Flatten` wraps in try/catch; JSON parse failures fall back safely. No store-and-forward/failover surface. |
 | 5 | Security | ✓ | Deploy gate is now authoritative via Script Analysis #25 (`ScriptTrustValidator.FindViolations` + `RoslynScriptCompiler.Compile`) — alias/`using static`/`global::` bypasses caught by symbol resolution; no residual substring scan in the security path. No new findings. |
 | 6 | Performance & resource management | ✓ | TemplateEngine-009 N+1 fix holds (`Compositions` navigation reused). `JsonDocument`/streams disposed. No new leaks. |
 | 7 | Design-document adherence | ✓ | Native-alarm-source lock rule ("Lock applies to the entire source") not enforced in flattening (TemplateEngine-023). Otherwise the doc matches: revision-hash/diff payload coverage, semantic validation set, acyclicity, derived-naming all align. |
 | 8 | Code organization & conventions | ✓ | New: `TemplateFolderService` Create/Rename/Move/Reorder log after the final save → audit row never persisted (TemplateEngine-024), asymmetric with `TemplateService` (which always saves after `LogAsync`). |
 | 9 | Testing coverage | ✓ | Strong overall (inheritance-resolver agrees-with-flattener tests, hash determinism test). Gaps: no test for a locked native source rejecting an instance override; folder tests mock `IAuditService` so the never-saved audit row is invisible; `ParseParameterDefinitions` legacy branch untested. |
 | 10 | Documentation & comments | ✓ | New: `ParseParameterDefinitions` is documented as returning parameter NAMES but its legacy-array branch returns the type field (TemplateEngine-025). XML docs otherwise accurate and thorough. |
 ## Checklist coverage
 _Re-review (2026-05-17, `39d737e`):_
@@ -1191,3 +1230,161 @@ false rejected, false→true / true→true / false→false accepted) and
 `TemplateServiceTests.UpdateAttribute_LockedInDerivedDowngrade_OnBase_Rejected`
 (end-to-end on the attribute update path).
 ### TemplateEngine-023 — Instance native-alarm-source override bypasses the source lock during flattening
 | | |
 |--|--|
 | Severity | Medium |
 | Category | Correctness & logic bugs |
 | Status | Deferred |
 | Location | `src/ZB.MOM.WW.ScadaBridge.TemplateEngine/Flattening/FlatteningService.cs:775`, `src/ZB.MOM.WW.ScadaBridge.Commons/Types/Flattening/FlattenedConfiguration.cs:137` |
 **Description**
 `ApplyInstanceNativeAlarmSourceOverrides` applies a per-instance native-alarm-source
 override unconditionally — it looks the binding up by canonical name and, if found,
 overlays the override's non-null fields with **no lock check**:
 ```csharp
 foreach (var ovr in overrides)
 {
    if (!sources.TryGetValue(ovr.SourceCanonicalName, out var existing))
        continue; // Cannot add new bindings via overrides.
    sources[ovr.SourceCanonicalName] = existing with { ConnectionName = ..., ... };
 }
 ```
 Contrast `ApplyInstanceOverrides` (attributes, `:309` — `if (existing.IsLocked) continue;`)
 and `ApplyInstanceAlarmOverrides` (alarms, `:337` — `if (existing.IsLocked) continue;`).
 The design (`Component-TemplateEngine.md` Override Granularity, Native alarm sources:
 *"Lock applies to the entire source"*) requires the same treatment. The root cause is
 in Commons: `ResolvedNativeAlarmSource` (`FlattenedConfiguration.cs:137`) carries **no
 `IsLocked` field**, so the flattener has nothing to gate on even if it wanted to —
 unlike `ResolvedAttribute`/`ResolvedAlarm`, which both expose `IsLocked`. The
 inheritance-side `ResolveInheritedNativeAlarmSources` does track a base lock (via the
 `lockedNames` set) so a *derived template* cannot override a base-locked source, but
 that lock state is dropped on the way out and the *instance* layer is unguarded.
 This is the same defect class as the closed **TemplateEngine-008** (which fixed the
 identical instance-override lock-bypass for alarms), never applied to native alarm
 sources because they were introduced after that finding. The companion gap lives in
 `ManagementService.HandleSetInstanceNativeAlarmSourceOverride`
 (`ManagementActor.cs:868`), which — unlike `InstanceService.SetAlarmOverrideAsync` /
 `SetAttributeOverrideAsync` — performs neither an existence check (an override for an
 unknown source name is silently persisted as a dead record) nor a lock check before
 writing the override row straight through the repository, bypassing `InstanceService`
 entirely. Blast radius is limited (native alarms are a read-only mirror — no write-back
 to the source), but the documented lock invariant is not enforced anywhere on this path.
 **Recommendation**
 Add an `IsLocked` field to `ResolvedNativeAlarmSource` (Commons) and carry it through
 `ResolveInheritedNativeAlarmSources` / `ResolveComposedNativeAlarmSources`; then gate
 `ApplyInstanceNativeAlarmSourceOverrides` on `if (existing.IsLocked) continue;`,
 mirroring the attribute/alarm paths. Add the matching existence + lock guard to
 `HandleSetInstanceNativeAlarmSourceOverride` (resolve the effective native-source set
 via `TemplateResolver`/flattening as `SetAlarmOverrideAsync` does). Add a flattening
 regression test (`Flatten_LockedNativeSource_InstanceOverrideIgnored`) and a management
 test rejecting an override of a locked / unknown source.
 **Resolution**
 Deferred 2026-06-20: the native-alarm-source override lock-bypass fix spans Commons (`ResolvedNativeAlarmSource` has no `IsLocked` field to gate on) and ManagementService (the override handler skips the existence/lock checks), and the blast radius is limited (native alarms are read-only mirrors, no write-back). Recorded as a cross-module follow-up.
 ### TemplateEngine-024 — `TemplateFolderService` mutators stage their audit row after the final save, so it is never persisted
 | | |
 |--|--|
 | Severity | Medium |
 | Category | Code organization & conventions |
 | Status | Resolved |
 | Location | `src/ZB.MOM.WW.ScadaBridge.TemplateEngine/Services/TemplateFolderService.cs:58`, `src/ZB.MOM.WW.ScadaBridge.TemplateEngine/Services/TemplateFolderService.cs:91`, `src/ZB.MOM.WW.ScadaBridge.TemplateEngine/Services/TemplateFolderService.cs:154`, `src/ZB.MOM.WW.ScadaBridge.TemplateEngine/Services/TemplateFolderService.cs:204` |
 **Description**
 `IAuditService.LogAsync` only *stages* the audit row in the EF change tracker — the
 implementation (`ConfigurationDatabase/Services/AuditService.cs:61`) ends with
 `AddAsync(entry)` and explicitly documents *"caller is responsible for calling
 SaveChangesAsync to ensure atomicity."* Four of the five `TemplateFolderService`
 mutators call `LogAsync` **after** their final `SaveChangesAsync` and never save again:
 - `CreateFolderAsync` (`:56-58`): `AddFolderAsync` → `SaveChangesAsync` → `LogAsync` (no save).
 - `RenameFolderAsync` (`:89-91`): `UpdateFolderAsync` → `SaveChangesAsync` → `LogAsync` (no save).
 - `MoveFolderAsync` (`:152-154`): `UpdateFolderAsync` → `SaveChangesAsync` → `LogAsync` (no save).
 - `ReorderFolderAsync` (`:201-204`): `Update`×2 → `SaveChangesAsync` → `LogAsync` (no save).
 `ManagementActor.ProcessCommand` (`ManagementActor.cs:130`) wraps each command in a DI
 scope, dispatches, and disposes the scope with no outer `SaveChangesAsync`; EF Core does
 **not** flush on dispose. The folder operation itself persists (it was saved *before* the
 `LogAsync`), but the staged `TemplateFolder` audit row is silently discarded when the
 scoped `DbContext` is disposed. Every folder Create / Rename / Move / Reorder therefore
 leaves **no audit trail**, even though `_auditService.LogAsync(...)` is called and reads
 as if it does. (`DeleteFolderAsync` has the same shape but the lost row is the *Delete*
 record specifically.) This is asymmetric with `TemplateService`, where every mutator
 ends with `LogAsync` → `SaveChangesAsync` (e.g. `:225-226`). The folder tests mock
 `IAuditService` and the repository, so they never exercise the real change-tracker
 lifecycle and the gap is invisible to them.
 **Recommendation**
 Add a final `await _repository.SaveChangesAsync(cancellationToken);` after the `LogAsync`
 in all five folder mutators (matching `TemplateService`), or — cleaner — reorder each to
 mutate → log → single `SaveChangesAsync` so the entity change and its audit row commit
 atomically. Add a test that uses a real (in-memory) `AuditService` + `DbContext` and
 asserts an `AuditLogEntry` row exists after a folder create/rename/move/reorder.
 **Resolution**
 Resolved 2026-06-20 (commit `fd618cf1`): added `await _repository.SaveChangesAsync(...)` after `LogAsync` in all five `TemplateFolderService` mutators (Create/Rename/Move/Reorder/Delete), so folder-mutation audit rows are now persisted (previously staged then discarded with the scope). Matches the `TemplateService` two-save pattern. Tests added.
 ### TemplateEngine-025 — `SemanticValidator.ParseParameterDefinitions` legacy-array branch returns types, not names; method is dead production code
 | | |
 |--|--|
 | Severity | Low |
 | Category | Documentation & comments |
 | Status | Resolved |
 | Location | `src/ZB.MOM.WW.ScadaBridge.TemplateEngine/Validation/SemanticValidator.cs:573` |
 **Description**
 `ParseParameterDefinitions` is documented and named as returning the declared parameter
 **names**:
 ```csharp
 /// ... returns the declared parameter names.
 internal static List<string> ParseParameterDefinitions(string? parameterDefinitionsJson)
 ```
 Its JSON-Schema branch correctly returns `props.EnumerateObject().Select(p => p.Name)`,
 but its legacy flat-array branch returns the `type` field:
 ```csharp
 else if (doc.RootElement.ValueKind == JsonValueKind.Array)
 {
    return doc.RootElement.EnumerateArray()
        .Select(e => e.TryGetProperty("type", out var t) ? t.GetString() ?? "unknown" : "unknown")
        .ToList();
 }
 ```
 This is a copy-paste from the sibling `ParseParameterTypes` (`:491`, which legitimately
 returns types) — so for legacy `[{name,type}]` definitions this method returns the wrong
 field (types instead of names). The method has **no non-test caller** in `src/` (only its
 own `SemanticValidatorTests`), and the test exercises only the JSON-Schema branch, so the
 buggy legacy branch is both dead and untested. The live param-count/type validation path
 uses `ParseParameterTypes`, not this method, so there is no behavioural impact today — but
 the method is `internal` (callable from the rest of the module) and silently returns
 mislabelled data for one input shape.
 **Recommendation**
 Delete `ParseParameterDefinitions` (and its tests) as dead code, or — if it is meant to
 back a future feature — fix the legacy branch to return `e.TryGetProperty("name", ...)`
 and add a legacy-array test asserting names are returned.
 **Resolution**
 Resolved 2026-06-20 (commit `fd618cf1`): deleted the dead, misnamed `SemanticValidator.ParseParameterDefinitions` (no production caller per grep; its only references were two tests, which asserted the buggy type-not-name output). Tests removed.