code-review: 2026-05-28 baseline re-review of all 23 modules at 1eb6e97

Re-applies the full 10-category checklist to every src/ project — including first-time reviews of the four newer components (AuditLog, NotificationOutbox, SiteCallAudit, Transport) — so the code-reviews/ index reflects today's codebase rather than the 2026-05-16 baseline. 172 new Open findings (0 Critical, 18 High, 62 Medium, 92 Low); 481 findings total across 23 modules. regen-readme.py now derives each module's Last reviewed + Commit from its findings.md header instead of hard-coding 2026-05-16 / 9c60592, so future single-module re-reviews show their own date in the Module Status table.
2026-05-28 02:55:47 -04:00
parent 1eb6e972b0
commit f93b7b99bb
25 changed files with 8793 additions and 115 deletions
@@ -0,0 +1,500 @@
+# Code Review — AuditLog
+
+| Field | Value |
+|-------|-------|
+| Module | `src/ScadaLink.AuditLog` |
+| Design doc | `docs/requirements/Component-AuditLog.md` |
+| Status | Reviewed |
+| Last reviewed | 2026-05-28 |
+| Reviewer | claude-agent |
+| Commit reviewed | `1eb6e97` |
+| Open findings | 11 |
+
+## Summary
+
+AuditLog is one of the larger and most carefully-engineered modules in the codebase.
+The site-side hot-path (`SqliteAuditWriter` + `FallbackAuditWriter` + `RingBufferFallback`)
+implements a textbook bounded-channel + dedicated-writer pattern with batched transactions,
+UTF-8-safe truncation, additive schema migration, and a drop-oldest fallback that
+genuinely honours the "audit-write must NEVER abort the user-facing action" contract.
+The central side mirrors that with per-row try/catch on batch ingest, a transactional
+dual-write for the cached-telemetry path, per-site cursor isolation in reconciliation,
+and a partition-switch purge that is metadata-only. The payload filter is well-factored
+with a compile-time regex cache, per-stage failure isolation, and per-target overrides.
+Test coverage is broad — ~12 000 lines spanning unit, integration, and end-to-end paths.
+
+Themes across findings: (1) the largest issue is a **specced-but-unwired transport path** —
+`ISiteStreamAuditClient.IngestCachedTelemetryAsync` and `AuditLogIngestActor.OnCachedTelemetryAsync`
+both exist and the protobuf RPC is plumbed, but no production code ever calls the cached-telemetry
+client; the cached-call lifecycle audit rows ride the audit-only `IngestAuditEventsAsync` drain
+and the central dual-write transaction is dead code (AuditLog-001). (2) Several
+**Akka.NET supervisor-strategy comments are inaccurate** — multiple actors document
+"`SupervisorStrategy` uses Resume" but the code returns `DefaultDecider` (which Restarts), and
+the strategy applies to children, not to the actor itself (AuditLog-002). (3) The
+**`SqliteAuditWriter` hot-path lock is contended by the 30 s backlog probe** — `GetBacklogStatsAsync`
+takes the same `_writeLock` that serialises every batch INSERT, so a large-backlog scan can
+park the hot-path writer (AuditLog-005). (4) **Sync-over-async in `Dispose`** can deadlock under
+an ASP.NET sync context (AuditLog-006). (5) A handful of **misleading code comments and minor
+configuration drift** (AuditLog-007, AuditLog-008, AuditLog-009). (6) `CancellationToken`
+parameters on the actor drain paths are accepted but immediately replaced with
+`CancellationToken.None` (AuditLog-010). (7) The site-only `AddAuditLogHealthMetricsBridge`
+registers the `SiteAuditBacklogReporter` hosted service but the `AddAuditLog` registration
+chain doesn't reject a central composition root that mistakenly calls the site bridge
+(AuditLog-011). No Critical-severity issues; three Medium, eight Low.
+
+## Checklist coverage
+
+| # | Category | Examined | Notes |
+|---|----------|----------|-------|
+| 1 | Correctness & logic bugs | Yes | `AuditLogIngestActor.OnCachedTelemetryAsync` is unreachable production code (AuditLog-001); reconciliation cursor advances on persistent insert failure (AuditLog-004); `Dispose` comment about `_disposed` ordering is misleading (AuditLog-009). |
+| 2 | Akka.NET conventions | Yes | `SupervisorStrategy` returned by actors does not do what the surrounding doc says (AuditLog-002); per-actor strategy applies to children only, but comments imply self-protection. |
+| 3 | Concurrency & thread safety | Yes | `GetBacklogStatsAsync` contends with hot-path writes on `_writeLock` (AuditLog-005); sync DI scopes block on async EF disposal (AuditLog-003); `_disposed` is set after the wait, contradicting comment (AuditLog-009); no cooperative cancellation through the drain paths (AuditLog-010). |
+| 4 | Error handling & resilience | Yes | Best-effort contract is honoured throughout; `Dispose()` sync-over-async is the one remaining hazard (AuditLog-006); reconciliation silently discards permanently-failing rows (AuditLog-004). |
+| 5 | Security | Yes | Append-only enforcement, redaction stack, and "never under-redact" safety net all present. Test composition roots that omit the filter SILENTLY pass payloads through unredacted (AuditLog-008). |
+| 6 | Performance & resource management | Yes | Hot-path batched + back-pressured. Backlog scan holds the write lock (AuditLog-005); `MarkForwardedAsync` interpolates an `IN (...)` list inside the lock, fine in practice but scales linearly with batch size. |
+| 7 | Design-document adherence | Yes | Combined telemetry transport plumbed but never called (AuditLog-001); other than that the implementation closely tracks the design doc. |
+| 8 | Code organization & conventions | Yes | Composition root well-segmented; `INodeIdentityProvider` resolution mixes `GetService` and `GetRequiredService` for the same dependency across registrations (AuditLog-007); `AddAuditLog*` helpers register hosted services and option bindings without idempotency guards (AuditLog-011). |
+| 9 | Testing coverage | Yes | Excellent surface coverage. Integration tests exist for the dual-write path in `AuditLogIngestActorCombinedTelemetryTests` and `CachedCallCombinedTelemetryTests`, but those drive the actor directly via the test harness — there is no integration test that asserts the production end-to-end emits a `CachedTelemetryBatch` from the site (because nothing does). |
+| 10 | Documentation & comments | Yes | Several large XML-doc paragraphs are accurate, but the `SupervisorStrategy` comments (AuditLog-002), the `Dispose` ordering comment (AuditLog-009), and a few stale "Bundle X" references could mislead a new reader. |
+
+## Findings
+
+### AuditLog-001 — Combined-telemetry transport is plumbed end-to-end but never invoked in production
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Design-document adherence |
+| Status | Open |
+| Location | `src/ScadaLink.AuditLog/Site/Telemetry/ISiteStreamAuditClient.cs:45`, `src/ScadaLink.AuditLog/Site/Telemetry/ClusterClientSiteAuditClient.cs:86`, `src/ScadaLink.AuditLog/Central/AuditLogIngestActor.cs:198` |
+
+**Description**
+
+The design (Component-AuditLog.md §"Cached Operations — Combined Telemetry") specifies a
+single `CachedCallTelemetry` packet per lifecycle event that carries BOTH the audit row
+AND the operational `SiteCalls` upsert, with central writing both rows in one transaction.
+The infrastructure exists: `ISiteStreamAuditClient.IngestCachedTelemetryAsync` is on the
+interface; `ClusterClientSiteAuditClient.IngestCachedTelemetryAsync` builds the
+`IngestCachedTelemetryCommand`; the proto carries `CachedTelemetryBatch`;
+`AuditLogIngestActor.OnCachedTelemetryAsync` performs the dual `InsertIfNotExists` +
+`UpsertAsync` inside a `BeginTransactionAsync`. But a `grep` for callers of
+`IngestCachedTelemetryAsync` in `src/ScadaLink.AuditLog` shows only the interface
+declaration and the two implementations — nothing produces a `CachedTelemetryBatch` for
+the site to push. The `SiteAuditTelemetryActor.OnDrainAsync` only calls
+`IngestAuditEventsAsync` (the audit-only path); cached-call audit rows written by
+`CachedCallTelemetryForwarder` to local SQLite are drained as ordinary audit events,
+and the `SiteCalls` operational half rides a separate `UpsertSiteCallCommand` channel
+into `SiteCallAuditActor`. The "central writes AuditLog + SiteCalls in one transaction"
+guarantee is therefore not delivered — the two writes are now uncorrelated across
+actors and can fail independently, and the dual-write path in `AuditLogIngestActor`
+is dead production code.
+
+**Recommendation**
+
+Either (a) wire the combined path: build a `CachedTelemetryBatch` from the audit rows
+the forwarder writes (alongside the operational half held by `IOperationTrackingStore`),
+add a parallel drain loop that calls `IngestCachedTelemetryAsync`, and gate the audit-only
+drain so cached-call rows don't double-emit; or (b) update the design doc + the
+`AuditLogIngestActor` / `ClusterClientSiteAuditClient` / interface XML comments to
+acknowledge that the two halves now flow via separate transports, and delete the
+unreachable `OnCachedTelemetryAsync` dual-write code (after confirming the
+`AuditLogIngestActorCombinedTelemetryTests` integration tests exercise it via direct
+actor injection only).
+
+**Resolution**
+
+_Unresolved._
+
+### AuditLog-002 — `SupervisorStrategy` comments claim Resume semantics but code returns the default Restart decider
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Akka.NET conventions |
+| Status | Open |
+| Location | `src/ScadaLink.AuditLog/Central/AuditLogIngestActor.cs:99-103`, `src/ScadaLink.AuditLog/Central/AuditLogPurgeActor.cs:109-115`, `src/ScadaLink.AuditLog/Central/SiteAuditReconciliationActor.cs:315-321` |
+
+**Description**
+
+Three central actors (`AuditLogIngestActor`, `AuditLogPurgeActor`, `SiteAuditReconciliationActor`)
+all override `SupervisorStrategy()` and return
+`new OneForOneStrategy(maxNrOfRetries: 0, withinTimeRange: TimeSpan.Zero, decider: DefaultDecider)`.
+The surrounding XML / inline comments variously claim "uses `Resume` so a thrown exception
+inside `ReceiveAsync` does not restart the actor" (AuditLogIngestActor remarks),
+"uses Resume so any leaked exception keeps the singleton alive for the next tick"
+(AuditLogPurgeActor remarks), and "the actor's supervisor strategy keeps it alive
+across any leaked exception with `DefaultDecider`'s Restart semantics — restart resets
+the in-memory cursors, but as noted above that's a safe (over-pull, idempotent) recovery"
+(SiteAuditReconciliationActor remarks — at least correctly says Restart, but conflicts
+with the other two). Two things are wrong: (1) the strategy returned by an actor's
+`SupervisorStrategy()` override governs how that actor supervises its CHILDREN, not how
+its own parent treats it — so it is not the mechanism that protects these singletons
+from their own throws; (2) `DefaultDecider` Restarts for most exceptions, not Resumes.
+The actors are in fact protected by the per-row / per-batch try/catch blocks inside
+the receive handlers — the supervisor override is effectively unused, since these
+actors have no children. The comments mislead a reader into trusting a guarantee
+that the code does not deliver.
+
+**Recommendation**
+
+Pick one of two corrections: either delete the `SupervisorStrategy` override (these
+actors have no children, so the override is dead) and rewrite the comments to credit
+the try/catch blocks for the alive-on-throw guarantee; or — if the override is kept
+as a forward-compat hedge — change the decider to `Decider.From(_ => Directive.Stop)`
+or similar to match the comment, AND add a clear note that the per-row catch is what
+keeps the actor running across handler throws, not the supervisor strategy.
+
+**Resolution**
+
+_Unresolved._
+
+### AuditLog-003 — `AuditLogIngestActor.OnIngestAsync` uses `CreateScope`, but `OnCachedTelemetryAsync` uses `CreateAsyncScope` — and only one disposes asynchronously
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Concurrency & thread safety |
+| Status | Open |
+| Location | `src/ScadaLink.AuditLog/Central/AuditLogIngestActor.cs:133`, `src/ScadaLink.AuditLog/Central/AuditLogPurgeActor.cs:139`, `src/ScadaLink.AuditLog/Central/SiteAuditReconciliationActor.cs:178` |
+
+**Description**
+
+`OnCachedTelemetryAsync` opens `_serviceProvider!.CreateAsyncScope()` and lets
+`await using` dispose it. `OnIngestAsync`, `OnTickAsync` in
+`SiteAuditReconciliationActor`, and `OnTickAsync` in `AuditLogPurgeActor` all open
+`_services.CreateScope()` (the synchronous variant) and dispose it with a synchronous
+`scope.Dispose()` in a `finally` block — even though the per-message work is async and
+the scoped `IAuditLogRepository` resolves an EF Core `DbContext`, which implements
+`IAsyncDisposable`. The synchronous `Dispose()` on a `DbContext` blocks on any pending
+async connection cleanup; under load this can hold the actor thread for the duration
+of a connection close, which on SQL Server may include sending a `SET TRANSACTION
+ISOLATION LEVEL` reset round-trip. Switching to `CreateAsyncScope()` + `await using`
+is the recommended pattern for scoped EF resources.
+
+**Recommendation**
+
+Change `_services.CreateScope()` to `_services.CreateAsyncScope()` in
+`OnIngestAsync`, `SiteAuditReconciliationActor.OnTickAsync`, and
+`AuditLogPurgeActor.OnTickAsync`, and replace the `try/finally { scope?.Dispose(); }`
+pattern with `await using var scope = _services.CreateAsyncScope();`. The DI scope
+will dispose asynchronously and the EF Core context will be released without
+blocking the actor thread.
+
+**Resolution**
+
+_Unresolved._
+
+### AuditLog-004 — `SiteAuditReconciliationActor` advances cursor even on per-row insert failure, silently abandoning permanently-failing rows
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Error handling & resilience |
+| Status | Open |
+| Location | `src/ScadaLink.AuditLog/Central/SiteAuditReconciliationActor.cs:233-265` |
+
+**Description**
+
+`PullSiteAsync` iterates the pulled events, calls `InsertIfNotExistsAsync` inside a
+per-row try/catch, and unconditionally updates `maxOccurred = evt.OccurredAtUtc` after
+the try/catch — regardless of whether the insert succeeded or threw. The comment at
+line 247 acknowledges this: "the cursor still advances based on OccurredAtUtc — the
+row was returned by the site, so the next tick won't re-fetch it; if it permanently
+fails to persist, that's an operational concern surfaced by the log, not a hot-loop
+trigger." For a transient fault that flips to success on the next pull the design
+holds. But if a row throws on EVERY central attempt (truly permanent persistence fault —
+e.g. column-too-long, FK violation that won't resolve) the cursor advance still moves
+past it, and central will simply log on every reconciliation tick. No alert escalates
+beyond a log line. Worse, the site keeps the row `Pending` (because `MarkReconciledAsync`
+is only called for rows the puller flipped centrally) AND will trip the
+`SiteAuditTelemetryStalled` signal because the backlog never drains, but the central
+log message is the only place an operator could correlate the stall with the
+persistent insert failure.
+
+**Recommendation**
+
+Either (a) only advance the cursor for rows whose `InsertIfNotExistsAsync` returned
+cleanly — leave `maxOccurred` at the previous value for the failing row so the next
+tick retries; or (b) increment a dedicated `CentralAuditPermanentInsertFailure` health
+counter on the per-row catch so the failure is observable on the dashboard instead of
+buried in the log. Option (a) needs a guard against the same row throwing forever
+(saturate the puller) — a small per-event retry counter held in the actor's state with
+a permanent-skip + `LogCritical` threshold is the standard escape valve.
+
+**Resolution**
+
+_Unresolved._
+
+### AuditLog-005 — `GetBacklogStatsAsync` holds the SQLite hot-path write lock for the full COUNT+MIN scan
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Performance & resource management |
+| Status | Open |
+| Location | `src/ScadaLink.AuditLog/Site/SqliteAuditWriter.cs:597-657` |
+
+**Description**
+
+`SqliteAuditWriter.GetBacklogStatsAsync` takes `_writeLock` (the same lock that
+serialises every batch INSERT in `FlushBatch`) and holds it for the duration of a
+`SELECT COUNT(*), MIN(OccurredAtUtc) FROM AuditLog WHERE ForwardState = 'Pending'`.
+`SiteAuditBacklogReporter` calls this on a 30-second timer. On a healthy site with
+few `Pending` rows the index-only scan is fast; under the scenario the metric exists
+to detect — a prolonged central outage growing the backlog "indefinitely" per
+Component-AuditLog.md — a `COUNT(*)` over hundreds of thousands of `Pending` rows
+on the `IX_SiteAuditLog_ForwardState_Occurred` index is no longer cheap, and the
+duration of that scan is added to the hot-path write latency for every concurrent
+script. The hot path is supposed to be "durable in microseconds" per the design doc;
+a multi-hundred-millisecond probe stall in the same period would not be visible
+externally but would back-pressure the bounded write channel. `ReadPendingAsync` and
+`ReadPendingSinceAsync` share the same lock for the same reason and have the same
+exposure under backlog growth.
+
+**Recommendation**
+
+Either (a) move the SELECT outside the write lock by using a second, dedicated
+read-only SQLite connection (Microsoft.Data.Sqlite supports concurrent connections
+to the same file when journal_mode=WAL is enabled — which would also benefit the
+hot path); or (b) cache the last snapshot inside the writer and recompute it
+lazily on a dedicated background tick so the reporter reads a pre-computed snapshot
+without acquiring the write lock. Option (a) also unblocks `ReadPendingAsync` /
+`ReadPendingSinceAsync` from competing with the writer.
+
+**Resolution**
+
+_Unresolved._
+
+### AuditLog-006 — `SqliteAuditWriter.Dispose()` does sync-over-async and may deadlock
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Concurrency & thread safety |
+| Status | Open |
+| Location | `src/ScadaLink.AuditLog/Site/SqliteAuditWriter.cs:697-700` |
+
+**Description**
+
+```csharp
+public void Dispose()
+{
+    DisposeAsync().AsTask().GetAwaiter().GetResult();
+}
+```
+
+This is the classic sync-over-async anti-pattern. `DisposeAsync` `await`s the
+writer-loop task with `.ConfigureAwait(false)`, so on a thread with no synchronization
+context (the typical .NET 10 host shutdown path) it's fine; but if any caller invokes
+`Dispose()` from a context that captures (an ASP.NET request, a SynchronizationContext
+test runner, an Akka.NET dispatcher in some configurations) the `GetResult()` blocks
+the captured thread while the continuation tries to resume on it — classic deadlock.
+The writer is registered as a DI singleton, so this is unlikely to bite during the
+host's `IAsyncDisposable` shutdown (DI prefers `DisposeAsync` when available), but
+an integration test or future code path that constructs the writer manually inside
+a sync context will hang.
+
+**Recommendation**
+
+Drop the `IDisposable` interface and rely on `IAsyncDisposable` only — the DI container
+will call `DisposeAsync` on singletons that implement it. If a sync `Dispose` is
+required for compatibility with consumers that don't honour `IAsyncDisposable`,
+implement it as a best-effort that calls `_writeQueue.Writer.TryComplete()` + a
+short wait, without blocking the thread for the full async drain.
+
+**Resolution**
+
+_Unresolved._
+
+### AuditLog-007 — `INodeIdentityProvider` resolution mixes `GetService` and `GetRequiredService` inconsistently across `AddAuditLog` registrations
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Code organization & conventions |
+| Status | Open |
+| Location | `src/ScadaLink.AuditLog/ServiceCollectionExtensions.cs:148-218` |
+
+**Description**
+
+`AddAuditLog` registers three components that depend on `INodeIdentityProvider`:
+- `CachedCallTelemetryForwarder` — resolves with `sp.GetService<INodeIdentityProvider>()`
+  (optional, falls back to a null `SourceNode`).
+- `CachedCallLifecycleBridge` — resolves with `sp.GetService<INodeIdentityProvider>()`
+  (optional, same fallback).
+- `CentralAuditWriter` — resolves with `sp.GetRequiredService<INodeIdentityProvider>()`
+  (required, throws at first resolution if unregistered).
+
+The XML comments at lines 153 / 175 / 215 explain the reasoning — the first two are
+optional because tests may skip the registration; the third is required because "the
+production composition root in `SiteServiceRegistration` registers the provider as a
+singleton on both site and central paths". But this is a fragile guarantee — `AddAuditLog`
+itself does NOT register the provider, so a future composition root that calls
+`AddAuditLog` without first calling whatever registers `INodeIdentityProvider` will fail
+on the FIRST resolution of `ICentralAuditWriter` (which is a lazy factory) rather than
+at `AddAuditLog` time. The result: site nodes that "happen to work" because they hold
+a registered provider, central composition test fixtures that fail at runtime instead
+of DI-build time, and a `GetService`/`GetRequiredService` split that gives no clear
+contract to the reader.
+
+**Recommendation**
+
+Either (a) make all three optional: `CentralAuditWriter` already handles a null provider
+gracefully (line 113-116 — null-coalescing the caller's value); the asymmetry buys
+nothing. Or (b) make all three required and either add `services.AddSingleton<INodeIdentityProvider, ...>()`
+inside `AddAuditLog` (with a sensible default — null node name returns `<unknown>`) or
+add an explicit guard at the top of `AddAuditLog` that throws if no provider has been
+registered yet (`services.Any(d => d.ServiceType == typeof(INodeIdentityProvider))`).
+
+**Resolution**
+
+_Unresolved._
+
+### AuditLog-008 — Test composition roots that omit `IAuditPayloadFilter` silently pass UNREDACTED payloads through the writer chain
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Security |
+| Status | Open |
+| Location | `src/ScadaLink.AuditLog/Site/FallbackAuditWriter.cs:51-77`, `src/ScadaLink.AuditLog/Central/CentralAuditWriter.cs:77-104`, `src/ScadaLink.AuditLog/Central/AuditLogIngestActor.cs:125,155` |
+
+**Description**
+
+`FallbackAuditWriter`, `CentralAuditWriter`, and `AuditLogIngestActor` all accept an
+`IAuditPayloadFilter` as an optional dependency, defaulting to `null = pass-through`.
+The justification in every XML comment is the same: "the M4 test composition roots
+that don't pass one keep working (they only ever write small payloads)". This is fine
+for size — but the filter also performs HEADER REDACTION (`Authorization`, `Cookie`,
+`Set-Cookie`, `X-API-Key`), GLOBAL BODY REDACTORS, and SQL PARAMETER REDACTION. A test
+fixture (or any future composition root that bypasses `AddAuditLog`) that injects a
+real `RequestSummary` will see secrets written to SQLite / MS SQL with no redaction.
+The combination "audit-write must never abort the user-facing action" + "unredacted
+secrets must never persist" (Component-AuditLog.md §Payload Capture Policy) makes the
+no-filter fallback genuinely dangerous — over-redacting on a missing filter is the
+contract the production setup honours, but the code itself defaults to under-redact.
+
+**Recommendation**
+
+Change the three null-coalesce sites to default to a non-null sentinel filter that
+performs the header redaction (`HeaderRedactList`) using the hard-coded defaults
+from `AuditLogOptions`, even when no `IAuditPayloadFilter` is registered. The
+truncation stage can remain optional; the header redaction must not. Alternatively,
+make `IAuditPayloadFilter` non-optional and have `AddAuditLog` register the real
+filter unconditionally — tests that don't bind the options section will resolve the
+default `AuditLogOptions` and get the production-default redact list automatically.
+
+**Resolution**
+
+_Unresolved._
+
+### AuditLog-009 — `SqliteAuditWriter.DisposeAsync` comment claims `_disposed` is set early, but it isn't
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Documentation & comments |
+| Status | Open |
+| Location | `src/ScadaLink.AuditLog/Site/SqliteAuditWriter.cs:706-740` |
+
+**Description**
+
+The first `lock (_writeLock)` block in `DisposeAsync` is commented:
+
+> Stop accepting new events. Setting _disposed first ensures any FlushBatch entered
+> after we mark disposed will fault its pending events rather than touching the
+> about-to-close connection.
+
+But the block does NOT set `_disposed = true` — it only calls
+`_writeQueue.Writer.TryComplete()` and captures `_writerLoop`. The `_disposed` flag is
+flipped in the SECOND lock block (line 738), AFTER the 5-second wait on the writer
+loop. During the wait window, a concurrent `WriteAsync` that observed the channel
+NOT-yet-completed (race: it ran before `TryComplete`) and got past `TryWrite` would
+land on the writer loop's `FlushBatch`, which then takes the lock and checks
+`_disposed` — and finds it still `false`. The check at the top of `FlushBatch`
+(line 265) `if (_disposed) { fault pending; return; }` therefore does NOT fire during
+the dispose window. In practice the channel being completed drains the loop cleanly
+and the disposable race is benign, but the comment claims a guarantee that the code
+does not implement.
+
+**Recommendation**
+
+Either set `_disposed = true` in the first lock block to match the comment (and remove
+the duplicate `_disposed` check in the second block); or rewrite the comment to
+describe the actual ordering: the channel is completed first, the loop drains
+remaining items under the lock, and `_disposed = true` is set only after the loop
+exits. The current code is correct; the comment is wrong.
+
+**Resolution**
+
+_Unresolved._
+
+### AuditLog-010 — Actor drain paths accept a `CancellationToken` parameter but always pass `CancellationToken.None` downstream
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Concurrency & thread safety |
+| Status | Open |
+| Location | `src/ScadaLink.AuditLog/Site/Telemetry/SiteAuditTelemetryActor.cs:92,107,124`, `src/ScadaLink.AuditLog/Central/SiteAuditReconciliationActor.cs:228` |
+
+**Description**
+
+The drain loops on `SiteAuditTelemetryActor.OnDrainAsync` and the per-site pull on
+`SiteAuditReconciliationActor.PullSiteAsync` both pass `CancellationToken.None` to
+every async dependency call (queue reads, gRPC client, repository writes). The actor
+has no `CancellationToken` field, so there's no in-flight cancellation source —
+graceful shutdown relies entirely on `PostStop` being called and the actor's
+`Receive` continuation completing naturally. For a healthy gRPC client this is fine,
+but a stuck `IngestAuditEventsAsync` call (slow central, partition switch in progress)
+holds the actor's continuation indefinitely; the host's coordinated-shutdown will then
+time out the actor system and leave the actor in an undefined state. The brief
+references "cancellation on stop" in the partition-maintenance comments but
+`SiteAuditTelemetryActor` does not implement it.
+
+**Recommendation**
+
+Introduce a per-actor `CancellationTokenSource` populated in `PreStart` and cancelled
+in `PostStop`; pass `_lifecycleCts.Token` instead of `CancellationToken.None` in
+every async dependency call. Same change for `SiteAuditReconciliationActor`. The
+existing `OperationCanceledException` is already swallowed by the top-level catch
+in `OnDrainAsync` (line 128), so plumbing the token through is a localised change.
+
+**Resolution**
+
+_Unresolved._
+
+### AuditLog-011 — `AddAuditLogHealthMetricsBridge` and `AddAuditLogCentralMaintenance` are non-idempotent and register hosted services on every call
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Code organization & conventions |
+| Status | Open |
+| Location | `src/ScadaLink.AuditLog/ServiceCollectionExtensions.cs:53-55, 263-276, 301-346` |
+
+**Description**
+
+The XML doc on `AddAuditLog` is explicit: "Idempotent re-registration is not supported;
+call this exactly once per `IServiceCollection`." But `AddAuditLogHealthMetricsBridge`
+calls `services.AddHostedService<SiteAuditBacklogReporter>()` (line 275), which is
+NOT idempotent — every call registers another descriptor, and the host will spin up
+N reporters and have them all poll SQLite every 30 s, all push the same snapshot into
+`ISiteHealthCollector`. The site composition path is supposed to call this exactly
+once, but tests or composition refactors that accidentally call twice will pay 2x the
+SQL probe rate and overwrite the snapshot with conflicting numbers (no race, just
+wasted work). Worse, `AddAuditLogCentralMaintenance` (line 301) is also non-idempotent —
+`AddOptions<AuditLogPartitionMaintenanceOptions>` and `AddHostedService<AuditLogPartitionMaintenanceService>`
+will pile up.
+
+**Recommendation**
+
+Either (a) guard each Add* helper with a "has the marker been seen?" sentinel
+(register a private marker descriptor on first call, no-op on subsequent calls);
+or (b) explicitly document idempotency on the public surface of every helper and
+verify with a unit test in `AddAuditLogTests`. Option (a) matches the pattern other
+SDK extensions use and removes a foot-gun.
+
+**Resolution**
+
+_Unresolved._
@@ -5,10 +5,10 @@
 | Module | `src/ScadaLink.CLI` |
 | Design doc | `docs/requirements/Component-CLI.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-17 |
+| Last reviewed | 2026-05-28 |
 | Reviewer | claude-agent |
-| Commit reviewed | `39d737e` |
-| Open findings | 0 |
+| Commit reviewed | `1eb6e97` |
+| Open findings | 7 |

 ## Summary

@@ -47,6 +47,36 @@ and `WriteAsTable` derives table columns from only the first array element, sile
 dropping columns for any later element with a different shape (CLI-016). No
 Critical/High issues; the module remains healthy.

+#### Re-review 2026-05-28 (commit `1eb6e97`)
+
+The CLI has grown two substantial new command groups since the last re-review —
+`scadalink audit` (Audit Log #23 M8) and `scadalink bundle` (Transport #24) — together
+adding ~1,500 lines of new production code. The new `audit` surface is well-tested and
+well-factored (pure helpers + a clear `IAuditFormatter` seam), but the new `bundle`
+surface is untested, duplicates the URL/credential resolution that already exists in
+`CommandHelpers`, and inherits a partial authorization-exit-code regression that also
+appears in the audit path. Two longstanding fragility gaps that the prior reviews missed
+also surface in this pass: `CliConfig.Load` parses the config file with no try/catch, and
+`CommandTreeTests` still pins the old 14-group count so the two new groups are excluded
+from the leaf-action and registry-resolution coverage that protected the rest of the
+tree. Module health is broadly good but the consolidated count is now seven Open
+findings (none Critical, three Medium).
+
+- **CLI-017** — `BundleCommands` duplicates `ExecuteCommandAsync` and skips the
+  `FORBIDDEN`/`UNAUTHORIZED` exit-code mapping (auth exit 2 contract regression).
+- **CLI-018** — `AuditQueryHelpers.RunQueryAsync` / `AuditExportHelpers.RunExportAsync`
+  return exit 1 on every error, never the documented exit 2 for authorization failure.
+- **CLI-019** — `BundleCommands.bundle export` decodes the entire base64 bundle in
+  memory and writes synchronously — 100 MB bundles double-buffer.
+- **CLI-020** — `BundleCommands.bundle export` parses the success body with bare
+  `JsonDocument.Parse` + `GetProperty` and throws on a malformed/abbreviated envelope.
+- **CLI-021** — `CliConfig.Load` crashes the whole CLI when `~/.scadalink/config.json`
+  is malformed or unreadable, even if `--url` was supplied on the command line.
+- **CLI-022** — `AuditCommands` and `BundleCommands` are absent from `CommandTreeTests`;
+  the test still pins `Equal(14, groups.Count)` and silently excludes the new groups.
+- **CLI-023** — `Component-CLI.md` says the audit commands ride `POST /management`,
+  but the implementation calls a new `GET /api/audit/*` REST endpoint pair.
+
 ## Checklist coverage

 _Original review (2026-05-16, `9c60592`):_
@@ -79,6 +109,21 @@ _Re-review (2026-05-17, `39d737e`):_
 | 9 | Testing coverage | ☑ | Substantially expanded (`CommandTreeTests`, `ManagementHttpClientTests`, `DebugStreamTests`). No new gaps. |
 | 10 | Documentation & comments | ☑ | XML docs accurate. `Component-CLI.md` drift folded into CLI-015. |

+_Re-review (2026-05-28, `1eb6e97`):_
+
+| # | Category | Examined | Notes |
+|---|----------|----------|-------|
+| 1 | Correctness & logic bugs | ☑ | `BundleCommands.BuildExport` unguarded `JsonDocument.Parse` + `GetProperty` (CLI-020); `CliConfig.Load` unguarded JSON parse (CLI-021). |
+| 2 | Akka.NET conventions | ☑ | Not applicable — pure HTTP/SignalR/REST client. No issues. |
+| 3 | Concurrency & thread safety | ☑ | No new concurrency surface; `debug stream` unchanged since CLI-011/012. No issues. |
+| 4 | Error handling & resilience | ☑ | Bundle and audit paths skip the auth exit-code contract (CLI-017, CLI-018); bundle JSON-envelope parse is brittle (CLI-020); config-file parse aborts the process (CLI-021). |
+| 5 | Security | ☑ | No new credential or trust-boundary issues. No issues. |
+| 6 | Performance & resource management | ☑ | `bundle export` double-buffers the whole bundle in memory (CLI-019). |
+| 7 | Design-document adherence | ☑ | `Component-CLI.md` claims audit commands ride `POST /management`; implementation uses new REST endpoints (CLI-023). |
+| 8 | Code organization & conventions | ☑ | `BundleCommands.RunBundleCommandAsync` re-implements credential/URL resolution that `CommandHelpers.ExecuteCommandAsync` already provides — drift waiting to happen (CLI-017). |
+| 9 | Testing coverage | ☑ | `BundleCommands` has no tests; `CommandTreeTests` pins `Equal(14, …)` and excludes the new `AuditCommands` + `BundleCommands` groups (CLI-022). |
+| 10 | Documentation & comments | ☑ | XML docs accurate; doc-vs-code transport drift folded into CLI-023. No other issues. |
+
 ## Findings

 ### CLI-001 — `SCADALINK_FORMAT` env var and config-file format are dead; format precedence broken
@@ -741,3 +786,284 @@ list and `OutputFormatter.WriteTable` pads missing cells, so heterogeneous array
 render every column. Regression tests added in `TableHeaderUnionTests` (3 tests:
 later-element-only column included, first-seen column order preserved,
 first-element-extra column still rendered).
+
+### CLI-017 — `BundleCommands.RunBundleCommandAsync` duplicates `ExecuteCommandAsync` and breaks the auth exit-code contract
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Error handling & resilience |
+| Status | Open |
+| Location | `src/ScadaLink.CLI/Commands/BundleCommands.cs:244-289` (vs. `src/ScadaLink.CLI/Commands/CommandHelpers.cs:20-73`, `:159-174`) |
+
+**Description**
+
+`BundleCommands.RunBundleCommandAsync` re-implements the URL/credential resolution,
+validation, and HTTP plumbing that `CommandHelpers.ExecuteCommandAsync` already provides
+for every other command group — to attach a 5-minute timeout (`BundleCommandTimeout`)
+and a caller-supplied success handler. In duplicating it, two contracts that
+`CommandHelpers` carefully establishes were dropped:
+
+1. **Authorization exit code.** `CommandHelpers.HandleResponse` routes through
+   `IsAuthorizationFailure`, which returns exit 2 for **either** HTTP 403 **or** an
+   `UNAUTHORIZED`/`FORBIDDEN` error code on any status (resolution of CLI-009). The
+   bundle path at line 287 uses a bare `if (response.StatusCode == 403) return 2;` — a
+   server that signals authorization failure via the `code` field on a non-403 status
+   (the same channel the rest of the CLI honours) will exit 1 instead of 2 from
+   `bundle export`/`preview`/`import`. `Component-Transport.md:289` explicitly states
+   "Exit codes follow the project convention: `0` = success, `1` = command failure,
+   `2` = authorization failure," so this is a contract regression.
+2. **Error-message phrasing drift.** The two duplicated error paths
+   (`bundle:258-260`, `:264-266`) emit shorter messages that omit the
+   `SCADALINK_MANAGEMENT_URL` / `SCADALINK_USERNAME` env-var hints the canonical paths
+   give — confusing if the user is trying to debug what's missing.
+
+**Recommendation**
+
+Refactor `CommandHelpers.ExecuteCommandAsync` to accept an optional `TimeSpan` timeout
+and an optional success handler, and have `BundleCommands` call it. Failing that,
+extract `CommandHelpers.IsAuthorizationFailure` to `internal` and call it from
+`RunBundleCommandAsync` in place of the bare 403 check, and copy the canonical error
+messages verbatim.
+
+**Resolution**
+
+_Unresolved._
+
+### CLI-018 — `audit query` and `audit export` never return exit 2 for an authorization failure
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Error handling & resilience |
+| Status | Open |
+| Location | `src/ScadaLink.CLI/Commands/AuditQueryHelpers.cs:186-193`, `src/ScadaLink.CLI/Commands/AuditExportHelpers.cs:147-153` |
+
+**Description**
+
+The two audit-log subcommands (`audit query`, `audit export`) ride a new REST surface
+(`GET /api/audit/query` and `GET /api/audit/export`) — not the `POST /management`
+envelope that goes through `CommandHelpers.HandleResponse`. Both helpers map *any*
+non-success response to a generic `OutputFormatter.WriteError(...)` + `return 1`:
+
+- `AuditQueryHelpers.RunQueryAsync:186-193` returns 1 unconditionally when `JsonData`
+  is null (i.e. any error). It never inspects `StatusCode` or `ErrorCode`.
+- `AuditExportHelpers.RunExportAsync:147-153` returns 1 for every non-success status,
+  again with no 403 / `FORBIDDEN` carve-out.
+
+`Component-CLI.md:295-296` documents exit code 2 for "Authorization failure (insufficient
+role)". `Component-AuditLog.md` (Security & Tamper-Evidence) and `Component-CLI.md:184-187`
+both call out that the audit endpoints are gated by `OperationalAudit` and `AuditExport`
+permissions enforced server-side — i.e. these are exactly the commands most likely to
+return 403 in routine use. The exit-code regression silently downgrades a 403 to a
+generic command failure, breaking the CI/CD scripting contract.
+
+**Recommendation**
+
+Promote `CommandHelpers.IsAuthorizationFailure` to `internal` (or move it to a small
+shared helper) and have `RunQueryAsync` / `RunExportAsync` return 2 when it matches.
+The check needs to use the `ManagementResponse.StatusCode` / `ErrorCode` pair the
+audit `SendGetAsync` already populates.
+
+**Resolution**
+
+_Unresolved._
+
+### CLI-019 — `bundle export` decodes the entire base64 bundle into memory before writing
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Performance & resource management |
+| Status | Open |
+| Location | `src/ScadaLink.CLI/Commands/BundleCommands.cs:117-124`, `src/ScadaLink.CLI/ManagementHttpClient.cs:47-92` |
+
+**Description**
+
+`Component-Transport.md:271` ceilings the raw bundle at 100 MB and notes the
+per-request body cap is raised to 200 MB once base64-inflated. The CLI's export path
+goes through `ManagementHttpClient.SendCommandAsync`, which reads the entire response
+body into a string (`responseBody = await httpResponse.Content.ReadAsStringAsync(...)`)
+and returns it as `ManagementResponse.JsonData`. `BundleCommands.BuildExport` then:
+
+1. `JsonDocument.Parse(jsonOk)` re-allocates the JSON DOM (~200 MB string + DOM).
+2. `doc.RootElement.GetProperty("base64Bundle").GetString()` materializes the base64
+   payload as another ~200 MB `string`.
+3. `Convert.FromBase64String(base64)` allocates a fresh ~100 MB `byte[]`.
+4. `File.WriteAllBytes(output, bytes)` writes synchronously.
+
+Peak working-set for a 100 MB bundle is therefore ~600 MB, all on the LOH, plus the
+file-I/O is fully synchronous. The streaming `SendGetStreamAsync` path the audit
+export uses (line 155-156) shows the right pattern is already available for plain GETs,
+but bundles ride a `POST /management` envelope so they currently can't reuse it.
+
+**Recommendation**
+
+For the export path specifically, add a streaming variant — either a new
+`POST /api/bundle/export` REST endpoint mirroring the audit pattern, or a chunk-fetch
+follow-up `GET /api/bundle/<exportId>` so the CLI can stream bytes through
+`Stream.CopyToAsync` without buffering the whole envelope. If a v1 stop-gap is needed,
+at minimum switch to `File.WriteAllBytesAsync` and use `Convert.TryFromBase64Chars`
+with a rented buffer to avoid the double-LOH allocation.
+
+**Resolution**
+
+_Unresolved._
+
+### CLI-020 — `bundle export` success-envelope parse is unguarded
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.CLI/Commands/BundleCommands.cs:117-126` |
+
+**Description**
+
+The export success handler does:
+
+```csharp
+using var doc = JsonDocument.Parse(jsonOk);
+var base64 = doc.RootElement.GetProperty("base64Bundle").GetString()!;
+var byteCount = doc.RootElement.GetProperty("byteCount").GetInt32();
+var bytes = Convert.FromBase64String(base64);
+```
+
+None of these calls are wrapped in a `try/catch`. A server-side bug that omits one of
+the two properties, returns a `null` `base64Bundle`, sends invalid base64, or sends a
+malformed JSON envelope will surface as one of `KeyNotFoundException` /
+`InvalidOperationException` / `FormatException` — an unhandled stack trace, not a clean
+`INVALID_RESPONSE` / exit 1, contradicting the "graceful-degradation" theme that the
+prior reviews (CLI-002, CLI-003, CLI-005) repeatedly hardened.
+
+**Recommendation**
+
+Wrap the parse + base64-decode in a `try` block that catches `JsonException`,
+`KeyNotFoundException`, `InvalidOperationException`, and `FormatException` and emits a
+clean `OutputFormatter.WriteError(..., "INVALID_RESPONSE")` + `return 1`. Add a
+regression test against a malformed-envelope stub `HttpMessageHandler`.
+
+**Resolution**
+
+_Unresolved._
+
+### CLI-021 — `CliConfig.Load` crashes the CLI on a malformed config file
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Error handling & resilience |
+| Status | Open |
+| Location | `src/ScadaLink.CLI/CliConfig.cs:41-53` |
+
+**Description**
+
+`CliConfig.Load` is the first thing every command runs (via `ExecuteCommandAsync`,
+`AuditCommandHelpers.ResolveConnection`, and `BundleCommands.RunBundleCommandAsync`).
+Its config-file branch is:
+
+```csharp
+if (File.Exists(configPath))
+{
+    var json = File.ReadAllText(configPath);
+    var fileConfig = JsonSerializer.Deserialize<CliConfigFile>(json, ...);
+    ...
+}
+```
+
+Neither call is guarded. If `~/.scadalink/config.json` exists but is malformed
+(stale, partial, or someone's `vim` swap), `JsonSerializer.Deserialize` throws
+`JsonException`. If the file exists but isn't readable (mode 0000),
+`File.ReadAllText` throws `UnauthorizedAccessException`. Either fault aborts every
+CLI invocation with an unhandled stack trace — even invocations that supply every
+input on the command line and don't need the config file at all (`--url`,
+`--username`, `--password`, `--format` all on the CLI).
+
+**Recommendation**
+
+Wrap the file-read and the `JsonSerializer.Deserialize` in a single
+`try/catch (Exception)` (or specifically `JsonException` +
+`UnauthorizedAccessException` + `IOException`). On failure, write a single one-line
+warning to `Console.Error` ("ignoring malformed `~/.scadalink/config.json`: {message}")
+and return the default `CliConfig`, so the rest of the precedence chain (env vars +
+command-line flags) still works.
+
+**Resolution**
+
+_Unresolved._
+
+### CLI-022 — `CommandTreeTests` excludes the two new command groups
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Testing coverage |
+| Status | Open |
+| Location | `tests/ScadaLink.CLI.Tests/CommandTreeTests.cs:21-37`, `:55-58` (vs. `src/ScadaLink.CLI/Program.cs:21-36`) |
+
+**Description**
+
+`CommandTreeTests.AllCommandGroups()` builds 14 command groups; `Program.cs` now
+registers 16 (`AuditCommands` and `BundleCommands` were added since the last
+re-review). Worse, the smoke test pins `Assert.Equal(14, groups.Count)`, so the
+test list intentionally matches the harness's array and stays green even though the
+real production tree is two groups larger. The downstream assertions
+(`EveryLeafCommand_HasAnAction`, `CommandPayloadTypes_ResolveViaRegistry`) therefore
+also do NOT cover the new audit / bundle leaves — and `BundleCommands` has zero
+test coverage of any kind (no parsing tests, no success-handler tests, no
+registry-resolution tests).
+
+**Recommendation**
+
+Add `AuditCommands.Build(...)` and `BundleCommands.Build(...)` to the
+`AllCommandGroups()` array, bump the assertion to `Equal(16, groups.Count)`, and add
+representative payload types to `CommandPayloadTypes_ResolveViaRegistry`
+(`ExportBundleCommand`, `PreviewBundleCommand`, `ImportBundleCommand`). Optionally,
+add a `BundleCommandsTests` file covering the success-envelope parse and the
+`NameListOption` comma-split parser.
+
+**Resolution**
+
+_Unresolved._
+
+### CLI-023 — `Component-CLI.md` claims audit commands ride `POST /management`; implementation uses REST endpoints
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Design-document adherence |
+| Status | Open |
+| Location | `docs/requirements/Component-CLI.md:310-311` (vs. `src/ScadaLink.CLI/Commands/AuditQueryHelpers.cs:186`, `src/ScadaLink.CLI/Commands/AuditExportHelpers.cs:126`, `src/ScadaLink.CLI/ManagementHttpClient.cs:94-156`) |
+
+**Description**
+
+`Component-CLI.md:310` states: "The `scadalink audit` command group rides this same
+transport — there is no separate audit endpoint." But the implementation calls a
+new REST surface — `GET /api/audit/query` and `GET /api/audit/export` — via two new
+methods on `ManagementHttpClient` (`SendGetAsync`, `SendGetStreamAsync`), distinct
+from the `POST /management` envelope. The plan document
+(`docs/plans/2026-05-20-audit-log-code-roadmap.md:1583`) corroborates the
+implementation: "REST endpoints `GET /api/audit/query` (paged) and
+`GET /api/audit/export` (streaming)" — i.e. the design doc is the stale one.
+
+A reader following `Component-CLI.md` would expect the audit endpoints to share
+the management envelope's authentication + dispatch path and route through
+`ManagementActor`, neither of which is true. The auth-exit-code regression
+(CLI-018) is itself a direct consequence of this divergence: the audit helpers
+duplicate the management envelope's response handling instead of riding it, and
+forgot to copy the auth carve-out.
+
+**Recommendation**
+
+Update `Component-CLI.md:310-311` (and the Dependencies bullet at `:311`) to
+describe the actual REST surface: `GET /api/audit/query` (paged) and
+`GET /api/audit/export` (streaming), with HTTP Basic Auth shared with the
+management envelope and permission checks enforced by the server-side
+`AuditController`. Optionally cross-link to
+`docs/plans/2026-05-20-audit-log-code-roadmap.md` (M8 task list) as the
+authoritative source.
+
+**Resolution**
+
+_Unresolved._
@@ -5,10 +5,10 @@
 | Module | `src/ScadaLink.CentralUI` |
 | Design doc | `docs/requirements/Component-CentralUI.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-17 |
+| Last reviewed | 2026-05-28 |
 | Reviewer | claude-agent |
-| Commit reviewed | `39d737e` |
-| Open findings | 0 |
+| Commit reviewed | `1eb6e97` |
+| Open findings | 8 |

 ## Summary

@@ -73,6 +73,55 @@ cross-thread `Dictionary`; CentralUI-022 unguarded `InvokeAsync`), category 4
 claims), category 9 (CentralUI-025 untested `SessionExpiry` poll). Categories
 1, 2, 5, 6, 7, 10 produced no new findings.

+_Re-review (2026-05-28, `1eb6e97`):_
+
+| # | Category | Examined | Notes |
+|---|----------|----------|-------|
+| 1 | Correctness & logic bugs | ☑ | CentralUI-026 (AuditFilterBar UTC), CentralUI-027 (3 other pages with same UTC bug). |
+| 2 | Akka.NET conventions | ☑ | No new findings — module is presentation; `DebugStreamService` actor usage unchanged. |
+| 3 | Concurrency & thread safety | ☑ | CentralUI-030 (StringWriter capture buffer not thread-safe under intra-script `Task.WhenAll`). |
+| 4 | Error handling & resilience | ☑ | No new findings — the prior CentralUI-018/023 patterns hold. |
+| 5 | Security | ☑ | CentralUI-028 (NotificationReport + SiteCallsReport not site-scoped — CentralUI-002 regression on new pages). |
+| 6 | Performance & resource management | ☑ | CentralUI-031 (TransportImport buffers full bundle bytes in component state). |
+| 7 | Design-document adherence | ☑ | CentralUI-032 (AuditResultsGrid forward-only paging diverges from "keyset paginated" implied bi-directional). |
+| 8 | Code organization & conventions | ☑ | CentralUI-029 (`JS.InvokeAsync<int>("eval", ...)` in ConfigurationAuditLog vs the `_content/.../BrowserTime` module pattern). |
+| 9 | Testing coverage | ☑ | CentralUI-033 (TransportImport / SiteCallsReport query-string drill-in code paths untested). |
+| 10 | Documentation & comments | ☑ | No new findings — code comments accurately describe intent. |
+
+#### Re-review 2026-05-28 (commit `1eb6e97`)
+
+All 25 prior findings remain closed. This re-review re-examined the full
+module against the 10-category checklist with attention to the
+recently-added Transport export/import wizards (`TransportExport`,
+`TransportImport`) and the operational Audit Log page (Bundle B..G). The
+most consequential pattern in this pass is that the **CentralUI-008
+local-input-treated-as-UTC** bug, fixed for the legacy
+`AuditLog.razor` via the `BrowserTime.LocalInputToUtc` helper, has been
+silently recreated on every other page that exposes a
+`<input type="datetime-local">` filter — `AuditFilterBar` (the new
+operational Audit Log filter, CentralUI-026), `SiteCallsReport`,
+`NotificationReport`, and `EventLogs` (CentralUI-027). The Audit Log
+page CSV export URL therefore mis-shifts the From/To filter window by
+the operator's UTC offset, and the same offset bug silently corrupts
+audit-style queries on Site Calls / Notification Report / Event Logs.
+Second-most consequential is **CentralUI-028**: the new `NotificationReport`
+and `SiteCallsReport` pages (both `[Authorize(RequireDeployment)]`) do
+NOT filter their site dropdown or row data through `SiteScopeService`,
+and the relay actions (`RetryNotification`/`DiscardNotification`,
+`RetrySiteCall`/`DiscardSiteCall`) issue no server-side site-scope
+re-check before relaying to the owning site — so a site-scoped Deployment
+user can read and act on notifications and cached calls for sites
+outside their grant, replicating the original CentralUI-002 defect on
+the two pages added after the CentralUI-002 fix landed. The remaining
+new findings (CentralUI-029..CentralUI-033) cover a residual `JS.InvokeAsync<int>("eval", ...)`
+in `ConfigurationAuditLog`, a single-thread `StringWriter` capture buffer
+in the Test Run sandbox (a sandboxed script that uses `Task.WhenAll` can
+write concurrently), a `using var` `MemoryStream` followed by `ms.ToArray()`
+buffering the full bundle in memory in `TransportImport`, the
+`AuditResultsGrid` having no Previous-page control (forward-only navigation,
+a UX/design adherence gap), and the un-tested `TransportImport` /
+`SiteCallsReport` query-string drill-in code paths.
+
 ## Findings

 ### CentralUI-001 — Test Run sandbox executes arbitrary C# with no trust-model enforcement
@@ -1216,3 +1265,278 @@ also forces the CentralUI-020 fix.
 **Resolution**

 2026-05-17 — added `SessionExpiryComponentTests` (bUnit): an expired ping (401) redirects to `/login`, a live ping (200) and a transient failure (status 0) do not, and on the `/login` route the component neither pings nor redirects; also added `AuthPingEndpointTests` covering the `/auth/ping` endpoint contract.
+
+### CentralUI-026 — `AuditFilterBar` From/To filters treat browser-local datetimes as UTC
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.CentralUI/Components/Audit/AuditFilterBar.razor:97-104`; `src/ScadaLink.CentralUI/Components/Audit/AuditQueryModel.cs:56-58,150-178,203-213` |
+
+**Description**
+
+The new operational Audit Log filter bar binds two `<input type="datetime-local">` controls
+straight to `AuditQueryModel.CustomFromUtc` / `CustomToUtc` (`DateTime?`), and `ToFilter`
+emits those values as `AuditLogQueryFilter.FromUtc` / `ToUtc` without converting from
+the browser's local time zone. A `datetime-local` input yields the user's *browser-local*
+wall-clock value, so for any non-UTC user the audit query window is shifted by their UTC
+offset — returning the wrong rows from the central `AuditLog` table and producing a
+mis-shifted CSV export through `AuditLogPage.BuildExportUrl`, which round-trips the
+filter's `FromUtc`/`ToUtc` straight into `?from=`/`?to=` query params. This is the same
+defect CentralUI-008 fixed for the legacy `Components/Pages/Monitoring/AuditLog.razor`
+via the `BrowserTime.LocalInputToUtc(value, _browserUtcOffsetMinutes)` helper — but the
+new Audit Log v2 filter bar does not use that helper, so a Bundle B/C/D/E/F regression
+re-introduced the bug for the page-replacement target. The CLAUDE.md "all timestamps are
+UTC throughout" decision is satisfied at the wire level but violated at the input
+boundary, exactly as the original finding called out.
+
+**Recommendation**
+
+Fetch the browser offset once via JS interop (mirroring `ConfigurationAuditLog.OnAfterRenderAsync`
+and `AuditLog.razor`'s implementation), pipe both `CustomFromUtc` and `CustomToUtc` through
+`BrowserTime.LocalInputToUtc(value, offsetMinutes)` inside `AuditQueryModel.ToFilter`
+(or in the filter-bar Apply path before calling `ToFilter`), and add a regression test
+that pins the non-UTC behaviour (mirroring `BrowserTimeTests.LocalInputToUtc_NonUtcBrowser_DoesNotEqualNaiveRelabelling`).
+The label "Custom From / To" should also be clarified ("UTC" vs "local") in the UI.
+
+### CentralUI-027 — Same UTC misinterpretation in `SiteCallsReport`, `NotificationReport`, and `EventLogs`
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.CentralUI/Components/Pages/SiteCalls/SiteCallsReport.razor:74-80`; `src/ScadaLink.CentralUI/Components/Pages/SiteCalls/SiteCallsReport.razor.cs:421-425`; `src/ScadaLink.CentralUI/Components/Pages/Notifications/NotificationReport.razor:75-81,639-640`; `src/ScadaLink.CentralUI/Components/Pages/Monitoring/EventLogs.razor:62-73,261-262` |
+
+**Description**
+
+The same `datetime-local`-treated-as-UTC bug from CentralUI-008 and CentralUI-026 is
+present on three other pages:
+
+- `SiteCallsReport.ToUtc` stamps `DateTimeKind.Utc` on the local-input value
+  (`DateTime.SpecifyKind(value.Value, DateTimeKind.Utc)`).
+- `NotificationReport.ToUtc` does the same — `new DateTimeOffset(DateTime.SpecifyKind(local.Value, DateTimeKind.Utc))`.
+- `EventLogs.FetchPage` emits `new DateTimeOffset(_filterFrom.Value, TimeSpan.Zero)`,
+  which labels the browser-local wall-clock value as UTC (the exact pre-fix shape of
+  CentralUI-008).
+
+For any non-UTC operator, every Site-Calls / Notification / Event-Log query is silently
+shifted by their UTC offset. The bug is mass-recreated on every page added after
+CentralUI-008 landed — the `BrowserTime` helper exists but is only used by the legacy
+Audit Log page and `ConfigurationAuditLog`.
+
+**Recommendation**
+
+Plumb the browser offset (via `eval` interop or a dedicated JS module, mirroring
+`ConfigurationAuditLog`/`AuditLog.razor`) into each of these pages and route every
+local-input value through `BrowserTime.LocalInputToUtc(value, offsetMinutes)` before
+constructing the wire filter. Add regression tests pinning the non-UTC behaviour for
+at least one representative page so the helper's continued use is enforced.
+
+### CentralUI-028 — `NotificationReport` and `SiteCallsReport` bypass `SiteScopeService` — Deployment role site-scoping defeated on the two new central-mirror pages
+
+| | |
+|--|--|
+| Severity | High |
+| Category | Security |
+| Status | Open |
+| Location | `src/ScadaLink.CentralUI/Components/Pages/Notifications/NotificationReport.razor:2,434,472,502`; `src/ScadaLink.CentralUI/Components/Pages/SiteCalls/SiteCallsReport.razor:2,52-59`; `src/ScadaLink.CentralUI/Components/Pages/SiteCalls/SiteCallsReport.razor.cs:97-110,201,250-251,278-279` |
+
+**Description**
+
+Both pages are `[Authorize(Policy = RequireDeployment)]` and, per CLAUDE.md "Security &
+Auth", the Deployment role must be site-scoped. CentralUI-002 fixed this for every
+Deployment/Monitoring page that existed at the time by introducing `SiteScopeService`
+and threading `FilterSitesAsync` / `IsSiteAllowedAsync` through the site dropdowns and
+mutating calls. The two new central-mirror pages — Notification Report (Notification
+Outbox queryable list) and Site Calls Report (Site Call Audit queryable list) — do NOT
+inject `SiteScopeService`, do NOT filter their Source-Site `<select>` lists (they
+enumerate `await SiteRepository.GetAllSitesAsync()` straight to the dropdown), do NOT
+narrow the query results by permitted site, and do NOT re-check the user's grant
+before relaying Retry/Discard to the owning site. `NotificationReport.RetryNotificationAsync`,
+`NotificationReport.DiscardNotificationAsync`, `SiteCallsReport.RetrySiteCallAsync`,
+and `SiteCallsReport.DiscardSiteCallAsync` all dispatch with the row's `SourceSiteId` /
+`SourceSite` unchecked. A scoped Deployment user can therefore (a) browse every row in
+the central `Notifications` / `SiteCalls` table including those for sites outside their
+grant, (b) submit Retry/Discard URLs hand-crafted from the row metadata, and (c) the
+site relay completes successfully because the CommunicationService only sees the
+row's source-site identifier, not the user's grant. This is a direct regression of the
+CentralUI-002 contract on the two pages that landed after CentralUI-002 was closed.
+
+**Recommendation**
+
+Inject `SiteScopeService` into both pages; filter the source-site dropdown through
+`FilterSitesAsync`; default the filter to the permitted-site set so a scoped user sees
+only their own rows (or push the predicate into the central query — preferred, so the
+filter cannot be bypassed by URL manipulation); and re-check `IsSiteAllowedAsync` in
+`RetryNotificationAsync`/`DiscardNotificationAsync`/`RetrySiteCallAsync`/`DiscardSiteCallAsync`
+before the CommunicationService call, surfacing a "not permitted for this site" toast
+on failure (mirroring `ParkedMessages.razor`'s `SelectedSiteIsPermitted` guard).
+Add `Site_ScopedDeploymentUser_OnlySeesPermittedRows` and
+`Site_ScopedDeploymentUser_CannotRetryRowOnNonPermittedSite` regression tests modelled
+on `TopologyPageTests.SiteScoping_*`.
+
+### CentralUI-029 — `ConfigurationAuditLog` uses `JS.InvokeAsync<int>("eval", ...)` instead of a dedicated JS module
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Code organization & conventions |
+| Status | Open |
+| Location | `src/ScadaLink.CentralUI/Components/Pages/Audit/ConfigurationAuditLog.razor:248-263` |
+
+**Description**
+
+`OnAfterRenderAsync` fetches the browser's UTC offset with
+`JS.InvokeAsync<int>("eval", "new Date().getTimezoneOffset()")`. Calling `eval` over
+JS interop is a code-smell: it widens the JS-interop attack surface (any future
+attacker who can influence the second argument runs arbitrary JS), it is brittle
+under stricter Content-Security-Policy headers (CSP `script-src` directives commonly
+forbid `unsafe-eval`), and it bypasses the existing module-import pattern the rest
+of the module follows (`session-expiry.js`, `audit-grid.js`, `nav-state.js`,
+`transport.js` are all loaded as `IJSObjectReference` modules). The legacy
+`AuditLog.razor` (CentralUI-008 fix) and the planned helper exist precisely to avoid
+this. Today the eval text is a static string so there is no live bug; the issue is
+that the pattern invites a future caller to compose the argument from page state.
+
+**Recommendation**
+
+Move the offset lookup into a small wwwroot JS module (e.g.
+`wwwroot/js/browser-time.js` exporting `getTimezoneOffsetMinutes()`) and `import` it
+via `IJSObjectReference` like the other helpers. Replace the `eval` call with
+`module.InvokeAsync<int>("getTimezoneOffsetMinutes")`. The fix is local and removes
+a residual eval surface; the same module can host the rest of the `BrowserTime`
+plumbing CentralUI-027 will need.
+
+### CentralUI-030 — `SandboxConsoleCapture`'s per-call `StringWriter` is not thread-safe under intra-script concurrency
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Concurrency & thread safety |
+| Status | Open |
+| Location | `src/ScadaLink.CentralUI/ScriptAnalysis/SandboxConsoleCapture.cs:31-118`; `src/ScadaLink.CentralUI/ScriptAnalysis/ScriptAnalysisService.cs:401-404` |
+
+**Description**
+
+CentralUI-003 correctly routed console capture through an `AsyncLocal<StringWriter?>`
+so concurrent Test Runs cannot cross-contaminate. `BeginCapture` flows the capture
+buffer through the call-tree, and `Target` reads it on every `Write`. But a single
+script execution can still write to its captured `StringWriter` from multiple threads
+within one call-tree: the script trust model allows `System.Threading.Tasks`, so a
+user script can `await Task.WhenAll(t1, t2, t3)` where each task is `Task.Run(() => Console.WriteLine(...))`,
+and `_current.Value` flows into each `Task.Run`. The capture buffer is a plain
+`StringWriter` (`captured = new StringWriter()` in `RunInSandboxAsync`), which is
+**not** thread-safe — concurrent `WriteLine` calls can throw or interleave
+character-level. The Akka/gRPC-thread race fixed by CentralUI-003 is gone, but the
+intra-script-concurrency race is a residual hazard for any script that exercises
+parallel tasks (a realistic shape for a Test Run that calls multiple `External.Call`s
+concurrently). Severity is Low because the symptom is a corrupted ConsoleOutput
+string, not a security/data-loss issue, and the script must opt into Task-based
+concurrency to trigger it.
+
+**Recommendation**
+
+Wrap the capture buffer with `TextWriter.Synchronized(new StringWriter())` (the
+BCL's purpose-built thread-safe wrapper), or hold a lock inside `SandboxConsoleCapture.Write*`
+on the current scope's `StringWriter`. Add a focused test that runs `await Task.WhenAll(...)`
+with `Console.WriteLine` in each task and asserts the resulting `ConsoleOutput` has
+the expected line count regardless of thread interleaving.
+
+### CentralUI-031 — `TransportImport` buffers the full bundle bytes in component state
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Performance & resource management |
+| Status | Open |
+| Location | `src/ScadaLink.CentralUI/Components/Pages/Design/TransportImport.razor.cs:72,104-142,160-161` |
+
+**Description**
+
+`OnFileSelectedAsync` reads the uploaded `.scadabundle` into a `MemoryStream`,
+calls `ms.ToArray()`, and stores the byte array on the component as
+`private byte[]? _bundleBytes`. The bytes live on the Blazor circuit for the
+lifetime of the wizard — through the passphrase step, the diff step (which can
+take an arbitrary amount of operator time on a large bundle), the confirm step,
+and the apply step — and are only cleared in `ResetSessionState` (Done /
+re-upload). For an operator who walks away from the diff step mid-review, the
+configured `MaxBundleSizeMb` (default not enforced here; only the file-size
+check on read) worth of bytes stays pinned on the central node's heap per
+open circuit. The page has no `IDisposable` to clear the bytes on tear-down
+either. Severity is Low because the cap is checked at upload time and Import
+is Admin-only (limited concurrent users), but the lifetime is longer than the
+strictly-needed retention.
+
+**Recommendation**
+
+Stream the bundle to a temp file (or to the `IBundleImporter`'s session store)
+rather than caching it on the component. Failing that, implement `IDisposable`
+on `TransportImport` and clear `_bundleBytes` (`Array.Clear` for sensitivity)
+on dispose; also clear the cached passphrase string. Tighten `MaxBundleSizeMb`
+docs to call out the in-memory cost per concurrent import session.
+
+### CentralUI-032 — `AuditResultsGrid` paging is forward-only, no Previous button
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Design-document adherence |
+| Status | Open |
+| Location | `src/ScadaLink.CentralUI/Components/Audit/AuditResultsGrid.razor:76-82`; `src/ScadaLink.CentralUI/Components/Audit/AuditResultsGrid.razor.cs:65,196-197,219-220` |
+
+**Description**
+
+The Audit Log results grid (Bundle B / M7-T3) renders a single "Next page" button
+and a `Page N · M rows` label, with no Previous control. The design doc says
+"Keyset pagination ordered by `(OccurredAtUtc desc, EventId desc)`. Default page
+size 100." — keyset paging is naturally forward-only, but a usable audit-triage
+workflow needs to step back to the previous page (the `SiteCallsReport` keyset
+implementation correctly maintains a `Stack<(...)> _cursorStack` for exactly this).
+An operator who clicks Next once and misses a row on the first page cannot return
+without re-applying the filter to start a fresh first page. The current shape
+also makes the "Page N" label slightly misleading — there is no in-grid affordance
+to use it as a navigation target.
+
+**Recommendation**
+
+Mirror the `SiteCallsReport.razor.cs` keyset-paging shape: maintain a
+`Stack<(DateTime?, Guid?)> _cursorStack` of previous-page cursors, add a Previous
+button gated on `_cursorStack.Count > 0`, push the current cursor on Next and pop
+on Previous. Either implement this or update the design doc to acknowledge
+forward-only paging on the Audit Log grid.
+
+### CentralUI-033 — Drill-in / query-string code paths for the new Transport + SiteCalls pages are untested
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Testing coverage |
+| Status | Open |
+| Location | `src/ScadaLink.CentralUI/Components/Pages/Design/TransportImport.razor.cs:97-238,267-319`; `src/ScadaLink.CentralUI/Components/Pages/SiteCalls/SiteCallsReport.razor.cs:107-148`; `tests/ScadaLink.CentralUI.Tests/Pages/Design/TransportImportPageTests.cs`; `tests/ScadaLink.CentralUI.Tests/Pages/SiteCallsReportPageTests.cs` |
+
+**Description**
+
+The CentralUI-025 lesson — "a critical drill-in/redirect path was untested, so the
+CentralUI-020 defect was not caught" — applies again to the two newest pages.
+`SiteCallsReport.ApplyQueryStringFilters` parses `?status=` and `?stuck=true` to
+seed the filters from a Health-dashboard KPI tile drill-in; there is no test that
+pins this seeding (an unrecognised status, a missing param, the case-insensitive
+match). `TransportImport` has a 5-step state machine and a 3-strike passphrase
+lockout, both with intricate transition logic
+(`GoFromUploadAsync` re-trying `LoadAsync`, the `_failedUnlockAttempts` reset on
+success, the audit-row write on failure) — none of the step-machine transition
+paths or the lockout reset / lockout-trip behaviours are pinned by tests. The
+existing `TransportImportPageTests` exercise rendering shapes, not the lifecycle.
+
+**Recommendation**
+
+Add bUnit tests for `SiteCallsReport.ApplyQueryStringFilters` covering valid /
+invalid / case-mismatched `?status=` values and the `?stuck=true` toggle, and
+add `TransportImport` lifecycle tests covering: an encrypted-bundle upload
+advances to Step 2 without opening a session; a wrong passphrase increments the
+counter and writes the `BundleImportUnlockFailed` audit row; the lockout resets
+the wizard to Step 1 once `MaxUnlockAttemptsPerSession` is reached; a successful
+unlock resets the counter and advances to Step 3.
@@ -5,10 +5,10 @@
 | Module | `src/ScadaLink.ClusterInfrastructure` |
 | Design doc | `docs/requirements/Component-ClusterInfrastructure.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-17 |
+| Last reviewed | 2026-05-28 |
 | Reviewer | claude-agent |
-| Commit reviewed | `39d737e` |
-| Open findings | 0 |
+| Commit reviewed | `1eb6e97` |
+| Open findings | 4 |

 ## Summary

@@ -45,6 +45,43 @@ part of the configuration contract but is never consumed — `ScadaLink.Host`'s
 does not enforce the design doc's requirement that `down-if-alone` be `on` for the
 keep-oldest resolver, so `DownIfAlone = false` is silently accepted (CI-010, Low).

+#### Re-review 2026-05-28 (commit `1eb6e97`)
+
+The only change to this module between `39d737e` and `1eb6e97` is the
+documentation-only commit `1eb6e97` itself, which added a handful of `<param>`
+XML doc tags to `ClusterOptionsValidator.Validate` and to
+`AddClusterInfrastructureActors` — no source-of-truth changes. Walked all three
+source files and all three test files against the full 10-category checklist
+again. Found **four new issues**, all Low severity, that the prior re-review
+either did not surface or that have aged into the file:
+
+- **CI-011 (Low, Code organization)** — `ClusterOptions.SectionName` is
+  documented as "the single source of truth so binding sites do not hard-code
+  the magic string" (the very justification CI-005's resolution offered), but
+  `ScadaLink.Host.SiteServiceRegistration.BindSharedOptions:100` and three
+  references in `ScadaLink.Host.StartupValidator` all hard-code
+  `"ScadaLink:Cluster"` literals. The constant is decorative — a "single source
+  of truth" that nothing reads. Same pattern as CI-009 (inert configuration knob).
+- **CI-012 (Low, Design-document adherence)** — the validator accepts
+  `SeedNodes.Count == 1` even though the design doc states "both nodes are seed
+  nodes" (a properly-configured deployment lists 2). `Host.StartupValidator:45`
+  already enforces `>= 2`, so this module's own contract validator is the
+  weaker of the two. Inconsistent enforcement across the two projects that
+  share ownership of the cluster contract.
+- **CI-013 (Low, Documentation & comments)** — `ClusterOptionsTests
+  .Properties_CanBeSetToCustomValues` deliberately sets
+  `SplitBrainResolverStrategy = "keep-majority"` and `MinNrOfMembers = 2` — the
+  exact values the design doc warns are catastrophic. The CI-006 resolution
+  acknowledged this is intentional (testing the POCO accepts any value; the
+  validator does the rejecting) but the test has no inline comment saying so,
+  and a future reader could easily misinterpret it as endorsing those values.
+- **CI-014 (Low, Code organization)** — `AddClusterInfrastructureActors` is
+  dead surface: no caller exists anywhere in the solution (verified via
+  `grep -rn`), its XML doc instructs callers "do not call", and its body
+  unconditionally throws. CI-002's resolution chose "fail loudly" over "delete"
+  but the method now offers nothing — keeping it is API-surface noise that an
+  IDE will still suggest via auto-complete.
+
 ## Checklist coverage

 Original review (2026-05-16, `9c60592`) below; the re-review notes (2026-05-17,
@@ -63,6 +100,21 @@ Original review (2026-05-16, `9c60592`) below; the re-review notes (2026-05-17,
 | 9 | Testing coverage | ✓ | `ClusterOptionsTests` covers defaults and setters. No tests for any cluster behaviour because none exists; the test project references nothing else (CI-006). **Re-review:** CI-006 resolved — 16 tests across three classes covering options, validator, and DI registration. No `DownIfAlone`-wiring test exists, but that wiring lives in the Host (CI-009). No new issue here. |
 | 10 | Documentation & comments | ✓ | `ClusterOptions` has no XML doc comments unlike peer options classes (CI-007). The "Phase 0 skeleton" placeholders are undocumented at the module level — no README or tracking note (CI-008). **Re-review:** CI-007/CI-008 resolved — full XML docs on all members; skeleton comments gone. Note: the `DownIfAlone` XML doc calls `true` "the design-doc requirement" yet the value is inert (CI-009) and unenforced (CI-010). |

+_Re-review (2026-05-28, `1eb6e97`):_
+
+| # | Category | Examined | Notes |
+|---|----------|----------|-------|
+| 1 | Correctness & logic bugs | ✓ | Validator logic and DI registration are correct. No new defects. |
+| 2 | Akka.NET conventions | ✓ | No actors in this module (legitimate, per CI-001 resolution). Nothing actor-shaped to evaluate. |
+| 3 | Concurrency & thread safety | ✓ | Validator and DI extensions remain stateless. No issues. |
+| 4 | Error handling & resilience | ✓ | Validator now rejects every catastrophic value the design doc enumerates. New — it accepts `SeedNodes.Count == 1` even though the design doc requires both nodes as seeds, and `Host.StartupValidator` enforces `>= 2`, so the module's own validator is the weaker check (CI-012). |
+| 5 | Security | ✓ | No authn/authz surface, no secret handling, no remoting transport configured here. No issues. |
+| 6 | Performance & resource management | ✓ | No resources held; validator allocates a small failure list per call only. No issues. |
+| 7 | Design-document adherence | ✓ | `ClusterOptions` contract complete and validated. New — validator's seed-node count check is weaker than the design (CI-012). |
+| 8 | Code organization & conventions | ✓ | Options/validator placement and Options pattern correct. New — `SectionName` constant documented as "single source of truth" but never read by any binding site (CI-011); `AddClusterInfrastructureActors` is dead surface that no caller invokes (CI-014). |
+| 9 | Testing coverage | ✓ | 16 tests across three classes. New — `ClusterOptionsTests.Properties_CanBeSetToCustomValues` sets the exact catastrophic values the design doc forbids without an inline comment explaining why (CI-013). |
+| 10 | Documentation & comments | ✓ | XML docs accurate across all source files (commit `1eb6e97` filled in the remaining `<param>` tags). New — CI-013 (test lacks intent comment); CI-011 (XML doc for `SectionName` claims a property the code does not deliver). |
+
 ## Findings

 ### ClusterInfrastructure-001 — Module implements none of its documented responsibilities
@@ -628,3 +680,181 @@ message explaining the isolated-single-node-cluster hazard, consistent with how
 validator already rejects quorum split-brain strategies. Developed test-first:
 `ClusterOptionsValidatorTests.DownIfAloneFalse_FailsValidation` was written first,
 confirmed failing, then passing after the fix. Module test suite green (18 passed).
+
+### ClusterInfrastructure-011 — `SectionName` constant is decorative — no binding site references it
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Code organization & conventions |
+| Status | Open |
+| Location | `src/ScadaLink.ClusterInfrastructure/ClusterOptions.cs:24-27`, `src/ScadaLink.Host/SiteServiceRegistration.cs:100`, `src/ScadaLink.Host/StartupValidator.cs:43`, `src/ScadaLink.Host/StartupValidator.cs:45`, `src/ScadaLink.Host/StartupValidator.cs:75` |
+
+**Description**
+
+`ClusterOptions.SectionName` was added by CI-005 as `public const string SectionName =
+"ScadaLink:Cluster";`, with an XML doc declaring it "the single source of truth so
+binding sites do not hard-code the magic string". CI-005's resolution likewise framed
+the constant as the canonical reference value. In practice, **no caller in the
+solution reads it**. `grep -rn "ClusterOptions.SectionName" src/` returns zero hits.
+Every site that needs the section name hard-codes the literal:
+
+- `ScadaLink.Host.SiteServiceRegistration.BindSharedOptions:100` —
+  `services.Configure<ClusterOptions>(config.GetSection("ScadaLink:Cluster"));`
+- `ScadaLink.Host.StartupValidator:43,45,75` — three `"ScadaLink:Cluster"` /
+  `"ScadaLink:Cluster:SeedNodes"` literals.
+
+The `SectionName_IsTheExpectedAppSettingsSection` test pins the constant's value but
+does not protect against the underlying drift hazard: if someone changes
+`SectionName` to `"ScadaLink:Akka:Cluster"`, the test still passes (because it tests
+the constant against the same literal), the validator still registers, and binding
+silently goes to whichever string the Host hard-codes. The constant currently
+provides none of the safety its XML doc claims. This is the same pattern of "inert
+configuration knob" CI-009 flagged for `DownIfAlone`, just with the harm being
+configuration drift rather than runtime behaviour.
+
+**Recommendation**
+
+Either (a) replace the hard-coded `"ScadaLink:Cluster"` literals in
+`SiteServiceRegistration.cs:100` and `StartupValidator.cs:43,45,75` with
+`ClusterOptions.SectionName` (a small Host-module change, to be tracked there), or
+(b) if the constant is intentionally decorative, soften the XML doc so it does not
+claim to be the source of truth. Do not leave a public constant whose stated
+guarantee the code does not deliver.
+
+**Resolution**
+
+_Open — needs a one-line Host-side change to reference the constant, plus a test
+that proves the section name flows from this module to the Host._
+
+### ClusterInfrastructure-012 — Validator accepts `SeedNodes.Count == 1` despite design requiring both nodes as seeds
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Design-document adherence |
+| Status | Open |
+| Location | `src/ScadaLink.ClusterInfrastructure/ClusterOptionsValidator.cs:30-33` |
+
+**Description**
+
+`Component-ClusterInfrastructure.md` (Node Configuration) states:
+
+> Cluster seed nodes: **Both nodes** are seed nodes — each node lists both itself and
+> its partner. Either node can start first and form the cluster; the other joins when
+> it starts. No startup ordering dependency.
+
+A correctly-configured ScadaLink deployment therefore lists **two** seed nodes.
+`ClusterOptionsValidator.Validate` only checks that `SeedNodes` is non-null and
+non-empty (`Count == 0`). A configuration with a single seed node passes validation
+silently — but that defeats the "no startup ordering dependency" guarantee the
+design doc explicitly calls out.
+
+`ScadaLink.Host.StartupValidator:43-46` does enforce the rule:
+
+```csharp
+var seedNodes = configuration.GetSection("ScadaLink:Cluster:SeedNodes").Get<List<string>>();
+if (seedNodes is null || seedNodes.Count < 2)
+    errors.Add("ScadaLink:Cluster:SeedNodes must have at least 2 entries");
+```
+
+So the rule is enforced — but by the **other** project, after the
+`ClusterOptionsValidator` (the contract owner) already accepted the value. This is
+both inconsistent (two validators with different rules for the same field) and the
+weaker check is the contract-owner's. The pre-existing test
+`ServiceCollectionExtensionsTests.AddClusterInfrastructure_ValidatorRejectsBadOptionsAtResolution`
+even constructs a `SeedNodes` list with one entry and expects validation to succeed
+on that count — locking in the gap.
+
+**Recommendation**
+
+Tighten the validator: require `SeedNodes.Count >= 2` with a message that references
+the "both nodes are seed nodes" design rule. Update
+`AddClusterInfrastructure_ValidatorRejectsBadOptionsAtResolution` to use a two-entry
+list, and add a test case for `SeedNodes.Count == 1` failing validation. Once this
+module's validator enforces the rule, `Host.StartupValidator`'s duplicate check
+becomes redundant and can be removed in the Host's review.
+
+**Resolution**
+
+_Open._
+
+### ClusterInfrastructure-013 — Test uses catastrophic config values without an inline-intent comment
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Documentation & comments |
+| Status | Open |
+| Location | `tests/ScadaLink.ClusterInfrastructure.Tests/ClusterOptionsTests.cs:47-67` |
+
+**Description**
+
+`ClusterOptionsTests.Properties_CanBeSetToCustomValues` deliberately sets two values
+the design doc explicitly warns are catastrophic:
+
+```csharp
+SplitBrainResolverStrategy = "keep-majority",   // design doc: total cluster shutdown on partition
+...
+MinNrOfMembers = 2                              // design doc: blocks singleton, halts data collection
+```
+
+The CI-006 resolution acknowledged this is intentional — the test exercises the POCO
+property setter (which by design accepts any string/int because the validator does
+the rejecting), and `ClusterOptionsValidatorTests.UnsupportedSplitBrainStrategy_FailsValidation`
+ `MinNrOfMembers_NotOne_FailsValidation` prove the validator rejects them. But this
+reasoning is recorded **only** in the CI-006 resolution text in this findings file,
+not in the test itself. A reader landing on the test cold has no signal that these
+values are forbidden in production; they could reasonably infer the test endorses
+them.
+
+**Recommendation**
+
+Add a brief XML-doc / inline comment to `Properties_CanBeSetToCustomValues` stating
+that it exercises only the POCO's setter — these values intentionally do **not**
+represent a valid runtime configuration, and `ClusterOptionsValidator` rejects them
+(with a cross-reference to the relevant validator tests). Two lines is enough; the
+goal is to make the test's intent self-documenting.
+
+**Resolution**
+
+_Open._
+
+### ClusterInfrastructure-014 — `AddClusterInfrastructureActors` is dead surface — no caller, no behaviour
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Code organization & conventions |
+| Status | Open |
+| Location | `src/ScadaLink.ClusterInfrastructure/ServiceCollectionExtensions.cs:42-48` |
+
+**Description**
+
+`AddClusterInfrastructureActors` has now reached a curious state: it is a public
+extension method with an XML doc that ends "Do not call AddClusterInfrastructureActors()"
+and a body that unconditionally throws `NotImplementedException`. CI-002's resolution
+chose "throw loudly" over "delete" specifically because CI-001 was still resolving the
+ownership-split question. That question is settled — the design doc, the README
+component table, and `Component-ClusterInfrastructure.md`'s "Implementation Note — Code
+Placement" all permanently locate the Akka actor bootstrap in `ScadaLink.Host`.
+
+A `grep -rn "AddClusterInfrastructureActors" src/ tests/` confirms there is no caller
+anywhere in the solution. The method's only consumer is its own test
+(`AddClusterInfrastructureActors_ThrowsRatherThanSilentlySucceeding`), which asserts
+that the method throws when called. Keeping it costs API surface (IDE auto-complete
+suggests it, the docs render it, and a future contributor might re-introduce a call
+expecting it to register something), and gives nothing in return.
+
+**Recommendation**
+
+Delete `AddClusterInfrastructureActors`, delete its test, and add a one-line note to
+`docs/requirements/Component-ClusterInfrastructure.md`'s code-placement section
+explicitly stating that this project exposes no actor-registration extension
+(actor wiring lives in `ScadaLink.Host`). If the user prefers to keep the
+"fail-fast" trap, mark the method `[Obsolete(true, error: true)]` so the compiler —
+not the runtime — rejects the call.
+
+**Resolution**
+
+_Open._
@@ -5,10 +5,10 @@
 | Module | `src/ScadaLink.Commons` |
 | Design doc | `docs/requirements/Component-Commons.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-17 |
+| Last reviewed | 2026-05-28 |
 | Reviewer | claude-agent |
-| Commit reviewed | `39d737e` |
-| Open findings | 0 |
+| Commit reviewed | `1eb6e97` |
+| Open findings | 9 |

 ## Summary

@@ -46,6 +46,42 @@ indexer that rejects `long` indices (Commons-013) and an `OpcUaEndpointConfigSer
 legacy-fallback path that can mislabel a corrupt new-shape row as `Legacy` (Commons-014).
 No Critical, High, or Medium issues were found.

+#### Re-review 2026-05-28 (commit `1eb6e97`)
+
+Commons has grown substantially since `39d737e` — 132 changed files (≈ +4 600 lines), driven
+by the Audit Log (#23), Site Call Audit (#22), and Transport (#24) work. The new surface
+area covers six new entity domain folders (Audit, Transport types under `Types/Transport`),
+seven new service interfaces (`IPartitionMaintenance`, `INodeIdentityProvider`,
+`ISiteAuditQueue`, `ICachedCallLifecycleObserver`, `ICachedCallTelemetryForwarder`,
+`IOperationTrackingStore`, `IBundleExporter` / `IBundleImporter` / `IBundleSessionStore` /
+`IAuditCorrelationContext`), a new `IAuditLogRepository`, and three new message folders
+(`Messages/Audit/`, `Messages/Integration/` extensions, `Messages/Management/TransportCommands`).
+The `SourceNode` thread-through and `ExecutionId` / `ParentExecutionId` additive-evolution
+fields are uniformly applied across `AuditEvent`, `SiteCall`, `Notification`,
+`NotificationSubmit`, `RouteToCallRequest`, `ScriptCallRequest`, and `SiteHealthReport` —
+all as trailing optional parameters, consistent with REQ-COM-5a.
+
+All fourteen prior findings (Commons-001 through Commons-014) remain `Resolved`. Nine new
+findings were recorded this pass: one Medium on the lack of UTC-kind enforcement for the
+new `DateTime`-typed `*Utc` columns (Commons-019), one Medium on unconstrained
+`EncryptionMetadata` (Commons-015), one Medium on the now-substantially-stale design doc
+(Commons-017), and six Low findings covering minor convention drift, missing unit tests
+for the Transport types, an unresolvable `<see cref>` in `IAuditCorrelationContext`, a
+benign lazy-parse race in `ExternalCallResult.Response`, undocumented JSON-blob shapes,
+two interfaces parked in the wrong folder, and a magic-number threshold in `BundleSession`.
+No Critical or High issues were found.
+
+The architectural-constraint tests still enforce the no-Akka/no-EF/no-ASP.NET rule, the
+POCO-entity and message-as-record conventions, and the `ToLocalTime` ban; they do not yet
+cover the new `*Utc`-suffixed `DateTime` properties on `AuditEvent` / `SiteCall`. Test
+coverage for the new types is uneven — `TrackedOperationId`, `SiteCallOperational`,
+`CachedCallTelemetry`, `SiteCallQueries`, `AuditQueryParamParsers`, `ApiKeyHasher`,
+`Notification`, and `SiteCall` are all directly tested; the Transport types
+(`BundleManifest`, `EncryptionMetadata`, `BundleSession`, `BundleSummary`, `ExportSelection`,
+`ImportPreview`, `ImportResolution`, `ImportResult`, `ManifestContentEntry`) have only
+integration-level coverage in `tests/ScadaLink.Transport.IntegrationTests/`, with no
+shape/serialization tests in `ScadaLink.Commons.Tests`.
+
 ## Checklist coverage

 | # | Category | Examined | Notes |
@@ -61,6 +97,21 @@ No Critical, High, or Medium issues were found.
 | 9 | Testing coverage | ✓ | `ValueFormatter`, `DynamicJsonElement`, `ScriptArgs`, `ManagementCommandRegistry`, `Result<T>`, `ConfigurationDiff`, `AlarmContext`, and the OPC UA serializer round-trip have no tests (Commons-010). |
 | 10 | Documentation & comments | ✓ | `OpcUaEndpointConfigSerializer.Deserialize` XML doc does not mention the silent data-loss path (Commons-005). `Component-Commons.md` is stale relative to the actual file set (Commons-009). `ValueFormatter` uses current-culture formatting without documenting it (Commons-012). |

+## Checklist coverage — Re-review 2026-05-28 (commit `1eb6e97`)
+
+| # | Category | Examined | Notes |
+|---|----------|----------|-------|
+| 1 | Correctness & logic bugs | ✓ | `EncryptionMetadata` accepts any algorithm string + any iteration count with no validation (Commons-015). New `*Utc`-suffixed `DateTime` columns on `AuditEvent`/`SiteCall` have no `DateTimeKind.Utc` enforcement and are inconsistent with `Notification`'s `DateTimeOffset` (Commons-019). |
+| 2 | Akka.NET conventions | ✓ | Commons has no actors. All new message contracts (`Messages/Audit`, `Messages/Integration` extensions, `RouteToCallRequest`, `ScriptCallRequest`) are records with trailing optional members per REQ-COM-5a. Correlation IDs present on request/response messages. |
+| 3 | Concurrency & thread safety | ✓ | `IAuditCorrelationContext` documents its scoped/sequential thread-safety contract explicitly (good). `ExternalCallResult.Response` has a benign lazy-parse race — two concurrent reads can both parse and produce distinct wrappers (Commons-021). |
+| 4 | Error handling & resilience | ✓ | The new ingest/upsert command + reply pairs (`UpsertSiteCallReply`, `IngestAuditEventsReply`, `IngestCachedTelemetryReply`) carry idempotency-friendly accepted-id lists and an `Accepted` flag that explicitly does NOT propagate audit-write failure to the user-facing action (alog.md §13). |
+| 5 | Security | ✓ | `ApiKeyHasher` correctly fails-fast on missing / weak pepper (≥16 chars), uses HMAC-SHA256, never accepts a null plaintext, and provides a clearly-labelled `Default` for tests only. `ApiKey.FromHash` is the production constructor; the plaintext constructor only ever uses the unpeppered `Default` and is documented as such. No script-trust violations in any new file. |
+| 6 | Performance & resource management | ✓ | `IBundleSessionStore.EvictExpired` exists for sessions — good. `BundleSession` carries `DecryptedContent` plus `Manifest` per session; the size is bounded by the configured bundle cap but no explicit per-session size accounting. `ExternalCallResult.Response` lazy parse not thread-safe (Commons-021). |
+| 7 | Design-document adherence | ✓ | `Component-Commons.md` is now significantly stale relative to the actual file set: stale enum values for `AuditKind`/`AuditStatus`, missing `AuditEvent`/`SiteCall` entities, missing `IAuditLogRepository`, missing six service interfaces and `Interfaces/Transport/`, missing four `Types/*` folders and `Messages/Audit/` (Commons-017). |
+| 8 | Code organization & conventions | ✓ | `IOperationTrackingStore` and `IPartitionMaintenance` live at the root of `Interfaces/` rather than under `Interfaces/Services/` (Commons-018). `BundleSession.Locked` uses a magic `3` rather than a named constant (Commons-016). Message contracts and entities otherwise follow the additive-evolution / POCO / `record` conventions. |
+| 9 | Testing coverage | ✓ | Transport types (`BundleManifest`, `EncryptionMetadata`, `BundleSession`, `BundleSummary`, `ExportSelection`, `ImportPreview`, `ImportResolution`, `ImportResult`, `ManifestContentEntry`) have no unit tests in `tests/ScadaLink.Commons.Tests/`; only `tests/ScadaLink.Transport.IntegrationTests/` exercises them (Commons-020). `IngestAuditEventsCommand` / `IngestCachedTelemetryCommand` / `UpsertSiteCallCommand` / `PullAuditEventsRequest` / `PullAuditEventsResponse` / `AuditTelemetryEnvelope` shape tests also absent. |
+| 10 | Documentation & comments | ✓ | `IAuditCorrelationContext` references `BundleImporter.ApplyAsync` — an implementation type Commons does not see, so the `<see cref>` is unresolvable (Commons-022b, folded into Commons-022). `ImportPreviewItem.FieldDiffJson` and `Notification.ResolvedTargets` are JSON-string columns with no documented shape contract (Commons-022). |
+
 ## Findings

 ### Commons-001 — `StaleTagMonitor` stale-fire race between timer and `OnValueReceived`
@@ -674,3 +725,415 @@ describe the corrupt-typed-row branch. Regression tests added in
 `OpcUaEndpointConfigSerializerTests` (`Deserialize_TypedShapeWithInvalidEnum_ReportsMalformedNotLegacy`,
 `Deserialize_TypedShapeWithWrongTypeField_ReportsMalformedNotLegacy`,
 `Deserialize_ValidTypedRow_StillReportsTyped`).
+
+### Commons-015 — `EncryptionMetadata` accepts any algorithm string and any iteration count
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.Commons/Types/Transport/EncryptionMetadata.cs:3-8` |
+
+**Description**
+
+`EncryptionMetadata` is a positional record that carries the bundle's encryption parameters
+over the wire and into the persistence/audit layer:
+
+```csharp
+public sealed record EncryptionMetadata(
+    string Algorithm,      // "AES-256-GCM"
+    string Kdf,            // "PBKDF2-SHA256"
+    int Iterations,
+    string SaltB64,
+    string IvB64);
+```
+
+The expected values are documented as inline comments only — there is no validation, no
+enum, and no constructor invariant. The consequences:
+
+- A bundle manifest that says `Algorithm = "AES-128-CBC"` (or any garbage string) will
+  deserialize successfully. The mismatch surfaces only when `BundleImporter` tries to
+  decrypt, where it most likely manifests as a misleading exception (or a silent wrong-key
+  result, depending on the implementation).
+- `Iterations` is unconstrained — `0`, negative, or absurdly large values round-trip. A
+  zero/negative iteration count weakens the KDF and a billion-iteration count is a DoS
+  vector against a passphrase-unlock attempt.
+- `SaltB64` / `IvB64` are just `string` — there is no length, format, or non-null check.
+  A null or empty salt/IV silently rides through serialization and surfaces inside the
+  cipher init.
+
+`EncryptionMetadata` is the integrity contract for the bundle's encryption envelope and
+crosses both the file boundary (the on-disk bundle manifest) and the central audit log.
+The defense-in-depth principle says malformed values should be rejected at the type
+boundary, not at the cipher.
+
+**Recommendation**
+
+Validate in a static factory or constructor: reject unsupported `Algorithm`/`Kdf` (an
+enum or a small whitelist of strings), require `Iterations >= 100_000` (or whatever the
+documented PBKDF2 minimum is) and `<= 10_000_000`, require non-blank `SaltB64`/`IvB64`,
+and Base64-decode them at construction so a malformed encoding fails fast. Document the
+accepted values on the record.
+
+### Commons-016 — `BundleSession.Locked` uses a magic `3` rather than a named constant
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Code organization & conventions |
+| Status | Open |
+| Location | `src/ScadaLink.Commons/Types/Transport/BundleSession.cs:13-16` |
+
+**Description**
+
+`BundleSession` exposes:
+
+```csharp
+public int FailedUnlockAttempts { get; set; }
+public bool Locked => FailedUnlockAttempts >= 3;
+```
+
+The `3` is a magic number with no constant, no XML doc reference, and no symbol to
+search for if a future operator wants to change the threshold (or write a test that
+deliberately exercises the lockout). The XML comment on `Locked` repeats the literal
+("three or more unlock attempts have failed") rather than citing a constant, so a
+change to the threshold would have to be made in three places (the comparison, the XML
+text, and any caller-side `attempts < 3` checks). The lockout count is also a
+security-relevant policy parameter — it deserves a named symbol so a security review
+can find it.
+
+**Recommendation**
+
+Promote the threshold to a `public const int MaxUnlockAttempts = 3;` on `BundleSession`
+(or to the `IBundleSessionStore`/`BundleImporter` if that is the better home), and rewrite
+the `Locked` expression and the XML comment in terms of it. If the threshold is actually
+owned by a Transport-component option, document the link.
+
+### Commons-017 — `Component-Commons.md` is significantly stale (audit enums, new entities, new repositories, new service interfaces, new folders)
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Design-document adherence |
+| Status | Open |
+| Location | `docs/requirements/Component-Commons.md:41-44`, `:75-79`, `:88-95`, `:107-117`, `:152-232` |
+
+**Description**
+
+The Commons design doc has fallen materially behind the code:
+
+- **REQ-COM-1 audit enums** — the doc's `AuditKind` enum lists
+  `SyncCall, CachedEnqueued, CachedAttempt, CachedTerminal, SyncWrite, SyncRead, Enqueued,
+  Attempt, Terminal, Completed`; the actual enum in `Types/Enums/AuditKind.cs` has
+  *completely different* values: `ApiCall, ApiCallCached, DbWrite, DbWriteCached, NotifySend,
+  NotifyDeliver, InboundRequest, InboundAuthFailure, CachedSubmit, CachedResolve`.
+  Likewise `AuditStatus` — doc says `Success, TransientFailure, PermanentFailure, Enqueued,
+  Retrying, Delivered, Parked, Discarded`; actual values are `Submitted, Forwarded,
+  Attempted, Delivered, Failed, Parked, Discarded, Skipped`. The doc's enum names cannot
+  be matched to the code at all.
+- **REQ-COM-3 entities** — the Audit bullet still lists only `AuditLogEntry`; the
+  actual `Entities/Audit/` folder now contains `AuditEvent` and `SiteCall` as well, and
+  both carry significant additional columns (`SourceNode`, `ExecutionId`,
+  `ParentExecutionId`) that are core to the M3-M7 work and entirely absent from the doc.
+- **REQ-COM-4 repositories** — `IAuditLogRepository` is in the code (with its
+  `InsertIfNotExistsAsync`, `QueryAsync`, `SwitchOutPartitionAsync`,
+  `GetPartitionBoundariesOlderThanAsync`, `GetKpiSnapshotAsync`, `GetExecutionTreeAsync`,
+  `GetDistinctSourceNodesAsync` surface) but missing from the REQ-COM-4 list.
+- **REQ-COM-4a services** — the doc lists seven service interfaces. The code adds
+  `ICachedCallLifecycleObserver`, `ICachedCallTelemetryForwarder`, `INodeIdentityProvider`,
+  `ISiteAuditQueue`, plus the misplaced `IOperationTrackingStore` and `IPartitionMaintenance`
+  (see Commons-018), and the `Interfaces/Transport/` folder with four more interfaces
+  (`IBundleExporter`, `IBundleImporter`, `IBundleSessionStore`, `IAuditCorrelationContext`)
+  — none of which appear in REQ-COM-4a.
+- **REQ-COM-5b folder tree** — missing: `Types/Audit/` (`AuditLogPaging`,
+  `AuditLogQueryFilter`, `AuditQueryParamParsers`, `ExecutionTreeNode`,
+  `SiteCallKpiSnapshot`, `SiteCallPaging`, `SiteCallQueryFilter`,
+  `SiteCallSiteKpiSnapshot`), `Types/Notifications/` (`NotificationKpiSnapshot`,
+  `NotificationOutboxFilter`, `SiteNotificationKpiSnapshot`), `Types/InboundApi/`
+  (`ApiKeyHasher`, `ParameterDefinition`), `Types/Transport/` (nine records),
+  `Messages/Audit/` (seven new message files), `Interfaces/Transport/` (four
+  interfaces), plus the new `AuditLogKpiSnapshot`, `SiteAuditBacklogSnapshot`,
+  `SiteCallOperational`, `TrackingStatusSnapshot` directly under `Types/`.
+
+CLAUDE.md's editing rules state design docs and code must travel together. The doc is now
+much less useful as a map of the actual file set than after the previous (Commons-009)
+refresh.
+
+**Recommendation**
+
+Refresh `Component-Commons.md` against the current file set: rewrite the `AuditKind` /
+`AuditStatus` enum value lists to match the code, add `AuditEvent` and `SiteCall` to
+REQ-COM-3, add `IAuditLogRepository` to REQ-COM-4, expand REQ-COM-4a with the new service
+interfaces (and add a sentence on the Transport interfaces in `Interfaces/Transport/`),
+and rewrite the REQ-COM-5b folder tree to include the new `Types/*`, `Messages/Audit`,
+and `Interfaces/Transport` folders. The same kind of refresh that resolved Commons-009 is
+needed again now.
+
+### Commons-018 — `IOperationTrackingStore` and `IPartitionMaintenance` are at the root of `Interfaces/` instead of `Interfaces/Services/`
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Code organization & conventions |
+| Status | Open |
+| Location | `src/ScadaLink.Commons/Interfaces/IOperationTrackingStore.cs`, `src/ScadaLink.Commons/Interfaces/IPartitionMaintenance.cs` |
+
+**Description**
+
+REQ-COM-5b documents the `Interfaces/` folder as having exactly three sub-folders:
+`Protocol/` (REQ-COM-2), `Repositories/` (REQ-COM-4), and `Services/` (REQ-COM-4a). Two
+new interfaces — `IOperationTrackingStore` and `IPartitionMaintenance` — are filed at
+the root of `Interfaces/` (namespace `ScadaLink.Commons.Interfaces`) rather than under
+`Interfaces/Services/` (namespace `ScadaLink.Commons.Interfaces.Services`). They are
+straightforward cross-cutting service interfaces consumed by the Audit Log component (a
+site-local SQLite tracking store; a central partition-maintenance hosted-service helper)
+and conceptually belong alongside `ISiteAuditQueue`, `ICachedCallLifecycleObserver`, etc.
+The inconsistency is small but it breaks the "every interface lives under a sub-folder"
+rule REQ-COM-5b establishes, and it makes the namespace surface inconsistent — every
+other recently-added service interface uses `Interfaces.Services`.
+
+**Recommendation**
+
+Move both files into `Interfaces/Services/` and adjust the namespace to
+`ScadaLink.Commons.Interfaces.Services`. Update consumers in `ScadaLink.AuditLog`,
+`ScadaLink.SiteRuntime`, and `ScadaLink.ConfigurationDatabase`. Add them to the
+REQ-COM-4a list (see Commons-017).
+
+### Commons-019 — New `*Utc`-suffixed `DateTime` columns on `AuditEvent` / `SiteCall` are not enforced as UTC; inconsistent with `Notification`'s `DateTimeOffset`
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.Commons/Entities/Audit/AuditEvent.cs:15-18`, `src/ScadaLink.Commons/Entities/Audit/SiteCall.cs:59-68`, `tests/ScadaLink.Commons.Tests/Entities/EntityConventionTests.cs:49-69` |
+
+**Description**
+
+CLAUDE.md mandates UTC throughout the system, "DateTime with DateTimeKind.Utc *or*
+DateTimeOffset". The pre-existing convention in Commons entities is `DateTimeOffset`,
+and the architectural test `AllTimestampProperties_ShouldBeDateTimeOffset` enforces it
+on a name-allowlist (`Timestamp`, `DeployedAt`, `CompletedAt`, `GeneratedAt`,
+`ReportTimestamp`, `SnapshotTimestamp`). The new audit entities deviate:
+
+- `AuditEvent.OccurredAtUtc` and `IngestedAtUtc` — `DateTime` (nullable on the second).
+- `SiteCall.CreatedAtUtc`, `UpdatedAtUtc`, `TerminalAtUtc`, `IngestedAtUtc` — `DateTime`.
+
+The `Notification` entity in the same domain uses `DateTimeOffset` for every timestamp
+(`SiteEnqueuedAt`, `CreatedAt`, `LastAttemptAt`, `NextAttemptAt`, `DeliveredAt`). The
+architectural test does not catch the `*Utc` columns because those property names are not
+on the allowlist. Concretely:
+
+- Nothing prevents a producer from assigning `DateTime.Now` (kind = `Local`) or
+  `new DateTime(2026,1,1)` (kind = `Unspecified`) to an `OccurredAtUtc` column. The
+  value will round-trip through `System.Text.Json` losing the `Kind` (it defaults to
+  `Unspecified` on read). The `Utc` suffix is convention-only.
+- Comparison across the boundary is now ambiguous — the central `AuditLog.OccurredAtUtc`
+  and the central `Notifications.CreatedAt` are different CLR types, with `DateTimeOffset`
+  carrying an explicit offset and `DateTime` not.
+- The repository query filters (`AuditLogQueryFilter.FromUtc`/`ToUtc`,
+  `SiteCallQueryFilter.FromUtc`/`ToUtc`) also use bare `DateTime`. A caller building one
+  from `DateTime.UtcNow.AddHours(-1)` is fine; a caller using `DateTimeOffset.UtcNow.DateTime`
+  is fine; a caller using `DateTime.Now` is silently wrong.
+
+This is the same defect the architectural test was designed to catch on the
+`DateTimeOffset` side — the test just doesn't cover the new column-naming convention.
+
+**Recommendation**
+
+Pick a single rule:
+
+1. Convert the audit entities to `DateTimeOffset` to match every other Commons entity
+   and the architectural-test allowlist (largest blast radius — touches gRPC proto
+   types, EF mappings, SQL schemas, query filters).
+2. Keep `DateTime` for audit but extend `EntityConventionTests` to recognise the `*Utc`
+   property-name pattern and assert (a) it is `DateTime` (not `DateTimeOffset`) and
+   (b) any constant-default has `DateTimeKind.Utc`. Add a runtime assertion at the
+   write boundary (`SqliteAuditWriter.WriteAsync`, the central upsert) that the
+   incoming `Kind == DateTimeKind.Utc` and reject otherwise.
+
+Option 2 is the smaller change and is consistent with how `AuditLog` rows are stored in
+SQL Server (`datetime2`, no offset). Either way the inconsistency with `Notification`
+should be documented in REQ-COM-1 as a deliberate choice.
+
+### Commons-020 — Transport types and new Audit-message types have no unit tests in `ScadaLink.Commons.Tests`
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Testing coverage |
+| Status | Open |
+| Location | `tests/ScadaLink.Commons.Tests/` |
+
+**Description**
+
+The Transport (#24) work adds nine records under `Types/Transport/` (`BundleManifest`,
+`EncryptionMetadata`, `BundleSession`, `BundleSummary`, `ExportSelection`,
+`ImportPreview` + `ImportPreviewItem`, `ImportResolution`, `ImportResult`,
+`ManifestContentEntry`) and four interfaces under `Interfaces/Transport/`. None of them
+have a focused test file in `tests/ScadaLink.Commons.Tests/` — coverage is entirely
+inside `tests/ScadaLink.Transport.IntegrationTests/`, which exercises the
+end-to-end exporter/importer flow but does not pin the Commons-level wire contracts.
+
+Similarly, the new `Messages/Audit/` folder (`IngestAuditEventsCommand`/`Reply`,
+`IngestCachedTelemetryCommand`/`Reply`, `UpsertSiteCallCommand`/`Reply`,
+`SiteCallRelayMessages`) and the `Messages/Integration/` additions
+(`AuditTelemetryEnvelope`, `PullAuditEventsRequest`/`Response`) have no
+serialization-shape tests in Commons. The existing `MessageConventionTests`,
+`CompatibilityTests`, `ConnectionBindingSerializationTests`, and
+`SiteCallQueriesTests` cover some but not all of the new traffic — `PullAuditEvents`
+and `AuditTelemetryEnvelope` in particular cross the site→central version-skew
+boundary that REQ-COM-5a is designed to enforce, so a JSON round-trip + named-property
+assertion is the minimum protection against a future positional/tuple slip.
+
+This is the same pattern as Commons-010 — behavior-bearing types with no Commons-level
+test coverage, where the integration suite cannot catch a Commons-only contract
+regression.
+
+**Recommendation**
+
+Add focused tests in `tests/ScadaLink.Commons.Tests/Types/Transport/` (round-trip
+serialization for each Transport record, named JSON property assertions for
+`EncryptionMetadata` / `BundleManifest`, the `BundleSession.Locked` threshold —
+see Commons-016, the `ConflictKind`/`ResolutionAction` enum coverage), and in
+`tests/ScadaLink.Commons.Tests/Messages/Audit/` (round-trip + named-property assertions
+for the seven new message files). Prioritise the contracts that cross the site→central
+boundary (`AuditTelemetryEnvelope`, `PullAuditEventsRequest`/`Response`,
+`IngestCachedTelemetryCommand`).
+
+### Commons-021 — `ExternalCallResult.Response` has a benign lazy-parse race
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Concurrency & thread safety |
+| Status | Open |
+| Location | `src/ScadaLink.Commons/Interfaces/Services/IExternalSystemClient.cs:91-104` |
+
+**Description**
+
+`ExternalCallResult` is a `record` returned to scripts after an outbound HTTP call. The
+`Response` property lazily parses `ResponseJson` into a `DynamicJsonElement`:
+
+```csharp
+public dynamic? Response
+{
+    get
+    {
+        if (!_responseParsed)
+        {
+            _response = string.IsNullOrEmpty(ResponseJson)
+                ? null
+                : new DynamicJsonElement(System.Text.Json.JsonDocument.Parse(ResponseJson).RootElement);
+            _responseParsed = true;
+        }
+        return _response;
+    }
+}
+```
+
+`_response` and `_responseParsed` are plain mutable fields on a `record` that the
+language otherwise treats as immutable. Two threads reading `Response` simultaneously
+can both see `_responseParsed == false`, both call `JsonDocument.Parse`, and produce
+two distinct `DynamicJsonElement` wrappers — the second write wins, and any reference
+the loser thread already held becomes inconsistent with the winner. The race is benign
+in the current usage (scripts get the result on one thread and use it on that thread),
+and `DynamicJsonElement` after Commons-002 clones the underlying `JsonElement`, so the
+duplicate parses do not even leak document handles. But the pattern is fragile — a
+future caller that hands the result to a background continuation or `Task.WhenAll` would
+introduce a real correctness gap, and the laziness is implicit in `record` semantics
+that otherwise suggest immutability.
+
+**Recommendation**
+
+Use `Lazy<dynamic?>` initialised in the property (with `LazyThreadSafetyMode.ExecutionAndPublication`,
+the default) and drop the mutable backing fields, or replace the property with a method
+named `ParseResponse()` so the laziness is explicit and the caller knows to call it once
+and cache. Either way, the change is local and preserves the existing `record`-equality
+behavior.
+
+### Commons-022 — `IAuditCorrelationContext` references an unresolvable `BundleImporter.ApplyAsync` cref; JSON-blob columns have no documented shape
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Documentation & comments |
+| Status | Open |
+| Location | `src/ScadaLink.Commons/Interfaces/Transport/IAuditCorrelationContext.cs:11`, `src/ScadaLink.Commons/Types/Transport/ImportPreview.cs:11`, `src/ScadaLink.Commons/Entities/Notifications/Notification.cs:33` |
+
+**Description**
+
+Two related XML-doc weaknesses, both around the new Transport / Audit surface:
+
+1. `IAuditCorrelationContext`'s remarks say
+   `<see cref="BundleImporter.ApplyAsync"/>`. `BundleImporter` is the concrete
+   implementation in `ScadaLink.Transport.Import`, which Commons does not (and must
+   not) reference. The cref is unresolvable from Commons and will surface as a
+   build-time XML doc warning. The correct reference is the interface method
+   `IBundleImporter.ApplyAsync`.
+
+2. Two JSON-string columns flow across components without a documented shape:
+   - `ImportPreviewItem.FieldDiffJson` — described only as "string?" with no remarks on
+     who produces it, who reads it, or what shape it carries. The Central UI renders it,
+     so a drift between producer and renderer is a silent UI regression.
+   - `Notification.ResolvedTargets` — described as "Resolved delivery targets snapshotted
+     at delivery time, for audit" but the shape (newline-separated emails? a JSON array?
+     comma-separated?) is undocumented. Audit consumers and the Central UI both read
+     this field.
+
+   Both are wire/persistence-format strings; an undocumented schema invites the same
+   kind of producer/consumer drift the `ValueTuple` finding in Commons-008 surfaced for
+   the typed messages.
+
+**Recommendation**
+
+- Fix the `<see cref>` in `IAuditCorrelationContext` to point at `IBundleImporter.ApplyAsync`.
+- Add a remarks block to `ImportPreviewItem.FieldDiffJson` describing the JSON shape
+  (e.g. "a JSON object keyed by field name with `{ existing, incoming }` values") or, if
+  the shape is meant to be opaque to the wire, document that explicitly.
+- Add a remarks block to `Notification.ResolvedTargets` documenting the format.
+- Consider replacing both with strong-typed Commons records — `ResolvedTargets` could be
+  `IReadOnlyList<string>` serialised via EF value converter, and `FieldDiffJson` could
+  be a `FieldDiff` record. That is a larger change and is left as a follow-up.
+
+### Commons-023 — Trailing-optional `SourceNode` on positional records mixes additive evolution patterns
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Akka.NET conventions |
+| Status | Open |
+| Location | `src/ScadaLink.Commons/Messages/Audit/SiteCallQueries.cs:53-66`, `:110-123`, `src/ScadaLink.Commons/Messages/Notification/NotificationOutboxQueries.cs:26-39`, `:104-123`, `src/ScadaLink.Commons/Types/SiteCallOperational.cs:42-54`, `src/ScadaLink.Commons/Types/TrackingStatusSnapshot.cs:33-46` |
+
+**Description**
+
+The `SourceNode` rollout adds an optional trailing parameter to a long list of positional
+records. Two minor patterns emerge that are worth flagging:
+
+- `SiteCallSummary` (twelve required positional members plus an optional 13th
+  `SourceNode = null`) — and the parallel `NotificationSummary` (ten required + optional
+  `SourceNode = null`) — both push the optional past a `bool IsStuck` flag. A consumer
+  reading the positional signature is now mixing required and optional members. The
+  record otherwise works correctly because every consumer constructs it via named
+  arguments, but a positional constructor call (which the language allows) would silently
+  miss the new field.
+- `TrackingStatusSnapshot` has been made non-optional `SourceNode` (`string? SourceNode`
+  without `= null`), inconsistent with `SiteCallOperational`'s `string? SourceNode` (also
+  without default — but `SiteCallOperational` is purely positional). The mix of "optional
+  with default" and "optional without default" across the same domain is fine technically
+  but is the kind of inconsistency that bites a future additive field.
+
+Neither pattern is a defect today — every consumer is updated, and JSON serialization
+treats nullable-without-default the same as nullable-with-default. But the conventions
+across the Audit / Notifications message surface have drifted enough that REQ-COM-5a's
+"additive-only" rule deserves a one-paragraph clarification: do new optional fields take
+a `= null` default, or not? The current code is mixed.
+
+**Recommendation**
+
+Add a one-paragraph "How to add a field" sub-section to REQ-COM-5a stating: new optional
+fields on positional records MUST be added at the end of the parameter list AND MUST
+carry a `= null` (or other safe) default value, so existing positional construction
+sites keep compiling. Apply that rule retroactively to `TrackingStatusSnapshot` and any
+other recent record that did not adopt it. No behavioral change required.
@@ -5,10 +5,10 @@
 | Module | `src/ScadaLink.Communication` |
 | Design doc | `docs/requirements/Component-Communication.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-17 |
+| Last reviewed | 2026-05-28 |
 | Reviewer | claude-agent |
-| Commit reviewed | `39d737e` |
-| Open findings | 0 |
+| Commit reviewed | `1eb6e97` |
+| Open findings | 7 |

 ## Summary

@@ -42,6 +42,47 @@ gRPC-supplied `correlation_id` flows straight into an Akka actor name
 (Communication-014), and the factory's endpoint-reuse defect is masked by the test
 mock (Communication-015). Four new findings, all Open: one High, one Medium, two Low.

+#### Re-review 2026-05-28 (commit `1eb6e97`)
+
+All prior findings (Communication-001..015) remain `Resolved` in this commit. The
+re-review walked all 10 checklist categories again on the surface that has not
+been re-examined before — the central↔site command/control routing surface
+(`CentralCommunicationActor`, `SiteCommunicationActor`) rather than the
+previously-mined gRPC streaming surface — and uncovered a cluster of defects
+around the connection-state-change workflow. The single material finding is
+**`HandleConnectionStateChanged` is dead code**: no production code path emits
+`ConnectionStateChanged`, so the documented "kill active debug streams for the
+disconnected site" + "mark in-progress deployments as failed" workflow never
+fires at runtime (Communication-016). The downstream consequence is
+**`_inProgressDeployments` grows unboundedly** — entries are inserted on every
+deployment but only cleaned via that dead path (Communication-017). Three
+smaller items round out the re-review: site heartbeats hard-code
+`IsActive: true` regardless of node role (Communication-018), the
+60-second-periodic `LoadSiteAddressesFromDb` task has no CancellationToken so a
+hung DB query has no upper bound (Communication-019), the
+`SiteAddressCacheLoaded` internal message carries a mutable
+`Dictionary`/`List` (Communication-020), `SiteStreamGrpcServer.SubscribeInstance`
+leaks the StreamRelayActor if `_streamSubscriber.Subscribe` throws between
+`ActorOf` and the `try` block (Communication-021), and `_debugSubscriptions`
+keyed by caller-supplied `CorrelationId` could orphan a subscriber on ID reuse
+(Communication-022). Seven new findings, all Open: one High, one Medium, five
+Low.
+
+## Checklist coverage 2026-05-28
+
+| # | Category | Examined | Notes |
+|---|----------|----------|-------|
+| 1 | Correctness & logic bugs | ✓ | `HandleConnectionStateChanged` and its `_inProgressDeployments` / `_debugSubscriptions` cleanup never fire — the connection-state workflow is dead (Communication-016, Communication-017). `_debugSubscriptions` correlation-ID overwrite risk (Communication-022). |
+| 2 | Akka.NET conventions | ✓ | `SiteAddressCacheLoaded` carries mutable `Dictionary<string, List<string>>` — violates message-immutability convention (Communication-020). `Forward`/`PipeTo`/Sender-capture all clean. |
+| 3 | Concurrency & thread safety | ✓ | All mutable state mutated on the actor thread. `_subscriptions` ConcurrentDictionary use disciplined. No new issues. |
+| 4 | Error handling & resilience | ✓ | `LoadSiteAddressesFromDb` lacks a `CancellationToken` propagation point (Communication-019). `SubscribeInstance` leaks the relay actor if `Subscribe` throws pre-try (Communication-021). |
+| 5 | Security | ✓ | Correlation-id validation in place (Communication-014). No new issues. |
+| 6 | Performance & resource management | ✓ | `_inProgressDeployments` grows unboundedly (Communication-017). gRPC client/server lifecycles otherwise clean. |
+| 7 | Design-document adherence | ✓ | `ConnectionStateChanged` handler is dead code — the doc-stated "kill streams on disconnect, fail in-progress deployments" workflow does not actually run (Communication-016). Site heartbeats always report `IsActive: true` regardless of role (Communication-018). |
+| 8 | Code organization & conventions | ✓ | Options pattern correct; mapper placement and proto evolution are additive-only. No new issues. |
+| 9 | Testing coverage | ✓ | `CentralCommunicationActorTests.ConnectionLost_DebugStreamsKilled` exercises a code path that no production caller ever drives — gives false confidence (related to Communication-016). |
+| 10 | Documentation & comments | ✓ | Detailed XML docs added in this commit. No new issues. |
+
 ## Checklist coverage

 | # | Category | Examined | Notes |
@@ -726,3 +767,294 @@ gained `On_GrpcError_Reconnects_To_Other_Node_Endpoint`, which uses a new
 per endpoint (instead of one fixed mock regardless of endpoint), so the bridge actor's
 NodeA→NodeB reconnect is now verified to actually target the NodeB endpoint rather
 than being masked by an endpoint-agnostic mock.
+
+### Communication-016 — `HandleConnectionStateChanged` is dead code — the documented disconnect-cleanup workflow never fires
+
+| | |
+|--|--|
+| Severity | High |
+| Category | Design-document adherence |
+| Status | Open |
+| Location | `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:169`, `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:338-375` |
+
+**Description**
+
+`CentralCommunicationActor.HandleConnectionStateChanged` is wired to
+`Receive<ConnectionStateChanged>` and implements two important workflows on
+`IsConnected == false`: (1) kill every active debug stream for the disconnected
+site (`_debugSubscriptions` walk → `DebugStreamTerminated` Tell to each
+subscriber); (2) mark every in-progress deployment for that site as failed
+(`_inProgressDeployments` walk → entry removal). Both are documented in the
+component design doc's "Connection Failure Behavior" section and in WP-5 of the
+work plan referenced in the class's own XML doc comment.
+
+A repo-wide search (`grep -rn ConnectionStateChanged src/ tests/`) shows **no
+production code ever emits `ConnectionStateChanged`**. The only producers are
+the unit test `CentralCommunicationActorTests.ConnectionLost_DebugStreamsKilled`
+(line 137) and the Commons message-roundtrip test. The
+`CentralCommunicationActor` therefore never receives one in production, the
+disconnect-cleanup workflow never fires, and `_debugSubscriptions` /
+`_inProgressDeployments` are never pruned via this path.
+
+Concrete consequences:
+- A site goes down → its active debug streams do **not** get a synchronous
+  `DebugStreamTerminated` notification from central. The bridge actor must
+  detect the disconnect itself via gRPC keepalive timing out (~25s) or TCP RST.
+  Subscribers wait that long for the `OnStreamTerminated` callback instead of
+  the documented "immediately killed by central" behaviour.
+- In-progress deployments to a disconnected site continue to occupy the
+  Ask-reply window and only fail when the Ask times out at the
+  `CommunicationService.DeployInstanceAsync` layer (120s). They are never
+  proactively marked failed.
+- The unit test gives a strong false impression that the workflow works — it
+  exercises a code path that has no production caller.
+
+The design doc and CLAUDE.md mention "ClusterClient handles failover between
+NodeA and NodeB internally — there is no application-level NodeA preference /
+NodeB fallback logic" — so the ClusterClient mechanism is the documented
+failover transport. But that says nothing about *signalling* a fully-down
+remote cluster to central's coordinator actor, which is exactly what
+`ConnectionStateChanged` was meant to do.
+
+**Recommendation**
+
+Pick one of:
+- Wire a producer for `ConnectionStateChanged` — e.g. subscribe to
+  `ClusterClient`'s contact-point/cluster events (`ClusterClient.ContactPoints`
+  Refresh / `ContactPointAdded` / `ContactPointRemoved`) or watch the
+  ClusterClient actor for a "no contact points reachable" state — and have it
+  publish `ConnectionStateChanged` to `Self` on each transition.
+- If the documented "synchronously kill streams on disconnect" behaviour is
+  intentionally being dropped in favour of the slower keepalive-based
+  detection, delete the handler, the `ConnectionStateChanged` record, and the
+  related `_debugSubscriptions` / `_inProgressDeployments` tracking, then
+  update the design doc's "Connection Failure Behavior" section accordingly.
+
+Either way, replace `CentralCommunicationActorTests.ConnectionLost_DebugStreamsKilled`
+— at present it asserts a behaviour that no production code triggers.
+
+---
+
+### Communication-017 — `_inProgressDeployments` grows unboundedly — successful deployments are never cleaned up
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Performance & resource management |
+| Status | Open |
+| Location | `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:73`, `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:501`, `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:357-367` |
+
+**Description**
+
+`TrackMessageForCleanup` inserts `_inProgressDeployments[deploy.DeploymentId] =
+envelope.SiteId` on every `DeployInstanceCommand` routed to a site (line 501).
+The only places that *remove* from `_inProgressDeployments` are:
+- `HandleConnectionStateChanged` on `IsConnected == false` (line 366) — which
+  per Communication-016 never fires in production.
+- `PostStop` (line 553) — only on actor death (central failover).
+
+There is **no removal on the normal happy path** — neither when the site replies
+`DeploymentStatusResponse` (the reply goes to the Ask's temporary reply actor,
+not back through `CentralCommunicationActor`), nor on Ask timeout. Every
+successful or failed deployment leaves its entry behind for the lifetime of the
+process.
+
+Memory impact is modest (each entry is ~70-100 bytes), but the dictionary grows
+monotonically. Over months of operation across all sites a central node could
+accumulate tens of thousands of entries — a real, observable leak. More
+seriously, the field is *also* the source-of-truth set the
+`HandleConnectionStateChanged` walk uses to fail in-progress deployments, so
+even if a `ConnectionStateChanged` *were* fired today, the walk would
+"fail" thousands of already-completed deployments and Tell their (now stale)
+correlation-IDs into the void.
+
+`_debugSubscriptions` (line 67) shares the same shape — but a normal debug
+session ends with an `UnsubscribeDebugViewRequest` that *does* drive cleanup
+(line 497), so leaks are only realised when a consumer crashes without
+unsubscribing.
+
+**Recommendation**
+
+Either remove `_inProgressDeployments` entirely (it has no other consumer once
+Communication-016 is fixed by deletion) or, if the disconnect-cleanup workflow
+is retained, add a removal hook on the reply path. The simplest fix is to
+subscribe `CentralCommunicationActor` to the Ask reply: route
+`DeployInstanceCommand` through the actor with the actor as the Ask sender,
+forward the reply to the original caller, and `_inProgressDeployments.Remove`
+in the same handler. (Today the Ask is taken on the *actor* itself by the
+caller, so the reply skips the coordinator.)
+
+---
+
+### Communication-018 — Site heartbeats hard-code `IsActive: true` regardless of node role
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Design-document adherence |
+| Status | Open |
+| Location | `src/ScadaLink.Communication/Actors/SiteCommunicationActor.cs:357-371` |
+
+**Description**
+
+`SiteCommunicationActor.SendHeartbeatToCentral` builds
+`new HeartbeatMessage(_siteId, hostname, IsActive: true, DateTimeOffset.UtcNow)`
+on every periodic tick (line 366), with no inspection of whether this node is
+actually the active site node or a standby. The `HeartbeatMessage.IsActive`
+field thus carries the literal value `true` on every heartbeat from every
+node, and the field is effectively dead — central's `HandleHeartbeat` doesn't
+consume it either (line 297 only passes `SiteId` and `Timestamp` to
+`MarkHeartbeat`).
+
+Per CLAUDE.md's Cluster & Failover section the active/standby distinction is
+real ("Both nodes are seed nodes", "keep-oldest split-brain resolver",
+"automatic dual-node recovery"), so a heartbeat that *could* carry node-role
+information would be useful for the central health dashboard distinguishing
+"active node down, standby up" from "site fully offline". As shipped, the
+field is contract noise and a future implementer might mistakenly assume it
+already carries meaningful state.
+
+**Recommendation**
+
+Either (a) resolve the current cluster role at heartbeat-send time and pass it
+through — e.g. `Cluster.Get(Context.System).SelfRoles.Contains("active")` or
+the project's existing role mechanism — and have the central aggregator
+consume `IsActive`; or (b) drop the `IsActive` field from `HeartbeatMessage`
+(additive-only-evolution: deprecate the field, default to `true`, plan
+removal in a major message contract revision).
+
+---
+
+### Communication-019 — `LoadSiteAddressesFromDb` does not pass a `CancellationToken` to the repository
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Error handling & resilience |
+| Status | Open |
+| Location | `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:397-431` |
+
+**Description**
+
+`LoadSiteAddressesFromDb` runs `await repo.GetAllSitesAsync()` inside
+`Task.Run(async () => ...).PipeTo(self)` with no cancellation token (line 404).
+The repository signature accepts `CancellationToken` (the test mock declares
+`GetAllSitesAsync(Arg.Any<CancellationToken>())`), but the actor calls the
+no-arg overload — so a hung MS SQL connection has no upper bound. The
+60-second-periodic refresh keeps firing; each tick spawns a fresh `Task.Run`
+that piles up if the database is consistently slow. The actor itself is
+unaffected (it's not blocked), but pending tasks and DB connection-pool
+resources accumulate, and the `Status.Failure` handler (Communication-006)
+never fires because the task never faults — it just sits.
+
+**Recommendation**
+
+Maintain a per-load `CancellationTokenSource` with a deadline (e.g. the same
+60s the refresh runs on, or a configurable timeout in `CommunicationOptions`).
+Pass its `Token` to `GetAllSitesAsync`. Cancel the prior token before spinning
+a new load to avoid task accumulation.
+
+---
+
+### Communication-020 — `SiteAddressCacheLoaded` carries mutable `Dictionary`/`List` types
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Akka.NET conventions |
+| Status | Open |
+| Location | `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:567` |
+
+**Description**
+
+The Akka.NET convention is that messages crossing actor boundaries (even
+internal Self-messages over an async task boundary) are immutable.
+`SiteAddressCacheLoaded(Dictionary<string, List<string>> SiteContacts)` is a
+record but its `SiteContacts` payload is a mutable `Dictionary` whose values
+are mutable `List<string>`. Constructed inside `Task.Run` and handed off to
+the actor, the cache could in principle be mutated by either side; in
+practice nothing does, but the type is a stale-evidence guarantee that
+CLAUDE.md's "message immutability" rule is being followed only by convention.
+
+**Recommendation**
+
+Change the record signature to use `IReadOnlyDictionary<string, IReadOnlyList<string>>`
+(or `ImmutableDictionary` / `ImmutableArray<string>`) and freeze the data
+before piping. The cost is negligible — the payload is built and consumed
+once per refresh tick.
+
+---
+
+### Communication-021 — `SiteStreamGrpcServer.SubscribeInstance` leaks the `StreamRelayActor` if `Subscribe` throws pre-try
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Error handling & resilience |
+| Status | Open |
+| Location | `src/ScadaLink.Communication/Grpc/SiteStreamGrpcServer.cs:188-200` |
+
+**Description**
+
+`SubscribeInstance` performs these statements in order (lines 189-194), all
+*before* the `try` block at line 200:
+1. `Interlocked.Increment(ref _actorCounter)`
+2. `_actorSystem!.ActorOf(Props.Create(typeof(StreamRelayActor), ...))`
+3. `_streamSubscriber.Subscribe(request.InstanceUniqueName, relayActor)`
+
+If step 3 throws (the subscriber is wired but its `Subscribe` faults — a stale
+instance name, a temporary index lookup failure, etc.), the exception escapes
+the method as an unhandled `RpcException` *and* leaks the freshly-created
+`relayActor`. The `finally` block at line 211 is unreachable because the
+throw happens before the `try`. The actor's `_activeStreams` entry, the
+`StreamEntry.Cts`, and the `Channel<SiteStreamEvent>` are also leaked.
+
+In normal operation `_streamSubscriber.Subscribe` does not throw, so the bug is
+latent — but a misbehaving site runtime (e.g. `SiteStreamManager` faulted
+because the actor system is shutting down) would surface it.
+
+**Recommendation**
+
+Restructure to either (a) wrap the `Subscribe` call in a `try` whose `catch`
+stops the relay actor and disposes the CTS, or (b) move the actor + subscriber
+creation *inside* the existing `try` block (the `finally` will then handle
+cleanup uniformly). Option (b) is the simplest — just move lines 189-194 down
+past the `try {` brace.
+
+---
+
+### Communication-022 — `_debugSubscriptions` keyed by caller-supplied correlation ID; reuse silently orphans the prior subscriber
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:67`, `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:493` |
+
+**Description**
+
+`TrackMessageForCleanup` on `SubscribeDebugViewRequest` does
+`_debugSubscriptions[sub.CorrelationId] = (envelope.SiteId, Sender)` (line 493).
+The dictionary indexer silently overwrites any prior entry for the same
+`CorrelationId`. If two debug sessions ever reuse the same correlation ID (e.g.
+two Blazor users start a stream at the same moment with a non-GUID id, or a
+caller bug, or a malicious caller as flagged in the cousin
+Communication-014), the first subscriber's entry is overwritten and lost —
+on a later `ConnectionStateChanged(false)` (per Communication-016 it never
+actually fires today, but the design intent stands), only the *second*
+subscriber would be notified of the disconnect.
+
+`DebugStreamService.StartStreamAsync` uses `Guid.NewGuid().ToString("N")` as
+the session id (`DebugStreamService.cs:97`), so a real collision is
+astronomically unlikely in normal operation. But the central side is not
+defending itself: a CLI consumer or a future caller is implicitly trusted to
+generate globally-unique ids.
+
+**Recommendation**
+
+When the slot is already occupied, log a Warning and either reject the new
+subscription with an error response or evict the prior subscriber via
+`DebugStreamTerminated` before installing the new one. Mirrors the
+`SiteStreamGrpcServer` defensive behaviour where a duplicate `correlation_id`
+cancels the existing stream (line 167).
@@ -5,10 +5,10 @@
 | Module | `src/ScadaLink.ConfigurationDatabase` |
 | Design doc | `docs/requirements/Component-ConfigurationDatabase.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-17 |
+| Last reviewed | 2026-05-28 |
 | Reviewer | claude-agent |
-| Commit reviewed | `39d737e` |
-| Open findings | 0 |
+| Commit reviewed | `1eb6e97` |
+| Open findings | 10 |

 ## Summary

@@ -59,6 +59,59 @@ inconsistency — a redundant cast on one of the three `HasConversion` calls
 (`ConfigurationDatabase-014`). The module is otherwise healthy and the prior fixes
 hold up well.

+#### Re-review 2026-05-28 (commit `1eb6e97`)
+
+Re-reviewed the module at commit `1eb6e97`. All fourteen prior findings remain
+`Resolved`; their fixes still hold (encryption converter, fail-fast guard,
+peppered API-key hash, ephemeral-fallback hardening, etc.). The module has
+grown since the last review — new code includes Audit Log (#23) raw-SQL
+paths in `AuditLogRepository` (partition-switch purge, recursive
+execution-tree CTE, KPI snapshot, partition-boundary discovery), the
+`AuditLogPartitionMaintenance` SPLIT-RANGE roll-forward implementation, the
+`AuditCorrelationContext` scoped service that stamps `BundleImportId`, the
+`SiteCallAuditRepository` monotonic-rank upsert, and the
+`NotificationOutboxRepository` per-site KPI surface — and most of the new
+findings are concentrated in those raw-SQL paths and in latent gaps left
+behind by the CD-012 hash migration.
+
+Ten new findings were recorded. The most material is
+`ConfigurationDatabase-015`: a check-then-act race in
+`NotificationOutboxRepository.InsertIfNotExistsAsync` with no duplicate-key
+catch — unlike the sibling Audit Log / Site Call ingest paths, a concurrent
+ack-after-persist on the same `NotificationId` will surface as an
+unhandled `DbUpdateException` and break the at-least-once site→central
+handoff. `ConfigurationDatabase-016` flags that
+`InboundApiRepository.GetApiKeyByValueAsync` hashes the candidate with
+`ApiKeyHasher.Default` (unpeppered) while the production create-path uses
+the configured peppered hasher — any future caller (or test that exercises
+the method) will silently fail to find a real key; the production
+`ApiKeyValidator` happens not to call it, but the method is a publicly
+exposed `IInboundApiRepository` member and a latent bug.
+`ConfigurationDatabase-017` records that the `DeleteDeploymentRecordAsync`
+stub-attach delete bypasses the documented optimistic-concurrency rule on
+`DeploymentRecord.RowVersion` — the SQLite tests pass because the test
+fixture re-maps `RowVersion` as a nullable concurrency token, but in
+production this is likely to throw `DbUpdateConcurrencyException`.
+`ConfigurationDatabase-018` records the `DateTime`-typed `*Utc` columns on
+`AuditEvent` and `SiteCall` re-emerge as `Kind=Unspecified` on read; the
+sibling Commons module flagged the same pattern as Commons-019, and
+`AuditLogPartitionMaintenance.GetMaxBoundaryAsync` already defends against
+it with an explicit `SpecifyKind(Utc)` — but `GetPartitionBoundariesOlderThanAsync`
+does not (`ConfigurationDatabase-020`). `ConfigurationDatabase-019` is the
+SPLIT-RANGE loop in `AuditLogPartitionMaintenance.EnsureLookaheadAsync`
+swallowing every `SqlException` as a Warning and continuing — a genuine
+failure (permissions, deadlock, transient) leaves a missing boundary and
+the next iteration cheerfully splits the following month, creating a hole.
+`ConfigurationDatabase-021` is a low-severity hardening concern around
+`SwitchOutPartitionAsync`'s raw-SQL interpolation of `monthBoundaryStr` /
+`stagingTableName` (currently safe by construction, but truncates fractional
+seconds). `ConfigurationDatabase-022` is the stale "WP-24 stub" XML comment
+on `DeploymentManagerRepository`. `ConfigurationDatabase-023` is a
+design-doc-adherence drift on `IX_AuditLog_CorrelationId` (design says
+`IX_AuditLog_Correlation`). `ConfigurationDatabase-024` is missing test
+coverage for the SPLIT-RANGE failure-continuation behaviour and for the
+production-shape stub-attach delete with a real rowversion.
+
 ## Checklist coverage

 | # | Category | Examined | Notes |
@@ -74,6 +127,21 @@ hold up well.
 | 9 | Testing coverage | ✓ | Several repositories and `InstanceLocator` lack direct tests (CD-010). |
 | 10 | Documentation & comments | ✓ | `DeploymentManagerRepository` "WP-24 stub" XML comment is stale; noted in module context but not raised as a standalone finding. No issues found beyond items above. |

+_Re-review (2026-05-28, `1eb6e97`):_
+
+| # | Category | Examined | Notes |
+|---|----------|----------|-------|
+| 1 | Correctness & logic bugs | ✓ | `GetPartitionBoundariesOlderThanAsync` returns `DateTimeKind.Unspecified` (CD-020). `GetApiKeyByValueAsync` hashes with the unpeppered default (CD-016). |
+| 2 | Akka.NET conventions | ✓ | No actors in this module. No issues found. |
+| 3 | Concurrency & thread safety | ✓ | `NotificationOutboxRepository.InsertIfNotExistsAsync` check-then-act has no duplicate-key catch (CD-015). Stub-attach delete bypasses documented optimistic concurrency on `DeploymentRecord.RowVersion` (CD-017). |
+| 4 | Error handling & resilience | ✓ | `AuditLogPartitionMaintenance.EnsureLookaheadAsync` swallows non-idempotent SPLIT failures and continues (CD-019). |
+| 5 | Security | ✓ | `SwitchOutPartitionAsync` interpolates a `DateTime` string and a GUID-suffixed identifier into raw SQL — safe by construction but pattern is risky (CD-021). |
+| 6 | Performance & resource management | ✓ | No new issues found. |
+| 7 | Design-document adherence | ✓ | Index name drift: design says `IX_AuditLog_Correlation`, code uses `IX_AuditLog_CorrelationId` (CD-023). |
+| 8 | Code organization & conventions | ✓ | `DateTime *Utc` columns on `AuditEvent` / `SiteCall` carry no `DateTimeKind` enforcement (CD-018). |
+| 9 | Testing coverage | ✓ | No tests for SPLIT failure continuation and no production-shape rowversion stub-attach test (CD-024). |
+| 10 | Documentation & comments | ✓ | Stale "WP-24 stub" XML comment on `DeploymentManagerRepository` (CD-022). |
+
 ## Findings

 ### ConfigurationDatabase-001 — `GetTemplateWithChildrenAsync` loads child templates then discards them
@@ -816,3 +884,411 @@ no behavioural regression test is meaningful (cf. CD-005); a forward guard was a
 in `SchemaConfigurationTests.cs` —
 `SecretColumns_AllHaveEncryptedStringConverterApplied` (theory over all three secret
 columns) — asserting each column keeps an `EncryptedStringConverter`.
+
+### ConfigurationDatabase-015 — `NotificationOutboxRepository.InsertIfNotExistsAsync` is a check-then-act race with no duplicate-key catch
+
+| | |
+|--|--|
+| Severity | High |
+| Category | Concurrency & thread safety |
+| Status | Open |
+| Location | `src/ScadaLink.ConfigurationDatabase/Repositories/NotificationOutboxRepository.cs:33-45` |
+
+**Description**
+
+`InsertIfNotExistsAsync` does `AnyAsync(x => x.NotificationId == n.NotificationId)`,
+then — if false — `AddAsync` + `SaveChangesAsync`. There is a check-then-act window
+between the two operations: two sessions can both pass the `AnyAsync` check and both
+attempt the INSERT, and the loser surfaces as a uniqueness violation on the
+`NotificationId` primary key wrapped in a `DbUpdateException` / `SqlException` (error
+2627). The site→central handoff for notifications is documented as **at-least-once
+with ack-after-persist plus insert-if-not-exists**; collisions on the same
+`NotificationId` are therefore not a "should never happen" but the *expected* contention
+mode. As written, the second concurrent ack throws, fails the site→central
+acknowledgement, and the site retries the same row again on its next forward — a
+livelock if the contending pair keeps racing.
+
+The sibling raw-SQL `IF NOT EXISTS … INSERT` paths in `AuditLogRepository.InsertIfNotExistsAsync`
+(see SqlErrorUniqueIndexViolation / SqlErrorPrimaryKeyViolation handling at
+`AuditLogRepository.cs:74-89`) and `SiteCallAuditRepository.UpsertAsync`
+(`SiteCallAuditRepository.cs:87-96`) explicitly catch errors 2601/2627 and treat the
+loser as a no-op — exactly the right pattern for "first-write-wins idempotent ingest".
+This repository alone does not.
+
+**Recommendation**
+
+Either (a) rewrite the body as a single raw-SQL `IF NOT EXISTS … INSERT` and apply the
+same 2601/2627 catch-and-log-Debug pattern the AuditLog and SiteCall repositories use,
+or (b) wrap the existing flow in a try/catch around `SaveChangesAsync` that inspects
+the inner `SqlException.Number` and returns `false` (i.e. "another writer won the race")
+on 2601/2627. Option (a) is preferable because it collapses the two round-trips to one
+and matches the established idempotent-ingest pattern used elsewhere in the module.
+Add a regression test that simulates two concurrent `InsertIfNotExistsAsync` calls
+(using two open contexts) for the same `NotificationId` and asserts neither call
+throws and exactly one row lands.
+
+### ConfigurationDatabase-016 — `InboundApiRepository.GetApiKeyByValueAsync` hashes the candidate with the unpeppered `ApiKeyHasher.Default`
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Security |
+| Status | Open |
+| Location | `src/ScadaLink.ConfigurationDatabase/Repositories/InboundApiRepository.cs:35-39` |
+
+**Description**
+
+`GetApiKeyByValueAsync` resolves an API key by its presented plaintext value by hashing
+the candidate and looking up `KeyHash`. The hash, however, is computed with the static
+`ApiKeyHasher.Default` (the fixed, deployment-independent unpeppered hasher used for
+tests). Production key creation uses the DI-registered, *peppered* `IApiKeyHasher`
+constructed from `InboundApiOptions.ApiKeyPepper` (see CD-012 resolution and
+`ApiKeyHasher.ctor(string pepper)`), so the stored `KeyHash` of any real key was
+produced under the deployment pepper. Hashing the candidate with the unpeppered
+`Default` yields a different digest, and the `WHERE KeyHash = @hash` lookup will never
+match a real key.
+
+The production `ApiKeyValidator` (InboundAPI module) deliberately does NOT call this
+method — it fetches all keys and runs a constant-time comparison via the
+DI-registered hasher (`ApiKeyValidator.cs:53-64`) — so the immediate
+authentication path is unaffected. But `GetApiKeyByValueAsync` remains a publicly
+exposed `IInboundApiRepository` member; any new caller (a future admin tool, a CLI
+command, a test) that uses it under a peppered configuration will silently get a
+`null` result for an existing, valid key, and almost certainly mis-route the failure
+as "key not found".
+
+**Recommendation**
+
+Either (a) take `IApiKeyHasher` via constructor injection — alongside the existing
+`ScadaLinkDbContext` and optional `ILogger` — and use it here so the repository
+participates in the same peppered scheme as the rest of the system; or (b) delete
+the method from both the implementation and `IInboundApiRepository` (Commons) on the
+grounds that the production authentication path correctly avoids it for timing
+reasons and there is no remaining valid caller. Add a regression test that constructs
+the repository under a real `ApiKeyHasher("a-strong-pepper-value")`, inserts an
+`ApiKey.FromHash(...)` using the same hasher, and asserts `GetApiKeyByValueAsync`
+returns the row — under option (a) it should pass; under option (b) the method no
+longer exists.
+
+### ConfigurationDatabase-017 — Stub-attach delete on `DeploymentRecord` bypasses optimistic concurrency
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Concurrency & thread safety |
+| Status | Open |
+| Location | `src/ScadaLink.ConfigurationDatabase/Repositories/DeploymentManagerRepository.cs:83-97` |
+
+**Description**
+
+`DeploymentRecord` carries a SQL Server `rowversion` concurrency token (declared
+in `DeploymentConfiguration` and confirmed by `ConcurrencyTests`), per the design
+doc's "Optimistic concurrency is used on deployment status records". When
+`DeleteDeploymentRecordAsync` falls into its stub-attach branch (no tracked entity
+in `_dbContext.DeploymentRecords.Local` for the given id), it constructs
+`new DeploymentRecord("stub", "stub") { Id = id }`, `Attach`es it, and `Remove`s it.
+The stub's `RowVersion` is left at its default `null` (or `byte[0]`).
+
+EF Core's SQL Server provider generates the delete as
+`DELETE FROM DeploymentRecords WHERE Id = @id AND RowVersion = @stubRowVersion` — and
+the stub rowversion is not the row's real rowversion, so on a real SQL Server (with
+`IsRowVersion()` auto-populating the column) the WHERE never matches and `SaveChanges`
+throws `DbUpdateConcurrencyException`. The path is exercised by
+`RepositoryCoverageTests.DeleteDeploymentRecord_ViaStubAttachPath_RemovesEntity` —
+but the test fixture remaps `RowVersion` as a nullable `IsConcurrencyToken()` column
+without auto-population (`SqliteTestHelper.ConfigureForTests`), so the stored
+RowVersion is null AND the stub's RowVersion is null AND the SQLite delete matches.
+Production-shape behaviour is the opposite.
+
+The same stub-attach pattern is used on `SystemArtifactDeploymentRecord`,
+`Site`, and `DataConnection`. Those entities have no rowversion token, so the
+production behaviour is correct for them — the issue is specific to
+`DeploymentRecord`.
+
+**Recommendation**
+
+Replace the stub-attach branch in `DeleteDeploymentRecordAsync` with a real lookup —
+`await _dbContext.DeploymentRecords.FindAsync([id], ct)` then `Remove` if non-null —
+mirroring `DeleteInstanceAttributeOverrideAsync` and `DeleteDeployedSnapshotAsync`.
+This loses the "delete by id without a read" micro-optimisation (a real concern only
+in batched-delete loops) but restores the documented concurrency contract. If the
+optimisation is genuinely required, attach a `DeploymentRecord` with the *caller's*
+known RowVersion (the caller had to fetch the row at some point) and accept the
+`DbUpdateConcurrencyException` as the correct concurrency signal. Add a regression
+test under MS SQL (extend `RepositoryCoverageTests` with a SQL-Server-flavoured
+fixture, or use `MsSqlMigrationFixture`) that asserts the stub-attach delete works
+when the real RowVersion is supplied.
+
+### ConfigurationDatabase-018 — `DateTime`-typed `*Utc` columns on `AuditEvent` / `SiteCall` carry no `DateTimeKind` enforcement
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Code organization & conventions |
+| Status | Open |
+| Location | `src/ScadaLink.ConfigurationDatabase/Configurations/AuditLogEntityTypeConfiguration.cs`, `Configurations/SiteCallEntityTypeConfiguration.cs` (mappings for `OccurredAtUtc`, `IngestedAtUtc`, `CreatedAtUtc`, `UpdatedAtUtc`, `TerminalAtUtc`) |
+
+**Description**
+
+`AuditEvent.OccurredAtUtc` / `IngestedAtUtc` and `SiteCall.CreatedAtUtc` /
+`UpdatedAtUtc` / `TerminalAtUtc` / `IngestedAtUtc` are declared as `DateTime` (not
+`DateTimeOffset`) per the Audit Log #23 spec, with a UTC suffix convention. SQL Server's
+`datetime2` provider strips the `Kind` flag on the wire — values inserted with
+`DateTimeKind.Utc` round-trip as `DateTimeKind.Unspecified` on read. The EF mappings
+add no `HasConversion(...)` to normalise the kind. The sibling Commons module just
+flagged the same pattern as `Commons-019`; in this module the consequence is concrete:
+
+- `AuditLogPartitionMaintenance.GetMaxBoundaryAsync` already defends with an explicit
+  `DateTime.SpecifyKind(dt, DateTimeKind.Utc)` (see `AuditLogPartitionMaintenance.cs:103-104`).
+  That defence is necessary precisely because the EF mapping does not enforce it.
+- `AuditLogRepository.GetPartitionBoundariesOlderThanAsync` does NOT defend — it
+  returns `reader.GetDateTime(0)` directly with `Kind=Unspecified` (separate finding
+  CD-020).
+- Downstream comparisons like `DateTime.UtcNow` (Kind=Utc) against a re-read
+  `OccurredAtUtc` (Kind=Unspecified) do not produce a runtime error, but any code
+  path that converts via `.ToLocalTime()` or `.ToUniversalTime()` will silently
+  interpret an unspecified-kind value as local time and produce wrong results.
+
+**Recommendation**
+
+Apply a value converter on every `DateTime`-typed `*Utc` column that re-tags the
+`Kind` to `Utc` on read (and asserts/`SpecifyKind` on write to defend against an
+accidental local-kind write). EF Core's built-in
+`UtcValueConverter`-style pattern is a single line per column:
+
+```csharp
+builder.Property(e => e.OccurredAtUtc)
+    .HasConversion(
+        v => v,
+        v => DateTime.SpecifyKind(v, DateTimeKind.Utc));
+```
+
+Apply uniformly to `AuditEvent` (OccurredAtUtc, IngestedAtUtc), `SiteCall`
+(CreatedAtUtc, UpdatedAtUtc, TerminalAtUtc, IngestedAtUtc), and any other
+`DateTime *Utc` columns added later. Add a regression test that inserts a UTC row,
+re-reads it in a fresh context, and asserts `Kind == DateTimeKind.Utc`. Coordinate
+with the sibling `Commons-019` finding so the resolution is consistent across both
+modules.
+
+### ConfigurationDatabase-019 — `EnsureLookaheadAsync` swallows non-idempotent SPLIT failures and continues, creating partition holes
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Error handling & resilience |
+| Status | Open |
+| Location | `src/ScadaLink.ConfigurationDatabase/Maintenance/AuditLogPartitionMaintenance.cs:181-199` |
+
+**Description**
+
+`EnsureLookaheadAsync` loops one month at a time from `next` up to `horizon` and
+issues `ALTER PARTITION SCHEME … NEXT USED` + `ALTER PARTITION FUNCTION … SPLIT RANGE`
+per month. The class doc says idempotency is guaranteed by reading the max-boundary
+first and only issuing SPLITs for strictly-greater months — so "boundary already
+exists" (SQL Server msg 7708/7711) cannot occur by construction. Yet the loop wraps
+each iteration in `catch (SqlException ex) { _logger.LogWarning(...); }` and
+continues, with the rationale "the desired end state (boundary present) is satisfied
+by either path."
+
+That rationale is correct only for an "already-exists" error — which the pre-check
+makes impossible. Any *other* `SqlException` — a permissions failure (the
+`scadalink_audit_purger` role's `ALTER ON SCHEMA::dbo` revoked or not granted), a
+deadlock victim, a transient connection drop, a transaction log full, an underlying
+filegroup full — leaves the boundary genuinely **not** created, logs a Warning
+(quiet by default in most appenders), and the next iteration tries to SPLIT the
+following month. That split *can* succeed (it is a different range value), creating
+a permanent **hole** in the partition layout: month N never had a partition created,
+month N+1 does, so any future row in month N lands in the partition that previously
+spanned both months and partition-switch purge for month N becomes impossible.
+
+The class is the central singleton's daily-tick partition roll-forward, so the hole
+persists until an operator notices it and rebuilds manually — by which point months
+of audit retention may be locked behind the unsplit range.
+
+**Recommendation**
+
+Either (a) drop the `try/catch` entirely so any SPLIT failure aborts the loop and
+surfaces to the hosted service (the next tick retries — at-least-once with no holes),
+or (b) keep the catch but narrow it to ONLY the
+"boundary-already-exists" errors (SQL Server msg 7708 and 7711) and log at Debug,
+mirroring how `AuditLogRepository.InsertIfNotExistsAsync` narrowly catches 2601/2627.
+Option (a) is preferable: by class-doc construction the catch should never fire, so
+its only effect is to mask the real-failure case. Add tests that simulate a SPLIT
+failure (e.g. a permission denial via a constrained test login) and assert the loop
+aborts after the first failure with no further SPLITs.
+
+### ConfigurationDatabase-020 — `GetPartitionBoundariesOlderThanAsync` returns `DateTime` with `Kind=Unspecified`
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.ConfigurationDatabase/Repositories/AuditLogRepository.cs:378-387` |
+
+**Description**
+
+`GetPartitionBoundariesOlderThanAsync` reads `reader.GetDateTime(0)` and adds the
+raw value to the returned list. SQL Server's `datetime2` materialises as
+`DateTimeKind.Unspecified` on the ADO.NET side (see CD-019), so every returned
+boundary has `Kind=Unspecified`. The sibling `AuditLogPartitionMaintenance.GetMaxBoundaryAsync`
+(`AuditLogPartitionMaintenance.cs:103-104`) explicitly defends against this exact
+issue by calling `DateTime.SpecifyKind(dt, DateTimeKind.Utc)` — exactly because EF /
+ADO.NET strips the kind — but the repository method does not. Callers (the
+`AuditLogPurgeActor`) that compare a returned boundary to `DateTime.UtcNow` get a
+silently wrong comparison if they ever serialise to/from a string with a local-kind
+assumption in between.
+
+**Recommendation**
+
+Wrap the read with `DateTime.SpecifyKind(reader.GetDateTime(0), DateTimeKind.Utc)`,
+matching the explicit defensive pattern already in
+`AuditLogPartitionMaintenance.GetMaxBoundaryAsync`. Better still: fix CD-019 (a value
+converter on the column) so the defence at the read site is no longer required.
+
+### ConfigurationDatabase-021 — `SwitchOutPartitionAsync` interpolates `monthBoundary` / staging table name into raw SQL
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Security |
+| Status | Open |
+| Location | `src/ScadaLink.ConfigurationDatabase/Repositories/AuditLogRepository.cs:192-338` |
+
+**Description**
+
+`SwitchOutPartitionAsync` builds two large SQL batches via interpolated strings
+(`sampleSql` and `sql`) that include `{monthBoundaryStr}` and `{stagingTableName}`
+directly in the SQL text, and executes them via `ExecuteSqlRawAsync` /
+`cmd.ExecuteScalarAsync`. Both values are constructed inside the method —
+`monthBoundaryStr = monthBoundary.ToUniversalTime().ToString("yyyy-MM-dd HH:mm:ss")`
+and `stagingTableName = $"AuditLog_Staging_{Guid.NewGuid():N}"` — and the formats are
+fully controlled. SQL injection is therefore not possible as the code stands.
+
+Two related concerns:
+
+1. The format string `"yyyy-MM-dd HH:mm:ss"` truncates fractional seconds. The
+   partition function is seeded at `T00:00:00` exactly, so truncation happens to
+   produce the right boundary value today. A future change that adds a sub-second
+   boundary (or invokes `SwitchOutPartitionAsync` with a non-midnight value) would
+   silently round to the wrong partition with no error — and SWITCH PARTITION would
+   either fail loudly or succeed against the wrong month. Use
+   `"yyyy-MM-dd HH:mm:ss.fffffff"` to match the precision the migration seeds at,
+   and the rounding ambiguity disappears.
+2. The pattern of "build a multi-statement DDL batch by string concatenation" is
+   robust today only by inspection. A code review tripwire — the CLAUDE.md note "the
+   data-access layer must not concatenate SQL" — would catch the pattern earlier;
+   converting the batch to a parameterised `sp_executesql` invocation (the inner
+   `EXEC sp_executesql @sql` already exists for the SWITCH itself) is the textbook
+   safe form even when the input is internally controlled.
+
+**Recommendation**
+
+(1) Switch `monthBoundaryStr`'s format to `"yyyy-MM-dd HH:mm:ss.fffffff"`. (2)
+Optionally migrate the two batches to fully parameterised `sp_executesql` form so
+the `monthBoundary` value flows as a typed `@boundary datetime2(7)` parameter
+rather than as interpolated text — the only piece that genuinely *cannot* be
+parameterised is the staging table identifier (DDL identifiers are not parameterisable
+in T-SQL), but a server-side `QUOTENAME(@stagingTable)` wrapper covers it. Add a
+regression test that supplies a non-midnight `monthBoundary` value and asserts the
+boundary lookup resolves to the expected partition.
+
+### ConfigurationDatabase-022 — Stale "WP-24 Stub level sufficient for diff/staleness support" XML comment on `DeploymentManagerRepository`
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Documentation & comments |
+| Status | Open |
+| Location | `src/ScadaLink.ConfigurationDatabase/Repositories/DeploymentManagerRepository.cs:8-14` |
+
+**Description**
+
+The class-level XML doc on `DeploymentManagerRepository` reads "WP-24: Stub level
+sufficient for diff/staleness support." WP-24 (Deployment Manager work-package) shipped
+long ago; the repository now covers full `DeploymentRecord` CRUD,
+`SystemArtifactDeploymentRecord` CRUD, `DeployedConfigSnapshot` CRUD, and an
+`Instance` deletion path with explicit Restrict-FK cleanup
+(`DeleteInstanceAsync` at line 210-229). The comment misleads a reader into
+thinking the repository is incomplete and tempts them not to investigate further
+before adding new behaviour. The same module-context observation was noted but
+not raised in the prior review.
+
+**Recommendation**
+
+Remove the WP-24 line and rewrite the class doc to describe what the repository
+actually does today: EF Core implementation of `IDeploymentManagerRepository`
+covering deployment records, system-artifact deployment records, deployed config
+snapshots, and the Restrict-FK-aware `DeleteInstanceAsync` for the
+deployment pipeline. Cross-reference the optimistic-concurrency contract on
+`DeploymentRecord.RowVersion`.
+
+### ConfigurationDatabase-023 — `AuditLog` correlation-index name drifts from design doc (`IX_AuditLog_CorrelationId` vs `IX_AuditLog_Correlation`)
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Design-document adherence |
+| Status | Open |
+| Location | `src/ScadaLink.ConfigurationDatabase/Configurations/AuditLogEntityTypeConfiguration.cs:99-101`, `Migrations/20260520142214_AddAuditLogTable.cs:103-107` |
+
+**Description**
+
+The Component-ConfigurationDatabase design doc lists the AuditLog indexes by name —
+including `IX_AuditLog_Correlation (CorrelationId)` for the "drilldown from a single
+operation" use case. The implemented index name is `IX_AuditLog_CorrelationId` (the
+fluent-config `HasDatabaseName` call and the matching DDL in the migration both use
+the `Id`-suffixed form). The names are syntactically valid SQL Server index names and
+the index does the right work; the drift is cosmetic but it breaks scripted
+maintenance ops that grep for the documented name (e.g. a runbook reindex script).
+
+The other four documented index names (`IX_AuditLog_OccurredAtUtc`,
+`IX_AuditLog_Site_Occurred`, `IX_AuditLog_Channel_Status_Occurred`,
+`IX_AuditLog_Target_Occurred`, plus the post-design additions
+`IX_AuditLog_Execution`, `IX_AuditLog_ParentExecution`, `IX_AuditLog_Node_Occurred`)
+agree with the code.
+
+**Recommendation**
+
+Pick one direction. Updating the design doc to match the code is cheap (one word) and
+preserves the existing migration; renaming the index in the database requires a new
+migration that does `sp_rename`. Document-aligning is the lower-cost option and
+matches the resolution pattern used for CD-005.
+
+### ConfigurationDatabase-024 — Missing test coverage for SPLIT-RANGE failure-continuation and production-shape rowversion delete
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Testing coverage |
+| Status | Open |
+| Location | `tests/ScadaLink.ConfigurationDatabase.Tests/Maintenance/AuditLogPartitionMaintenanceTests.cs`, `tests/.../RepositoryCoverageTests.cs:855-869` |
+
+**Description**
+
+`AuditLogPartitionMaintenanceTests` exercises the happy-path SPLIT-RANGE behaviour
+(no-op, single-month, three-month, already-exists idempotency) but never simulates a
+SPLIT *failure* — so the catch-and-continue behaviour flagged in CD-019 is
+behaviourally untested. The class is a central singleton driving daily audit purge;
+a regression that turned the failure path into a permanent hole would not surface in
+the test suite.
+
+Separately, `RepositoryCoverageTests.DeleteDeploymentRecord_ViaStubAttachPath_RemovesEntity`
+covers the stub-attach delete path under the SQLite test fixture, but the fixture
+remaps `RowVersion` as a nullable concurrency token (`SqliteTestHelper`), so it does
+not exercise the production-shape `IsRowVersion()` auto-population — the actual
+concurrency-token bug flagged in CD-018 cannot show up. There is an
+`MsSqlMigrationFixture` in the test project already (used by the Audit Log migration
+tests); the stub-attach delete deserves a parallel MS-SQL-flavoured test.
+
+**Recommendation**
+
+(1) Add an `AuditLogPartitionMaintenanceTests` case that constructs a context against
+a constrained login (no `ALTER ON SCHEMA::dbo`), invokes `EnsureLookaheadAsync` for a
+three-month gap, and asserts: only the partition boundaries created BEFORE the
+permissions failure landed remain, and the call aborts cleanly without continuing to
+later months. This pins down the resolution of CD-019. (2) Add a
+`RepositoryCoverageTests` case that uses `MsSqlMigrationFixture` to insert a
+`DeploymentRecord`, clear the change tracker, call `DeleteDeploymentRecordAsync`,
+and assert the row is gone — pinning the resolution of CD-018. Both tests should be
+`[SkippableFact]` so the suite still passes when no MS SQL Server is available.
@@ -5,10 +5,10 @@
 | Module | `src/ScadaLink.DataConnectionLayer` |
 | Design doc | `docs/requirements/Component-DataConnectionLayer.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-17 |
+| Last reviewed | 2026-05-28 |
 | Reviewer | claude-agent |
-| Commit reviewed | `39d737e` |
-| Open findings | 0 |
+| Commit reviewed | `1eb6e97` |
+| Open findings | 5 |

 ## Summary

@@ -30,6 +30,40 @@ the design doc's failover state machine and the implemented unstable-disconnect
 heuristic. Test coverage is adequate for the happy paths and failover but absent for
 tag-resolution retry, disconnect/re-subscribe, and concurrency around `HandleSubscribe`.

+#### Re-review 2026-05-28 (commit `1eb6e97`)
+
+The 2026-05-28 re-review walked all 10 checklist categories against the current
+source and found **5 new findings**. All 17 prior findings remain `Resolved` and the
+fixes (reverse-index unsubscribe, atomic disconnect guards, real-logger threading,
+initial-connect failover, per-tag write-batch results, subscribe-response accuracy)
+were verified in place. The new findings cluster around `HandleSubscribe` /
+`HandleSubscribeCompleted` race-induced state drift:
+
+- **High** — concurrent subscribes for the same tag from different instances each see
+  the tag as not-yet-subscribed (the `alreadySubscribed` snapshot was taken before
+  the Task.Run dispatch), so each Task.Run calls `_adapter.SubscribeAsync` and the
+  later `HandleSubscribeCompleted` silently discards the second adapter subscription
+  handle — the orphan never gets `UnsubscribeAsync`'d.
+- **Medium** — `OpcUaDataConnection._subscriptionHandles` is a plain `Dictionary<,>`
+  mutated from thread-pool continuations of `SubscribeAsync` / `UnsubscribeAsync` /
+  `DisconnectAsync` running in parallel — the same class of bug DCL-003 fixed in
+  `RealOpcUaClient` but missed in the layer above.
+- **Medium** — `HandleSubscribeCompleted`'s success branch never checks
+  `_unresolvedTags`, so a tag that previously failed resolution (incrementing
+  `_totalSubscribed`) and is then successfully subscribed by a different instance gets
+  `_totalSubscribed++` a second time, double-counting; meanwhile the unresolved entry
+  lingers until the retry timer also resolves it, creating an orphaned monitored item.
+- **Medium** — when an instance is unsubscribed mid-flight,
+  `HandleSubscribeCompleted` re-creates an empty `_subscriptionsByInstance[name]`
+  entry and processes the late results, leaking `_tagSubscriberCount` /
+  `_totalSubscribed` / `_resolvedTags` increments for an instance with no
+  `_subscribers` entry to deliver values to.
+- **Medium** — `HandleSubscribeCompleted` calls `Timers.StartPeriodicTimer` on every
+  completed subscribe with unresolved tags; in Akka.NET, `StartPeriodicTimer` with the
+  same key cancels and replaces the existing timer, so a burst of subscribes arriving
+  faster than `TagResolutionRetryInterval` (10 s default) keeps resetting the timer
+  and the retry never actually fires.
+
 #### Re-review 2026-05-17 (commit `39d737e`)

 All 13 findings from the 2026-05-16 review remain `Resolved` and the fixes were
@@ -50,7 +84,22 @@ so a mid-batch disconnect aborts the whole write batch (the same class of defect
 DCL-007 fixed for `ReadBatchAsync`). New findings are numbered from
 `DataConnectionLayer-014`.

-## Checklist coverage
+## Checklist coverage (2026-05-28 re-review, commit `1eb6e97`)
+
+| # | Category | Examined | Notes |
+|---|----------|----------|-------|
+| 1 | Correctness & logic bugs | x | Findings 020 (double-count `_totalSubscribed` when a previously-unresolved tag is resolved by a different instance) and 021 (leaked `_subscriptionsByInstance` entry + counter increments when instance unsubscribes mid-flight). |
+| 2 | Akka.NET conventions | x | Finding 022 — `Timers.StartPeriodicTimer` reset on every `HandleSubscribeCompleted` for unresolved tags can stall the retry timer indefinitely under a subscribe burst. |
+| 3 | Concurrency & thread safety | x | Finding 018 — concurrent subscribes for the same tag from different instances each spawn an adapter subscription and the second handle is orphaned. Finding 019 — `OpcUaDataConnection._subscriptionHandles` is a plain `Dictionary<,>` mutated from thread-pool continuations (same class of bug as DCL-003 one layer above). |
+| 4 | Error handling & resilience | x | No new issues; DCL-004 / DCL-007 / DCL-015 / DCL-017 fixes verified in place. |
+| 5 | Security | x | No new issues; DCL-012 / DCL-014 fixes verified. The Commons-side `OpcUaEndpointConfig.AutoAcceptUntrustedCerts = true` default surfaced in DCL-012 is still present but is out of this module's scope. |
+| 6 | Performance & resource management | x | No new issues; DCL-008 reverse index verified. (Finding 018's orphaned adapter handle is logged under concurrency.) |
+| 7 | Design-document adherence | x | No new issues. DCL-009's design-doc action (document unstable-disconnect failover trigger + configurable threshold) is still open at the doc level but out of this module's scope. |
+| 8 | Code organization & conventions | x | No issues — POCOs in Commons, options class owned by component, factory + DI registration consistent. |
+| 9 | Testing coverage | x | DCL001–017 regression tests present. Gaps remain for finding 018 (concurrent subscribe of same tag from two instances), 019 (concurrent `_subscriptionHandles` mutation), 020 (resolve-via-different-instance), 021 (unsubscribe-mid-flight), 022 (timer-reset starvation). |
+| 10 | Documentation & comments | x | No new issues; DCL-013 atomic-guard XML comments verified. |
+
+## Checklist coverage (2026-05-17 re-review, commit `39d737e`)

 | # | Category | Examined | Notes |
 |---|----------|----------|-------|
@@ -896,3 +945,268 @@ unhandled exception. Regression test
 `DCL017_WriteBatch_ReturnsPerTagResults_WhenConnectionDropsMidBatch` fails against the
 pre-fix code (the batch throws, no map returned) and passes after;
 `DCL017_WriteBatch_CancellationAbortsWholeBatch` guards that cancellation still aborts.
+
+### DataConnectionLayer-018 — Concurrent subscribes for the same tag from different instances orphan an adapter subscription handle
+
+| | |
+|--|--|
+| Severity | High |
+| Category | Concurrency & thread safety |
+| Status | Open |
+| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:557,564-594,653` |
+
+**Description**
+
+`HandleSubscribe` snapshots `_subscriptionIds.Keys` into a local `alreadySubscribed`
+set on the actor thread before dispatching the `Task.Run` that performs the adapter
+I/O (line 557). The snapshot is the only basis on which the background task decides
+whether to call `_adapter.SubscribeAsync` — and it is taken **once**, before the I/O
+runs.
+
+If two `SubscribeTagsRequest` messages arrive on the actor thread for different
+instances that both reference the same tag path, both `HandleSubscribe` invocations
+take a snapshot at a time when neither subscribe has completed, so `alreadySubscribed`
+does not contain the shared tag in either snapshot. Both background tasks then call
+`_adapter.SubscribeAsync(tagPath, ...)`, the adapter creates **two** monitored items
+and returns two distinct subscription ids, and each task pipes a `SubscribeCompleted`
+back to the actor with `AlreadySubscribed: false, Success: true`.
+
+`HandleSubscribeCompleted` for the first message takes the success branch and writes
+`_subscriptionIds[tagPath] = subId1`. The second message arrives, hits the
+"already in `_subscriptionIds`" guard at line 653 (`_subscriptionIds.ContainsKey(...)`)
+and `continue`s — but `result.SubscriptionId` (the orphan handle for the second
+adapter subscription) is silently discarded. The orphan monitored item stays alive in
+the OPC UA session for the lifetime of the adapter, sending duplicate data-change
+notifications (whose callbacks were stamped with the captured `generation`) into
+`HandleTagValueReceived` for every value change. Across a deploy that creates many
+instances sharing a few tags, this leaks N-1 monitored items per shared tag and
+doubles/triples the per-tag publish traffic.
+
+DCL-010 fixed an analogous duplicate-dispatch bug for the tag-resolution retry path
+via `_resolutionInFlight`; the equivalent guard is missing on the user-subscribe
+path.
+
+**Recommendation**
+
+Track in-flight subscribes the same way DCL-010 tracks in-flight retries: maintain a
+`HashSet<string> _subscribesInFlight` and add `tagPath` to it on the actor thread
+**before** the `Task.Run` dispatch, only for tags not already in
+`_subscriptionIds` and not already in `_subscribesInFlight`. Tags that are already
+in flight should produce a `SubscribeTagResult(..., AlreadySubscribed: true, ...)`
+without touching the adapter. Remove from `_subscribesInFlight` in
+`HandleSubscribeCompleted` once the result is applied. Add a regression test that
+fans two simultaneous `SubscribeTagsRequest` messages for the same tag and asserts
+exactly one `_adapter.SubscribeAsync(tag, ...)` call (and no orphan subscription id).
+
+### DataConnectionLayer-019 — `OpcUaDataConnection._subscriptionHandles` is a plain `Dictionary<,>` mutated from concurrent thread-pool continuations
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Concurrency & thread safety |
+| Status | Open |
+| Location | `src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:31,167,177`, `src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:163-164` |
+
+**Description**
+
+`OpcUaDataConnection._subscriptionHandles` is declared as `Dictionary<string,
+string>`. It is mutated from:
+
+- `SubscribeAsync` (line 167): `_subscriptionHandles[subscriptionId] = tagPath;`
+  after an `await _client!.CreateSubscriptionAsync(...)` — i.e. the assignment
+  executes on the continuation thread (a thread-pool thread).
+- `UnsubscribeAsync` (line 177): `_subscriptionHandles.Remove(subscriptionId);`
+  similarly after an `await`.
+- `DisconnectAsync` indirectly via the underlying `_client.DisconnectAsync` does
+  **not** touch `_subscriptionHandles`, but multiple `SubscribeAsync` /
+  `UnsubscribeAsync` calls can run in parallel from the upper layer.
+
+The DCL upper layer calls `_adapter.SubscribeAsync` from multiple places that all
+run off the actor thread:
+
+- `DataConnectionActor.HandleSubscribe` inside its `Task.Run` (multiple invocations
+  can run in parallel — see DCL-018);
+- `HandleRetryTagResolution` issues `_adapter.SubscribeAsync` for every tag in
+  `_unresolvedTags` and pipes the continuation (each subscribe runs concurrently
+  via the SDK's async machinery);
+- `ReSubscribeAll` does the same after a reconnect.
+
+So plain-`Dictionary` mutations occur on multiple thread-pool threads concurrently —
+the exact pattern DCL-003 fixed by switching `RealOpcUaClient._monitoredItems` and
+`_callbacks` to `ConcurrentDictionary<,>`. Plain `Dictionary` mutations during a
+concurrent resize are undefined behaviour: they can throw
+`InvalidOperationException`, corrupt the internal hash buckets, or lose entries.
+
+This is `_subscriptionHandles` is currently dead state (the dictionary is written to
+and `Remove`d but **never read**), so a corruption today would not crash the
+subscribe path — but the bug is latent and the field will become load-bearing the
+moment any code reads it (e.g., to expose a subscription-id-to-tag-path lookup for
+diagnostics, which is what the dictionary's name suggests it was intended for).
+
+**Recommendation**
+
+Either (a) change `_subscriptionHandles` to
+`ConcurrentDictionary<string, string>` and use `TryAdd` / `TryRemove`, mirroring
+DCL-003's fix one layer down, or (b) delete the field entirely since it is never
+read — the bookkeeping is fully owned by `RealOpcUaClient._monitoredItems` /
+`_callbacks` and `DataConnectionActor._subscriptionIds`. Removing it eliminates the
+race and removes dead state in one stroke. Add a regression test (or extend
+`DCL003_SharedDictionaryFields_AreConcurrentCollections`) that asserts no
+non-concurrent `Dictionary` field is shared across thread boundaries in adapter
+state.
+
+### DataConnectionLayer-020 — `HandleSubscribeCompleted` double-counts `_totalSubscribed` when a previously-unresolved tag is resolved by a different instance's subscribe
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:653-661,670-688` |
+
+**Description**
+
+`HandleSubscribeCompleted`'s success branch (line 656-661) writes
+`_subscriptionIds[result.TagPath] = result.SubscriptionId!; _totalSubscribed++;
+_resolvedTags++;`. The guard at line 653 only skips when the tag is already in
+`_subscriptionIds`; it does **not** check `_unresolvedTags`. So the success branch
+runs for a tag that previously failed resolution from an earlier instance's subscribe
+(which incremented `_totalSubscribed` and added the tag to `_unresolvedTags` at line
+674-676) and is now successfully subscribed by a later instance.
+
+Sequence:
+
+1. Instance A subscribes "Tag1". `_adapter.SubscribeAsync` throws a non-connection-level
+   exception. `HandleSubscribeCompleted` takes the resolution-failure branch:
+   `_unresolvedTags.Add("Tag1"); _totalSubscribed++;` (now 1).
+2. The device finishes booting. Instance B subscribes "Tag1". `_adapter.SubscribeAsync`
+   succeeds, returning `subId`. `HandleSubscribeCompleted` takes the success branch:
+   `_subscriptionIds["Tag1"] = subId; _totalSubscribed++; _resolvedTags++;`
+   (now `_totalSubscribed = 2`, `_resolvedTags = 1`).
+3. `_unresolvedTags` still contains "Tag1". The retry timer fires next tick,
+   `HandleRetryTagResolution` dispatches `SubscribeAsync("Tag1", ...)` against the
+   adapter (creating a **second** monitored item for the same tag), and
+   `HandleTagResolutionSucceeded` runs `_unresolvedTags.Remove("Tag1")` →
+   `_subscriptionIds["Tag1"] = newSubId` (overwriting Instance B's id, orphaning that
+   monitored item) → `_resolvedTags++` (now 2, matching `_totalSubscribed`).
+
+Net effect:
+
+- `_totalSubscribed` is over-counted by 1 from step 2 until step 3 reconciles
+  `_resolvedTags`. During that window the health report's "subscribed / resolved"
+  ratio is wrong.
+- Two adapter subscription handles for the same tag are leaked across this race
+  (DCL-018's orphan plus the retry's second adapter call); the second leaks
+  permanently because `_subscriptionIds["Tag1"]` only stores the most recent id.
+
+**Recommendation**
+
+In `HandleSubscribeCompleted`'s success branch, before the `_totalSubscribed++`,
+check `_unresolvedTags.Remove(result.TagPath)` — if the tag was already counted as
+unresolved, promote it without re-incrementing `_totalSubscribed` (mirror
+`HandleTagResolutionSucceeded`'s shape: only increment `_resolvedTags`,
+`_subscriptionIds[tag] = subId`, and clear `_resolutionInFlight`). Add a regression
+test that asserts `_totalSubscribed` / `_resolvedTags` consistency after the
+"resolve via a second instance" sequence.
+
+### DataConnectionLayer-021 — `HandleSubscribeCompleted` re-creates and leaks `_subscriptionsByInstance` entry when the instance unsubscribed mid-flight
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:626-634,642-687` |
+
+**Description**
+
+`HandleSubscribe` dispatches a `Task.Run` that performs adapter I/O off the actor
+thread and pipes a `SubscribeCompleted` back. If an `UnsubscribeTagsRequest` for the
+same instance is processed on the actor thread between dispatch and completion,
+`HandleUnsubscribe` removes the instance from both `_subscriptionsByInstance` and
+`_subscribers`. When the late `SubscribeCompleted` arrives,
+`HandleSubscribeCompleted` (line 629-634) **re-creates** the
+`_subscriptionsByInstance[instanceName] = new HashSet<string>()` entry and proceeds
+to apply the results — but `_subscribers[instanceName]` was already removed by the
+unsubscribe and is **not** re-added.
+
+Consequences:
+
+1. `_subscriptionsByInstance` keeps a permanently-leaked entry for an instance that
+   no longer exists. `ReSubscribeAll` derives its tag list from
+   `_subscriptionsByInstance.Values` and will keep re-subscribing the leaked tags on
+   every future reconnect.
+2. For each tag, `_tagSubscriberCount[tagPath]` is incremented (line 647-649), so the
+   reverse index treats the leaked instance as a real subscriber. The only way to
+   drop the count is another `HandleUnsubscribe` for the same instance — which can
+   never arrive because the Instance Actor that owned the instance is gone.
+3. The success branch increments `_totalSubscribed` / `_resolvedTags` (or
+   `_unresolvedTags` for genuine resolution failures), drifting health counters
+   permanently above the actual subscribed instance count.
+4. Subsequent `HandleTagValueReceived` fanout iterates `_subscriptionsByInstance` and
+   skips this entry via the `_subscribers.TryGetValue` check (line 1019), so values
+   are silently dropped — but the work of fanning them out (the iteration and the
+   tag lookup) is still done for every value update on every leaked tag, forever.
+5. The genuine-resolution-failure path at line 682-686 (`subscriber.Tell(new
+   TagValueUpdate(..., QualityCode.Bad, ...))`) also silently no-ops because
+   `_subscribers.TryGetValue` is false — so the design doc's "push bad quality on
+   resolution failure" promise is broken for this case (a minor, edge-case wrinkle).
+
+**Recommendation**
+
+In `HandleSubscribeCompleted`, when `_subscriptionsByInstance.TryGetValue` fails,
+treat the result as obsolete: log it and `return` without re-creating the entry or
+applying any state mutations. Any successfully-created adapter subscriptions in
+`msg.Results` should be cleaned up — iterate the results and
+`_adapter.UnsubscribeAsync(result.SubscriptionId!)` (fire-and-forget) for each
+successful one so the orphan handles do not leak in the adapter. Add a regression
+test that subscribes from instance A, immediately sends an `UnsubscribeTagsRequest`
+for A while the subscribe I/O is in flight, completes the subscribe, and asserts
+`_subscriptionsByInstance`, `_tagSubscriberCount` and health counters are all clean.
+
+### DataConnectionLayer-022 — `HandleSubscribeCompleted` and `HandleTagResolutionFailed` reset the tag-resolution retry timer on every call via `StartPeriodicTimer`, starving the retry under subscribe bursts
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Akka.NET conventions |
+| Status | Open |
+| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:691-698,991-998` |
+
+**Description**
+
+`HandleSubscribeCompleted` (line 691-698) and `HandleTagResolutionFailed` (line
+991-998) both call:
+
+```
+Timers.StartPeriodicTimer(
+    "tag-resolution-retry",
+    new RetryTagResolution(),
+    _options.TagResolutionRetryInterval,
+    _options.TagResolutionRetryInterval);
+```
+
+`Akka.Actor.ITimerScheduler.StartPeriodicTimer(key, ...)` cancels and replaces any
+existing timer registered under the same key. So every additional subscribe (or
+every additional tag-resolution failure) that produces unresolved tags **resets** the
+retry timer's countdown to the full interval — the timer never accumulates
+elapsed time across calls.
+
+With the default `TagResolutionRetryInterval = 10s`, an instance-startup burst that
+produces a new `SubscribeTagsRequest` every 5s (a not-unusual cadence during
+deployment fan-out) will keep cancelling the not-yet-fired retry every 5s, so the
+"periodic" retry never actually fires until subscribes go quiet. In a steady-state
+site with many instances deploying together this can delay tag resolution by tens
+of seconds, leaving attributes at `Bad` quality longer than the documented retry
+interval implies.
+
+**Recommendation**
+
+Start the periodic timer once, when the actor first transitions to having
+non-empty `_unresolvedTags`, and only re-start it after `Timers.Cancel(...)` has
+been called (e.g., when the actor enters `Reconnecting`). The cleanest pattern is to
+gate the start with `if (!Timers.IsTimerActive("tag-resolution-retry"))` before
+calling `StartPeriodicTimer` — `IsTimerActive` is on `ITimerScheduler`. Apply the
+same gate at both call sites. Add a regression test that fires 5 subscribes with
+unresolved tags within one retry interval and asserts the retry fires at most one
+interval after the first failure (not after the fifth subscribe).
@@ -5,10 +5,10 @@
 | Module | `src/ScadaLink.DeploymentManager` |
 | Design doc | `docs/requirements/Component-DeploymentManager.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-17 |
+| Last reviewed | 2026-05-28 |
 | Reviewer | claude-agent |
-| Commit reviewed | `39d737e` |
-| Open findings | 0 |
+| Commit reviewed | `1eb6e97` |
+| Open findings | 7 |

 ## Summary

@@ -53,20 +53,52 @@ DeploymentManager-016). The `GetDeploymentStatusAsync` XML doc is now stale —
 it still describes the query-before-redeploy behaviour that actually moved into
 `TryReconcileWithSiteAsync` (DeploymentManager-017).

+#### Re-review 2026-05-28 (commit `1eb6e97`)
+
+Re-reviewed at commit `1eb6e97` after the DeploymentManager-015/016/017 fixes
+and a docs-only XML-comment pass. The three prior findings remain `Resolved`
+and verified — `ApplyPostSuccessSideEffectsAsync` is now invoked from both the
+normal success path and `TryReconcileWithSiteAsync`, the reconciled-success
+branch corrects `prior.RevisionHash` to the target, and `GetDeploymentStatusAsync`'s
+XML doc now describes the local-DB-read it actually performs and cross-refs the
+reconciliation helper. The DiffService wiring, options binding, ref-counted
+operation lock, broadened catch, non-cancellable cleanup, and TestKit-actor
+test seam are still in place. The 7 new findings here are not regressions in
+the DeploymentManager-015/016 fixes — they are issues uncovered by widening
+the lens to the lifecycle paths, reconciliation's interaction with
+intentional `Disabled` state, audit semantics, and operational concerns
+(per-site artifact-build cost, Pending→InProgress double-write).
+
+The single notable correctness issue is DeploymentManager-018: the
+reconciliation shortcut unconditionally sets `instance.State = Enabled` via
+`ApplyPostSuccessSideEffectsAsync`. After a central failover that loses the
+in-memory operation lock, a user can legitimately `Disable` an instance whose
+prior deploy record is still `InProgress`; a subsequent redeploy then reconciles
+and silently re-enables the instance against the user's explicit intent.
+The remaining six findings are medium/low: lifecycle-timeout audit gap
+(DeploymentManager-019), audit-user attribution in reconciliation
+(DeploymentManager-020), silent fallback in `ResolveSiteIdentifierAsync`
+(DeploymentManager-021), back-to-back `Pending`→`InProgress` writes
+(DeploymentManager-022), per-site re-query of system-wide artifacts
+(DeploymentManager-023), and shared static state across `*ProbeActor` tests
+(DeploymentManager-024).
+
 ## Checklist coverage

+#### Re-review 2026-05-28 (commit `1eb6e97`)
+
 | # | Category | Examined | Notes |
 |---|----------|----------|-------|
-| 1 | Correctness & logic bugs | ✓ | Re-review 2026-05-17: reconciliation skips instance-state/snapshot updates (DeploymentManager-015) and keeps a stale `RevisionHash` (DeploymentManager-016). Prior: stuck `InProgress` / cancelled-token write (resolved). |
-| 2 | Akka.NET conventions | ✓ | Module is a plain service layer; it calls `CommunicationService` which wraps Ask. No actors here. No issues. |
-| 3 | Concurrency & thread safety | ✓ | `OperationLockManager` ref-counts and reclaims semaphores; `DeployToAllSitesAsync` correctly builds commands sequentially before parallel send. No issues at re-review. |
-| 4 | Error handling & resilience | ✓ | Prior gaps DeploymentManager-001/002/003/004 resolved and verified. No new issues. |
-| 5 | Security | ✓ | SMTP credential handling documented as an accepted design decision (DeploymentManager-013). No injection vectors; no authz here (enforced upstream). No new issues. |
-| 6 | Performance & resource management | ✓ | Semaphore leak resolved (DeploymentManager-005). No new issues. |
-| 7 | Design-document adherence | ✓ | Query-before-redeploy and Diff View implemented (DeploymentManager-006/007). Re-review: reconciliation path breaks the deployed-snapshot/instance-state invariants — see DeploymentManager-015. |
-| 8 | Code organization & conventions | ✓ | Options binding resolved (DeploymentManager-008). POCO/repo placement correct. No new issues. |
-| 9 | Testing coverage | ✓ | Broad coverage added (success, lifecycle, lock serialization, reconciliation, artifact matrix). Re-review: reconciled-success path's missing side effects (DeploymentManager-015) are untested. |
-| 10 | Documentation & comments | ✓ | Prior comment findings resolved. Re-review: `GetDeploymentStatusAsync` XML doc is now stale — DeploymentManager-017. |
+| 1 | Correctness & logic bugs | ✓ | New: reconciliation forces `Enabled` even if the user disabled the instance in between (DeploymentManager-018). |
+| 2 | Akka.NET conventions | ✓ | Module remains a plain service layer; no actors. No issues. |
+| 3 | Concurrency & thread safety | ✓ | `OperationLockManager` ref-counting verified. Note: test probes hold static state (DeploymentManager-024) — a test concern, not production code. |
+| 4 | Error handling & resilience | ✓ | New: Disable/Enable/Delete timeouts return early without writing any audit entry — deploy has `DeployFailed`, lifecycle has nothing (DeploymentManager-019). |
+| 5 | Security | ✓ | No new issues. SMTP credential decision documented (DeploymentManager-013 closed). |
+| 6 | Performance & resource management | ✓ | New: `BuildDeployArtifactsCommandAsync` re-queries every system-wide artifact set per site in `DeployToAllSitesAsync` (DeploymentManager-023). |
+| 7 | Design-document adherence | ✓ | Reconciliation now performs post-success side effects (DeploymentManager-015 resolved). DeploymentManager-018 surfaces a new gap on `Disabled`-state preservation. |
+| 8 | Code organization & conventions | ✓ | New: redundant `Pending`→`InProgress` back-to-back write with no intervening work (DeploymentManager-022). Silent string-fallback in `ResolveSiteIdentifierAsync` (DeploymentManager-021). |
+| 9 | Testing coverage | ✓ | New: no coverage for the reconciliation-overwrites-Disabled case (part of DeploymentManager-018); test probes share static state across tests (DeploymentManager-024). |
+| 10 | Documentation & comments | ✓ | New: `DeployReconciled` audit uses `prior.DeployedBy` instead of the current `user` parameter — misleading for forensics (DeploymentManager-020). |

 ## Findings

@@ -873,3 +905,293 @@ database as a pure local read, and cross-references `TryReconcileWithSiteAsync`
 as where the query-the-site-before-redeploy reconciliation actually lives.
 Documentation-only change; no regression test (a test asserting comment text
 would be meaningless).
+
+### DeploymentManager-018 — Reconciliation force-sets `Enabled`, overwriting an intentional `Disabled` after central failover
+
+| | |
+|--|--|
+| Severity | High |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:675-682,721-748` |
+
+**Description**
+
+`TryReconcileWithSiteAsync` calls `ApplyPostSuccessSideEffectsAsync` whenever
+the site reports it has the target revision hash, and that helper
+unconditionally writes `instance.State = InstanceState.Enabled`. The
+reconciliation shortcut only runs when the prior `DeploymentRecord` is
+`InProgress` or timeout-`Failed` — exactly the scenarios that survive a central
+failover (the in-memory `OperationLockManager` is lost on failover, by design:
+*"Lost on central failover (acceptable per design — in-progress treated as
+failed)"*).
+
+After such a failover, the per-instance operation lock is gone but the
+deployment record is still `InProgress` in the DB. A user can legitimately
+issue `DisableInstanceAsync` for the same instance — there is nothing in
+`DisableInstanceAsync` that consults the deployment record, only the
+`StateTransitionValidator` over `Instance.State`. If the state is `Enabled`
+(the typical case when the deploy started), the disable proceeds, the site
+honours it (the design states a disabled instance retains its deployed
+configuration), and central now persists `Instance.State = Disabled`. The
+deployment-record row remains `InProgress` (no one transitioned it). Later the
+user retries the deploy: `TryReconcileWithSiteAsync` runs, the site still has
+the target revision hash (Disable doesn't change the deployed config), the
+prior record is marked `Success`, and `ApplyPostSuccessSideEffectsAsync` writes
+`Instance.State = Enabled` — silently overriding the user's explicit Disable.
+
+The same trap exists for any direct DB edit / migration that flipped the state
+between the timed-out deploy and the redeploy. The normal deploy path can
+defensibly assume `Enabled` after a fresh successful apply, but the
+reconciliation path is reconciling *prior* state with *prior* user intent; it
+should preserve `Disabled` if that is the current `Instance.State` at the time
+of reconciliation, mirroring the design's separation between deploy (config
+apply) and disable (subscription/script lifecycle).
+
+**Recommendation**
+
+In the reconciliation branch, do not force `Enabled`. Either:
+- Pass a flag/parameter to `ApplyPostSuccessSideEffectsAsync` telling it
+  whether to touch state, and skip the state write on the reconciliation path
+  (leaving the current `Instance.State` intact, which is already `Enabled`
+  for a fresh deploy that timed out and `Disabled` for the user-disabled
+  follow-up case); or
+- Only set `Enabled` when the current `Instance.State` is `NotDeployed` (i.e.
+  the first-deploy timed-out case), and leave existing `Enabled`/`Disabled`
+  alone.
+
+Add a regression test where an instance with `Instance.State = Disabled` and a
+prior `InProgress` deployment record is reconciled — the resulting
+`Instance.State` must remain `Disabled`, and the deployment record must still
+be marked `Success`.
+
+### DeploymentManager-019 — Lifecycle command timeout writes no audit entry
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Error handling & resilience |
+| Status | Open |
+| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:328-339,385-396,445-458` |
+
+**Description**
+
+`DisableInstanceAsync`, `EnableInstanceAsync`, and `DeleteInstanceAsync` each
+wrap the `CommunicationService` call in a linked CTS with
+`LifecycleCommandTimeout` (DeploymentManager-012). On timeout they log a
+warning and `return Result<...>.Failure(...)` — and skip the
+`_auditService.LogAsync` call entirely. As a result, an operator-initiated
+disable/enable/delete that times out at the site leaves **no audit trail**:
+the user, the timestamp, the command id, and the failure mode are not
+recorded in the audit log. The deploy path goes out of its way to write a
+`DeployFailed` audit entry on the same failure mode
+(`DeploymentService.cs:274-276`), with `CancellationToken.None` so the write is
+durable; the lifecycle commands do not.
+
+The design lists audit logging as a Deployment Manager responsibility for "all
+deployment actions, system-wide artifact deployments, and instance lifecycle
+changes" — a timed-out lifecycle command **is** an attempted lifecycle change,
+and the operator action is exactly the kind of event the audit log exists to
+record.
+
+**Recommendation**
+
+In each of the three `catch (Exception ex) when (ex is TimeoutException or
+OperationCanceledException)` blocks, write a `DisableTimeout`/`EnableTimeout`/
+`DeleteTimeout` (or use the existing operation name with a failure flag)
+audit entry with `CancellationToken.None` so a cancelled outer token does not
+prevent the audit write, mirroring `DeployFailed`. Add a unit test asserting
+that `DisableInstanceAsync_SiteUnresponsive_LifecycleCommandTimeoutBoundsTheWait`
+also produces an audit entry.
+
+### DeploymentManager-020 — `DeployReconciled` audit attributes the action to the prior deployer, not the current user
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Documentation & comments |
+| Status | Open |
+| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:683-686` |
+
+**Description**
+
+In `TryReconcileWithSiteAsync` the audit call is:
+
+```
+await _auditService.LogAsync(prior.DeployedBy, "DeployReconciled", ...)
+```
+
+`prior.DeployedBy` is the user who issued the original (timed-out / stuck)
+deployment, not the `user` parameter passed into `DeployInstanceAsync`. The
+current user — the one who triggered the redeploy that produced the
+reconciliation — is dropped on the floor. For audit forensics this is
+misleading: the row will read "user A reconciled their own deployment"
+when in fact user B initiated the action that reconciled it.
+
+The original deployer is interesting context, but it should be carried in the
+audit-detail object (where `DeploymentId` and `RevisionHash` already live), not
+substituted for the actor.
+
+**Recommendation**
+
+Use `user` (the parameter on `DeployInstanceAsync`, threaded through
+`TryReconcileWithSiteAsync`) as the audit actor, and include
+`OriginalDeployer = prior.DeployedBy` in the detail object so the original
+attribution is preserved without misrepresenting who took the action.
+
+### DeploymentManager-021 — `ResolveSiteIdentifierAsync` silently substitutes the DB id when the site row is missing
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:107-111` |
+
+**Description**
+
+```
+private async Task<string> ResolveSiteIdentifierAsync(int siteId, CancellationToken cancellationToken)
+{
+    var site = await _siteRepository.GetSiteByIdAsync(siteId, cancellationToken);
+    return site?.SiteIdentifier ?? siteId.ToString();
+}
+```
+
+If the `Site` row is missing (FK was deleted, race with admin delete, DB
+inconsistency), the method silently returns the numeric DB id rendered as a
+string. This is then passed to `CommunicationService.{Deploy,Disable,Enable,
+Delete}InstanceAsync` and `QueryDeploymentStateAsync` as if it were a real
+`SiteIdentifier` (e.g. "site-a"). The communication layer will fail with an
+"unknown site" or routing error, producing a confusing diagnostic that hides
+the actual problem (no site row).
+
+This is a defensive concern, but every mutating operation in the module goes
+through this method, so a stale instance whose site was deleted will produce a
+misleading error every time it is touched.
+
+**Recommendation**
+
+Treat a missing site as a hard validation failure: return a
+`Result.Failure($"Site with ID {siteId} not found")` early from the calling
+operations, instead of fabricating an identifier. The repository already
+returns `Site?`, so the null path is type-visible; just don't paper over it.
+
+### DeploymentManager-022 — `Pending` and `InProgress` are written back-to-back with no intervening work
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Code organization & conventions |
+| Status | Open |
+| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:178-194` |
+
+**Description**
+
+`DeployInstanceAsync` does:
+
+```
+record.Status = Pending;
+AddDeploymentRecordAsync(record); SaveChangesAsync(); NotifyStatusChange(record);
+record.Status = InProgress;
+UpdateDeploymentRecordAsync(record); SaveChangesAsync(); NotifyStatusChange(record);
+```
+
+There is no work between the two writes — flattening, validation, and
+reconciliation have already completed by line 174. The deploy command is sent
+immediately after the `InProgress` write. The `Pending` write therefore costs:
+an extra `SaveChangesAsync` round-trip, an extra `IDeploymentStatusNotifier`
+invocation (which the CentralUI-006 page renders, so the user briefly sees a
+`Pending` flicker before `InProgress`), and an extra row-version bump if EF
+optimistic concurrency is enabled on the table.
+
+The design uses `Pending` to mean "queued, not yet sent" and `InProgress` to
+mean "sent to site, awaiting response". The code's `Pending` slot has no
+queuing — it is set and immediately overwritten — so the state buys nothing
+operationally.
+
+**Recommendation**
+
+Either:
+- Drop the `Pending` write entirely and create the record directly in
+  `InProgress` (one row insert, one notification, simpler UI); or
+- Move the `Pending`→`InProgress` transition to bracket actual queueing/work
+  (e.g. set `Pending` *before* flattening + reconciliation, set `InProgress`
+  immediately before `DeployInstanceAsync` on the comm service) so the two
+  states carry distinguishable semantics worth a separate write.
+
+### DeploymentManager-023 — `BuildDeployArtifactsCommandAsync` re-queries system-wide artifacts once per site
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Performance & resource management |
+| Status | Open |
+| Location | `src/ScadaLink.DeploymentManager/ArtifactDeploymentService.cs:82-144,169-173` |
+
+**Description**
+
+`DeployToAllSitesAsync` loops over sites and calls
+`BuildDeployArtifactsCommandAsync(site.Id, ...)` for each one. Of the six
+artifact sets the method gathers, **only** `dataConnections` is per-site:
+
+- `_templateRepo.GetAllSharedScriptsAsync` — global.
+- `_externalSystemRepo.GetAllExternalSystemsAsync` — global, plus
+  `GetMethodsByExternalSystemIdAsync` per external system per site.
+- `_externalSystemRepo.GetAllDatabaseConnectionsAsync` — global.
+- `_notificationRepo.GetAllNotificationListsAsync` — global.
+- `_notificationRepo.GetAllSmtpConfigurationsAsync` — global.
+- `_siteRepo.GetDataConnectionsBySiteIdAsync(siteId, ...)` — **per-site**.
+
+With N sites this issues ≈ 5·N redundant queries on the global sets (plus
+M·N method queries, where M is the external-system count). On a hub-and-spoke
+deployment with many sites the artifact-deploy path is noticeably slower than
+necessary and pins DbContext usage longer than needed. Per CLAUDE.md, the
+DbContext is not thread-safe and the per-site commands are already built
+sequentially (good); the redundant queries are sequential too, but the
+network/round-trip cost is real.
+
+**Recommendation**
+
+Hoist the global queries (shared scripts, external systems + their methods,
+DB connections, notification lists, SMTP configurations) out of
+`BuildDeployArtifactsCommandAsync`, fetch them once in `DeployToAllSitesAsync`,
+and pass them in alongside the site id (or expose a
+`BuildDeployArtifactsCommandAsync(siteId, prefetchedGlobals)` overload).
+`RetryForSiteAsync` (the single-site path) can keep the convenience-overload
+behaviour. Add a test using NSubstitute's `.Received()` to assert
+`_templateRepo.GetAllSharedScriptsAsync` is called exactly once for an
+N-site deployment.
+
+### DeploymentManager-024 — Test probe actors hold mutable static state across tests
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Testing coverage |
+| Status | Open |
+| Location | `tests/ScadaLink.DeploymentManager.Tests/DeploymentServiceTests.cs:966-1075`, `tests/ScadaLink.DeploymentManager.Tests/ArtifactDeploymentServiceTests.cs:196-217` |
+
+**Description**
+
+`ReconcileProbeActor.QueryCount` / `DeployCount`, `SerializationProbeActor.MaxConcurrent`
+/ `_current`, and `ArtifactProbeActor.Received` are all `static` fields.
+Each test's actor constructor resets them — but reset-on-construction only
+works as long as no two tests in the same class run concurrently. xUnit's
+default parallelism disables intra-class parallelism, so today's tests pass;
+flip the assembly-level `[CollectionBehavior(DisableTestParallelization = true)]`
+or move to xUnit v3 (which enables intra-class parallelism by default) and the
+counters race — a deploy in test A could increment `DeployCount` while test B
+is asserting on it.
+
+Static state shared across tests is also why a flaky-test investigation here
+will be unusually painful: the offending interaction is invisible from any
+single test file.
+
+**Recommendation**
+
+Replace the static counters with instance state, hand the actor a probe
+recipient (an `IActorRef` to a TestKit probe), and assert via `ExpectMsg`
+in each test. Where the simpler counter shape is preferred, pass a
+shared-state object into the actor's constructor so each test owns its own
+instance — never reach for `static` mutable test state.
@@ -5,10 +5,10 @@
 | Module | `src/ScadaLink.ExternalSystemGateway` |
 | Design doc | `docs/requirements/Component-ExternalSystemGateway.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-17 |
+| Last reviewed | 2026-05-28 |
 | Reviewer | claude-agent |
-| Commit reviewed | `39d737e` |
-| Open findings | 0 |
+| Commit reviewed | `1eb6e97` |
+| Open findings | 6 |

 ## Summary

@@ -51,6 +51,36 @@ both substantive findings are second-order defects in earlier fixes — the earl
 resolutions did not verify the downstream contract of the S&F engine they integrate
 with.

+#### Re-review 2026-05-28 (commit `1eb6e97`)
+
+All seventeen prior findings (001–017) remain `Resolved`; spot-checks against the
+current source confirm the fixes still hold. Between `39d737e` and this re-review the
+only source changes to the module are the documentation-only commit `1eb6e97` (XML
+doc additions) and the `executionId` / `sourceScript` / `parentExecutionId` plumbing
+threaded through `CachedCallAsync` / `CachedWriteAsync` to the S&F enqueue (Audit Log
+#23 Tasks 4/6). The re-review walked the full 10-category checklist again and
+surfaced **six new findings**, none Critical. The most serious
+(`ExternalSystemGateway-018`, High) is that `DeliverBufferedAsync` on both
+`ExternalSystemClient` and `DatabaseGateway` lets a `JsonException` from
+`JsonSerializer.Deserialize` propagate out of the delivery handler — the S&F engine
+treats any thrown exception as a transient retry, so a corrupted or
+schema-incompatible buffered row becomes a permanent poison message that is retried
+on every sweep forever (the same retry-forever class of hazard `-015` already
+addressed for a different cause). `ExternalSystemGateway-019` (Medium) is that
+`HttpClient.Timeout` is never set, so any operator-configured `DefaultHttpTimeout`
+greater than 100s is silently clipped by `HttpClient`'s built-in 100s default and the
+gateway's "timeout applies to the HTTP request round-trip" guarantee no longer
+holds — a partial reopen of the `-002` contract for the long-timeout case.
+`ExternalSystemGateway-020` (Medium) is a silent precision-loss bug in the cached-DB-write
+retry path: `JsonElementToParameterValue` collapses any JSON number that is not
+Int64-convertible to `double`, so a script's `decimal` SQL parameter is downcast on
+retry and only on retry. The remaining three (`-021`/`-022`/`-023`, Low) are an
+unauthenticated-by-default `ApplyAuth` for unknown `AuthType` / malformed Basic config,
+runtime-only HTTP-verb validation, and an undocumented PATCH HTTP method (code vs
+design-doc drift). Theme: every new finding is in a code path that was added or
+touched by the earlier fix bundle but whose error-propagation contract was not
+verified end-to-end against the S&F engine or the design doc.
+
 ## Checklist coverage

 | # | Category | Examined | Notes |
@@ -66,6 +96,21 @@ with.
 | 9 | Testing coverage | ☑ | Coverage is broad after finding 014. Re-review note: the `ZeroMaxRetries...` tests assert the persisted column, not the sweep outcome, and so lock in the finding-015 defect. |
 | 10 | Documentation & comments | ☑ | Inline comments at `ExternalSystemClient.cs:118-119` / `DatabaseGateway.cs:99-101` assert a "never retry" semantic that the code does not deliver — see finding 015. |

+_Re-review (2026-05-28, `1eb6e97`):_
+
+| # | Category | Examined | Notes |
+|---|----------|----------|-------|
+| 1 | Correctness & logic bugs | ☑ | `JsonException` not caught in either `DeliverBufferedAsync`, so a corrupt buffered payload becomes a permanent poison-message retried forever — finding 018. `JsonElementToParameterValue` collapses a non-Int64 number to `double`, silently losing precision for `decimal` SQL parameters on cached-write retry — finding 020. `new HttpMethod(method.HttpMethod)` accepts any string at runtime, so an invalid HTTP verb is only diagnosed at call time — finding 022. |
+| 2 | Akka.NET conventions | ☑ | Still no actors in this module; `AddExternalSystemGatewayActors` remains a no-op. The cached-call lifecycle/audit emission lives in `ScriptRuntimeContext` / `CachedCallTelemetryForwarder` (SiteRuntime / AuditLog), not here, and that boundary is correct. No issues found. |
+| 3 | Concurrency & thread safety | ☑ | Services are still stateless and DI-scoped; the S&F delivery handlers resolve in a fresh DI scope on the sweep thread. The added `executionId` / `sourceScript` / `parentExecutionId` plumbing flows through method arguments only — no shared state introduced. No findings. |
+| 4 | Error handling & resilience | ☑ | The poison-payload retry-forever path is the headline resilience issue (finding 018). `HttpClient.Timeout` not being set leaves the gateway's per-call round-trip cap clipped to the framework's 100s default whenever the configured `DefaultHttpTimeout` is larger — finding 019 (partial reopen of the `-002` contract). |
+| 5 | Security | ☑ | Auth secrets still never logged; error bodies still truncated. `ApplyAuth` is silent on unknown `AuthType` / empty `AuthConfiguration` / malformed Basic config — finding 021 (fail-open is a real but bounded risk; recorded Low because misconfiguration is the precondition). Connection-string handling in `DatabaseGateway` reads from the entity verbatim and never logs it. |
+| 6 | Performance & resource management | ☑ | Disposal paths from findings 005/010 still hold. The `IHttpClientFactory` name-keyed-options registration (finding 016 fix) creates a fresh `SocketsHttpHandler` per primary-handler build — acceptable because `IHttpClientFactory` recycles handlers. No new findings. |
+| 7 | Design-document adherence | ☑ | The design doc enumerates GET/POST/PUT/DELETE but the code also serializes a body for PATCH (and accepts arbitrary HTTP verbs at runtime) — finding 023 (drift to be reconciled). The per-call timeout guarantee is partially defeated by the unset `HttpClient.Timeout` for option values > 100s — finding 019. |
+| 8 | Code organization & conventions | ☑ | The `-016` fix replaced `ConfigureHttpClientDefaults` with a scoped `IConfigureNamedOptions<HttpClientFactoryOptions>` — verified clean, no new conventions issue. `internal virtual CreateConnection` (DatabaseGateway) and `internal InvokeHttpAsync` (ExternalSystemClient) are exposed via `InternalsVisibleTo` for tests — acceptable. No new findings. |
+| 9 | Testing coverage | ☑ | The `JsonException` deserialization path for `DeliverBufferedAsync` is untested; the `JsonElementToParameterValue` `double`-downcast path is untested; `ApplyAuth`'s unknown-AuthType / empty-config / malformed-Basic branches are untested. Recorded under findings 018 / 020 / 021 rather than a standalone coverage finding. |
+| 10 | Documentation & comments | ☑ | XML doc additions in `1eb6e97` are accurate and consistent. PATCH support is undocumented in the design doc (finding 023). The inline `ExternalSystemGateway-015` block-comment in `CachedCallAsync` (lines 126–133) and the equivalent in `DatabaseGateway.cs:106–113` now correctly describe the "treat 0 as unset" semantics. |
+
 ## Findings

 ### ExternalSystemGateway-001 — No S&F delivery handler registered; cached calls and writes can never be delivered
@@ -951,3 +996,298 @@ method whose effective parameter set is empty produces a URL identical to the
 no-parameters case. Regression test
 `Call_GetWithAllNullParameters_DoesNotAppendTrailingQuestionMark` asserts the
 captured request URI has no trailing `?`; it was verified to fail before the fix.
+
+### ExternalSystemGateway-018 — `DeliverBufferedAsync` lets `JsonException` propagate, turning a corrupt buffered row into a permanent retry-forever poison message
+
+| | |
+|--|--|
+| Severity | High |
+| Category | Error handling & resilience |
+| Status | Open |
+| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:176`, `src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:151` |
+
+**Description**
+
+Both `ExternalSystemClient.DeliverBufferedAsync` and `DatabaseGateway.DeliverBufferedAsync`
+begin with an unguarded `JsonSerializer.Deserialize<...>(message.PayloadJson)`:
+
+```csharp
+var payload = JsonSerializer.Deserialize<CachedCallPayload>(message.PayloadJson);
+if (payload == null || string.IsNullOrEmpty(payload.SystemName) || ...) {
+    _logger.LogError("... unreadable payload; parking.");
+    return false;
+}
+```
+
+The "unreadable payload; parking" branch is only entered when `Deserialize` *succeeds*
+and produces a null / partially-empty object. If `PayloadJson` is **malformed JSON** —
+the column was truncated mid-write, an older payload schema is being deserialized into a
+newer record, or storage corruption occurred — `Deserialize` throws `JsonException`
+before that check is ever reached. The exception propagates out of the delivery handler.
+
+The Store-and-Forward retry loop treats *any* thrown exception from a delivery handler
+as a transient failure (only a returned `false` parks the message); see
+`StoreAndForwardService.RetryMessageAsync`. Combined with the `MaxRetries == 0` →
+"unset → bounded default" fix from `-015`, the resulting behaviour is:
+
+1. Corrupt payload arrives in the buffer.
+2. Every retry sweep deserializes, throws `JsonException`, increments `RetryCount`.
+3. The message is retried until `RetryCount >= MaxRetries`, then parked — *only* if
+   `MaxRetries > 0` is configured (which `-015` already established is not the default
+   site configuration today). With the bounded S&F default it does eventually park, but
+   it park-loops noisily for `DefaultMaxRetries` iterations first; without that bound it
+   retries forever.
+4. The script is unaware — the cached call was returned `WasBuffered: true` long ago.
+
+This is the same "poison message buffered forever" class of hazard that
+`ExternalSystemGateway-001` (no-handler) and `ExternalSystemGateway-015` (MaxRetries==0)
+already removed for their own causes; corrupt JSON is an alternative arrival path into
+the same bad state.
+
+The `DatabaseGateway.DeliverBufferedAsync` path has the same shape and the same defect:
+`JsonSerializer.Deserialize<CachedWritePayload>` at line 151 is not guarded.
+
+**Recommendation**
+
+Wrap the `Deserialize` call in a `try/catch (JsonException)` block in both
+`DeliverBufferedAsync` methods. A `JsonException` is by definition a permanent failure —
+re-running the same deserialization against the same payload will produce the same
+exception — so the catch should log at `LogError` and **return `false`** so the S&F
+engine parks the message rather than retrying. Add regression tests that feed a
+malformed `PayloadJson` to each handler and assert `delivered == false` (i.e. the
+message parks) and that no exception escapes the handler.
+
+### ExternalSystemGateway-019 — `HttpClient.Timeout` is not set; `DefaultHttpTimeout` > 100s is silently clipped by the framework default
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Design-document adherence |
+| Status | Open |
+| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:226,257-264`, `src/ScadaLink.ExternalSystemGateway/ServiceCollectionExtensions.cs:90-102` |
+
+**Description**
+
+The `-002` fix enforces the per-call timeout via a linked `CancellationTokenSource`
+built from `_options.DefaultHttpTimeout` and passed into `SendAsync`. That correctly
+caps every call to *at most* the configured value when `DefaultHttpTimeout` ≤ 100s.
+However, `HttpClient.Timeout` (the framework default) is never set on either the named
+client or its primary handler — the `GatewayHttpClientConfigurator` only sets
+`MaxConnectionsPerServer`. `HttpClient.Timeout` defaults to **100 seconds**, and
+`SendAsync` enforces it internally by cancelling its own private CTS, raising a
+`TaskCanceledException` from `SendAsync` *without* cancelling either the caller's token
+or the gateway's `timeoutCts`.
+
+Consequences when an operator configures `DefaultHttpTimeout` to anything > 100s
+(a legitimate setting for external systems with long-running endpoints — recipe
+exports, large queries):
+
+1. The gateway's `timeoutCts` (e.g. 5 minutes) has not yet fired.
+2. `HttpClient.Timeout` fires at 100s, `SendAsync` throws.
+3. Neither `when (cancellationToken.IsCancellationRequested)` nor
+   `when (timeoutCts.IsCancellationRequested)` matches, so the exception falls into
+   the generic `catch (Exception ex) when (ErrorClassifier.IsTransient(ex))` branch
+   (line 277) and is re-thrown as a `TransientExternalSystemException` with the
+   message `"Connection error to {Name}: A task was canceled."` — misattributing a
+   timeout as a connection error.
+4. The configured 5-minute round-trip window the design doc promises ("Each external
+   system definition specifies a timeout that applies to all method calls on that
+   system" / "applies to the HTTP request round-trip") is silently overridden.
+
+The opposite case (`DefaultHttpTimeout` < 100s) is the only one the `-002` regression
+test exercises (200ms), so the defect is not caught by the existing suite.
+
+**Recommendation**
+
+Set `HttpClient.Timeout = Timeout.InfiniteTimeSpan` on the gateway's named clients via
+the existing `GatewayHttpClientConfigurator` (delegate `HttpClientActions` rather than
+just `HttpMessageHandlerBuilderActions`), so the cancellation-token mechanism is the
+sole timeout source. The linked `timeoutCts` then reliably enforces
+`DefaultHttpTimeout` for every value, and the timeout-vs-cancellation classification at
+lines 266–276 stays accurate. Add a regression test that configures `DefaultHttpTimeout`
+to ~150s, hangs the handler, and asserts the call times out at the configured value
+and produces a `"Timeout calling..."` (not `"Connection error to..."`) error.
+
+### ExternalSystemGateway-020 — `JsonElementToParameterValue` silently downcasts non-Int64 JSON numbers to `double`, losing precision for `decimal` SQL parameters on retry
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:185-193` |
+
+**Description**
+
+`DatabaseGateway.JsonElementToParameterValue` materialises the buffered cached-write
+SQL parameter values during a retry-sweep delivery:
+
+```csharp
+private static object JsonElementToParameterValue(JsonElement element) => element.ValueKind switch
+{
+    JsonValueKind.String => (object?)element.GetString() ?? DBNull.Value,
+    JsonValueKind.Number => element.TryGetInt64(out var l) ? l : element.GetDouble(),
+    ...
+};
+```
+
+For a JSON number, the helper attempts `Int64` first and otherwise returns a `double`.
+There is no `decimal` branch. The immediate-attempt path is unaffected — `CachedWriteAsync`
+on the original call serializes the script-provided typed parameters via
+`JsonSerializer.Serialize(new { ConnectionName, Sql, Parameters = parameters })` and
+executes the SQL directly outside this code path. But the **retry path** runs through
+`DeliverBufferedAsync` → `JsonElementToParameterValue`, so a script that submitted
+a `decimal` value (e.g. `123.4567890123m`) gets:
+
+1. Immediate attempt: `decimal` parameter, full precision (or, more accurately, the
+   value never enters this helper because cached writes today never re-execute on the
+   immediate path — but on the retry path it does).
+2. Retry attempt(s) after a transient failure: the value is deserialized as a JSON
+   number, fails `TryGetInt64`, and is downcast to `double` — which has ~15–17 digits
+   of precision against `decimal`'s 28–29. A SQL column of type `decimal(18, 6)` or
+   `numeric` receives a value that has been truncated to `double` precision before
+   parameter binding.
+
+Two further consequences worth recording:
+
+- The downcast is **silent** — there is no log, no error, and the cached-write
+  acknowledgement to the script has long since happened. Data drift between a
+  same-call immediate-success delivery and a same-call retry delivery is the worst
+  shape of "looks like the right value but isn't" defect.
+- For SCADA telemetry (process variables, totals, currency-denominated quality
+  reports) `decimal` is the correct CLR type and `double`'s representation error
+  changes the persisted value.
+
+**Recommendation**
+
+Replace the `Number` branch with a precision-preserving cascade — try `Int64`, then
+`decimal` (`element.TryGetDecimal(out var d) ? d : element.GetDouble()`), and only
+fall back to `double` when even `decimal` fails. Add a regression test against
+`DatabaseGateway.DeliverBufferedAsync` that buffers a write with a high-precision
+`decimal` parameter, drives the delivery, and asserts the SQL parameter bound is a
+`decimal` (or compares the round-tripped value to the original at the parameter level)
+rather than a `double` with truncated precision. The same Number-branch decision should
+be reviewed against `JsonValueKind.True`/`False`/`Null` (currently fine) and a string
+that happens to encode a number (already correctly returns `string`).
+
+### ExternalSystemGateway-021 — `ApplyAuth` silently sends an unauthenticated request on unknown `AuthType`, empty `AuthConfiguration`, or malformed Basic config
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Security |
+| Status | Open |
+| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:385-415` |
+
+**Description**
+
+`ApplyAuth` has three fail-open paths that all result in an HTTP request being sent
+**without** the credential the operator configured:
+
+1. Line 387 — `if (string.IsNullOrEmpty(system.AuthConfiguration)) return;` returns
+   early regardless of `AuthType`. A system entity with `AuthType = "apikey"` but an
+   empty `AuthConfiguration` (e.g. the secret column failed to deploy, or the
+   protector key changed and decryption produced `""`) sends every request with no
+   `X-API-Key` header — the gateway is silent.
+2. The `switch` has no `default` arm. A system entity with `AuthType = "bearer"`,
+   `"oauth2"`, a typo like `"ApiKey "` (trailing space) or even `"none"` falls off the
+   `switch` and the request is sent without any auth header — again silent.
+3. Line 408 — `if (basicParts.Length == 2)` skips the auth attach when
+   `AuthConfiguration` for `basic` lacks a `:` separator. The request is sent with no
+   `Authorization` header.
+
+Effectively the gateway treats every misconfiguration as "send anonymously" and
+relies on the remote system rejecting it with a 401/403. That is a defensible default
+on its own, but combined with `-007`'s 2 KB error-body cap and the fact that no audit
+or warning is emitted, an operator debugging "why does my external system always
+return 401" has nothing to go on inside ScadaLink — the gateway never says it failed
+to apply auth. For `AuthType = "none"` (the design's expected sentinel for
+unauthenticated systems) the fall-through is correct; the failure mode is misconfig.
+
+**Recommendation**
+
+Add a `default:` arm to the `switch` that logs `_logger.LogWarning(...)` naming the
+unknown `AuthType` and the system, and emit a similar warning when
+`AuthConfiguration` is empty for an `AuthType` of `"apikey"` or `"basic"` (those
+require a value; `"none"` does not). For Basic auth specifically, the
+`basicParts.Length != 2` branch should also warn. Do **not** include the
+`AuthConfiguration` value in the log message — secrets must stay out of the log
+(consistent with the existing module). A small set of `ApplyAuth` unit tests
+verifying the warning emission and that no `Authorization` / `X-API-Key` header is
+ever leaked in the warning text would close the test gap as well.
+
+### ExternalSystemGateway-022 — `new HttpMethod(method.HttpMethod)` accepts any string at runtime; an invalid HTTP verb fails only at call time
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:233` |
+
+**Description**
+
+`InvokeHttpAsync` constructs the request method directly from the string column:
+`new HttpRequestMessage(new HttpMethod(method.HttpMethod), url)`. `System.Net.Http.HttpMethod(string)`
+performs only a token-character validation (it rejects whitespace and control chars
+but accepts arbitrary non-standard tokens like `"FOO"` or `"GIT"`). The body-vs-query
+selection at lines 239–250 explicitly checks for POST/PUT/PATCH; for any other
+non-standard verb (`"FOO"`) the parameters silently go to neither body nor query
+string and the request is dispatched anyway.
+
+The design doc enumerates GET/POST/PUT/DELETE as the supported set. There is no
+validation at deployment time, at definition save time, or at gateway
+resolution time that `method.HttpMethod` is one of the expected verbs. An operator
+who typos `"DLETE"` discovers the issue only when a script invokes that method and
+the remote server rejects the request — usually as a 4xx that the gateway classifies
+as permanent, which is correct but obscures the root cause.
+
+**Recommendation**
+
+Validate `method.HttpMethod` at gateway entry — either with a small `switch` of
+allowed verbs in `InvokeHttpAsync` that throws `PermanentExternalSystemException` for
+an unsupported verb (cheap, immediate, surfaces a clear error to the script), or by
+adding a validation pass in the Template/Deployment Manager so it can never reach
+the gateway. The first option is local to this module and cheaper to land. Either
+way, the canonical list should agree with `BuildUrl`'s query-vs-body decision (which
+currently knows about POST/PUT/PATCH for body and GET/DELETE for query — note PATCH
+is in the body branch but not the design-doc list; see finding 023).
+
+### ExternalSystemGateway-023 — PATCH HTTP method is supported by code but absent from the design doc; body-vs-query decision drifts from the documented set
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Design-document adherence |
+| Status | Open |
+| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:241`, `docs/requirements/Component-ExternalSystemGateway.md:43` |
+
+**Description**
+
+The component design doc lists the supported HTTP methods as `GET, POST, PUT, or
+DELETE` (line 43: `**HTTP method**: GET, POST, PUT, or DELETE.`). `InvokeHttpAsync`'s
+body-serialization branch at lines 239–250 explicitly includes `PATCH` alongside POST
+and PUT — so PATCH is in fact supported (and routes parameters into the JSON body),
+but operators reading the spec would not know it. Conversely, `BuildUrl`'s
+query-string branch at lines 364–366 lists only `GET` and `DELETE`, so a PATCH
+method's parameters always go to the body, matching the body-branch but not appearing
+anywhere in the documented contract.
+
+This is mild drift — the code is more permissive than the spec. It only becomes a
+real issue if a future change relies on the documented "only GET/POST/PUT/DELETE"
+set and breaks the PATCH path silently, or if PATCH is genuinely out of scope and a
+template author defines a PATCH method on purpose only to learn later it is
+unsupported.
+
+**Recommendation**
+
+Pick one direction and apply it in the same session, per the project's "design doc +
+code travel together" rule:
+
+- If PATCH is intentionally supported, add `PATCH` to the Component-ExternalSystemGateway.md
+  HTTP-method list (line 43) and add a parameterised test confirming a PATCH method
+  sends its parameters in the JSON body and resolves like POST/PUT for error
+  classification.
+- If PATCH is not in scope, remove `method.HttpMethod.Equals("PATCH", ...)` from the
+  body branch in `InvokeHttpAsync` and let finding-022's verb validation reject it.
+  The design-doc list then remains the single source of truth.
@@ -5,10 +5,10 @@
 | Module | `src/ScadaLink.HealthMonitoring` |
 | Design doc | `docs/requirements/Component-HealthMonitoring.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-17 |
+| Last reviewed | 2026-05-28 |
 | Reviewer | claude-agent |
-| Commit reviewed | `39d737e` |
-| Open findings | 0 |
+| Commit reviewed | `1eb6e97` |
+| Open findings | 7 |

 ## Summary

@@ -51,6 +51,35 @@ HealthMonitoring + CentralUI change), and `CollectReport` reading
 `TimeProvider` (HealthMonitoring-016). The module remains small, readable, and
 broadly faithful to the design intent.

+#### Re-review 2026-05-28 (commit `1eb6e97`)
+
+All sixteen prior findings (HealthMonitoring-001..016) remain `Resolved`. This
+baseline re-review applied the full 10-category checklist and produced **7 new
+findings** (1 Medium, 6 Low — none crash-class). The most material observation
+is a **metric-loss race** in `HealthReportSender.ExecuteAsync`
+(HealthMonitoring-017): `CollectReport` resets the per-interval error counters
+(`ScriptErrorCount`, `AlarmEvaluationErrorCount`, `DeadLetterCount`,
+`SiteAuditWriteFailures`, `AuditRedactionFailure`) **before**
+`_transport.Send(...)` is attempted, so a transport failure (the existing
+`catch { LogError; }` path) silently discards every error this site recorded in
+the failed interval — the module-specific concern of "metric counters drifting
+from raw-per-interval to cumulative" inverted into _drifting_ to _lost_. A
+parallel hazard exists in `CentralHealthReportLoop` (HealthMonitoring-018). The
+remaining items are smaller: two Audit Log metrics
+(`SiteAuditTelemetryStalled`, `CentralAuditWriteFailures`) listed in the design
+doc never make it into a HealthMonitoring surface (HealthMonitoring-019); a
+heartbeat with `receivedAt <= existing.LastHeartbeatAt` brings an offline site
+back online with a stale heartbeat that can flap right back to offline on the
+next check (HealthMonitoring-020); the reserved `CentralSiteId = "central"`
+constant collides with any real site named `"central"` and silently extends its
+offline grace (HealthMonitoring-021); `CentralHealthReportLoopTests` uses real
+wall-clock 50 ms timers + `Task.Delay`, making it timing-sensitive
+(HealthMonitoring-022); and one obsolete placeholder test name
+(`StoreAndForwardBufferDepths_IsEmptyPlaceholder`) misrepresents what it now
+covers (HealthMonitoring-023). All sequence-number and offline-detection
+arithmetic uses `_timeProvider.GetUtcNow()` consistently — no wall-clock vs
+monotonic mismatch was observed.
+
 ## Checklist coverage

 | # | Category | Examined | Notes |
@@ -66,6 +95,21 @@ broadly faithful to the design intent.
 | 9 | Testing coverage | x | No tests for `CentralHealthReportLoop`, `MarkHeartbeat`, offline-via-heartbeat, replica idempotency, or most collector setters (HealthMonitoring-009). |
 | 10 | Documentation & comments | x | Heartbeat interval is described inconsistently (~2s vs ~5s) across XML docs (HealthMonitoring-004, resolved); `LatestReport = null!` misrepresents the contract (HealthMonitoring-012, resolved). Re-review: offline-check-interval comment claims "(shorter)" timeout but code only uses `OfflineTimeout` (HealthMonitoring-013). |

+_Re-review (2026-05-28, `1eb6e97`):_
+
+| # | Category | Examined | Notes |
+|---|----------|----------|-------|
+| 1 | Correctness & logic bugs | x | `HealthReportSender` and `CentralHealthReportLoop` reset per-interval counters before the send/process call — counts lost on transport failure (HealthMonitoring-017, HealthMonitoring-018). `MarkHeartbeat` brings an offline site back online with a stale `LastHeartbeatAt` when `receivedAt <= existing.LastHeartbeatAt` — site can flap straight back to offline (HealthMonitoring-020). `CentralSiteId = "central"` reserved constant silently collides with any real site named "central" (HealthMonitoring-021). |
+| 2 | Akka.NET conventions | x | Module contains no actors itself. `IHealthReportTransport` cleanly abstracts the Akka-remoting send. `ProcessReport`/`MarkHeartbeat` are called from `CentralCommunicationActor`'s receive — invoked on the actor thread but the aggregator's CAS loops make that safe regardless. No issues found. |
+| 3 | Concurrency & thread safety | x | Verified the resolved `SiteHealthState` immutable-record / CAS-loop pattern still holds across `ProcessReport`, `MarkHeartbeat`, `CheckForOfflineSites`. `SiteHealthCollector` uses `volatile` for reference fields (`_clusterNodes`, `_nodeHostname`, `_siteAuditBacklog`, `_isActiveNode`) and `Interlocked` for counters consistently. `CollectReport`'s `new Dictionary<>(concurrentDict)` snapshots are not strictly atomic but acceptable at the documented scale. No new issues found. |
+| 4 | Error handling & resilience | x | `try/catch` blocks now log all non-fatal failures (resolved HealthMonitoring-010 still in place). Outer `catch (Exception)` in `ExecuteAsync` keeps the loop alive — sound. New: the counter-reset-before-send issue (HealthMonitoring-017, HealthMonitoring-018) is an error-handling gap — transport failure silently swallows the interval's metric data. |
+| 5 | Security | x | No issues found. The module handles only numeric/string operational metrics; no secrets, auth surface, or untrusted input parsing. `MarkHeartbeat` and `ProcessReport` trust the caller (intra-cluster). |
+| 6 | Performance & resource management | x | `PeriodicTimer` instances disposed via `using`. CAS retry loops in `ProcessReport`/`MarkHeartbeat` have no bounded retry cap but contention is the dictionary-size limit (one entry per site) so the loop is effectively wait-free for the common case. No issues found. |
+| 7 | Design-document adherence | x | `SiteAuditTelemetryStalled` and `CentralAuditWriteFailures` are listed as required dashboard tiles in `Component-HealthMonitoring.md` but have no HealthMonitoring-side surface — both live only in `AuditLog`'s `AuditCentralHealthSnapshot` with no integration into the health aggregator or its consumers (HealthMonitoring-019). |
+| 8 | Code organization & conventions | x | Options class correctly owned by the component, validator registered idempotently across all three `Add*` paths. POCO/messages in Commons. `AddCentralHealthAggregation` implicitly depends on `ISiteHealthCollector` being registered elsewhere (Host calls `AddHealthMonitoring()` first) — works but is a hidden ordering requirement. Minor; not flagged. |
+| 9 | Testing coverage | x | Per-interval reset semantics covered for site-side counters but NOT for the failed-send case (no test asserts counters remain accumulated when the transport throws — would catch HealthMonitoring-017). `CentralHealthReportLoopTests` uses real wall-clock 50 ms `PeriodicTimer` + `Task.Delay(250)` for timing — flake-prone on a slow CI runner (HealthMonitoring-022). The placeholder test `StoreAndForwardBufferDepths_IsEmptyPlaceholder` name is stale (HealthMonitoring-023). |
+| 10 | Documentation & comments | x | XML docs in the new audit-bridge surfaces (`IncrementSiteAuditWriteFailures`, `IncrementAuditRedactionFailure`, `UpdateSiteAuditBacklog`) are accurate. The stale placeholder test name is the only issue (HealthMonitoring-023). |
+
 ## Findings

 ### HealthMonitoring-001 — Store-and-forward buffer depth metric is never populated
@@ -776,3 +820,314 @@ continues to work via the optional parameter. Regression test
 asserts the timestamp equals a fixed injected instant exactly (not just a
 before/after window); it would not compile against the pre-fix single-arg-less
 constructor.
+
+### HealthMonitoring-017 — `HealthReportSender` resets interval counters before `Send`; transport failures silently drop the interval's error counts
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.HealthMonitoring/HealthReportSender.cs:140-154`, `src/ScadaLink.HealthMonitoring/SiteHealthCollector.cs:146-153` |
+
+**Description**
+
+`HealthReportSender.ExecuteAsync` calls `_collector.CollectReport(_siteId)` and
+then `_transport.Send(reportWithSeq)` inside a single `try` block whose `catch`
+logs and continues. `CollectReport` atomically read-and-resets the per-interval
+counters via `Interlocked.Exchange(ref _scriptErrorCount, 0)` (and the same for
+`_alarmErrorCount`, `_deadLetterCount`, `_siteAuditWriteFailures`,
+`_auditRedactionFailures`). If `_transport.Send` then throws — Akka remoting
+hiccup, transport not yet associated, central side temporarily unavailable,
+serialization failure on a malformed metric, etc. — the `catch (Exception ex)`
+on line 150 logs an error and the loop simply waits for the next tick. The
+report was never delivered, but the counters have already been reset to zero, so
+**every error this site recorded in the failed interval is gone**: it is neither
+in the (un-sent) report nor in the (zeroed) collector. The very next successful
+report will show "0 script errors / 0 alarm errors" for the entire window in
+which the transport was broken, masking exactly the period the operator most
+needs to triage.
+
+This contradicts the design doc's "raw counts per reporting interval" / "counter
+resets **after each report is sent**" wording — current code resets on each
+report _attempt_, regardless of outcome. The hazard worsens under sustained
+transport failure: every interval's errors are lost; the central dashboard sees
+a quiet site while the site is, in fact, failing.
+
+The same shape exists in `CentralHealthReportLoop` (see HealthMonitoring-018) —
+`CollectReport` is called before `_aggregator.ProcessReport`. The aggregator
+call is in-process and unlikely to throw, but the structural bug is identical.
+
+**Recommendation**
+
+Build the report from a non-destructive read first (`PeekReport(siteId)`,
+returning a snapshot without mutating the counters) and only call a dedicated
+`ResetIntervalCounters()` after a successful `_transport.Send`. Alternatively,
+on a `Send` failure, restore the lost counts via `Interlocked.Add` of the
+captured values back into the collector fields — atomically correct as long as
+no other thread can read them in between, which is true here because the next
+read is the next `CollectReport` on the same loop. The "peek then commit"
+shape is the cleaner public API.
+
+A regression test should add a failing-transport scenario:
+`Send` throws an `InvalidOperationException`; assert that the next successful
+report includes the previously-failed interval's `ScriptErrorCount`.
+
+### HealthMonitoring-018 — Same counter-reset-before-publish hazard in `CentralHealthReportLoop`
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.HealthMonitoring/CentralHealthReportLoop.cs:87-98` |
+
+**Description**
+
+`CentralHealthReportLoop.ExecuteAsync` calls `_collector.CollectReport(CentralSiteId)`
+(which resets the per-interval counters on the shared `SiteHealthCollector`
+instance — see HealthMonitoring-017) and then `_aggregator.ProcessReport(reportWithSeq)`
+inside the same `try` block. If `ProcessReport` throws, the central node's own
+per-interval counters (`ScriptErrorCount`, `AlarmEvaluationErrorCount`,
+`DeadLetterCount`, `SiteAuditWriteFailures`, `AuditRedactionFailure`) are lost
+for that interval.
+
+In practice `ProcessReport` is a pure in-memory CAS loop and is very unlikely
+to throw, so the operational impact is small. However, the structural bug is
+identical to HealthMonitoring-017 and would be fixed by the same
+"peek then commit" refactor in `SiteHealthCollector`. The Audit-Log-related
+metrics matter most here: `AuditRedactionFailure` is genuinely incremented at
+central during normal operation (the Notification Outbox dispatcher and
+Inbound API middleware both write through `CentralAuditRedactionFailureCounter`
+which can fan out to the collector via the bridge), so this is not purely
+theoretical.
+
+**Recommendation**
+
+Adopt the same "peek then reset on successful publish" pattern recommended for
+HealthMonitoring-017. Reuse the new `PeekReport` / `ResetIntervalCounters`
+collector API once it lands.
+
+### HealthMonitoring-019 — `SiteAuditTelemetryStalled` and `CentralAuditWriteFailures` design-doc metrics have no HealthMonitoring-side surface
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Design-document adherence |
+| Status | Open |
+| Location | `docs/requirements/Component-HealthMonitoring.md:39,40`, `src/ScadaLink.HealthMonitoring/ICentralHealthAggregator.cs`, `src/ScadaLink.AuditLog/Central/AuditCentralHealthSnapshot.cs:39-58` |
+
+**Description**
+
+`Component-HealthMonitoring.md` lists `SiteAuditTelemetryStalled` and
+`CentralAuditWriteFailures` (and reiterates them under the Audit Log KPIs
+section and in the Dependencies section) as required dashboard metrics. The
+doc also says they "are central-computed alongside the existing central KPIs"
+(Notification Outbox / Site Call Audit) and surface in the **Audit** dashboard
+tile group.
+
+Tracing the code:
+
+- `SiteAuditTelemetryStalled` is published by `SiteAuditReconciliationActor`,
+  picked up by `SiteAuditTelemetryStalledTracker`, and latched into
+  `AuditCentralHealthSnapshot._stalled` (a `ConcurrentDictionary<string, bool>`
+  in the `ScadaLink.AuditLog` assembly).
+- `CentralAuditWriteFailures` is incremented inside `AuditCentralHealthSnapshot`
+  via `ICentralAuditWriteFailureCounter.Increment()` (also in `ScadaLink.AuditLog`).
+
+Neither metric is referenced anywhere in `src/ScadaLink.HealthMonitoring/`:
+- `ICentralHealthAggregator` does not expose them.
+- `SiteHealthCollector` has no central counterpart (it is site-only).
+- `SiteHealthReport` has no `SiteAuditTelemetryStalled` / `CentralAuditWriteFailures`
+  fields (the site-only `SiteAuditWriteFailures`, `AuditRedactionFailure`, and
+  `SiteAuditBacklog` _are_ wired; the central pair is the gap).
+
+Currently the only consumer of `IAuditCentralHealthSnapshot` is whatever
+Central UI page binds to it directly (out of scope for this module), but the
+design doc places these metrics under HealthMonitoring's responsibility
+("Health Monitoring Dashboard displays aggregated metrics"). At minimum the
+Dependencies section's claim that Health Monitoring provides "the
+central-computed `CentralAuditWriteFailures` / `AuditRedactionFailure` metrics"
+is false for `CentralAuditWriteFailures`: nothing under
+`src/ScadaLink.HealthMonitoring/` knows about it.
+
+**Recommendation**
+
+Decide whether HealthMonitoring or the consuming UI page owns the
+`IAuditCentralHealthSnapshot` integration:
+
+- If HealthMonitoring owns it, expose a `CentralKpis` accessor on
+  `ICentralHealthAggregator` (e.g. a `GetCentralAuditHealth()` method that
+  returns a typed DTO derived from the injected `IAuditCentralHealthSnapshot`)
+  so the dashboard has a single read surface mirroring `GetAllSiteStates`.
+- If the UI page binds `IAuditCentralHealthSnapshot` directly, update the
+  HealthMonitoring design doc's Responsibilities / Dependencies sections to
+  reflect that and remove the implied integration.
+
+Either way, add a regression test that the chosen surface returns the live
+counter and per-site stalled state.
+
+### HealthMonitoring-020 — `MarkHeartbeat` brings offline site back online with a stale `LastHeartbeatAt` when `receivedAt <= existing.LastHeartbeatAt`
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:128-147` |
+
+**Description**
+
+The CAS path in `MarkHeartbeat` picks `newHeartbeat = max(receivedAt, existing.LastHeartbeatAt)`,
+then short-circuits only when `newHeartbeat == existing.LastHeartbeatAt &&
+existing.IsOnline`. That short-circuit is correct, but consider the case where
+`existing.IsOnline == false` and `receivedAt <= existing.LastHeartbeatAt`:
+
+1. Suppose a site is marked offline by `CheckForOfflineSites` at time T1.
+2. A late/out-of-order heartbeat carrying a `receivedAt` _older_ than the last
+   stored `LastHeartbeatAt` arrives at T2 (clock skew at the receive site, or a
+   delayed message that was generated before the offline-marking).
+3. `newHeartbeat == existing.LastHeartbeatAt` (kept), but the short-circuit
+   condition fails because `existing.IsOnline == false`, so the CAS produces a
+   new record with `IsOnline = true` and the **stale** `LastHeartbeatAt`.
+4. On the very next `CheckForOfflineSites` tick (≤ `OfflineTimeout/2` later),
+   `now - LastHeartbeatAt` is still ≥ `OfflineTimeout`, so the site is
+   immediately marked offline again — the heartbeat brought it online for less
+   than the check cadence, producing a "flap" in the dashboard.
+
+In practice `receivedAt` is normally `_timeProvider.GetUtcNow()` at the
+`CentralCommunicationActor` receive site, so monotonically increasing — the bug
+is latent. But the contract `MarkHeartbeat(string siteId, DateTimeOffset receivedAt)`
+makes no guarantee about ordering, and an out-of-order delivery (Akka remoting
+ordering across connection re-establishment edge cases) or a small wall-clock
+correction at central would expose it.
+
+**Recommendation**
+
+When transitioning offline → online, use `now` (from the injected
+`TimeProvider`) rather than the caller-supplied `receivedAt` for
+`LastHeartbeatAt`, or take `max(receivedAt, _timeProvider.GetUtcNow())` so the
+recovery point is always recent. A unit test driving `MarkHeartbeat` with a
+`receivedAt` older than the last stored heartbeat on an offline site, then a
+`CheckForOfflineSites` immediately afterwards, would assert the site stays
+online.
+
+### HealthMonitoring-021 — `CentralSiteId = "central"` reserved constant silently collides with a real site named "central"
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.HealthMonitoring/CentralHealthReportLoop.cs:22`, `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:224-226` |
+
+**Description**
+
+`CentralHealthAggregator.CheckForOfflineSites` looks up the per-site offline
+timeout with:
+
+```csharp
+var timeout = kvp.Key == CentralHealthReportLoop.CentralSiteId
+    ? _options.CentralOfflineTimeout
+    : _options.OfflineTimeout;
+```
+
+`CentralSiteId` is the literal string `"central"`. Site IDs are free-form
+strings set in configuration / the Sites repository; there is no validation
+that excludes the reserved `"central"` name. An operator who creates a real
+site with `SiteId = "central"` will have:
+
+- Their real-site reports arriving via `ProcessReport` get stored in the same
+  dictionary slot as the central self-report (they share the keyspace), so the
+  central self-report and the real-site report repeatedly overwrite each
+  other via the sequence-number guard — whichever has the higher Unix-ms seed
+  wins, and the other is silently rejected as stale. The dashboard alternates
+  between two unrelated payloads.
+- The real site gets the longer `CentralOfflineTimeout` (default 3 minutes)
+  instead of the normal `OfflineTimeout` (60 s), so a genuinely-failed real
+  site marked "central" stays falsely-online for an extra two minutes.
+
+**Recommendation**
+
+Two options:
+
+1. Reject the reserved name at the Site entity / configuration validation
+   layer (Configuration Database component, out of this module's scope) and
+   document `"central"` as reserved. This is the cleaner UX fix.
+2. As a defence-in-depth inside HealthMonitoring, store the central
+   self-report under a key that cannot collide — e.g. prefix it with a
+   character that is forbidden in real site IDs (`":central"` or `"#central"`)
+   — and adjust `CheckForOfflineSites` accordingly.
+
+Either fix should include a regression test creating a real `SiteHealthReport`
+with `SiteId = "central"` and asserting the central self-report's identity is
+preserved.
+
+### HealthMonitoring-022 — `CentralHealthReportLoopTests` uses real-time `PeriodicTimer` + `Task.Delay`; flake-prone on slow CI
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Testing coverage |
+| Status | Open |
+| Location | `tests/ScadaLink.HealthMonitoring.Tests/CentralHealthReportLoopTests.cs:32-42` |
+
+**Description**
+
+`RunLoopBriefly` starts the hosted service with a 50 ms `PeriodicTimer` and
+then `await Task.Delay(runForMs, CancellationToken.None)` (with `runForMs`
+between 150 ms and 300 ms). `GeneratesCentralReports_WhenSelfIsPrimary` and
+`AssignsMonotonicSequenceNumbers` both assert "at least 2 reports were
+generated" within the window. On a heavily-contended CI runner where the
+hosted-service start-up plus a couple of `PeriodicTimer` ticks can blow past
+300 ms, these tests will silently flake.
+
+The rest of the suite (`CentralHealthAggregatorTests`, `SiteHealthCollectorTests`,
+`HealthReportSenderTests` partially) was deliberately refactored to use the
+injected `TimeProvider` precisely to avoid this. `CentralHealthReportLoop` and
+`HealthReportSender` already accept a `TimeProvider`, but the loop's
+`PeriodicTimer` is still real-time because `PeriodicTimer` does not consume
+the `TimeProvider` parameter.
+
+**Recommendation**
+
+Either (a) accept the timing-sensitivity and bump the delay budget
+generously, or (b) refactor the hosted-service loop to use a
+`TimeProvider.CreateTimer`-based tick mechanism so the test can advance a
+fake clock and assert deterministically how many ticks fire. Option (b) is
+the better long-term fix and matches the pattern used elsewhere in the
+module's tests.
+
+### HealthMonitoring-023 — `StoreAndForwardBufferDepths_IsEmptyPlaceholder` test name is stale; it now covers the default-state contract, not a placeholder
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Documentation & comments |
+| Status | Open |
+| Location | `tests/ScadaLink.HealthMonitoring.Tests/SiteHealthCollectorTests.cs:117-122` |
+
+**Description**
+
+The test `StoreAndForwardBufferDepths_IsEmptyPlaceholder` was originally named
+to codify the HealthMonitoring-001 bug ("`SetStoreAndForwardDepths` has no
+callers, so `StoreAndForwardBufferDepths` is always empty"). HealthMonitoring-001
+is `Resolved` — `HealthReportSender` now populates per-category depths from
+the S&F engine, and the same test class has `SetStoreAndForwardDepths_ReflectedInReport`
+covering the populated path. The "placeholder" test still passes because it
+constructs a fresh collector and never calls the setter, so its assertion
+(`Assert.Empty(report.StoreAndForwardBufferDepths)`) is now testing the
+**default empty state of an un-configured collector**. The HealthMonitoring-001
+resolution note explicitly chose to keep it as "the collector-level
+default-state test", but the test method name and the implied semantics no
+longer match.
+
+A maintainer reading the test name today will misread it as documentation that
+the metric is unimplemented (which it isn't), and may waste time investigating
+a non-bug.
+
+**Recommendation**
+
+Rename to `StoreAndForwardBufferDepths_DefaultsToEmpty_WhenSetterNotCalled`
+(or similar) and update the test body's intent — purely a documentation /
+maintainability fix; no behaviour change.
@@ -5,10 +5,10 @@
 | Module | `src/ScadaLink.Host` |
 | Design doc | `docs/requirements/Component-Host.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-17 |
+| Last reviewed | 2026-05-28 |
 | Reviewer | claude-agent |
-| Commit reviewed | `39d737e` |
-| Open findings | 0 |
+| Commit reviewed | `1eb6e97` |
+| Open findings | 7 |

 ## Summary

@@ -48,6 +48,38 @@ Serilog sink setup is hard-coded in `Program.cs` rather than configuration-drive
 REQ-HOST-8 requires (Host-014), and `StartupRetry` retries indiscriminately on every
 exception type including permanent schema-validation failures (Host-015).

+#### Re-review 2026-05-28 (commit `1eb6e97`)
+
+All fifteen prior findings (Host-001..015) remain `Resolved` in the current tree
+and the regressions introduced for them — Host-001's predicate, the externalised
+secrets, the Site GrpcPort/RemotingPort/seed-port validation rules, the escaped
+HOCON builder with `DownIfAlone` and millisecond-precision durations, the
+configuration-driven Serilog sinks, the transient-only `StartupRetry`
+classifier — are all still in place. This re-review walked the ten checklist
+categories over the full module again and recorded seven new findings, none of
+them crash/data-loss class. Host-016 (Medium) mirrors the resolved Host-004
+shipped-config bug on the **Communication** side: `appsettings.Site.json`'s
+second `CentralContactPoints` entry points at the site's own remoting port
+(`localhost:8082`) instead of central, an incorrect dev example that copies
+into multi-central deployments. Host-017 (Medium) flags a partial REQ-HOST-7
+implementation — the documented site-shutdown ordering (stop accepting streams
+first, cancel active streams via `IHostApplicationLifetime.ApplicationStopping`,
+then tear down actors) is not wired: the site path registers no
+`ApplicationStopping` handler that signals `SiteStreamGrpcServer`, and the gRPC
+server exposes no cancel-all-streams entry point. The remaining five are Low:
+`NodeOptions.NodeName` (the operator-configured value stamped on
+`AuditLog.SourceNode`) is absent from both shipped per-role configs even though
+the docker per-node configs set it (Host-018); the migration `StartupRetry`
+call passes `default` for `CancellationToken`, so a SIGTERM during the
+bounded-retry window is ignored for up to ~2 minutes (Host-019);
+`LoggerConfigurationFactory` layers `MinimumLevel.Is` over
+`ReadFrom.Configuration`, so any `Serilog:MinimumLevel` an operator sets is
+silently overridden by `ScadaLink:Logging:MinimumLevel` (Host-020); the
+shipped `appsettings.json` carries a Microsoft `Logging:LogLevel` block but
+Serilog is the only logger provider and the section is dead config (Host-021);
+and `ParseLevel` silently swallows an unrecognised `MinimumLevel` value (e.g.
+a typo) and falls back to `Information` with no warning (Host-022).
+
 ## Checklist coverage

 | # | Category | Examined | Notes |
@@ -63,6 +95,21 @@ exception type including permanent schema-validation failures (Host-015).
 | 9 | Testing coverage | ☑ | Strong suite; regression tests added for Host-001/004/006/007/010/011. No coverage for the new `down-if-alone`, sub-second-duration, or non-transient-retry paths (Host-012/013/015). |
 | 10 | Documentation & comments | ☑ | REQ-HOST-6 stale-doc resolved. Re-review: REQ-HOST-8 says sinks are "configuration-driven" but they are code-defined (Host-014). |

+_Re-review (2026-05-28, `1eb6e97`):_
+
+| # | Category | Examined | Notes |
+|---|----------|----------|-------|
+| 1 | Correctness & logic bugs | ☑ | Re-review: `appsettings.Site.json` second `CentralContactPoints` entry targets the site's own remoting port instead of central (Host-016) — same defect class as the resolved Host-004 seed-list bug. |
+| 2 | Akka.NET conventions | ☑ | CoordinatedShutdown, receptionist registration, singleton scoping, role-scoped site singletons, ClusterClient initial-contact wiring all reviewed; no new issues. |
+| 3 | Concurrency & thread safety | ☑ | `_trackedDisposables` is locked on both sides of the lifecycle; `_actorSystem` publication is safe via the IHost startup `await` boundary. New Low: `StartupRetry` migration call passes `default` `CancellationToken`, so SIGTERM during the retry window is ignored (Host-019). |
+| 4 | Error handling & resilience | ☑ | `IsTransientDatabaseFault` correctly classifies socket / timeout / SqlException; the retry helper itself remains sound. Host-019 is the resilience gap. |
+| 5 | Security | ☑ | Secrets stay externalised; the `_secrets` placeholder comment is intact. No new issues. |
+| 6 | Performance & resource management | ☑ | No new undisposed resources; gRPC stream lifetime cap remains correct. No new issues. |
+| 7 | Design-document adherence | ☑ | Re-review: REQ-HOST-7 site-shutdown ordering — stop accepting new streams, cancel active streams via `ApplicationStopping`, then tear down actors — is not wired in `Program.cs` (Host-017). |
+| 8 | Code organization & conventions | ☑ | Re-review: `NodeOptions.NodeName` is absent from the shipped per-role configs even though it stamps `AuditLog.SourceNode` (Host-018); the appsettings `Logging:LogLevel` Microsoft section is dead config under Serilog (Host-021). |
+| 9 | Testing coverage | ☑ | Strong existing suite. No coverage for the Site `CentralContactPoints` second-entry rule (Host-016), the site-shutdown ordering (Host-017), the `NodeName`-absent shipped config (Host-018), the unused `CancellationToken` parameter (Host-019), the `MinimumLevel.Is` override semantics (Host-020) or the `ParseLevel` silent fallback (Host-022). |
+| 10 | Documentation & comments | ☑ | Re-review: layered `MinimumLevel.Is` / `ReadFrom.Configuration` semantics are not surfaced — an operator-set `Serilog:MinimumLevel` is silently overridden by `ScadaLink:Logging:MinimumLevel` (Host-020); `ParseLevel` silently coerces a misspelled level to `Information` with no warning (Host-022). |
+
 ## Findings

 ### Host-001 — `/health/ready` includes the leader-only `active-node` check
@@ -777,3 +824,278 @@ site now passes it. Regression tests in `StartupRetryTests`:
 when `isTransient` returns false) and `ExecuteWithRetry_TransientThenPermanent_StopsAtPermanent`
 (retries a `TimeoutException` then stops at a permanent `InvalidOperationException`).
 Full Host suite green (182 passed).
+
+### Host-016 — Site `CentralContactPoints` second entry targets the site's own remoting port
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.Host/appsettings.Site.json:33-37` |
+
+**Description**
+
+The shipped site config sets `Node:RemotingPort = 8082` and lists
+`Communication:CentralContactPoints` as
+`["akka.tcp://scadalink@localhost:8081", "akka.tcp://scadalink@localhost:8082"]`.
+The second contact point — port `8082` — is the **site's own** remoting endpoint,
+not a central node. `SiteCommunicationActor` / `ClusterClient` uses these
+addresses as initial contacts when discovering the central
+`ClusterClientReceptionist`; a contact pointing at the site itself can never
+reach the central receptionist and will be a permanent failure in the
+initial-contact rotation. For the single-node dev loopback layout the first
+contact (`8081`, central) succeeds and the bug is masked, but this is exactly
+the kind of dev-config "example" that gets duplicated into multi-central
+deployments — the same failure mode the resolved Host-004 finding called out
+for the seed-node list. `StartupValidator` validates seed nodes against the
+gRPC port (Host-004) but does not cross-check `CentralContactPoints` against
+the site's own `RemotingPort`, so the misconfiguration passes silently.
+
+**Recommendation**
+
+Correct the shipped site example to list two central remoting endpoints (e.g.
+`localhost:8081` for `central-a` and a distinct port for `central-b` in a
+multi-node layout). Consider extending `StartupValidator` to reject any
+`Communication:CentralContactPoints` entry whose host+port matches this site
+node's `NodeHostname`+`RemotingPort`. Add a regression test in
+`StartupValidatorTests` mirroring `Site_SeedNodeOnGrpcPort_FailsValidation`.
+
+**Resolution**
+
+_Open._
+
+### Host-017 — Site-shutdown ordering from REQ-HOST-7 is not wired
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Design-document adherence |
+| Status | Open |
+| Location | `src/ScadaLink.Host/Program.cs:229-265`, `src/ScadaLink.Communication/Grpc/SiteStreamGrpcServer.cs` |
+
+**Description**
+
+REQ-HOST-7 documents an explicit four-step shutdown sequence for site nodes:
+"(1) On `CoordinatedShutdown`, stop accepting new gRPC streams first.
+(2) Cancel all active gRPC streams (triggering client-side reconnect).
+(3) Tear down actors.
+(4) Use `IHostApplicationLifetime.ApplicationStopping` to signal the gRPC
+server." The site path in `Program.cs` (the `role == "Site"` branch) registers
+no `IHostApplicationLifetime.ApplicationStopping` callback, and
+`SiteStreamGrpcServer` exposes no "stop accepting" / "cancel all streams"
+entry point — it has `SetReady` but no corresponding `SetUnavailable` or
+`CancelAllStreams`. In practice, on `SIGTERM` Kestrel closes its listener
+naturally and `AkkaHostedService.StopAsync` runs Akka `CoordinatedShutdown`,
+but there is no explicit, ordered handoff that meets the documented contract:
+in-flight streams are not actively cancelled before actors begin tearing down,
+so clients see a stream that goes silent (and only times out via gRPC
+keepalive) rather than a clean `Cancelled` they can reconnect on. This is a
+contract-vs-code drift — either the design doc is overstating what is
+implemented, or the implementation is incomplete.
+
+**Recommendation**
+
+Add a `SiteStreamGrpcServer.CancelAllStreams()` method that flips a "shutting
+down" flag (so `SubscribeSite` immediately fails new streams with
+`StatusCode.Unavailable`) and cancels every entry's `Cts` in the `_streams`
+map. In `Program.cs` site branch, resolve `IHostApplicationLifetime` and
+register a callback on `ApplicationStopping` that calls `CancelAllStreams()`
+before the Akka hosted service runs `CoordinatedShutdown` (or order via
+`AkkaHostedService.StopAsync` itself — `IHostedService.StopAsync` runs in
+reverse-registration order, so the gRPC server's lifetime can be sequenced
+before Akka shutdown). Alternatively, reconcile REQ-HOST-7 with the actual
+implementation if the explicit ordering is no longer intended. Add an
+integration test under `tests/ScadaLink.Host.Tests` that starts a site host,
+opens a stream, triggers shutdown, and asserts the stream completes with
+`Cancelled` before the actor system tears down.
+
+**Resolution**
+
+_Open._
+
+### Host-018 — Shipped per-role configs omit `NodeOptions.NodeName`, leaving `SourceNode` null
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Code organization & conventions |
+| Status | Open |
+| Location | `src/ScadaLink.Host/appsettings.Central.json`, `src/ScadaLink.Host/appsettings.Site.json`, `src/ScadaLink.Host/NodeOptions.cs:10-16` |
+
+**Description**
+
+`NodeOptions.NodeName` is documented as "the operator-configured semantic node
+name used to stamp the SourceNode column on audit rows", with conventional
+values `node-a`/`node-b` for site nodes and `central-a`/`central-b` for
+central nodes. The CLAUDE.md "Centralized Audit Log" key-decision section
+calls this out: `SourceNode` is meant to be carried verbatim through audit
+telemetry and reconciliation, and is indexed via
+`IX_AuditLog_Node_Occurred (SourceNode, OccurredAtUtc)`. The docker per-node
+configs (`docker/central-node-a/appsettings.Central.json`,
+`docker/site-a-node-a/appsettings.Site.json`, etc.) all set
+`ScadaLink:Node:NodeName`. The **shipped, default** per-role files in
+`src/ScadaLink.Host/` — the templates a developer running the binary
+directly will use — do not. `NodeIdentityProvider` normalises an empty
+`NodeName` to `null`, so dev audit rows carry a null `SourceNode` and the
+indexed lookup never narrows. The dev examples should match the docker
+examples; at minimum the field should appear in the shipped templates with a
+placeholder explaining the convention.
+
+**Recommendation**
+
+Add `"NodeName": "central-a"` (or a placeholder like `"${NODE_NAME}"`) to
+`appsettings.Central.json` and `"NodeName": "node-a"` to
+`appsettings.Site.json`, with a short comment that the value must be set
+per-node in multi-node deployments. Consider validating in `StartupValidator`
+that `NodeName` is non-empty, or accept the null and document explicitly that
+single-node dev deployments leave `SourceNode` null.
+
+**Resolution**
+
+_Open._
+
+### Host-019 — Migration `StartupRetry` call drops the host `CancellationToken`
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Concurrency & thread safety |
+| Status | Open |
+| Location | `src/ScadaLink.Host/Program.cs:154-165` |
+
+**Description**
+
+`StartupRetry.ExecuteWithRetryAsync` accepts an optional
+`CancellationToken cancellationToken = default` and observes it both at the
+top of each attempt and inside the `Task.Delay` between retries. The migration
+call site in `Program.cs` passes no token, so the helper runs with
+`CancellationToken.None`. With `maxAttempts: 8`, `initialDelay: 2s`, and the
+30s cap, a database that stays unreachable can keep the retry loop alive for
+~2 minutes before the host process responds to `SIGTERM` / `Ctrl+C` /
+Windows-Service stop. The `Program.cs` startup pipeline does not yet have a
+host-lifetime token to forward at this point (the `app` is built but not
+yet running), but `app.Lifetime.ApplicationStopping` is available the moment
+`builder.Build()` returns. Threading it into the retry call honours the host
+lifecycle and matches the helper's documented contract.
+
+**Recommendation**
+
+Pass `app.Lifetime.ApplicationStopping` (or `CancellationToken.None`
+explicitly with a comment if intentional) into
+`StartupRetry.ExecuteWithRetryAsync`. Add a `StartupRetryTests` case
+exercising token-cancellation mid-backoff.
+
+**Resolution**
+
+_Open._
+
+### Host-020 — `MinimumLevel.Is` silently overrides any operator-set `Serilog:MinimumLevel`
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Documentation & comments |
+| Status | Open |
+| Location | `src/ScadaLink.Host/LoggerConfigurationFactory.cs:36-43` |
+
+**Description**
+
+`LoggerConfigurationFactory.Build` reads the `Serilog` configuration section
+via `ReadFrom.Configuration(configuration)` (which can include a
+`MinimumLevel` block — the standard Serilog way to set the floor) and **then**
+calls `.MinimumLevel.Is(minimumLevel)` derived from
+`ScadaLink:Logging:MinimumLevel`. Serilog's fluent builder applies the later
+call, so any `Serilog:MinimumLevel:Default` an operator sets is silently
+overridden by `ScadaLink:Logging:MinimumLevel` (or by its
+`Information` fallback when the ScadaLink key is absent). There are now two
+documented configuration paths for the same setting with non-obvious
+precedence, and the override direction is the opposite of what most Serilog
+users would expect (the more-specific `Serilog` section being the authority).
+The XML doc on `Build` says "the explicit `MinimumLevel.Is` pins the floor"
+but does not warn that the floor *overrides* the Serilog section's own
+`MinimumLevel`.
+
+**Recommendation**
+
+Pick one mechanism: either (a) drop the `MinimumLevel.Is` call and let
+`ReadFrom.Configuration` consume `Serilog:MinimumLevel`, migrating any docs/
+deployments that reference `ScadaLink:Logging:MinimumLevel`; or (b) keep the
+current "ScadaLink:Logging" path and reject `Serilog:MinimumLevel` if present
+(throw at startup so the operator sees the conflict). At minimum, expand the
+XML doc + REQ-HOST-8 to spell out the precedence explicitly.
+
+**Resolution**
+
+_Open._
+
+### Host-021 — Microsoft `Logging:LogLevel` section in `appsettings.json` is dead config under Serilog
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Code organization & conventions |
+| Status | Open |
+| Location | `src/ScadaLink.Host/appsettings.json:2-6` |
+
+**Description**
+
+`appsettings.json` carries a Microsoft `Logging:LogLevel:Default = Information`
+block. The `Logging:LogLevel` map is consumed by
+`Microsoft.Extensions.Logging.ConfigurationConsoleLoggerOptions` and similar
+provider configurations bound from the standard `Logging` section. The Host
+calls `builder.Host.UseSerilog()`, which replaces the default
+`ILoggerFactory` setup with Serilog as the **only** logger provider; Serilog
+reads from `configuration.ReadFrom.Configuration(...)` which consumes the
+`Serilog` section, **not** `Logging:LogLevel`. The result is that an operator
+editing `Logging:LogLevel:Default` (a very natural thing to try, since it is
+the .NET convention) sees no behaviour change — the section is dead config.
+
+**Recommendation**
+
+Either remove the `Logging:LogLevel` block from `appsettings.json` (Serilog
+owns logging configuration in this Host), or replace it with a brief comment
+explaining it is intentionally retained for non-Serilog tooling. Document the
+authoritative location (`Serilog` + `ScadaLink:Logging`) in
+`Component-Host.md` REQ-HOST-8 if not already explicit.
+
+**Resolution**
+
+_Open._
+
+### Host-022 — `ParseLevel` silently coerces unrecognised `MinimumLevel` to `Information`
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Error handling & resilience |
+| Status | Open |
+| Location | `src/ScadaLink.Host/LoggerConfigurationFactory.cs:50-55` |
+
+**Description**
+
+`LoggerConfigurationFactory.ParseLevel` uses
+`Enum.TryParse<LogEventLevel>(level, ignoreCase: true, out var parsed)` and
+returns `LogEventLevel.Information` when parsing fails — without logging the
+fallback. An operator who sets
+`ScadaLink:Logging:MinimumLevel = "Informaiton"` (a common typo) or
+`"Verbose,Debug"` or any unrecognised value gets the default level silently;
+there is no warning, no log line, no startup error. Combined with Host-020
+(this is the only mechanism that pins the floor), a misspelt value is
+invisible until someone wonders why the level change "didn't take". The
+helper is small and could either fail-fast in `StartupValidator` or emit a
+console warning before the logger is configured.
+
+**Recommendation**
+
+In `LoggerConfigurationFactory.Build`, when `loggingOptions.MinimumLevel` is
+non-null/non-blank but does not parse to a valid `LogEventLevel`, write a
+`Console.Error.WriteLine` warning (the logger is not yet built) and proceed
+with `Information`. Alternatively, validate the value in `StartupValidator`
+and fail fast — that matches the pattern used for other ScadaLink
+configuration keys. Add a `LoggerConfigurationTests` case asserting the
+behaviour you choose.
+
+**Resolution**
+
+_Open._
@@ -5,10 +5,10 @@
 | Module | `src/ScadaLink.InboundAPI` |
 | Design doc | `docs/requirements/Component-InboundAPI.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-17 |
+| Last reviewed | 2026-05-28 |
 | Reviewer | claude-agent |
-| Commit reviewed | `39d737e` |
-| Open findings | 0 |
+| Commit reviewed | `1eb6e97` |
+| Open findings | 8 |

 ## Summary

@@ -64,6 +64,66 @@ statement that the timeout covers routed calls (InboundAPI-016); and (4) `RouteH
 | 9 | Testing coverage | ☑ | Re-review: `RouteHelper`/`RouteTarget` (WP-4 routing) entirely untested (InboundAPI-017); validators/executor/filter well covered. |
 | 10 | Documentation & comments | ☑ | `ApiKeyValidationResult.NotFound` XML/name says "NotFound" but returns HTTP 400 — misleading (InboundAPI-013). |

+#### Re-review 2026-05-28 (commit `1eb6e97`)
+
+All 17 prior findings remain `Resolved`. The module has grown materially since the
+last pass — a new `AuditWriteMiddleware` (Audit Log #23 M4 Bundle D) now lives under
+`src/ScadaLink.InboundAPI/Middleware/`, the `ApiKeyValidator` was rewired to hash the
+candidate with `IApiKeyHasher` (ConfigurationDatabase-012), and an `IInstanceRouter`
+seam was introduced. This re-review re-walked all 10 checklist categories against
+`1eb6e97` and surfaced **8 new findings** concentrated on the new audit middleware
+and a stranded follow-up from InboundAPI-008:
+
+1. The InboundAPI-008 resolution explicitly deferred registering an `IActiveNodeGate`
+   implementation in `ScadaLink.Host` as a "follow-up outside this module's scope" —
+   that follow-up is still unfulfilled (no production registration anywhere in
+   `src/ScadaLink.Host/`), so the design-mandated standby-node gating is silently
+   disabled in production today (`InboundAPI-022`, High).
+2. `AuditWriteMiddleware` is wired in `Program.cs` against `/api/*` rather than the
+   specific `POST /api/{methodName}` route, so GETs against `/api/audit/query` and
+   `/api/audit/export` (audit query endpoints — themselves not script invocations)
+   now emit spurious `AuditChannel.ApiInbound`/`InboundRequest` rows back into the
+   audit log with `Target` set to the last path segment (`InboundAPI-025`, Medium).
+3. The middleware fires its audit write as `_ = _auditWriter.WriteAsync(evt)` — the
+   wrapping try/catch only catches synchronous throws, so a faulted async writer
+   task is unobserved and the row silently disappears with no log line
+   (`InboundAPI-018`, Low/Medium).
+4. `ParentExecutionId` correlation flows only through `RouteToCallRequest` —
+   `RouteToGetAttributesRequest`/`RouteToSetAttributesRequest` have no
+   `ParentExecutionId` field, so attribute reads/writes from inbound scripts lose
+   the inbound→site execution-tree link the Audit Log decision in CLAUDE.md
+   describes (`InboundAPI-021`, Medium).
+5. `EndpointExtensions.HandleInboundApiRequest` — the entire wiring composition
+   that ties validator/executor/route/audit together — has no test coverage; only
+   the components it composes are tested (`InboundAPI-023`, Low).
+6. `EndpointExtensions.HandleInboundApiRequest` does
+   `ContentType?.Contains("json")` (case-sensitive) so a request with
+   `application/JSON` and no Content-Length silently skips JSON body parsing
+   (`InboundAPI-020`, Low).
+7. `AuditWriteMiddleware.InvokeAsync` calls `EnableBuffering()` unconditionally
+   before the empty-body short-circuit, allocating a `FileBufferingReadStream` for
+   every request including bodyless ones (`InboundAPI-019`, Low).
+
+Severity mix: 1 High, 3 Medium, 4 Low — no Critical. (The eighth finding —
+`InboundAPI-024`, Low — is a defensive watch-list item flagging that
+`_knownBadMethods` is unbounded; it is bounded *in practice* today by the
+configuration database, but the invariant is undocumented.)
+
+## Checklist coverage — 2026-05-28 (commit `1eb6e97`)
+
+| # | Category | Examined | Notes |
+|---|----------|----------|-------|
+| 1 | Correctness & logic bugs | ☑ | `ContentType?.Contains("json")` is case-sensitive (InboundAPI-020). |
+| 2 | Akka.NET conventions | ☑ | ASP.NET-hosted, no actors of its own; routes via `IInstanceRouter`/`CommunicationService`. No new issues. |
+| 3 | Concurrency & thread safety | ☑ | `ConcurrentDictionary` handler cache (post-001/002 fix). New audit middleware is per-request scoped, no shared mutable state. No new issues. |
+| 4 | Error handling & resilience | ☑ | Audit `WriteAsync` is fire-and-forget; async faults are unobserved (InboundAPI-018). |
+| 5 | Security | ☑ | `IActiveNodeGate` not registered in Host — standby-node gating disabled in production (InboundAPI-022). |
+| 6 | Performance & resource management | ☑ | `EnableBuffering()` unconditional on bodyless requests (InboundAPI-019); audit middleware wraps `Response.Body` and mints `ExecutionId` for non-script /api routes (InboundAPI-025). |
+| 7 | Design-document adherence | ☑ | `ParentExecutionId` not stamped on attribute-read/write routed messages (InboundAPI-021). InboundAPI-008's deferred Host registration still unfulfilled (InboundAPI-022). |
+| 8 | Code organization & conventions | ☑ | No new issues. |
+| 9 | Testing coverage | ☑ | `EndpointExtensions.HandleInboundApiRequest` composition wiring has no test (InboundAPI-023); middleware/filter/validator/executor/route are individually covered. |
+| 10 | Documentation & comments | ☑ | No new issues. |
+
 ## Findings

 ### InboundAPI-001 — Singleton script handler cache mutated without synchronization
@@ -844,3 +904,329 @@ now depends on `IInstanceLocator` + `IInstanceRouter` (both substitutable). Adde
 for each routed method, `GetAttribute` delegating to the batch `GetAttributes` and
 returning `null` for an absent key, `SetAttribute` delegating to `SetAttributes`, and
 the InboundAPI-016 deadline-token inheritance behaviour. All 15 pass.
+
+### InboundAPI-018 — `AuditWriteMiddleware` fires `WriteAsync` as `_ = task` — faulted async writes are unobserved
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Error handling & resilience |
+| Status | Open |
+| Location | `src/ScadaLink.InboundAPI/Middleware/AuditWriteMiddleware.cs:257` |
+
+**Description**
+
+`EmitInboundAudit` calls `_ = _auditWriter.WriteAsync(evt);` — the returned `Task` is
+discarded with the discard operator inside a synchronous `try` block. The wrapping
+`try/catch (Exception ex)` (lines 198–266) only catches a *synchronous* throw before
+the writer returns a task. Once `WriteAsync` returns a task, any exception that
+faults that task (e.g. a DB timeout in the central audit writer, a serialization
+failure, a cancellation that bubbles up) is never observed: it is not logged, it
+does not increment the `CentralAuditWriteFailures` health-monitoring counter the
+design doc references ("Fail-soft semantics" paragraph), and the audit row is
+silently lost. In .NET, unobserved task exceptions are eventually surfaced via
+`TaskScheduler.UnobservedTaskException` and may also be logged by the runtime —
+either way, the middleware itself has no control over what (if anything) happens
+on a fault. The XML doc comment at line 255 claims "the writer itself swallows"
+but this is an implicit cross-component contract: the abstraction
+`ICentralAuditWriter.WriteAsync` returns `Task` and makes no such guarantee, and
+the only test that exercises a throwing writer (`AuditWriter_Throws_*` in
+`AuditWriteMiddlewareTests.cs`) uses an `OnWrite` callback that throws
+*synchronously*, not asynchronously — so the async fault path is not covered by
+tests either.
+
+This matters because Component-InboundAPI.md states that audit-emission failures
+must increment `CentralAuditWriteFailures` (Health Monitoring #11) — a counter
+that, with the current fire-and-forget, will under-count async-faulted writes.
+
+**Recommendation**
+
+Either (a) await the write and rely on the surrounding try/catch to log the
+failure, accepting an extra await on the request hot path; or (b) keep the
+fire-and-forget for latency but attach a `ContinueWith(t => ..., OnlyOnFaulted)`
+that logs the fault and increments the failure counter, so a faulted async write
+is at least observed. Option (b) preserves "audit emission never blocks the HTTP
+response" while restoring the visibility the design assumes. Add a regression
+test using a writer whose `WriteAsync` returns a faulted `Task` (not a
+synchronous throw) to pin the new contract.
+
+### InboundAPI-019 — `EnableBuffering()` called unconditionally on every request, including bodyless requests
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Performance & resource management |
+| Location | `src/ScadaLink.InboundAPI/Middleware/AuditWriteMiddleware.cs:141` |
+| Status | Open |
+
+**Description**
+
+`InvokeAsync` always calls `ctx.Request.EnableBuffering()` before the empty-body
+short-circuit at `ReadBufferedRequestBodyAsync` line 289 (`if (request.ContentLength
+is 0) return (null, false);`). `EnableBuffering()` swaps the request stream for a
+`FileBufferingReadStream` whose construction allocates an internal buffer (default
+threshold ~30 KB before spilling to a temp file) regardless of whether the request
+actually has a body. The /api scope this middleware lives under will see at least
+some bodyless requests (e.g. GET `/api/audit/query` once that route is in the same
+branch — see InboundAPI-025; future health checks; misbehaving clients) and each
+one pays the buffering allocation cost for no benefit.
+
+**Recommendation**
+
+Defer the `EnableBuffering()` call into `ReadBufferedRequestBodyAsync` after the
+`ContentLength is 0` check, or short-circuit in `InvokeAsync` before enabling
+buffering when `ContentLength is 0` and `Method is "GET" or "HEAD" or "DELETE"`.
+The win is a per-request `FileBufferingReadStream` allocation avoided on every
+bodyless request through the middleware.
+
+### InboundAPI-020 — `ContentType.Contains("json")` is case-sensitive; `application/JSON` with no Content-Length skips body parsing
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.InboundAPI/EndpointExtensions.cs:70` |
+
+**Description**
+
+`HandleInboundApiRequest` parses the JSON body only when
+`httpContext.Request.ContentLength > 0 || httpContext.Request.ContentType?.Contains("json") == true`.
+The `string.Contains(string)` overload used here is case-sensitive — a perfectly
+valid HTTP header `Content-Type: application/JSON` (uppercase) would yield
+`false` (`"application/JSON".Contains("json")` is `false`). With no
+Content-Length (e.g. chunked transfer-encoding) and an uppercase content type,
+the handler then leaves `body = null` and `ParameterValidator.Validate` runs
+against a missing body — so a method that declares any required parameter is
+rejected with "Missing required parameters" even though the caller did send a
+well-formed JSON body. HTTP RFC 7230 §3.2 makes header field names case-insensitive
+but is silent on values; in practice clients do sometimes uppercase media-type
+tokens, and the framework's own `MediaTypeHeaderValue` is case-insensitive.
+
+**Recommendation**
+
+Use the case-insensitive overload —
+`httpContext.Request.ContentType?.Contains("json", StringComparison.OrdinalIgnoreCase) == true`
+— or rely on the framework's `IsJson` check via
+`MediaTypeHeaderValue.TryParse`/`HttpRequest.HasJsonContentType()`. Add a
+regression test posting with `application/JSON` and Transfer-Encoding: chunked.
+
+### InboundAPI-021 — `ParentExecutionId` correlation flows only through `Call`; attribute reads/writes lose the inbound→site execution-tree link
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Design-document adherence |
+| Status | Open |
+| Location | `src/ScadaLink.InboundAPI/RouteHelper.cs:141-143`, `:182-183`, `:225-226`; `src/ScadaLink.Commons/Messages/InboundApi/RouteToInstanceRequest.cs:15-21`, `:36-40`, `:55-59` |
+
+**Description**
+
+CLAUDE.md's Centralized Audit Log section describes `ParentExecutionId` as the
+cross-execution spawn pointer that "every row of a spawned run carries" and
+specifically calls out "the inbound API → routed-site-script case". The current
+implementation honours this only on `RouteToCallRequest` — which carries
+`ParentExecutionId` as its trailing additive field (line 21 of
+`RouteToInstanceRequest.cs`) and is stamped by `RouteTarget.Call` with the
+inbound request's execution id at line 143 of `RouteHelper.cs`.
+
+`RouteToGetAttributesRequest` and `RouteToSetAttributesRequest`, however, have
+**no `ParentExecutionId` field** and the matching `RouteTarget.GetAttributes` /
+`SetAttributes` methods (`RouteHelper.cs:182-183`, `:225-226`) never reference
+`_parentExecutionId`. So when an inbound API script reads or writes a site
+attribute via `Route.To("inst").GetAttribute(...)` /
+`Route.To("inst").SetAttribute(...)`, the site-side audit row for that
+trust-boundary action (an outbound-by-the-script DB / OPC write at the site) is
+emitted with `ParentExecutionId = null` and the execution-tree walk
+`IX_AuditLog_ParentExecution` cannot link it back to the spawning inbound
+request. The two-row pair (inbound + spawned site work) reverts to the
+"top-level / null" state the design says is the *fallback* for non-spawned runs.
+The asymmetry between `Call` and `GetAttributes`/`SetAttributes` is also surprising
+— a script author would reasonably expect the same correlation across all
+`Route.To(...)` calls.
+
+**Recommendation**
+
+Add a trailing `Guid? ParentExecutionId = null` field to
+`RouteToGetAttributesRequest` and `RouteToSetAttributesRequest` (additive
+trailing member, matches the message-evolution rule in CLAUDE.md); stamp it
+from `_parentExecutionId` in `RouteTarget.GetAttributes` and
+`RouteTarget.SetAttributes`; have the site-side handlers thread the field onto
+their emitted audit rows. Add a `RouteHelperTests` regression case asserting
+that an attribute read/write carries the inherited `ParentExecutionId`.
+
+### InboundAPI-022 — `IActiveNodeGate` has no production registration in Host — standby-node gating is silently disabled in production
+
+| | |
+|--|--|
+| Severity | High |
+| Category | Security |
+| Status | Open |
+| Location | `src/ScadaLink.InboundAPI/IActiveNodeGate.cs`, `src/ScadaLink.InboundAPI/InboundApiEndpointFilter.cs:52-60`; absent from `src/ScadaLink.Host/Program.cs` |
+
+**Description**
+
+InboundAPI-008's resolution adds `IActiveNodeGate` (lines 17–24 of
+`IActiveNodeGate.cs`) so a standby central node can refuse to serve the inbound
+API. `InboundApiEndpointFilter.InvokeAsync` consults the gate at line 52
+(`var gate = httpContext.RequestServices.GetService<IActiveNodeGate>();`), and
+when `gate is { IsActiveNode: false }` returns HTTP 503. The filter's behaviour
+when **no implementation is registered** (line 51 comment) is to fall through and
+serve the request — the resolution paragraph for InboundAPI-008 closes with:
+
+> "Follow-up (outside this module's scope): `ScadaLink.Host` should register an
+> `IActiveNodeGate` implementation backed by `ActiveNodeHealthCheck` /
+> `Cluster.State.Leader` in the central-role branch of `Program.cs` so the gate is
+> actually enforced in production; until then the endpoint defaults to "allow"."
+
+A grep of the entire `src/ScadaLink.Host/` tree at `1eb6e97` finds **zero**
+`IActiveNodeGate` registrations: `grep -rn "IActiveNodeGate\|AddSingleton.*ActiveNode"
+src/ScadaLink.Host/` returns no matches. The follow-up was never carried out. So
+in production today the standby central node still serves the inbound API exactly
+as InboundAPI-008 described — executes method scripts, runs `Route.To()` calls,
+races the active node, and may operate against stale singleton state. The new
+infrastructure (interface + filter check) is present but unwired; from the user's
+perspective the original High-severity issue is unresolved in deployed binaries.
+
+The design says the inbound API is "Central cluster only (active node)" and
+"fails over with it" — this guarantee is not currently enforced in production.
+
+**Recommendation**
+
+Register an `IActiveNodeGate` implementation in the central-role branch of
+`ScadaLink.Host/Program.cs`. The natural backing is the existing
+`ActiveNodeHealthCheck` (already wired for `/health/active`) or a direct read of
+`Cluster.Get(actorSystem).State.Leader == Cluster.Get(actorSystem).SelfAddress`.
+Add an integration test in the Host that spins up the central role and asserts
+that the gate is resolvable and returns `IsActiveNode` consistent with cluster
+leader state. Until that wiring lands, this finding is the user-facing
+realisation of the InboundAPI-008 vulnerability.
+
+### InboundAPI-023 — `EndpointExtensions.HandleInboundApiRequest` composition wiring has no test coverage
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Testing coverage |
+| Status | Open |
+| Location | `src/ScadaLink.InboundAPI/EndpointExtensions.cs:31-140`, `tests/ScadaLink.InboundAPI.Tests/` |
+
+**Description**
+
+The endpoint handler `HandleInboundApiRequest` is the wiring composition that
+ties the validator → JSON parse → `ParameterValidator` → `InboundScriptExecutor` →
+result-serialization path together; it is the single piece of code that maps
+validator status codes to HTTP responses, threads the `parentExecutionId` from
+`HttpContext.Items` into the executor, stashes the resolved API key name as
+`AuditActorItemKey`, and emits the request-aborted short-circuit. The test
+project covers each composed component (`ApiKeyValidatorTests`,
+`ParameterValidatorTests`, `InboundScriptExecutorTests`, `RouteHelperTests`,
+`InboundApiEndpointFilter`, `AuditWriteMiddlewareTests`,
+`MiddlewareOrderTests`) but no test exercises `HandleInboundApiRequest` itself —
+so regressions in the wiring (e.g. forgetting to stash the actor name on
+`HttpContext.Items`, the `Contains("json")` case sensitivity from
+InboundAPI-020, or accidentally swapping `validationResult.StatusCode` for a
+literal) are not caught.
+
+**Recommendation**
+
+Add an `EndpointExtensionsTests` suite using `TestServer` (the same pattern
+`MiddlewareOrderTests` uses) covering: the happy path (200 + body), invalid
+JSON (400), validator 401, validator 403, parameter-validation failure (400),
+script-failure 500, client-aborted short-circuit (`Results.Empty`), and the
+actor-stash invariant (HttpContext.Items[AuditActorItemKey] is set with the
+resolved key name after successful auth, but is absent on auth failures).
+
+### InboundAPI-024 — `_knownBadMethods` is unbounded — an attacker can grow the cache by spamming distinct method names against the audit middleware path
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Performance & resource management |
+| Status | Open |
+| Location | `src/ScadaLink.InboundAPI/InboundScriptExecutor.cs:30`, `:77`, `:223`, `:233` |
+
+**Description**
+
+The InboundAPI-009 fix introduced `_knownBadMethods`, a `ConcurrentDictionary<string, byte>`
+of method names whose Roslyn compilation failed, to short-circuit lazy
+recompilation. It is keyed by `method.Name` and entries are only ever removed
+when `CompileAndRegister` succeeds for the same name (line 83). Practically the
+key space is bounded by the configured method definitions in the database, so
+this is bounded in normal operation. But because the cache is mutated from the
+lazy-compile path at `ExecuteAsync.cs:233`, and `ExecuteAsync` is called from
+`HandleInboundApiRequest` only **after** `ApiKeyValidator.ValidateAsync` has
+returned `Valid` (i.e. a real method exists), the entry is keyed by a name that
+must have already been resolved through `GetMethodByNameAsync` — so this attack
+surface is gated by the configuration database. The finding is therefore mostly
+defensive: there is no rate limit on inbound API calls (deliberate design), so
+if a future change ever causes `ExecuteAsync` to be called for an unvalidated
+caller-supplied method name (e.g. a refactor that moves method-existence
+checking later), this cache would become attacker-controllable.
+
+**Recommendation**
+
+Optional / defensive: cap `_knownBadMethods` (e.g. an LRU with a fixed size, or
+clear it periodically). At minimum, document the invariant in the executor's
+XML comment that `_knownBadMethods` keys must come from validated
+`ApiMethod.Name` values, so the safety property survives future refactors. No
+immediate change required; this is a watch-list item.
+
+### InboundAPI-025 — `AuditWriteMiddleware` runs against the entire `/api/*` branch — emits spurious `ApiInbound` audit rows for `/api/audit/query` and `/api/audit/export`
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.Host/Program.cs:183-185`; consumers: `src/ScadaLink.ManagementService/AuditEndpoints.cs:93-94`; emitter: `src/ScadaLink.InboundAPI/Middleware/AuditWriteMiddleware.cs:175-252` |
+
+**Description**
+
+`Program.cs` wires the audit middleware as
+`app.UseWhen(ctx => ctx.Request.Path.StartsWithSegments("/api"), branch => branch.UseAuditWriteMiddleware())`
+— scoped to the `/api` *prefix*, not to the `POST /api/{methodName}` route.
+Meanwhile, `ScadaLink.ManagementService/AuditEndpoints.cs` maps
+`MapGet("/api/audit/query", ...)` (line 93) and `MapGet("/api/audit/export", ...)`
+(line 94). Both routes therefore inherit `AuditWriteMiddleware`, which emits an
+`AuditEvent { Channel = AuditChannel.ApiInbound, Kind = AuditKind.InboundRequest, ... }`
+row for every call. The middleware's `ResolveMethodName` falls back to the last
+path segment (lines 446–452), so a GET `/api/audit/query?...` is recorded as if a
+caller had invoked an inbound API method named "query"; an export is recorded
+as method "export". Effects:
+
+1. **Audit log is polluted with non-script rows.** The audit log is now
+   recording its own query traffic as if it were inbound script invocations,
+   contradicting Component-AuditLog.md's scope ("script trust boundary actions").
+2. **Audit reads recursively emit audit writes.** Every audit-log query (e.g.
+   from the Central UI Audit Log page or the CLI `audit query` command) writes
+   an additional row into `AuditLog`, growing the table on read.
+3. **`Target` is meaningless.** `/api/audit/query` has no method definition, so
+   the recorded `Target = "query"` is not joinable to any `ApiMethod` row in
+   audit-log drill-ins.
+4. **Wasted resources on health probes / management calls.** Any future routes
+   added under `/api/` will inherit the middleware and pay the
+   `EnableBuffering`, `CapturedResponseStream`, and `JsonSerializer.Serialize`
+   costs even though they are not inbound script invocations.
+
+Tests for the audit middleware (`AuditWriteMiddlewareTests`) and pipeline order
+(`MiddlewareOrderTests`) wire the middleware only against the
+`POST /api/{methodName}` route in test hosts, so this production-only
+mis-scoping is not exercised.
+
+**Recommendation**
+
+Tighten the predicate so the middleware runs only on the inbound API method
+route, not on the `/api/` prefix. Options:
+
+- `app.UseWhen(ctx => ctx.Request.Path.StartsWithSegments("/api") && !ctx.Request.Path.StartsWithSegments("/api/audit") && !ctx.Request.Path.StartsWithSegments("/api/management"), ...)`
+  — defensive, but fragile to future route additions.
+- Move the audit emission from a pipeline middleware to an `IEndpointFilter`
+  applied via `.AddEndpointFilter<>()` on the `MapInboundAPI` registration
+  (alongside `InboundApiEndpointFilter`). This makes the scope explicit on the
+  one route that needs it and survives future `/api/...` route additions
+  unchanged.
+
+The endpoint-filter form is the recommended fix — it co-locates the audit-emission
+scope with the route definition and matches how InboundAPI-006/008 gating is
+already wired.
@@ -5,10 +5,10 @@
 | Module | `src/ScadaLink.ManagementService` |
 | Design doc | `docs/requirements/Component-ManagementService.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-17 |
+| Last reviewed | 2026-05-28 |
 | Reviewer | claude-agent |
-| Commit reviewed | `39d737e` |
-| Open findings | 0 (1 Deferred — see ManagementService-012) |
+| Commit reviewed | `1eb6e97` |
+| Open findings | 6 (1 Deferred — see ManagementService-012) |

 ## Summary

@@ -46,6 +46,32 @@ that can leave an instance partially modified after an error (015, Medium), raw
 messages from unexpected faults being returned verbatim to HTTP callers (016, Low), and
 `QueryDeploymentsCommand` having no test coverage at all (017, Low).

+#### Re-review 2026-05-28 (commit `1eb6e97`)
+
+All seventeen prior findings remain correctly closed; ManagementService-012 is still the
+only Deferred entry (marker-interface on `ManagementEnvelope.Command` still belongs in the
+Commons module). The module has grown substantially since the last review (`+1997 lines`):
+the Transport (#24) bundle commands (`ExportBundle`/`PreviewBundle`/`ImportBundle`) have
+been added to `ManagementActor`, and a new `AuditEndpoints.cs` (`/api/audit/query` and
+`/api/audit/export`) ships alongside the existing `/management` endpoint. This re-review
+re-ran the full 10-category checklist and surfaced **six new findings**. The dominant
+theme is the same authorization gap that findings 001/002/003/014 closed for the
+ManagementActor, now resurfacing in the new surfaces:
+**QueryAuditLogCommand has no role gate at all** (018, High) — any authenticated user can
+read the configuration audit log via `/management`, even though the parallel
+`/api/audit/query` requires `OperationalAuditRoles`. The new `/api/audit/{query,export}`
+endpoints build an `AuthenticatedUser` with `PermittedSiteIds` but never enforce site scope
+(019, Medium) — although audit roles are not site-scoped by design, the user-supplied
+`sourceSiteId` filter is honoured verbatim. `HandleUpdateSmtpConfig` returns the full
+SmtpConfiguration entity (including the `Credentials` field, which can carry SMTP passwords
+/ OAuth2 client secrets) in the response and audit row (020, Medium). The Transport (#24)
+bundle commands have zero test coverage in `ManagementActorTests` (021, Medium) — neither
+role gating nor success/error paths. The `Component-ManagementService.md` design doc is
+stale on three fronts: it does not mention Transport bundle commands, the `/api/audit/*`
+endpoints, or the now-wired `CommandTimeout` option (022, Low). Finally,
+`HandleQueryDeployments` issues one `GetInstanceByIdAsync` per unique instance ID when
+filtering for a site-scoped user — an N+1 read pattern on the unfiltered branch (023, Low).
+
 ## Checklist coverage

 | # | Category | Examined | Notes |
@@ -61,6 +87,21 @@ messages from unexpected faults being returned verbatim to HTTP callers (016, Lo
 | 9 | Testing coverage | + | Authorization is well covered; site-scope enforcement, the HTTP endpoint, `DebugStreamHub`, and remote-query handlers have no tests. See 013. |
 | 10 | Documentation & comments | + | XML docs are accurate where present; `ManagementServiceOptions` and `ResolveRolesCommand` paths are undocumented dead code (010, 011). |

+_Re-review (2026-05-28, `1eb6e97`):_
+
+| # | Category | Examined | Notes |
+|---|----------|----------|-------|
+| 1 | Correctness & logic bugs | + | `HandleImportBundle` correctly dedupes resolutions per (entity,name); `ParseDocument` still allocates a `JsonDocument.Parse("{}")` on the failure path but the caller's `using` disposes it. No new defects. |
+| 2 | Akka.NET conventions | + | PipeTo dispatch from 004 is intact; supervision strategy from 005 is intact; `Sender` correctly captured to local before PipeTo. No new findings. |
+| 3 | Concurrency & thread safety | + | Bundle handlers `await` cleanly; `BundleSession` is not cleaned up if `PreviewAsync`/`ApplyAsync` throws, but that is an `IBundleImporter` contract concern outside this module. No new findings. |
+| 4 | Error handling & resilience | + | `ManagementCommandException` from 016 is applied consistently across the new bundle handlers (curated `CryptographicException`/`ArgumentException` paths). No new findings. |
+| 5 | Security | + | `QueryAuditLogCommand` has no role gate (018, High). New `/api/audit/*` endpoints build `PermittedSiteIds` but never enforce them (019, Medium). `HandleUpdateSmtpConfig` returns + audits `Credentials` verbatim (020, Medium). |
+| 6 | Performance & resource management | + | `HandleQueryDeployments` unfiltered-with-scope branch is N+1 on instance lookups (023, Low). Request body up to 200 MB read into a single `string` in `HandleRequest` (acceptable per Transport bundle requirement). |
+| 7 | Design-document adherence | + | `Component-ManagementService.md` is stale on Transport bundle commands, `/api/audit/*` endpoints, and the now-wired `CommandTimeout` (022, Low). |
+| 8 | Code organization & conventions | + | `AuditEndpoints` duplicates the Basic Auth → LDAP → roles flow from `ManagementEndpoints` (~50 lines). Acknowledged in `AuditEndpoints` XML but worth tracking. No new finding raised. |
+| 9 | Testing coverage | + | Transport bundle commands have zero `ManagementActorTests` coverage — neither role gating nor handler logic (021, Medium). |
+| 10 | Documentation & comments | + | New `AuditEndpoints` XML doc is high quality. `Component-ManagementService.md` not updated for Transport/Audit endpoints (022 covers). |
+
 ## Findings

 ### ManagementService-001 — Remote-query and debug-snapshot handlers bypass site-scope enforcement
@@ -748,3 +789,294 @@ Resolved 2026-05-17 (commit pending). Added seven `QueryDeployments_*` tests to
 Deployment user and an Admin user, in- and out-of-scope
 (`_FilteredByOutOfScopeInstance_ReturnsUnauthorized`, `_FilteredByInScopeInstance_ReturnsRecords`,
 `_UnfilteredForSiteScopedUser_DropsOutOfScopeRecords`, `_UnfilteredForAdminUser_ReturnsAllRecords`).
+
+### ManagementService-018 — QueryAuditLogCommand has no role gate
+
+| | |
+|--|--|
+| Severity | High |
+| Category | Security |
+| Status | Open |
+| Location | `src/ScadaLink.ManagementService/ManagementActor.cs:153`–`:207`, `:336`, `:1302` |
+
+**Description**
+
+`QueryAuditLogCommand` is dispatched at line 336 to `HandleQueryAuditLog`, which calls
+`ICentralUiRepository.GetAuditLogEntriesAsync(...)` with no role check, no site-scope
+check, and no actor filter. `GetRequiredRole` (lines 153–207) does not list
+`QueryAuditLogCommand`, so it falls through to the `_ => null` case — i.e. "read-only
+queries — any authenticated user". The parallel `/api/audit/query` endpoint in
+`AuditEndpoints.HandleQuery` correctly enforces `AuthorizationPolicies.OperationalAuditRoles`
+(`{ "Admin", "Audit", "AuditReadOnly" }`), so a CLI authenticated as a user with only the
+`Deployment` role — or no roles at all — is rejected at `/api/audit/query` but can read
+the *same* audit log table through `/management` by sending `QueryAuditLogCommand`. The
+two surfaces enforce different permissions on the same data; the older
+ManagementActor-routed path is the looser one. The audit log records every script-trust-
+boundary action and is sensitive operationally — it should not be readable by a default
+authenticated user.
+
+This is the same authorization-bypass class as findings 001/002/014 and was missed in
+that sweep because `QueryAuditLogCommand` (legacy `Action`/`EntityType` filter) is a
+separate command from the new keyset-paged `IAuditLogRepository.QueryAsync` path the
+`/api/audit/query` endpoint uses.
+
+**Recommendation**
+
+Add `QueryAuditLogCommand` to `GetRequiredRole`. The natural fit is a new
+`"OperationalAudit"`-style role group — but `GetRequiredRole` returns a single string and
+the project's existing role gates do too (`Admin`/`Design`/`Deployment`). Two equally
+defensible options:
+
+1. Add `QueryAuditLogCommand` to the `Admin`-required group — strict, mirrors that
+   `AuditExportRoles` includes `Admin`. The CLI's CLI-017/018 audit work uses
+   `/api/audit/query`, so `QueryAuditLogCommand` may be effectively orphaned anyway.
+2. Extend `GetRequiredRole` to return a role *set* and add an `AuditRoles` group equal to
+   `AuthorizationPolicies.OperationalAuditRoles`, so the two surfaces converge.
+
+Recommended: option 1 plus a deprecation comment on `QueryAuditLogCommand` pointing at
+`/api/audit/query` — the legacy command's filter shape is a subset of the new endpoint's,
+so the ManagementActor route is redundant. Add a regression test asserting that a
+no-role / `Deployment`-only caller gets `ManagementUnauthorized` for `QueryAuditLogCommand`.
+
+### ManagementService-019 — AuditEndpoints builds PermittedSiteIds but never enforces them
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Security |
+| Status | Open |
+| Location | `src/ScadaLink.ManagementService/AuditEndpoints.cs:358`–`:368`, `:397`–`:437` |
+
+**Description**
+
+`AuditEndpoints.AuthenticateAsync` resolves the caller's roles AND `PermittedSiteIds` and
+wraps them in an `AuthenticatedUser` (lines 358–366), but the returned `AuthenticatedUser`
+is then only used for the `HasAnyRole(...)` role check on lines 114 and 163 — its
+`PermittedSiteIds` are never read. `ParseFilter` (line 397) accepts the caller-supplied
+`sourceSiteId=...` query string verbatim and passes it straight into the
+`IAuditLogRepository.QueryAsync` filter. A user whose `Audit` (or `AuditReadOnly`) role
+mapping carries scope rules — e.g. `AuditReadOnly` scoped to "plant-a" — can still ask
+for `sourceSiteId=plant-b` and get back rows for plant-b.
+
+Today this gap is partially benign because the design treats `Audit`/`AuditReadOnly` as
+non-site-scoped roles (`Component-AuditLog.md` does not list site scoping for the audit
+permissions, and the LDAP role mapping UI does not currently surface site scope rules
+for those roles). But (a) the `RoleMapper` will silently honour scope rules attached to
+any role, including `Audit`, so an operator who *does* configure them gets a UI that
+says "scoped" and an endpoint that ignores the scope — a contract violation; (b) the
+`Admin` role's `PermittedSiteIds` are always empty (system-wide), so enforcing for the
+other roles is cheap. The asymmetry with the `/management` endpoint — which routes every
+site-targeted command through `EnforceSiteScope` — is also a maintenance hazard.
+
+**Recommendation**
+
+Decide explicitly whether the audit endpoints honour site scope. Two options:
+
+1. **Honour scope** — in `HandleQuery` / `HandleExport`, after the role check, intersect
+   the caller-supplied `filter.SourceSiteIds` with `user.PermittedSiteIds`. If the
+   caller supplied no `sourceSiteId` and `PermittedSiteIds` is non-empty, restrict to
+   `PermittedSiteIds`. If the intersection is empty, return an empty page (or a 403 if
+   the caller explicitly asked for an out-of-scope site).
+2. **Document the intentional bypass** — drop the `PermittedSiteIds` field from the
+   `AuthenticatedUser` constructed in `AuthenticateAsync` (or comment it as "ignored —
+   audit roles are not site-scoped") so the code stops carrying a value it does not
+   read, and add an XML doc note on the endpoint class that audit roles are always
+   system-wide by design.
+
+Recommended: option 1, mirroring the `ManagementActor` pattern — same security posture
+across both surfaces. Add a regression test that a site-scoped `AuditReadOnly` user
+filtering on an out-of-scope site gets a 403 (or an empty page).
+
+### ManagementService-020 — UpdateSmtpConfig returns and audits the SMTP Credentials field verbatim
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Security |
+| Status | Open |
+| Location | `src/ScadaLink.ManagementService/ManagementActor.cs:1136`–`:1153` |
+
+**Description**
+
+`HandleUpdateSmtpConfig` reads the existing `SmtpConfiguration` entity, applies the
+incoming command, and then **(a)** passes the full `config` object as the `afterState`
+to `AuditAsync` (line 1151) — meaning the SMTP credential string is persisted in the
+audit log — and **(b)** returns the full `config` to the caller (line 1152), which is
+serialized via `SerializeResult` and sent back over HTTP. `SmtpConfiguration.Credentials`
+carries the SMTP-Auth password (for `Basic`) or the OAuth2 client secret (for
+`OAuth2ClientCredentials`); `SmtpConfiguration` has no `[JsonIgnore]` on this field
+and `SerializeResult`'s `JsonSerializerOptions` does not exclude it. The pattern
+parallels what ConfigurationDatabase-012 fixed for inbound API keys: a credential
+artifact must not be echoed back through every read/audit path.
+
+The credential is supplied by the operator in `UpdateSmtpConfigCommand.Credentials`,
+so the caller already has it. But (1) anyone with read access to the audit log
+(`OperationalAuditRoles`) can now retrieve every SMTP credential change verbatim — a
+strictly larger blast radius than `Admin`-only `UpdateSmtpConfig`. (2) The serialized
+`config` echo means the credential moves over the wire in the response even though the
+caller has no need for it. (3) Any future read path that returns
+`SmtpConfiguration` — `ListSmtpConfigsCommand` already does at line 1130 — will leak
+the stored credential too.
+
+**Recommendation**
+
+Three changes, in order of priority:
+
+1. In `HandleUpdateSmtpConfig` and `HandleListSmtpConfigs`, project to a credential-free
+   shape before returning — e.g. `new { config.Id, config.Host, config.Port,
+   config.AuthType, config.FromAddress, config.TlsMode }`. Match the
+   `HandleListApiKeys` pattern.
+2. In `AuditAsync` for the SMTP path, pass a credential-free `afterState` (the same
+   anonymous shape). The fact that *something* changed is auditable; the secret value
+   is not.
+3. Tag `SmtpConfiguration.Credentials` with `[JsonIgnore]` in Commons (out-of-scope edit
+   for this module, but worth a follow-up). Alternatively, configure
+   `ResultSerializerOptions` with a property name policy that skips a known set of
+   credential field names — but a per-entity projection is cleaner.
+
+Add regression tests: `UpdateSmtpConfig_DoesNotEchoCredentialsInResponse` and
+`UpdateSmtpConfig_DoesNotPersistCredentialsInAuditLog`.
+
+### ManagementService-021 — Transport bundle handlers have zero test coverage
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Testing coverage |
+| Status | Open |
+| Location | `tests/ScadaLink.ManagementService.Tests/ManagementActorTests.cs:1`; `src/ScadaLink.ManagementService/ManagementActor.cs:1717`–`:1897` |
+
+**Description**
+
+The three Transport (#24) bundle handlers — `HandleExportBundle`, `HandlePreviewBundle`,
+`HandleImportBundle` (~180 lines of handler logic at the bottom of `ManagementActor.cs`)
+— have **no tests** in `ManagementActorTests`. Specifically untested:
+
+1. **Role gating.** `ExportBundleCommand` requires `Design`; `PreviewBundleCommand` and
+   `ImportBundleCommand` require `Admin`. No test asserts that the wrong role gets
+   `ManagementUnauthorized`. CLI-017 / CLI-018 just landed around bundle plumbing — a
+   future refactor that moves these commands between role groups in `GetRequiredRole`
+   would silently regress the gate.
+2. **Name resolution in `HandleExportBundle`.** The inner `ResolveIds<T>` helper raises
+   `ManagementCommandException` for unknown names. The "all entity types" branch
+   (`cmd.All == true`) and the "missing name" branch are both untested.
+3. **`HandleImportBundle` blocker rejection.** The handler aborts before `ApplyAsync`
+   when any `ConflictKind.Blocker` row is present; the produced error message is
+   curated and surfaced to the caller, but no test asserts the abort path or that the
+   importer's `ApplyAsync` was not called.
+4. **Resolution dedupe.** `HandleImportBundle` dedupes `(EntityType, Name)` keys
+   last-write-wins — the dedupe is critical (CLI-014 was about it on the CLI side) but
+   has no actor-side regression test.
+5. **`DecodeBundle` failure modes** (empty/non-base64 input) — both branches return
+   curated `ManagementCommandException` but neither is exercised.
+6. **`ParseConflictPolicy`** for `"skip"`, `"overwrite"`, `"rename"`, and the invalid-
+   value branch — all untested.
+
+Given the size and reach of the bundle path (cross-cutting central configuration
+import), this gap is materially larger than usual for new handler code.
+
+**Recommendation**
+
+Add an `ImportBundleHandlerTests` suite covering:
+- role gating for all three commands (`Design`/`Admin` mismatch -> `ManagementUnauthorized`),
+- `ExportBundleCommand(All: true)` happy-path,
+- `ExportBundleCommand` with an unknown name -> `ManagementError`,
+- `ImportBundleCommand` with a `Blocker` row -> `ManagementError` and `ApplyAsync` not called,
+- `ImportBundleCommand` with duplicate preview items -> dedupe to one resolution per (type, name),
+- `DecodeBundle` empty/invalid base64,
+- `ParseConflictPolicy` all four branches.
+
+Use NSubstitute for `IBundleImporter` / `IBundleExporter` (no need for a real bundle in
+the actor tests; the bundle round-trip belongs in `Transport` tests).
+
+### ManagementService-022 — Design doc is stale on Transport bundle commands, /api/audit/* endpoints, and CommandTimeout
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Design-document adherence |
+| Status | Open |
+| Location | `docs/requirements/Component-ManagementService.md:77`–`:175`, `:205`–`:209` |
+
+**Description**
+
+`Component-ManagementService.md` does not mention three pieces of shipped functionality:
+
+1. **Transport (#24) bundle commands.** `ExportBundleCommand`, `PreviewBundleCommand`,
+   and `ImportBundleCommand` are dispatched at `ManagementActor.cs:350`–`:352` and
+   role-gated in `GetRequiredRole` (Design for Export; Admin for Preview/Import). The
+   design doc's "Message Groups" section enumerates Templates, Instances, Sites, Data
+   Connections, Deployments, External Systems, Notifications, Security, Audit Log,
+   Shared Scripts, Database Connections, Inbound API Methods, Health, and Remote
+   Queries — but has no "Transport" / "Bundles" group. The CLI now offers `bundle
+   export`/`preview`/`import` (per the recent CLI-017/018 work) and points
+   at these commands.
+2. **`/api/audit/*` endpoints.** The doc's "HTTP Management API" section (line 52)
+   describes only `POST /management`. `AuditEndpoints.MapAuditAPI()` adds
+   `GET /api/audit/query` and `GET /api/audit/export` with their own auth-and-role
+   path mirroring `ManagementEndpoints` (intentionally — see the `AuditEndpoints` XML
+   docs), but the design doc gives no signal that the module exposes more than one
+   route group, no per-endpoint role mapping table, and no mention that the response
+   shape differs (keyset cursor vs. opaque page).
+3. **`CommandTimeout`.** Line 209 still says "Reserved for future configuration —
+   e.g., command timeout overrides", but ManagementService-010 wired the option through
+   `ResolveAskTimeout`. The doc is stale.
+
+**Recommendation**
+
+Update `Component-ManagementService.md`:
+
+- Add a "Transport" entry to "Message Groups" listing `ExportBundle`,
+  `PreviewBundle`, `ImportBundle` with their per-command roles. Cross-reference
+  `Component-Transport.md`.
+- Add an "Audit Log HTTP API" subsection under "HTTP Management API" describing
+  `GET /api/audit/query` (keyset cursor, `OperationalAuditRoles`) and
+  `GET /api/audit/export` (csv/jsonl streaming, `AuditExportRoles`, parquet 501).
+  Note the deliberate divergence in the source-site query-string key
+  (`sourceSiteId` vs CentralUI's `site`).
+- In the "Configuration" table, replace "Reserved for future configuration" with the
+  actual `CommandTimeout` semantics: "Max time the HTTP endpoint will Ask the
+  ManagementActor before returning HTTP 504; falls back to 30 s when unset or
+  non-positive."
+
+### ManagementService-023 — HandleQueryDeployments unfiltered branch is N+1 on instance lookup
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Performance & resource management |
+| Status | Open |
+| Location | `src/ScadaLink.ManagementService/ManagementActor.cs:1276`–`:1295` |
+
+**Description**
+
+The site-scoped unfiltered branch of `HandleQueryDeployments` (added under
+ManagementService-014) reads every `DeploymentRecord` via `GetAllDeploymentRecordsAsync`,
+then for each *unique* `record.InstanceId` calls
+`ITemplateEngineRepository.GetInstanceByIdAsync` to resolve the instance's
+`SiteId`. The handler caches results in `instanceSiteCache` so each instance is loaded
+at most once per call, but for a fleet with N distinct instances having deployment
+history, the handler still issues N round-trips to the configuration database to
+authorize a single query. With a large deployment history the cumulative DB hit can be
+material; it also runs every time a site-scoped user opens the deployments page.
+
+This is acceptable in steady state today (sites tend to have small fleets and few
+deployments) but is a textbook N+1 read pattern, and on a busy day for a site-scoped
+operator the cost will dominate the request. Admin and system-wide Deployment users
+correctly skip the loop (they hit only `GetAllDeploymentRecordsAsync`).
+
+**Recommendation**
+
+Add a batch-resolve method to `ITemplateEngineRepository` — e.g.
+`Task<IDictionary<int, int>> GetInstanceSiteIdsAsync(IEnumerable<int> instanceIds)` —
+backed by a single EF query
+(`Instances.Where(i => instanceIds.Contains(i.Id)).Select(i => new { i.Id, i.SiteId })`).
+`HandleQueryDeployments` would then issue exactly two queries on the unfiltered branch
+(records + sites) regardless of fleet size. The change is additive to
+`ITemplateEngineRepository` and out-of-module for the actual implementation, but the
+handler change is local; a quick interim alternative is to project deployment records
+to include the instance's `SiteId` at the repo level, which removes the second query
+entirely.
+
+Defer until a noticeable hot path emerges, but track it: this is the only N+1 in
+`ManagementActor` once 002 / 014 are folded in.
@@ -0,0 +1,488 @@
+# Code Review — NotificationOutbox
+
+| Field | Value |
+|-------|-------|
+| Module | `src/ScadaLink.NotificationOutbox` |
+| Design doc | `docs/requirements/Component-NotificationOutbox.md` |
+| Status | Reviewed |
+| Last reviewed | 2026-05-28 |
+| Reviewer | claude-agent |
+| Commit reviewed | `1eb6e97` |
+| Open findings | 10 |
+
+## Summary
+
+NotificationOutbox is a small, focused module — one ~985-line actor
+(`NotificationOutboxActor`), a strongly-typed options class, an
+`INotificationDeliveryAdapter` seam, and the single concrete `EmailNotificationDeliveryAdapter`.
+The Akka.NET conventions are textbook: every async path is wrapped with `PipeTo`, the
+dispatcher uses an in-flight guard cleared on `DispatchComplete`, the sender is captured
+before crossing the await, and the actor isolates per-notification failures so one bad row
+never aborts a batch. Test coverage is broad — ingest, dispatch, query, retry/discard,
+purge, KPI, and the new audit-emission paths (B2 attempts + B3 terminals) all have
+dedicated test files — and the audit-write-failure-never-aborts-delivery contract is
+explicitly asserted.
+
+The dominant theme is **trust-boundary leakage between Outbox, NotificationService, and
+ConfigurationDatabase**. The outbox inherits two known defects from its sibling modules
+that are reachable through `EmailNotificationDeliveryAdapter`: the OAuth2 SASL empty-user
+bug (NS-021) ships every M365 send with `user=""`, and the
+`InsertIfNotExistsAsync` check-then-act race (CD-015) lives on the outbox's ack-after-persist
+hot path. Neither is a defect of code under `src/ScadaLink.NotificationOutbox/`, but both
+are surfaced here because production dispatch and ingest go through these exact lines.
+A secondary theme is **dispatcher-fire-and-forget audit writes** (`_ = _auditWriter.WriteAsync(...)`)
+that can race the per-sweep scope dispose under the wrong DI graph, and a few smaller
+drifts: the dispatcher passes `CancellationToken.None` to adapter delivery (no graceful
+shutdown for in-flight SMTP sends), the `StuckAgeThreshold` XML-doc describes a behavior
+the design explicitly forbids (display-only, never reclaim), the `MaxRetries` boundary check
+uses `>=` against a config value that can be zero (immediate park on first transient
+failure), and several `NotificationOutboxOptions` fields are documented in code but absent
+from `Component-NotificationOutbox.md`. No Critical findings; two High, six Medium, two Low.
+
+## Checklist coverage
+
+| # | Category | Examined | Notes |
+|---|----------|----------|-------|
+| 1 | Correctness & logic bugs | Yes | `MaxRetries` zero/negative immediately parks (NotificationOutbox-002); `StuckAgeThreshold` XML doc contradicts design (NotificationOutbox-009); `Guid.TryParse` accepts compact `"N"` ids emitted by sites. |
+| 2 | Akka.NET conventions | Yes | `PipeTo` / sender-capture / in-flight guard pattern is correctly applied throughout. Fire-and-forget `_ = _auditWriter.WriteAsync(...)` raises a scope-lifetime concern (NotificationOutbox-004). |
+| 3 | Concurrency & thread safety | Yes | Actor state mutated only on actor thread. Inherited CD-015 race on `InsertIfNotExistsAsync` (NotificationOutbox-005) is the only race; the dispatcher's in-flight guard correctly serializes sweeps. |
+| 4 | Error handling & resilience | Yes | Outer try/catch on `RunDispatchPass`/`RunPurgePass` keeps the in-flight guard sane; per-notification isolation is correct. CT not threaded into delivery (NotificationOutbox-003). |
+| 5 | Security | Yes | Inherited OAuth2 empty-user (NotificationOutbox-001) reachable through the adapter. No new credential or trust-boundary issues introduced by the outbox itself. |
+| 6 | Performance & resource management | Yes | Dispatch interval & batch size are simple polling; `ResolveAdapters` rebuilds the lookup per sweep (NotificationOutbox-006). No leaks. |
+| 7 | Design-document adherence | Yes | `NotificationOutboxOptions.DispatchBatchSize`, `DeliveredKpiWindow`, `PurgeInterval` are not in the design doc (NotificationOutbox-007). |
+| 8 | Code organization & conventions | Yes | Options class lives in the component project (correct); DI extension lives in the component (correct); adapter is `scoped`, actor singleton — interaction correctly documented in `ServiceCollectionExtensions`. No issues. |
+| 9 | Testing coverage | Yes | Solid actor-behaviour coverage. Missing tests for `FallbackMaxRetries` / empty-SMTP-config dispatch path (NotificationOutbox-008). |
+| 10 | Documentation & comments | Yes | XML on `StuckAgeThreshold` misleading (NotificationOutbox-009); XML on dispatcher's audit `_ =` fire-and-forget says "writer never throws" but `EmitAttemptAudit` still wraps in try/catch — comment contradicts itself (NotificationOutbox-010). |
+
+## Findings
+
+### NotificationOutbox-001 — `EmailNotificationDeliveryAdapter` inherits the OAuth2 empty-user SASL bug (NS-021) on the M365 send path
+
+| | |
+|--|--|
+| Severity | High |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.NotificationOutbox/Delivery/EmailNotificationDeliveryAdapter.cs:185-191` (calls `smtp.AuthenticateAsync("oauth2", token)`); root cause in `src/ScadaLink.NotificationService/MailKitSmtpClientWrapper.cs:76-79` |
+
+**Description**
+
+`EmailNotificationDeliveryAdapter.SendAsync` resolves an OAuth2 access token via
+`_tokenService.GetTokenAsync(...)` and then calls
+`await smtp.AuthenticateAsync(config.AuthType, credentials, cancellationToken);`
+on `ISmtpClientWrapper`. The production implementation (`MailKitSmtpClientWrapper`)
+constructs `new SaslMechanismOAuth2("", credentials)` — an empty user-name field —
+which Microsoft 365 SMTP rejects with `535 5.7.3 Authentication unsuccessful`. The
+sibling NotificationService finding NS-021 documents this in full; the outbox is the
+*new home* for delivery on central, so every OAuth2 send that the outbox dispatches
+hits this code path. The defect is therefore reachable here even though the offending
+constructor lives in the NotificationService project, and the central-only redesign
+means this is now the only delivery path in production. Existing outbox tests do not
+catch it because they all substitute `ISmtpClientWrapper` and assert only that
+`AuthenticateAsync` is invoked with `("oauth2", "<token>")` — the real
+`SaslMechanismOAuth2` is never instantiated. `OAuth2TokenService.GetTokenAsync` is
+explicitly wired to `login.microsoftonline.com/.../oauth2/v2.0/token` with
+`scope=https://outlook.office365.com/.default`, so M365 SMTP is the intended target —
+and is precisely the relay that requires the user field to be populated.
+
+**Recommendation**
+
+Track the NS-021 fix and add an outbox-side regression test once the wrapper signature
+is widened. Concretely, when `ISmtpClientWrapper.AuthenticateAsync` is extended to
+accept the sender mailbox (or a dedicated `oauth2UserName` parameter), update
+`EmailNotificationDeliveryAdapter.SendAsync` to pass `config.FromAddress`, and add a
+test in `EmailNotificationDeliveryAdapterTests` that asserts the OAuth2 path forwards
+the sender identity. Until then, surface the same finding here so the outbox is not
+treated as resolved when NS-021 fires.
+
+**Resolution**
+
+_Unresolved._
+
+### NotificationOutbox-002 — Dispatcher parks on first transient failure when `SmtpConfiguration.MaxRetries == 0`
+
+| | |
+|--|--|
+| Severity | High |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.NotificationOutbox/NotificationOutboxActor.cs:348-360` |
+
+**Description**
+
+The transient-failure branch increments `RetryCount` then evaluates
+`if (notification.RetryCount >= maxRetries) notification.Status = NotificationStatus.Parked;`.
+`maxRetries` is read from the central `SmtpConfiguration.MaxRetries` column, which has
+no enforced lower bound and is not validated by the outbox. A row whose `MaxRetries`
+is `0` (or any negative value) immediately satisfies `1 >= 0` on the very first
+transient failure, so the notification is parked without a single retry — directly
+contradicting the design doc's "fixed retry interval, reuse central SMTP
+max-retry-count" intent, where a configured value of zero would naturally read as
+"never retry, fail straight to permanent". `SetupSmtpRetryPolicy` in the dispatch
+tests always supplies a positive value, so this path is not exercised.
+
+Additionally, an operator who clears the SMTP config row drops into the
+`FallbackMaxRetries = 10` / `FallbackRetryDelay = 1 min` path
+(`ResolveRetryPolicyAsync` line 251); that path is also untested — see
+NotificationOutbox-008. The operational result is that a single bad SMTP config
+value silently halves the outbox's delivery guarantees.
+
+**Recommendation**
+
+Validate `MaxRetries` at the read point: treat a non-positive value as either the
+configured fallback (current `FallbackMaxRetries = 10`) or — preferred — surface the
+mis-configuration to the operator via a health metric and refuse to dispatch until
+the row is corrected. Either way, add a test that asserts the dispatcher's behaviour
+for `MaxRetries == 0` and `MaxRetries < 0`.
+
+**Resolution**
+
+_Unresolved._
+
+### NotificationOutbox-003 — Dispatcher does not propagate a `CancellationToken` into delivery; in-flight SMTP sends cannot be cancelled on shutdown
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Error handling & resilience |
+| Status | Open |
+| Location | `src/ScadaLink.NotificationOutbox/NotificationOutboxActor.cs:334`, `src/ScadaLink.NotificationOutbox/Delivery/INotificationDeliveryAdapter.cs:22` |
+
+**Description**
+
+`DeliverOneAsync` calls `var outcome = await adapter.DeliverAsync(notification);` —
+the second `CancellationToken` parameter on `INotificationDeliveryAdapter.DeliverAsync`
+is left at its `default(CancellationToken)` value, meaning `CancellationToken.None`.
+`EmailNotificationDeliveryAdapter.SendAsync` then threads that `None` token into
+`smtp.ConnectAsync`, `smtp.AuthenticateAsync`, and `smtp.SendAsync`. The consequence
+is that during a coordinated cluster shutdown (singleton handover, drain) any
+in-flight SMTP send is uncancellable and the dispatcher's sweep must wait for the
+underlying socket/SMTP timeout (`SmtpConfiguration.ConnectionTimeoutSeconds`) before
+the sweep's task completes and `DispatchComplete` lowers the in-flight guard. With
+the default connect-timeout values this is on the order of tens of seconds per
+notification in the in-progress batch, blocking `CoordinatedShutdown`.
+
+The adapter implementations clearly *expect* a token — the contract type is
+`CancellationToken cancellationToken = default` everywhere — so this is a wiring
+gap, not a missing interface.
+
+**Recommendation**
+
+Wire a per-sweep `CancellationTokenSource` linked to the actor's lifecycle (cancel
+in `PostStop`) and pass its token into `DeliverAsync`. A linked source per sweep
+also bounds individual deliveries by the configured connection timeout when a more
+explicit per-attempt budget is wanted. Add a test that cancels mid-`DeliverAsync` and
+asserts the dispatcher completes promptly and the row is left non-terminal
+(`Pending`/`Retrying` unchanged) for the next sweep.
+
+**Resolution**
+
+_Unresolved._
+
+### NotificationOutbox-004 — `EmitAttemptAudit`/`EmitTerminalAudit` fire-and-forget pattern can outlive the per-sweep DI scope
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Akka.NET conventions |
+| Status | Open |
+| Location | `src/ScadaLink.NotificationOutbox/NotificationOutboxActor.cs:425-435`, `463-485` |
+
+**Description**
+
+Both emission helpers issue `_ = _auditWriter.WriteAsync(evt);` — discarding the
+returned task. `CentralAuditWriter.WriteAsync` opens its own `await using var scope =
+_services.CreateAsyncScope();` and resolves a scoped `IAuditLogRepository` (verified
+at `src/ScadaLink.AuditLog/Central/CentralAuditWriter.cs:118-121`), so the writer is
+defensively scope-independent. However the dispatcher already holds a per-sweep
+`using var scope = _serviceProvider.CreateScope();` and the per-notification
+`UpdateAsync` runs in that scope. The fire-and-forget pattern means:
+
+1. The dispatcher's outer scope can be disposed (sweep done, `DispatchComplete`
+   piped) while the audit `WriteAsync` task is still running on a *different*
+   scope it owns — works today only because the writer creates its own scope.
+2. A faulted unobserved task is silently lost: if `CentralAuditWriter.WriteAsync`
+   itself were ever made `async void` or refactored to not internally try/catch,
+   the dispatcher would never see the fault and the audit row would vanish without
+   the `_logger.LogWarning` reaching the operator.
+3. The XML-doc above `EmitAttemptAudit` says "PipeTo is not used because the writer
+   never throws" — but the surrounding `try { _ = _auditWriter.WriteAsync(evt); }
+   catch (Exception ex)` will only catch a synchronous throw from the *task
+   construction*, not the awaited body of `WriteAsync`. The comment understates the
+   risk: the catch is structurally unreachable for the documented failure mode.
+
+The system actually wants the *invariant* "audit write never affects delivery"
+(verified by the `AuditWriter_Throws_…StillSucceeds` tests). That invariant is
+better expressed by `await`-ing the writer inside the actor's outer try/catch (the
+dispatcher already swallows per-notification exceptions) than by a discard-task,
+which couples the lifetime of the dispatcher's scope to that of the audit task
+through whatever scope graph the writer happens to use today.
+
+**Recommendation**
+
+Either `await _auditWriter.WriteAsync(evt)` inside the existing `try`/`catch` (the
+preferred fix — preserves the invariant, plays well with the per-sweep scope, and
+makes the catch block actually reachable), or — if a true fire-and-forget remains
+desired — capture the returned task and attach a continuation that calls
+`_logger.LogWarning` on faulted to keep diagnostics intact. Either way, fix the
+"writer never throws" XML-doc to match the implementation.
+
+**Resolution**
+
+_Unresolved._
+
+### NotificationOutbox-005 — Ingest persistence inherits the CD-015 check-then-act race; under contention the second writer throws and the site retries
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Concurrency & thread safety |
+| Status | Open |
+| Location | `src/ScadaLink.NotificationOutbox/NotificationOutboxActor.cs:127-132` (caller); root cause in `src/ScadaLink.ConfigurationDatabase/Repositories/NotificationOutboxRepository.cs:33-45` |
+
+**Description**
+
+`HandleSubmit` → `PersistAsync` calls `repository.InsertIfNotExistsAsync(notification)`
+on `INotificationOutboxRepository`. The current implementation
+(`src/ScadaLink.ConfigurationDatabase/Repositories/NotificationOutboxRepository.cs`)
+does a check-then-act with no duplicate-key catch — documented as CD-015 (High,
+Open). The Notification Outbox's documented contract is "at-least-once handoff with
+ack-after-persist plus insert-if-not-exists on `NotificationId`" (CLAUDE.md,
+Component-NotificationOutbox.md §Ingest & Idempotency), and the duplicate-insert
+race is the **expected contention pattern** — the site retries the same submission
+after a lost ack. As written, the loser surfaces a `SqlException` (2627 PK
+violation) wrapped in `DbUpdateException`, propagates through `PipeTo`'s failure
+projection as a `NotificationSubmitAck { Accepted: false, Error: "... PRIMARY KEY ..." }`,
+the site treats the ack as a forwarding failure and forwards the message **again**,
+re-entering the same race. If the contending pair keeps racing this can livelock.
+
+The actor side is fine — `PipeTo`'s success/failure projection correctly forwards
+the exception message. The repository side needs the standard `2601/2627 → no-op`
+pattern that AuditLog and SiteCall already use. This finding tracks the outbox-side
+visibility of the CD-015 defect so a re-review of NotificationOutbox surfaces it
+even if the reader has not yet read the ConfigurationDatabase findings.
+
+**Recommendation**
+
+Track CD-015 to resolution. As a defense-in-depth complement here, consider
+treating a duplicate-key `DbUpdateException` in the actor's ingest failure
+projection as `Accepted: true` so a lost ack between persisted-by-the-first-writer
+and ack-back does not produce a permanent re-forward loop — but the cleanest fix
+remains the CD-015 raw-SQL `IF NOT EXISTS … INSERT` with `2601/2627` catch in
+`NotificationOutboxRepository`.
+
+**Resolution**
+
+_Unresolved._
+
+### NotificationOutbox-006 — `ResolveAdapters` rebuilds the `NotificationType → adapter` dictionary on every dispatch sweep
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Performance & resource management |
+| Status | Open |
+| Location | `src/ScadaLink.NotificationOutbox/NotificationOutboxActor.cs:267-277` |
+
+**Description**
+
+Every dispatch sweep calls `ResolveAdapters(scope.ServiceProvider)` which enumerates
+`scopedServices.GetServices<INotificationDeliveryAdapter>()` and builds a fresh
+`Dictionary<NotificationType, INotificationDeliveryAdapter>`. Adapter registration
+is decided at startup (`AddNotificationOutbox` registers
+`EmailNotificationDeliveryAdapter`); the registration set does not change at
+runtime. With a default `DispatchInterval = 10s` and only ever one entry today, the
+allocation overhead is trivial — but the comment "the last adapter registered for a
+given type wins, mirroring DI's last-wins resolution semantics" elevates this to a
+behaviour contract, and the per-sweep dictionary construction obscures the lookup's
+identity from one sweep to the next, making any future stateful adapter (rate
+limiter, circuit breaker) silently lose its state.
+
+The same issue is the reason `EmailNotificationDeliveryAdapter` is *scoped* — it
+holds a scoped `INotificationRepository`. A trivial cache-the-types-but-resolve-
+the-instance fix is possible: cache the set of declared `NotificationType` values
+and look up each adapter by `GetService<INotificationDeliveryAdapter>()`
+filtered by `Type` per sweep.
+
+**Recommendation**
+
+Document the per-sweep contract explicitly ("each sweep gets a fresh adapter
+instance per the scoped DI contract — adapters must not carry state across
+sweeps") in the actor XML, or — preferred — cache only the *types* at startup
+(`PreStart`) and resolve the scoped instance per sweep, so future adapters with
+stateful intent (timeouts, circuit breakers) cannot accidentally lose state.
+
+**Resolution**
+
+_Unresolved._
+
+### NotificationOutbox-007 — `NotificationOutboxOptions.DispatchBatchSize`, `DeliveredKpiWindow`, and `PurgeInterval` are not in the design document
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Design-document adherence |
+| Status | Open |
+| Location | `src/ScadaLink.NotificationOutbox/NotificationOutboxOptions.cs:13`, `:22`, `:25`; `docs/requirements/Component-NotificationOutbox.md:152-160` |
+
+**Description**
+
+`Component-NotificationOutbox.md` §Configuration enumerates three options: dispatch
+interval, stuck-age threshold, and terminal-row retention window. The implemented
+`NotificationOutboxOptions` adds three additional fields:
+
+- `DispatchBatchSize` (default `100`) — caps the per-sweep claim size, but is invisible
+  to anyone reading only the spec.
+- `PurgeInterval` (default `1 day`) — the design doc says "daily purge" as if the
+  cadence is fixed; in code it is configurable.
+- `DeliveredKpiWindow` (default `1 min`) — the KPI section says "Delivered (last
+  interval)" without saying how long "last interval" is or that it is configurable.
+
+The design doc also asserts "Delivery max-retry-count and retry interval are not
+part of `NotificationOutboxOptions` — they are reused from the central SMTP
+configuration" (line 160) — implementation honours this. But the three additions
+above are dead text in the design doc. The KPI dashboard cadence and the dispatch
+batch size are both operationally important values an operator/engineer will hunt
+for; their absence from the spec is design drift.
+
+**Recommendation**
+
+Add the three fields to `Component-NotificationOutbox.md §Configuration` with their
+defaults, or remove them from the implementation if they were meant to be fixed
+constants. Cross-link `DeliveredKpiWindow` from the §Monitoring "Delivered (last
+interval)" KPI bullet so a reader sees what controls the bucket length.
+
+**Resolution**
+
+_Unresolved._
+
+### NotificationOutbox-008 — `FallbackMaxRetries` / `FallbackRetryDelay` path is unreachable in production AND untested
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Testing coverage |
+| Status | Open |
+| Location | `src/ScadaLink.NotificationOutbox/NotificationOutboxActor.cs:29-31`, `:251-259`; tests in `tests/ScadaLink.NotificationOutbox.Tests/NotificationOutboxActorDispatchTests.cs` |
+
+**Description**
+
+`ResolveRetryPolicyAsync` falls back to `FallbackMaxRetries = 10` and
+`FallbackRetryDelay = 1 min` when `notificationRepository.GetAllSmtpConfigurationsAsync()`
+returns an empty list (no SMTP configuration row). The comment correctly observes
+that delivery itself will then return `Permanent("No SMTP configuration available")`
+from `EmailNotificationDeliveryAdapter.cs:78-81`, so the fallback retry policy
+never actually retries anything — the row is permanently parked on first attempt
+regardless of retry count or delay.
+
+This produces three concerns. (1) The fallback is essentially dead code — the retry
+policy values are never consulted in practice because delivery always fails
+permanently before the retry branch is reached. (2) The fallback can be reached
+*after* a previously-deployed SMTP config is deleted, which is precisely the
+moment an operator needs accurate audit trails; the row will say `Parked` with
+`LastError = "No SMTP configuration available"` but the audit signal "retry policy
+fell back to defaults" is invisible. (3) Tests never exercise either the fallback
+path or the empty-SMTP-config dispatch path: `SetupSmtpRetryPolicy` always supplies
+a config in every dispatch test.
+
+**Recommendation**
+
+Add a regression test that runs a dispatch sweep with no SMTP config row and
+asserts the row is parked with the documented error. Optionally remove the fallback
+constants if parking-with-no-config is the *intended* operational signal; document
+the choice in the actor XML so a maintainer does not "fix" the unreachable code.
+
+**Resolution**
+
+_Unresolved._
+
+### NotificationOutbox-009 — `StuckAgeThreshold` XML-doc says "in-progress notification is re-claimed" — contradicts the design's display-only stuck detection
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Documentation & comments |
+| Status | Open |
+| Location | `src/ScadaLink.NotificationOutbox/NotificationOutboxOptions.cs:15-16` |
+
+**Description**
+
+```csharp
+/// <summary>Age past which an in-progress notification is considered stuck and re-claimed.</summary>
+public TimeSpan StuckAgeThreshold { get; set; } = TimeSpan.FromMinutes(10);
+```
+
+The implementation never reclaims anything based on `StuckAgeThreshold`. It is used
+only as a cutoff for the stuck-count KPI (`StuckCutoff`/`IsStuck` in
+`NotificationOutboxActor.cs:932-942`) and as a `StuckCutoff` filter on paginated
+queries. The design doc is explicit: "A notification is **stuck** if it is `Pending`
+or `Retrying` and older than a configurable age threshold (default 10 minutes).
+Detection is **display-only** — a count KPI and a row badge. There is no automated
+escalation or alerting" (`Component-NotificationOutbox.md:143-145`). A maintainer
+reading the XML and expecting "re-claim" behaviour will be surprised twice — once
+when no re-claim happens, and once when they go looking for the re-claim code and
+find none.
+
+**Recommendation**
+
+Rewrite the XML to match the design: "Age past which a still-`Pending`/`Retrying`
+notification is counted as stuck on the KPI tile and the per-row badge.
+Display-only — does not affect dispatch."
+
+**Resolution**
+
+_Unresolved._
+
+### NotificationOutbox-010 — Comment claims `PipeTo` is not used "because the writer never throws"; the surrounding try/catch is dead-letter for the documented failure mode
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Documentation & comments |
+| Status | Open |
+| Location | `src/ScadaLink.NotificationOutbox/NotificationOutboxActor.cs:469-477` |
+
+**Description**
+
+```csharp
+try
+{
+    var evt = BuildNotifyDeliverEvent(notification, now, AuditStatus.Attempted, errorMessage)
+        with { DurationMs = durationMs };
+    // Fire-and-forget — we do NOT await: the dispatcher loop must not
+    // be blocked by audit IO, and the writer swallows its own faults.
+    // PipeTo is not used because the writer never throws.
+    _ = _auditWriter.WriteAsync(evt);
+}
+catch (Exception ex)
+{
+    _logger.LogWarning(ex, "Failed to emit Attempted audit row …");
+}
+```
+
+The XML-doc on `EmitAttemptAudit` is internally inconsistent and structurally
+incorrect: (1) if "the writer never throws" then the surrounding try/catch is
+unreachable and dead code; (2) if the writer *can* throw (and the catch is
+meaningful) then "never throws" is wrong. In practice the catch only ever fires
+on a synchronous throw from the writer's *task construction* — never on a fault
+in the awaited body — because the discarded task is not observed. The current
+behaviour matches the design intent ("audit failure NEVER aborts delivery"), but
+the comment misleads the next reader on the *why*.
+
+This is the same root cause as NotificationOutbox-004 — they target the same lines
+from different angles (NotificationOutbox-004 is the scope-lifetime /
+fire-and-forget Akka concern, NotificationOutbox-010 is the doc/comment-clarity
+concern). Closing NotificationOutbox-004 by switching to `await` resolves both.
+
+**Recommendation**
+
+If `await`-ing the writer (recommended fix per NotificationOutbox-004): delete the
+"PipeTo is not used because the writer never throws" line entirely and let
+the try/catch's behaviour speak for itself. If keeping fire-and-forget: rewrite
+the comment to "fire-and-forget by design (the writer is responsible for its
+own failure handling); the surrounding try/catch only catches the synchronous
+task-construction throw and is otherwise unreachable."
+
+**Resolution**
+
+_Unresolved._
@@ -5,10 +5,10 @@
 | Module | `src/ScadaLink.NotificationService` |
 | Design doc | `docs/requirements/Component-NotificationService.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-17 |
+| Last reviewed | 2026-05-28 |
 | Reviewer | claude-agent |
-| Commit reviewed | `39d737e` |
-| Open findings | 0 |
+| Commit reviewed | `1eb6e97` |
+| Open findings | 7 |

 ## Summary

@@ -55,20 +55,65 @@ any code (NS-017, dead config — NS-007 sourced the timeout/limit from
 outside its lock, is sized once and never resized on redeployment, and is never
 disposed (NS-018).

+#### Re-review 2026-05-28 (commit `1eb6e97`)
+
+Re-reviewed at commit `1eb6e97` against the **materially-changed design**: per the
+updated `Component-NotificationService.md` and CLAUDE.md, the Notification Service
+is now **central-only**. Sites no longer deliver notifications over SMTP — a
+script's `Notify.Send` enqueues to the site Store-and-Forward Engine and
+`NotificationForwarder.DeliverAsync` (S&F handler in StoreAndForward) forwards
+the payload to the central Notification Outbox, which dispatches via the
+`INotificationDeliveryAdapter` registered for the list's `Type`. Email delivery
+on central is performed by `EmailNotificationDeliveryAdapter` in the
+NotificationOutbox project — it reuses this module's SMTP machinery
+(`ISmtpClientWrapper`, `OAuth2TokenService`, `SmtpErrorClassifier`,
+`SmtpTlsModeParser`, `EmailAddressValidator`, `CredentialRedactor`,
+`SmtpPermanentException`, `NotificationOptions`) but is the actual production
+caller. The intended residual responsibility of this module is to **supply that
+shared SMTP machinery** plus list/SMTP-config definition management on central.
+
+The re-review surfaced **seven new findings**. The dominant theme is **dead
+code that contradicts the design doc**: `NotificationDeliveryService`, the
+`INotificationDeliveryService` interface in Commons, the `NotificationResult`
+record, the entire `DeliverBufferedAsync` S&F handler, and the prior NS-001…
+NS-018 test fixtures that exercise them are now orphaned — no production code
+path resolves `INotificationDeliveryService` on a site (sites no longer register
+this module per `SiteServiceRegistration.cs:33-38`) and on central the
+NotificationOutbox uses its own `EmailNotificationDeliveryAdapter` (which
+duplicates the connect/auth/send/disconnect sequence rather than delegating to
+`NotificationDeliveryService`). The class is still registered by
+`AddNotificationService` on central (`Program.cs:77`) but no consumer resolves
+it (NS-019). The `S&F handler must be registered` workaround that NS-001 added
+to `AkkaHostedService` is itself superseded by the `NotificationForwarder`
+registered for the same category at `AkkaHostedService.cs:654-660` (NS-020).
+Secondary findings: a real-world correctness gap (the OAuth2
+`SaslMechanismOAuth2` is constructed with an **empty user id** so server-side
+account binding fails for any provider that requires it — NS-021); the SMTP
+client wrapper holds a single `MailKit.SmtpClient` for the lifetime of the
+wrapper but the factory delegate creates a new wrapper per send, so successive
+sends through the same factory share NO connection but DO share a wrapper that
+mutates `_client.Timeout` on every connect (benign because every wrapper has its
+own client, but the design comment about pooling is now contradicted — NS-022);
+the design-doc retention/maintenance language has no implementation in this
+module and there is no test affirming the module is central-only (NS-023, NS-024);
+and `CredentialRedactor` masks any component of the credential string that is
+≥ 4 characters long — a 4-character user name like `root` or a 4-char tenant
+prefix could be aggressively scrubbed out of unrelated log text (NS-025).
+
 ## Checklist coverage

 | # | Category | Examined | Notes |
 |---|----------|----------|-------|
-| 1 | Correctness & logic bugs | ☑ | Double SMTP client construction; `Auto` socket option for non-TLS; `TimeoutException`/`OperationCanceledException` misclassified. |
-| 2 | Akka.NET conventions | ☑ | No actors in this module (`AddNotificationServiceActors` is a no-op); delivery is a plain DI service. No Akka-specific issues. |
-| 3 | Concurrency & thread safety | ☑ | `OAuth2TokenService` is a singleton with a shared mutable token cache; double-checked locking present but cache key is wrong (NS-006). |
-| 4 | Error handling & resilience | ☑ | Critical: no S&F delivery handler registered for `Notification` (NS-001). Fragile substring error classification (NS-002, NS-003). |
-| 5 | Security | ☑ | Credentials handled as plaintext strings; OAuth2 client secret in DB credential blob; no recipient address validation. |
-| 6 | Performance & resource management | ☑ | Two `ISmtpClientWrapper` instances created per send, one leaked; connection not pooled; `MaxConcurrentConnections` unenforced. |
-| 7 | Design-document adherence | ☑ | Connection timeout, max concurrent connections, and TLS `SSL`/`None` modes from the design doc are not implemented. |
-| 8 | Code organization & conventions | ☑ | `SmtpPermanentException` in the wrong file; `SmtpConfiguration` POCO has non-nullable strings with no initializer (compiler-warning risk). |
-| 9 | Testing coverage | ☑ | Happy path and main error branches covered; OAuth2 delivery path, `DeliverAsync` permanent fallback, and token-cache concurrency untested. |
-| 10 | Documentation & comments | ☑ | XML comment on `DeliverAsync` ("Throws on failure") and the misleading "OAuth2 token refresh if needed" comment do not match behaviour. |
+| 1 | Correctness & logic bugs | ☑ | Re-review: OAuth2 SASL constructed with empty user id (NS-021); `CredentialRedactor` over-masks short components (NS-025). Earlier NS-005/NS-008 fixes hold. |
+| 2 | Akka.NET conventions | ☑ | No actors in this module. `AddNotificationServiceActors` remains a documented no-op. |
+| 3 | Concurrency & thread safety | ☑ | `OAuth2TokenService` per-credential locks now correct (NS-006 hold). No new issues. |
+| 4 | Error handling & resilience | ☑ | NS-014/NS-015 classification fixes hold but the entire `DeliverBufferedAsync` / `SendAsync` error path is dead (NS-019/NS-020). |
+| 5 | Security | ☑ | OAuth2 `SaslMechanismOAuth2` empty user id (NS-021); `CredentialRedactor` aggressiveness (NS-025); at-rest encryption still deferred (NS-013). |
+| 6 | Performance & resource management | ☑ | `MailKitSmtpClientWrapper` keeps a single `SmtpClient` for the wrapper lifetime; combined with per-send factory this means no pooling — re-document or fix (NS-022). |
+| 7 | Design-document adherence | ☑ | Critical drift: module still exposes site-style S&F sending; the design doc inverted delivery to central months ago (NS-019). Site registration removed but central still wires the dead service. |
+| 8 | Code organization & conventions | ☑ | `INotificationDeliveryService` lives in Commons and is now unused — should be retired or relocated to a NotificationService-internal namespace (NS-019). Module-vs-NotificationOutbox boundary unclear. |
+| 9 | Testing coverage | ☑ | 56 tests pass but ~40 of them assert behaviour of a code path no production caller exercises (NS-024). No test affirms the central-only design — i.e. that `AddNotificationService` registers no notification-sending service on a site. |
+| 10 | Documentation & comments | ☑ | `NotificationDeliveryService` XML doc still claims "WP-11/12: Notification delivery via SMTP" with no warning that the class is orphaned; `INotificationDeliveryService` Commons doc claims "Implemented by NotificationService, consumed by ScriptRuntimeContext" — both consumers are wrong now (NS-023). |

 ## Findings

@@ -595,3 +640,199 @@ Replace the hand-rolled double-checked init with `Lazy<SemaphoreSlim>` or `LazyI
 **Resolution**

 Resolved 2026-05-17. All three issues confirmed against source. The hand-rolled double-checked init was replaced with a `Lazy<SemaphoreSlim>` — its publication is correctly synchronised, eliminating the lock-free read of a non-`volatile` reference. `NotificationDeliveryService` now implements `IDisposable` and disposes the limiter (if created) under the existing lock, with idempotent re-entry and an `ObjectDisposedException` guard in `SendAsync`/`GetConcurrencyLimiter`; the scoped DI registration disposes it per scope. The limiter remains scoped (not hoisted to a site singleton) — the design doc deploys one SMTP config per site and the per-instance capture is bounded; the redeploy-resize concern is acknowledged as low-impact and not changed here, since hoisting would require a registration change for marginal benefit. Tests `Service_Dispose_DisposesConcurrencyLimiter` plus the existing `Send_MaxConcurrentConnections_LimitsConcurrentDeliveries`.
+
+### NotificationService-019 — `NotificationDeliveryService` and `INotificationDeliveryService` are orphaned by the central-only redesign
+
+| | |
+|--|--|
+| Severity | High |
+| Category | Design-document adherence |
+| Status | Open |
+| Location | `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:18-442`, `src/ScadaLink.NotificationService/ServiceCollectionExtensions.cs:20-21`, `src/ScadaLink.Commons/Interfaces/Services/INotificationDeliveryService.cs:1-33`, `src/ScadaLink.Host/Program.cs:77` |
+
+**Description**
+
+The updated `Component-NotificationService.md` (re-read in full at this commit) makes the new design unambiguous: "The Notification Service is the central component that manages notification-list and SMTP definitions and provides the per-type delivery adapters used to send notifications. … Notification delivery has been inverted: a site script's notification is store-and-forwarded to the central cluster, and the central **Notification Outbox** owns dispatch and delivery, calling an `INotificationDeliveryAdapter` supplied by this component." The doc explicitly states the service is "central cluster only", "no longer present at site clusters", and "no longer delivers notifications from sites".
+
+The current source does not match. `NotificationDeliveryService` is a site-shaped notification sender: it accepts `(listName, subject, message)`, performs an immediate SMTP `DeliverAsync`, catches transient failures and **buffers them to a `StoreAndForwardCategory.Notification` row**, and exposes `DeliverBufferedAsync` as the matching S&F handler. That is precisely the old site-side flow the design doc says was removed. The doc explicitly notes "there is no … local SQLite copy" of notification lists at sites, yet `DeliverBufferedAsync` re-resolves the list from a repository expected to be reachable on the buffering node.
+
+Who actually calls it?
+
+- **Sites** do **not**. `SiteServiceRegistration.cs:33-38` documents the deliberate omission: "AddNotificationService() is intentionally NOT registered on the site path." Sites register `NotificationForwarder` (in `ScadaLink.StoreAndForward`) as the S&F handler for `StoreAndForwardCategory.Notification` (`AkkaHostedService.cs:654-660`), which Asks the central comms actor and never touches SMTP. `ScriptRuntimeContext.NotifyHelper` (in `SiteRuntime`) enqueues directly to S&F as a serialized `NotificationSubmit`, **not** via `INotificationDeliveryService.SendAsync`.
+- **Central** registers it (`Program.cs:77` calls `AddNotificationService`) but no central component resolves it. The central notification dispatcher is `NotificationOutboxActor` → `INotificationDeliveryAdapter` → `EmailNotificationDeliveryAdapter`. The adapter is a full re-implementation of the connect/auth/send/disconnect sequence (see `EmailNotificationDeliveryAdapter.cs:163-222`) — it deliberately does not call `NotificationDeliveryService.DeliverAsync` (XML-doc on the adapter says "Reuses the `ScadaLink.NotificationService` SMTP machinery — `ISmtpClientWrapper`, `SmtpTlsModeParser`, `OAuth2TokenService` and the typed `SmtpPermanentException`", i.e. only the leaf primitives).
+
+The `NotificationDeliveryService` class, its `DeliverBufferedAsync`, the `Func<ISmtpClientWrapper>` registration consumed only by it, and the `INotificationDeliveryService` interface (still in Commons) and `NotificationResult` record are therefore dead code that contradicts the design. Worse, every prior finding NS-001..NS-018 was reviewed and resolved against this dead path. The 56-test green test suite (NS-012 resolution note) exercises behaviour no production caller invokes — it gives a false sense of coverage. The misleading XML doc on `NotificationDeliveryService` ("WP-11/12: Notification delivery via SMTP") tells a maintainer this is *the* delivery path; the registration on central does the same.
+
+Risk: an operator following the design doc will look here for "the central email delivery code" and find a parallel implementation that is never called; a future feature change (e.g. retry policy tweak) made here will silently have no effect; the `Notify` script-API end-to-end behaviour now depends on `NotificationOutbox` + `EmailNotificationDeliveryAdapter` + `NotificationForwarder`, none of which are tested in this module's suite.
+
+**Recommendation**
+
+Decide and execute one of:
+
+1. **Delete `NotificationDeliveryService`, `DeliverBufferedAsync`, the `BufferedNotification` payload type, the `Func<ISmtpClientWrapper>` scoped registration (move it to NotificationOutbox if still needed there — it already has its own), and `INotificationDeliveryService`/`NotificationResult` in Commons.** Reduce `AddNotificationService` to registering the shared primitives — `OAuth2TokenService`, `ISmtpClientWrapper` factory, `NotificationOptions`. Delete the NS-001..NS-018 tests that target the orphaned path; rebase the ones that exercise primitives (`SmtpErrorClassifier`, `SmtpTlsModeParser`, `CredentialRedactor`, `EmailAddressValidator`, `MailKitSmtpClientWrapper`, `OAuth2TokenService`) which remain genuinely shared. Update `CompositionRootTests` (`tests/ScadaLink.Host.Tests/CompositionRootTests.cs:208-209`) and `IntegrationSurfaceTests` (`tests/ScadaLink.IntegrationTests/IntegrationSurfaceTests.cs:122-135`) to drop the stale assertions.
+
+2. **Keep the class as the central-only Email delivery primitive** and rewrite `EmailNotificationDeliveryAdapter` to delegate to it. This is the smaller diff but the larger semantic burden — `NotificationDeliveryService.SendAsync` returns `NotificationResult` (Success / WasBuffered) which cannot encode the three-way `DeliveryOutcome` (Success / Transient / Permanent) the outbox needs, so the contract still has to change.
+
+Recommended path is option 1: the parallel implementation in `EmailNotificationDeliveryAdapter` is already complete and matches the new design's `DeliveryOutcome` model; salvaging the old class would re-introduce the very inversion this redesign removed.
+
+### NotificationService-020 — NS-001 fix superseded; `AkkaHostedService` would register two competing `Notification` S&F handlers if both code paths ran
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.Host/Actors/AkkaHostedService.cs:654-660`, NS-001 resolution note (this file) |
+
+**Description**
+
+NS-001 was resolved by registering an `S&F → DeliverBufferedAsync` handler for `StoreAndForwardCategory.Notification` at site startup in `AkkaHostedService`. The current source registers a **different** handler for the same category at `AkkaHostedService.cs:654-660` — `NotificationForwarder.DeliverAsync`, which forwards to central instead of sending SMTP. `StoreAndForwardService.RegisterDeliveryHandler` (verified by reading `StoreAndForward/StoreAndForwardService.cs` around line 109) takes a single handler per category — last-write-wins or first-write-wins, either way the two registrations cannot both be active.
+
+The NS-001 resolution note in this file describes a state of the code that no longer exists: it says the handler "is now registered at site startup in `AkkaHostedService`" and points to a handler resolving `NotificationDeliveryService` via a fresh DI scope. That registration is gone from the current `AkkaHostedService` (only `ExternalSystem`, `CachedDbWrite`, and the `NotificationForwarder`-based `Notification` registration are present at the current location). So the NS-001 fix has been silently rolled back / replaced as part of the central-only redesign.
+
+The risk this finding tracks is not the current state per se — `NotificationForwarder` registration is correct under the new design — but the **stale resolution note** plus the fact that `NotificationDeliveryService.DeliverBufferedAsync` still exists in this module and is still tested as an S&F handler. A future merge or revert that re-introduces the NS-001-style registration (because it is what the test suite shape implies) would conflict with `NotificationForwarder`. The two handlers do diametrically opposite things (forward to central vs. send SMTP locally on a site where there is no SMTP config), so a misregistration would cause a silent regression of the design inversion.
+
+**Recommendation**
+
+Mark the NS-001 resolution note in this file as **superseded by NS-019** with a one-line note explaining that the registration was removed when sites stopped delivering. Delete the orphan `DeliverBufferedAsync` and its tests as part of the NS-019 work. Add a comment on `NotificationForwarder` registration in `AkkaHostedService` cross-referencing NS-019/NS-020 so a maintainer searching for the `Notification` S&F handler finds the one canonical registration.
+
+### NotificationService-021 — OAuth2 SASL constructed with empty user identifier; M365 SMTP will reject the auth handshake
+
+| | |
+|--|--|
+| Severity | High |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.NotificationService/MailKitSmtpClientWrapper.cs:76-79` |
+
+**Description**
+
+```csharp
+case "oauth2":
+    // OAuth2 token is passed directly as credentials (pre-fetched by token service)
+    var oauth2 = new SaslMechanismOAuth2("", credentials);
+    await _client.AuthenticateAsync(oauth2, cancellationToken);
+    break;
+```
+
+`SaslMechanismOAuth2(string userName, string token)` — MailKit's XOAUTH2 mechanism — sends the SASL initial response as `user=<userName>\x01auth=Bearer <token>\x01\x01`. Microsoft 365 (and most OAuth2-enabled SMTP relays) **require the `userName` field to be the From mailbox identity the token was issued for**; an empty string is rejected with a server response like `535 5.7.3 Authentication unsuccessful` ("Either the user identity does not match the principal in the token, or the user is empty"). Office 365's documentation for SMTP AUTH XOAUTH2 calls this out explicitly.
+
+The token-fetch path supports this: `OAuth2TokenService.GetTokenAsync` issues a Client Credentials grant against `login.microsoftonline.com/{tenantId}/oauth2/v2.0/token` with `scope=https://outlook.office365.com/.default`, which is the Microsoft 365 SMTP send scope — meaning the intended target is M365 SMTP, which is precisely the server that rejects an empty user. The `SmtpConfiguration.FromAddress` field is exactly the user identity that should be passed.
+
+This bug is not caught by tests because every existing test uses a fake `ISmtpClientWrapper` (`Substitute.For<ISmtpClientWrapper>()`, `RecordingAuthClient`, etc.) — `MailKitSmtpClientWrapper.AuthenticateAsync` is never exercised against a real `SaslMechanismOAuth2`. The OAuth2 delivery test (NS-012, `Send_OAuth2Config_AuthenticatesWithResolvedAccessToken`) only asserts the wrapper's `AuthenticateAsync` is invoked with `("oauth2", "<access-token>")`; the wrapper itself is mocked out. The same defect is present in `EmailNotificationDeliveryAdapter` only because it routes through this same `AuthenticateAsync` method.
+
+**Recommendation**
+
+Pass the sender mailbox into the wrapper's `AuthenticateAsync` path. The cleanest fix is to thread `config.FromAddress` (or a dedicated `oauth2UserName` parameter) through `ISmtpClientWrapper.AuthenticateAsync` so the OAuth2 branch can construct `new SaslMechanismOAuth2(config.FromAddress, credentials)`. Add an integration-style test that runs `MailKitSmtpClientWrapper.AuthenticateAsync` against a stub `SmtpClient` and asserts the XOAUTH2 initial-response bytes contain the expected `user=<from>` field, so this regression is caught next time.
+
+### NotificationService-022 — `MailKitSmtpClientWrapper` holds a long-lived `SmtpClient`; combined with per-send factory, the design comment about pooling is contradicted
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Performance & resource management |
+| Status | Open |
+| Location | `src/ScadaLink.NotificationService/MailKitSmtpClientWrapper.cs:14`, `src/ScadaLink.NotificationService/ServiceCollectionExtensions.cs:19` |
+
+**Description**
+
+`MailKitSmtpClientWrapper` declares `private readonly SmtpClient _client = new();` — a single `SmtpClient` is constructed when the wrapper is constructed and lives for the wrapper's lifetime. The DI registration is `services.AddSingleton<Func<ISmtpClientWrapper>>(_ => () => new MailKitSmtpClientWrapper());` (`ServiceCollectionExtensions.cs:19`) — every invocation of the factory creates a **new** wrapper and therefore a **new** `SmtpClient`. `NotificationDeliveryService.DeliverAsync` (the orphan, per NS-019) and `EmailNotificationDeliveryAdapter.SendAsync` both invoke the factory per send and dispose the wrapper at end of send. So in practice there is no connection pooling — every send pays a full TCP+TLS handshake.
+
+This is internally consistent (and matches MailKit guidance — `SmtpClient` is not thread-safe and reusing across deliveries needs careful guarding). However:
+
+1. The XML on the wrapper class says nothing about lifetime; the field-initializer `new SmtpClient()` *implies* a reusable connection. A maintainer might "fix" the factory to reuse a single wrapper (singleton) believing they are enabling pooling, and immediately introduce a concurrency bug: `MailKit.SmtpClient` rejects concurrent send calls and the wrapper carries no synchronization.
+2. `ConnectAsync` mutates `_client.Timeout` (`MailKitSmtpClientWrapper.cs:39-42`) every time it runs. If a wrapper is ever reused across deliveries with different `SmtpConfiguration.ConnectionTimeoutSeconds` values, the timeout is silently overwritten — not a current bug, but a latent footgun.
+3. The design doc requirement "Max concurrent connections (default 5)" is currently honoured by the NS-007 `SemaphoreSlim` on `NotificationDeliveryService`, but `EmailNotificationDeliveryAdapter` has **no equivalent throttle** — see `EmailNotificationDeliveryAdapter.cs:163-222`, no semaphore. So on central, where the actual delivery now happens, the design-doc concurrency limit is no longer enforced. This is a regression introduced by the redesign — the outbox does not carry NS-007's limiter forward.
+
+**Recommendation**
+
+Document the per-send lifecycle on `MailKitSmtpClientWrapper` (XML on the class: "one wrapper per delivery; the wrapper owns a single `SmtpClient` that is connected/authenticated/sent/disconnected/disposed once"). Either move the NS-007 `SemaphoreSlim` into a shared per-site holder consumed by `EmailNotificationDeliveryAdapter`, or accept the loss and update the design doc. Add `[Obsolete]` or `internal` to discourage re-using a wrapper across sends.
+
+### NotificationService-023 — XML docs on the orphaned classes still describe the removed site-delivery flow; misleading to maintainers
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Documentation & comments |
+| Status | Open |
+| Location | `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:12-17`, `src/ScadaLink.Commons/Interfaces/Services/INotificationDeliveryService.cs:3-12`, `src/ScadaLink.NotificationService/ServiceCollectionExtensions.cs:8-9` |
+
+**Description**
+
+XML comments still claim the dead path is the live path:
+
+- `NotificationDeliveryService` class summary: "WP-11: Notification delivery via SMTP. WP-12: Error classification and S&F integration. Transient: connection refused, timeout, SMTP 4xx → hand to S&F. Permanent: SMTP 5xx → returned to script." This is the pre-redesign behaviour. The site-S&F branch in particular is dead (see NS-019), and "returned to script" is no longer accurate — `Notify.Send` is async and never returns a permanent error to the script per the design doc.
+- `INotificationDeliveryService` (Commons): "Interface for sending notifications. Implemented by NotificationService, consumed by ScriptRuntimeContext." Verified against source: `ScriptRuntimeContext` does **not** consume this interface — it enqueues directly to `StoreAndForwardService` (see `SiteRuntime/Scripts/ScriptRuntimeContext.cs:1770-1774`). The Commons-level claim therefore documents an interaction that no longer exists.
+- `NotificationResult` is a record returned only by the orphaned `SendAsync`. The Notification Outbox uses `DeliveryOutcome` instead, which encodes the Success/Transient/Permanent three-way that `NotificationResult(Success, ErrorMessage, WasBuffered)` cannot.
+- `ServiceCollectionExtensions.AddNotificationService` XML doc says "Registers the notification delivery services (SMTP, OAuth2 token, delivery adapter)" — no mention that the central-only redesign means most of what it registers is unused.
+
+A reader following the XML docs from any entry point ends up at a path that does not run. The CLAUDE.md "External Integrations" section and `Component-NotificationService.md` describe the new design; the in-source docs contradict them.
+
+**Recommendation**
+
+Tied to NS-019: if the orphan classes are deleted, this finding closes itself. If they are kept temporarily, prepend each summary with "**Obsolete — superseded by NotificationOutbox's `EmailNotificationDeliveryAdapter`. Retained for transitional compatibility; do not add new callers.**" and update `INotificationDeliveryService`'s summary to reflect the inverted flow or remove the interface.
+
+### NotificationService-024 — No test affirms the central-only invariant; the orphaned-path tests give a false coverage signal
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Testing coverage |
+| Status | Open |
+| Location | `tests/ScadaLink.NotificationService.Tests/NotificationDeliveryServiceTests.cs`, `tests/ScadaLink.IntegrationTests/IntegrationSurfaceTests.cs:118-136`, `tests/ScadaLink.Host.Tests/CompositionRootTests.cs:207-209` |
+
+**Description**
+
+The module test suite has 56 tests; counting `NotificationDeliveryServiceTests.cs`, ~40 of them exercise `NotificationDeliveryService.SendAsync`/`DeliverBufferedAsync` — code paths that, per NS-019, no production caller resolves. They pass against the orphaned class and so the suite stays green, but the green is a false signal: changing the dead implementation (or deleting it) does not flag any regression in the live notification-delivery flow, which now lives in `EmailNotificationDeliveryAdapter` (covered by NotificationOutbox's own tests) and `NotificationForwarder` (covered, if at all, by StoreAndForward's tests).
+
+In particular there is **no test in this module** that affirms the central-only invariant the design doc requires:
+
+- No test that `AddNotificationService()` registered on a *site* role would be inert / no-op'd, or that `SiteServiceRegistration.Configure` does **not** call `AddNotificationService` (an obvious regression vector — re-adding it would silently restore the orphaned site-delivery path).
+- No test that confirms `INotificationDeliveryService` has no production consumer (i.e. an architecture test that fails if anyone re-introduces a constructor parameter or `GetRequiredService<INotificationDeliveryService>()` call).
+- The cross-module `CompositionRootTests` (`tests/ScadaLink.Host.Tests/CompositionRootTests.cs:208-209`) still asserts `NotificationDeliveryService` and `INotificationDeliveryService` are registered, locking in the orphan rather than catching it.
+- `IntegrationSurfaceTests.cs:122-125` constructs `NotificationDeliveryService` directly to validate "the integration surface" — testing a surface that no script actually crosses.
+
+**Recommendation**
+
+After NS-019 is decided:
+
+1. If the orphan is deleted, remove the orphaned-path tests (NS-001/004/005/007/008/009/010/014/015/016/017/018-style tests targeting `SendAsync`/`DeliverBufferedAsync`). Retain `SmtpErrorClassifierTests`, `SmtpTlsModeParserTests`, `CredentialRedactorTests`, `OAuth2TokenServiceTests`, and `MailKitSmtpClientWrapperTests` (primitives genuinely shared). Update `CompositionRootTests` to drop the stale rows and `IntegrationSurfaceTests` to call the live path via `INotificationDeliveryAdapter`/`EmailNotificationDeliveryAdapter`.
+2. Add a one-shot architecture test in `tests/ScadaLink.Architecture.Tests` (if it exists, else this module) that scans for direct references to `INotificationDeliveryService` outside this project and the obsolete-interface declaration in Commons, failing if any new consumer reappears.
+
+### NotificationService-025 — `CredentialRedactor` over-masks: any 4-character credential component is masked anywhere it appears, including unrelated log text
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.NotificationService/CredentialRedactor.cs:34-48` |
+
+**Description**
+
+```csharp
+var parts = credentials.Split(':')
+    .Where(p => p.Length >= 4)
+    .Append(credentials)
+    .Distinct()
+    .OrderByDescending(p => p.Length);
+
+foreach (var part in parts)
+{
+    result = result.Replace(part, Mask, StringComparison.Ordinal);
+}
+```
+
+The threshold `p.Length >= 4` is permissive enough that common short identifiers used by operators become aggressive global redaction tokens:
+
+- A Basic-Auth credential of `root:hunter2` produces components `["root", "hunter2", "root:hunter2"]`. Every literal `root` anywhere in the exception/log text is masked — including unrelated mentions like file paths (`/root/.config`) or default-account names in the server's reply. This obscures legitimate diagnostic information without protecting any additional secret.
+- An OAuth2 tenant id is a GUID (long, safe). The client id is typically a GUID. The client secret is the high-entropy part. The full `tenant:client:secret` is the actual sensitive triple. A tenant GUID embedded in unrelated text (a tenant-bound error code, a partial URL) will be masked even when the appearance is non-sensitive.
+- The user name in Basic Auth is sometimes the From address (`scada-notifications@company.com`) — masking *the company's notification mailbox* in every log line that mentions it has real operational cost.
+
+The function also uses `String.Replace` ordinarily, not word-boundary aware — a 4-char prefix that happens to be a substring of a longer benign token gets eaten.
+
+The threshold is a defence-in-depth choice; the existing tests assert that `Hunter2pw!` and `Sup3rSecretValue` are masked (good) and that `null` text/credentials are handled (good), but nothing pins the negative behaviour: e.g. a test that a 4-char user name `root` is **not** also masked when it appears in an unrelated path.
+
+**Recommendation**
+
+Tighten the redaction policy: mask only the obviously-secret components — the password (Basic), the client secret (OAuth2), and the whole packed string — not the user name / tenant / client id. The simplest implementation is to redact only the **last** colon-separated component (the secret) plus the full packed string. Bump the per-component minimum length to something high enough that a typical short user name does not match (≥ 12 chars is the usual heuristic for a password). Add a test asserting `Scrub("/root/.config", "root:hunter2")` does not mask `/root/.config`'s `root`.
@@ -40,34 +40,38 @@ module file and counted in **Total**.
 | Severity | Open findings |
 |----------|---------------|
 | Critical | 0 |
-| High | 0 |
-| Medium | 0 |
-| Low | 0 |
-| **Total** | **0** |
+| High | 18 |
+| Medium | 62 |
+| Low | 92 |
+| **Total** | **172** |

 ## Module Status

 | Module | Last reviewed | Commit | Open (C/H/M/L) | Open | Total |
 |--------|---------------|--------|----------------|------|-------|
-| [CLI](CLI/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 16 |
-| [CentralUI](CentralUI/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 25 |
-| [ClusterInfrastructure](ClusterInfrastructure/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 10 |
-| [Commons](Commons/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 14 |
-| [Communication](Communication/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 15 |
-| [ConfigurationDatabase](ConfigurationDatabase/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 14 |
-| [DataConnectionLayer](DataConnectionLayer/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 17 |
-| [DeploymentManager](DeploymentManager/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 17 |
-| [ExternalSystemGateway](ExternalSystemGateway/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 17 |
-| [HealthMonitoring](HealthMonitoring/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 16 |
-| [Host](Host/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 15 |
-| [InboundAPI](InboundAPI/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 17 |
-| [ManagementService](ManagementService/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 17 |
-| [NotificationService](NotificationService/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 18 |
-| [Security](Security/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 15 |
-| [SiteEventLogging](SiteEventLogging/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 14 |
-| [SiteRuntime](SiteRuntime/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 19 |
-| [StoreAndForward](StoreAndForward/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 17 |
-| [TemplateEngine](TemplateEngine/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 16 |
+| [AuditLog](AuditLog/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/3/8 | 11 | 11 |
+| [CLI](CLI/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/3/4 | 7 | 23 |
+| [CentralUI](CentralUI/findings.md) | 2026-05-28 | `1eb6e97` | 0/1/2/5 | 8 | 33 |
+| [ClusterInfrastructure](ClusterInfrastructure/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/4 | 4 | 14 |
+| [Commons](Commons/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/3/6 | 9 | 23 |
+| [Communication](Communication/findings.md) | 2026-05-28 | `1eb6e97` | 0/1/1/5 | 7 | 22 |
+| [ConfigurationDatabase](ConfigurationDatabase/findings.md) | 2026-05-28 | `1eb6e97` | 0/1/4/5 | 10 | 24 |
+| [DataConnectionLayer](DataConnectionLayer/findings.md) | 2026-05-28 | `1eb6e97` | 0/1/4/0 | 5 | 22 |
+| [DeploymentManager](DeploymentManager/findings.md) | 2026-05-28 | `1eb6e97` | 0/1/1/5 | 7 | 24 |
+| [ExternalSystemGateway](ExternalSystemGateway/findings.md) | 2026-05-28 | `1eb6e97` | 0/1/2/3 | 6 | 23 |
+| [HealthMonitoring](HealthMonitoring/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/2/5 | 7 | 23 |
+| [Host](Host/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/2/5 | 7 | 22 |
+| [InboundAPI](InboundAPI/findings.md) | 2026-05-28 | `1eb6e97` | 0/1/3/4 | 8 | 25 |
+| [ManagementService](ManagementService/findings.md) | 2026-05-28 | `1eb6e97` | 0/1/3/2 | 6 | 23 |
+| [NotificationOutbox](NotificationOutbox/findings.md) | 2026-05-28 | `1eb6e97` | 0/2/5/3 | 10 | 10 |
+| [NotificationService](NotificationService/findings.md) | 2026-05-28 | `1eb6e97` | 0/2/2/3 | 7 | 25 |
+| [Security](Security/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/2/4 | 6 | 21 |
+| [SiteCallAudit](SiteCallAudit/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/2/4 | 6 | 6 |
+| [SiteEventLogging](SiteEventLogging/findings.md) | 2026-05-28 | `1eb6e97` | 0/1/2/6 | 9 | 23 |
+| [SiteRuntime](SiteRuntime/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/4/3 | 7 | 26 |
+| [StoreAndForward](StoreAndForward/findings.md) | 2026-05-28 | `1eb6e97` | 0/1/3/3 | 7 | 24 |
+| [TemplateEngine](TemplateEngine/findings.md) | 2026-05-28 | `1eb6e97` | 0/1/4/1 | 6 | 22 |
+| [Transport](Transport/findings.md) | 2026-05-28 | `1eb6e97` | 0/3/5/4 | 12 | 12 |

 ## Pending Findings

@@ -80,14 +84,189 @@ description, location, recommendation — lives in the module's `findings.md`.

 _None open._

-### High (0)
+### High (18)

-_None open._
+| ID | Module | Title |
+|----|--------|-------|
+| CentralUI-028 | [CentralUI](CentralUI/findings.md) | `NotificationReport` and `SiteCallsReport` bypass `SiteScopeService` — Deployment role site-scoping defeated on the two new central-mirror pages |
+| Communication-016 | [Communication](Communication/findings.md) | `HandleConnectionStateChanged` is dead code — the documented disconnect-cleanup workflow never fires |
+| ConfigurationDatabase-015 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | `NotificationOutboxRepository.InsertIfNotExistsAsync` is a check-then-act race with no duplicate-key catch |
+| DataConnectionLayer-018 | [DataConnectionLayer](DataConnectionLayer/findings.md) | Concurrent subscribes for the same tag from different instances orphan an adapter subscription handle |
+| DeploymentManager-018 | [DeploymentManager](DeploymentManager/findings.md) | Reconciliation force-sets `Enabled`, overwriting an intentional `Disabled` after central failover |
+| ExternalSystemGateway-018 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | `DeliverBufferedAsync` lets `JsonException` propagate, turning a corrupt buffered row into a permanent retry-forever poison message |
+| InboundAPI-022 | [InboundAPI](InboundAPI/findings.md) | `IActiveNodeGate` has no production registration in Host — standby-node gating is silently disabled in production |
+| ManagementService-018 | [ManagementService](ManagementService/findings.md) | QueryAuditLogCommand has no role gate |
+| NotificationOutbox-001 | [NotificationOutbox](NotificationOutbox/findings.md) | `EmailNotificationDeliveryAdapter` inherits the OAuth2 empty-user SASL bug (NS-021) on the M365 send path |
+| NotificationOutbox-002 | [NotificationOutbox](NotificationOutbox/findings.md) | Dispatcher parks on first transient failure when `SmtpConfiguration.MaxRetries == 0` |
+| NotificationService-019 | [NotificationService](NotificationService/findings.md) | `NotificationDeliveryService` and `INotificationDeliveryService` are orphaned by the central-only redesign |
+| NotificationService-021 | [NotificationService](NotificationService/findings.md) | OAuth2 SASL constructed with empty user identifier; M365 SMTP will reject the auth handshake |
+| SiteEventLogging-016 | [SiteEventLogging](SiteEventLogging/findings.md) | `From`/`To` filters compare non-normalised ISO 8601 strings against UTC-stored timestamps |
+| StoreAndForward-018 | [StoreAndForward](StoreAndForward/findings.md) | Notification corrupt-payload parks the buffered message, contradicting the "notifications do not park" design invariant |
+| TemplateEngine-017 | [TemplateEngine](TemplateEngine/findings.md) | Revision hash and diff both ignore `Description` and `Connections`, defeating staleness detection for real deployment changes |
+| Transport-001 | [Transport](Transport/findings.md) | Template Overwrite never syncs attributes / alarms / scripts |
+| Transport-002 | [Transport](Transport/findings.md) | ExternalSystem Overwrite never syncs methods |
+| Transport-003 | [Transport](Transport/findings.md) | Unlock lockout is enforced only client-side; server session is never marked Locked |

-### Medium (0)
+### Medium (62)

-_None open._
+| ID | Module | Title |
+|----|--------|-------|
+| AuditLog-001 | [AuditLog](AuditLog/findings.md) | Combined-telemetry transport is plumbed end-to-end but never invoked in production |
+| AuditLog-004 | [AuditLog](AuditLog/findings.md) | `SiteAuditReconciliationActor` advances cursor even on per-row insert failure, silently abandoning permanently-failing rows |
+| AuditLog-005 | [AuditLog](AuditLog/findings.md) | `GetBacklogStatsAsync` holds the SQLite hot-path write lock for the full COUNT+MIN scan |
+| CLI-017 | [CLI](CLI/findings.md) | `BundleCommands.RunBundleCommandAsync` duplicates `ExecuteCommandAsync` and breaks the auth exit-code contract |
+| CLI-018 | [CLI](CLI/findings.md) | `audit query` and `audit export` never return exit 2 for an authorization failure |
+| CLI-019 | [CLI](CLI/findings.md) | `bundle export` decodes the entire base64 bundle into memory before writing |
+| CentralUI-026 | [CentralUI](CentralUI/findings.md) | `AuditFilterBar` From/To filters treat browser-local datetimes as UTC |
+| CentralUI-027 | [CentralUI](CentralUI/findings.md) | Same UTC misinterpretation in `SiteCallsReport`, `NotificationReport`, and `EventLogs` |
+| Commons-015 | [Commons](Commons/findings.md) | `EncryptionMetadata` accepts any algorithm string and any iteration count |
+| Commons-017 | [Commons](Commons/findings.md) | `Component-Commons.md` is significantly stale (audit enums, new entities, new repositories, new service interfaces, new folders) |
+| Commons-019 | [Commons](Commons/findings.md) | New `*Utc`-suffixed `DateTime` columns on `AuditEvent` / `SiteCall` are not enforced as UTC; inconsistent with `Notification`'s `DateTimeOffset` |
+| Communication-017 | [Communication](Communication/findings.md) | `_inProgressDeployments` grows unboundedly — successful deployments are never cleaned up |
+| ConfigurationDatabase-016 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | `InboundApiRepository.GetApiKeyByValueAsync` hashes the candidate with the unpeppered `ApiKeyHasher.Default` |
+| ConfigurationDatabase-017 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | Stub-attach delete on `DeploymentRecord` bypasses optimistic concurrency |
+| ConfigurationDatabase-018 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | `DateTime`-typed `*Utc` columns on `AuditEvent` / `SiteCall` carry no `DateTimeKind` enforcement |
+| ConfigurationDatabase-019 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | `EnsureLookaheadAsync` swallows non-idempotent SPLIT failures and continues, creating partition holes |
+| DataConnectionLayer-019 | [DataConnectionLayer](DataConnectionLayer/findings.md) | `OpcUaDataConnection._subscriptionHandles` is a plain `Dictionary<,>` mutated from concurrent thread-pool continuations |
+| DataConnectionLayer-020 | [DataConnectionLayer](DataConnectionLayer/findings.md) | `HandleSubscribeCompleted` double-counts `_totalSubscribed` when a previously-unresolved tag is resolved by a different instance's subscribe |
+| DataConnectionLayer-021 | [DataConnectionLayer](DataConnectionLayer/findings.md) | `HandleSubscribeCompleted` re-creates and leaks `_subscriptionsByInstance` entry when the instance unsubscribed mid-flight |
+| DataConnectionLayer-022 | [DataConnectionLayer](DataConnectionLayer/findings.md) | `HandleSubscribeCompleted` and `HandleTagResolutionFailed` reset the tag-resolution retry timer on every call via `StartPeriodicTimer`, starving the retry under subscribe bursts |
+| DeploymentManager-019 | [DeploymentManager](DeploymentManager/findings.md) | Lifecycle command timeout writes no audit entry |
+| ExternalSystemGateway-019 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | `HttpClient.Timeout` is not set; `DefaultHttpTimeout` > 100s is silently clipped by the framework default |
+| ExternalSystemGateway-020 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | `JsonElementToParameterValue` silently downcasts non-Int64 JSON numbers to `double`, losing precision for `decimal` SQL parameters on retry |
+| HealthMonitoring-017 | [HealthMonitoring](HealthMonitoring/findings.md) | `HealthReportSender` resets interval counters before `Send`; transport failures silently drop the interval's error counts |
+| HealthMonitoring-019 | [HealthMonitoring](HealthMonitoring/findings.md) | `SiteAuditTelemetryStalled` and `CentralAuditWriteFailures` design-doc metrics have no HealthMonitoring-side surface |
+| Host-016 | [Host](Host/findings.md) | Site `CentralContactPoints` second entry targets the site's own remoting port |
+| Host-017 | [Host](Host/findings.md) | Site-shutdown ordering from REQ-HOST-7 is not wired |
+| InboundAPI-018 | [InboundAPI](InboundAPI/findings.md) | `AuditWriteMiddleware` fires `WriteAsync` as `_ = task` — faulted async writes are unobserved |
+| InboundAPI-021 | [InboundAPI](InboundAPI/findings.md) | `ParentExecutionId` correlation flows only through `Call`; attribute reads/writes lose the inbound→site execution-tree link |
+| InboundAPI-025 | [InboundAPI](InboundAPI/findings.md) | `AuditWriteMiddleware` runs against the entire `/api/*` branch — emits spurious `ApiInbound` audit rows for `/api/audit/query` and `/api/audit/export` |
+| ManagementService-019 | [ManagementService](ManagementService/findings.md) | AuditEndpoints builds PermittedSiteIds but never enforces them |
+| ManagementService-020 | [ManagementService](ManagementService/findings.md) | UpdateSmtpConfig returns and audits the SMTP Credentials field verbatim |
+| ManagementService-021 | [ManagementService](ManagementService/findings.md) | Transport bundle handlers have zero test coverage |
+| NotificationOutbox-003 | [NotificationOutbox](NotificationOutbox/findings.md) | Dispatcher does not propagate a `CancellationToken` into delivery; in-flight SMTP sends cannot be cancelled on shutdown |
+| NotificationOutbox-004 | [NotificationOutbox](NotificationOutbox/findings.md) | `EmitAttemptAudit`/`EmitTerminalAudit` fire-and-forget pattern can outlive the per-sweep DI scope |
+| NotificationOutbox-005 | [NotificationOutbox](NotificationOutbox/findings.md) | Ingest persistence inherits the CD-015 check-then-act race; under contention the second writer throws and the site retries |
+| NotificationOutbox-007 | [NotificationOutbox](NotificationOutbox/findings.md) | `NotificationOutboxOptions.DispatchBatchSize`, `DeliveredKpiWindow`, and `PurgeInterval` are not in the design document |
+| NotificationOutbox-010 | [NotificationOutbox](NotificationOutbox/findings.md) | Comment claims `PipeTo` is not used "because the writer never throws"; the surrounding try/catch is dead-letter for the documented failure mode |
+| NotificationService-020 | [NotificationService](NotificationService/findings.md) | NS-001 fix superseded; `AkkaHostedService` would register two competing `Notification` S&F handlers if both code paths ran |
+| NotificationService-024 | [NotificationService](NotificationService/findings.md) | No test affirms the central-only invariant; the orphaned-path tests give a false coverage signal |
+| Security-016 | [Security](Security/findings.md) | `RoleMapper` silently drops the system-wide Deployment grant when a user is also in any site-scoped Deployment group |
+| Security-017 | [Security](Security/findings.md) | `SiteScopeRequirement` / `SiteScopeAuthorizationHandler` are dead code from production callers — `[Authorize(Policy = RequireDeployment)]` does NOT enforce site scoping |
+| SiteCallAudit-001 | [SiteCallAudit](SiteCallAudit/findings.md) | SupervisorStrategy override is dead code; XML claims Resume that is not enforced |
+| SiteCallAudit-003 | [SiteCallAudit](SiteCallAudit/findings.md) | `OnUpsertAsync` does not refresh `IngestedAtUtc`; direct-write callers must remember to stamp it |
+| SiteEventLogging-015 | [SiteEventLogging](SiteEventLogging/findings.md) | Background write queue is unbounded; can grow without limit under sustained writer slowness |
+| SiteEventLogging-017 | [SiteEventLogging](SiteEventLogging/findings.md) | Central client's `PageSize` is unbounded; defeats the "configurable page size" design rationale |
+| SiteRuntime-020 | [SiteRuntime](SiteRuntime/findings.md) | Second `DeployInstanceCommand` arriving during a pending redeploy races the still-terminating actor on its name |
+| SiteRuntime-021 | [SiteRuntime](SiteRuntime/findings.md) | `HandleDeployArtifacts` updates `DataConnections` in SQLite but never sends `CreateConnectionCommand` to the DCL |
+| SiteRuntime-022 | [SiteRuntime](SiteRuntime/findings.md) | `AuditingDbCommand.DbConnection.set` uses reflection to read `AuditingDbConnection._inner` |
+| SiteRuntime-024 | [SiteRuntime](SiteRuntime/findings.md) | `OperationTrackingStore` serialises all writes through one connection + `SemaphoreSlim`, and `Dispose()` does sync-over-async |
+| StoreAndForward-019 | [StoreAndForward](StoreAndForward/findings.md) | Notifications park after `DefaultMaxRetries` exhaustion, contradicting "retried until central acks" |
+| StoreAndForward-020 | [StoreAndForward](StoreAndForward/findings.md) | `RetryParkedMessageAsync` skips standby replication when the message is deleted between local update and re-load |
+| StoreAndForward-021 | [StoreAndForward](StoreAndForward/findings.md) | Design doc claims the Operation Tracking Table lives in StoreAndForward but the implementation is in SiteRuntime |
+| TemplateEngine-018 | [TemplateEngine](TemplateEngine/findings.md) | `DiffService` reports no entries for added/removed/changed connections |
+| TemplateEngine-019 | [TemplateEngine](TemplateEngine/findings.md) | `TemplateResolver.BuildInheritanceChain` still uses the `0`-as-no-parent sentinel that was removed from `CycleDetector` |
+| TemplateEngine-020 | [TemplateEngine](TemplateEngine/findings.md) | `Create*` audit entries are written with `EntityId = "0"` before `SaveChangesAsync` populates the real key |
+| TemplateEngine-021 | [TemplateEngine](TemplateEngine/findings.md) | `MoveTemplateAsync` skips folder cycle and sibling-name-collision validation |
+| Transport-004 | [Transport](Transport/findings.md) | `MaxUnlockAttemptsPerIpPerHour` option is declared but never enforced |
+| Transport-005 | [Transport](Transport/findings.md) | Manifest fields outside `ContentHash` are not bound to the encrypted payload |
+| Transport-006 | [Transport](Transport/findings.md) | Bundle ZIP read has no per-entry size cap or entry-count cap (zip-bomb / decompression-bomb) |
+| Transport-007 | [Transport](Transport/findings.md) | Failed import sessions retain decrypted plaintext for the full 30-minute TTL |
+| Transport-010 | [Transport](Transport/findings.md) | Critical Overwrite + cross-cutting paths uncovered by tests |

-### Low (0)
+### Low (92)

-_None open._
+| ID | Module | Title |
+|----|--------|-------|
+| AuditLog-002 | [AuditLog](AuditLog/findings.md) | `SupervisorStrategy` comments claim Resume semantics but code returns the default Restart decider |
+| AuditLog-003 | [AuditLog](AuditLog/findings.md) | `AuditLogIngestActor.OnIngestAsync` uses `CreateScope`, but `OnCachedTelemetryAsync` uses `CreateAsyncScope` — and only one disposes asynchronously |
+| AuditLog-006 | [AuditLog](AuditLog/findings.md) | `SqliteAuditWriter.Dispose()` does sync-over-async and may deadlock |
+| AuditLog-007 | [AuditLog](AuditLog/findings.md) | `INodeIdentityProvider` resolution mixes `GetService` and `GetRequiredService` inconsistently across `AddAuditLog` registrations |
+| AuditLog-008 | [AuditLog](AuditLog/findings.md) | Test composition roots that omit `IAuditPayloadFilter` silently pass UNREDACTED payloads through the writer chain |
+| AuditLog-009 | [AuditLog](AuditLog/findings.md) | `SqliteAuditWriter.DisposeAsync` comment claims `_disposed` is set early, but it isn't |
+| AuditLog-010 | [AuditLog](AuditLog/findings.md) | Actor drain paths accept a `CancellationToken` parameter but always pass `CancellationToken.None` downstream |
+| AuditLog-011 | [AuditLog](AuditLog/findings.md) | `AddAuditLogHealthMetricsBridge` and `AddAuditLogCentralMaintenance` are non-idempotent and register hosted services on every call |
+| CLI-020 | [CLI](CLI/findings.md) | `bundle export` success-envelope parse is unguarded |
+| CLI-021 | [CLI](CLI/findings.md) | `CliConfig.Load` crashes the CLI on a malformed config file |
+| CLI-022 | [CLI](CLI/findings.md) | `CommandTreeTests` excludes the two new command groups |
+| CLI-023 | [CLI](CLI/findings.md) | `Component-CLI.md` claims audit commands ride `POST /management`; implementation uses REST endpoints |
+| CentralUI-029 | [CentralUI](CentralUI/findings.md) | `ConfigurationAuditLog` uses `JS.InvokeAsync<int>("eval", ...)` instead of a dedicated JS module |
+| CentralUI-030 | [CentralUI](CentralUI/findings.md) | `SandboxConsoleCapture`'s per-call `StringWriter` is not thread-safe under intra-script concurrency |
+| CentralUI-031 | [CentralUI](CentralUI/findings.md) | `TransportImport` buffers the full bundle bytes in component state |
+| CentralUI-032 | [CentralUI](CentralUI/findings.md) | `AuditResultsGrid` paging is forward-only, no Previous button |
+| CentralUI-033 | [CentralUI](CentralUI/findings.md) | Drill-in / query-string code paths for the new Transport + SiteCalls pages are untested |
+| ClusterInfrastructure-011 | [ClusterInfrastructure](ClusterInfrastructure/findings.md) | `SectionName` constant is decorative — no binding site references it |
+| ClusterInfrastructure-012 | [ClusterInfrastructure](ClusterInfrastructure/findings.md) | Validator accepts `SeedNodes.Count == 1` despite design requiring both nodes as seeds |
+| ClusterInfrastructure-013 | [ClusterInfrastructure](ClusterInfrastructure/findings.md) | Test uses catastrophic config values without an inline-intent comment |
+| ClusterInfrastructure-014 | [ClusterInfrastructure](ClusterInfrastructure/findings.md) | `AddClusterInfrastructureActors` is dead surface — no caller, no behaviour |
+| Commons-016 | [Commons](Commons/findings.md) | `BundleSession.Locked` uses a magic `3` rather than a named constant |
+| Commons-018 | [Commons](Commons/findings.md) | `IOperationTrackingStore` and `IPartitionMaintenance` are at the root of `Interfaces/` instead of `Interfaces/Services/` |
+| Commons-020 | [Commons](Commons/findings.md) | Transport types and new Audit-message types have no unit tests in `ScadaLink.Commons.Tests` |
+| Commons-021 | [Commons](Commons/findings.md) | `ExternalCallResult.Response` has a benign lazy-parse race |
+| Commons-022 | [Commons](Commons/findings.md) | `IAuditCorrelationContext` references an unresolvable `BundleImporter.ApplyAsync` cref; JSON-blob columns have no documented shape |
+| Commons-023 | [Commons](Commons/findings.md) | Trailing-optional `SourceNode` on positional records mixes additive evolution patterns |
+| Communication-018 | [Communication](Communication/findings.md) | Site heartbeats hard-code `IsActive: true` regardless of node role |
+| Communication-019 | [Communication](Communication/findings.md) | `LoadSiteAddressesFromDb` does not pass a `CancellationToken` to the repository |
+| Communication-020 | [Communication](Communication/findings.md) | `SiteAddressCacheLoaded` carries mutable `Dictionary`/`List` types |
+| Communication-021 | [Communication](Communication/findings.md) | `SiteStreamGrpcServer.SubscribeInstance` leaks the `StreamRelayActor` if `Subscribe` throws pre-try |
+| Communication-022 | [Communication](Communication/findings.md) | `_debugSubscriptions` keyed by caller-supplied correlation ID; reuse silently orphans the prior subscriber |
+| ConfigurationDatabase-020 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | `GetPartitionBoundariesOlderThanAsync` returns `DateTime` with `Kind=Unspecified` |
+| ConfigurationDatabase-021 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | `SwitchOutPartitionAsync` interpolates `monthBoundary` / staging table name into raw SQL |
+| ConfigurationDatabase-022 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | Stale "WP-24 Stub level sufficient for diff/staleness support" XML comment on `DeploymentManagerRepository` |
+| ConfigurationDatabase-023 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | `AuditLog` correlation-index name drifts from design doc (`IX_AuditLog_CorrelationId` vs `IX_AuditLog_Correlation`) |
+| ConfigurationDatabase-024 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | Missing test coverage for SPLIT-RANGE failure-continuation and production-shape rowversion delete |
+| DeploymentManager-020 | [DeploymentManager](DeploymentManager/findings.md) | `DeployReconciled` audit attributes the action to the prior deployer, not the current user |
+| DeploymentManager-021 | [DeploymentManager](DeploymentManager/findings.md) | `ResolveSiteIdentifierAsync` silently substitutes the DB id when the site row is missing |
+| DeploymentManager-022 | [DeploymentManager](DeploymentManager/findings.md) | `Pending` and `InProgress` are written back-to-back with no intervening work |
+| DeploymentManager-023 | [DeploymentManager](DeploymentManager/findings.md) | `BuildDeployArtifactsCommandAsync` re-queries system-wide artifacts once per site |
+| DeploymentManager-024 | [DeploymentManager](DeploymentManager/findings.md) | Test probe actors hold mutable static state across tests |
+| ExternalSystemGateway-021 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | `ApplyAuth` silently sends an unauthenticated request on unknown `AuthType`, empty `AuthConfiguration`, or malformed Basic config |
+| ExternalSystemGateway-022 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | `new HttpMethod(method.HttpMethod)` accepts any string at runtime; an invalid HTTP verb fails only at call time |
+| ExternalSystemGateway-023 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | PATCH HTTP method is supported by code but absent from the design doc; body-vs-query decision drifts from the documented set |
+| HealthMonitoring-018 | [HealthMonitoring](HealthMonitoring/findings.md) | Same counter-reset-before-publish hazard in `CentralHealthReportLoop` |
+| HealthMonitoring-020 | [HealthMonitoring](HealthMonitoring/findings.md) | `MarkHeartbeat` brings offline site back online with a stale `LastHeartbeatAt` when `receivedAt <= existing.LastHeartbeatAt` |
+| HealthMonitoring-021 | [HealthMonitoring](HealthMonitoring/findings.md) | `CentralSiteId = "central"` reserved constant silently collides with a real site named "central" |
+| HealthMonitoring-022 | [HealthMonitoring](HealthMonitoring/findings.md) | `CentralHealthReportLoopTests` uses real-time `PeriodicTimer` + `Task.Delay`; flake-prone on slow CI |
+| HealthMonitoring-023 | [HealthMonitoring](HealthMonitoring/findings.md) | `StoreAndForwardBufferDepths_IsEmptyPlaceholder` test name is stale; it now covers the default-state contract, not a placeholder |
+| Host-018 | [Host](Host/findings.md) | Shipped per-role configs omit `NodeOptions.NodeName`, leaving `SourceNode` null |
+| Host-019 | [Host](Host/findings.md) | Migration `StartupRetry` call drops the host `CancellationToken` |
+| Host-020 | [Host](Host/findings.md) | `MinimumLevel.Is` silently overrides any operator-set `Serilog:MinimumLevel` |
+| Host-021 | [Host](Host/findings.md) | Microsoft `Logging:LogLevel` section in `appsettings.json` is dead config under Serilog |
+| Host-022 | [Host](Host/findings.md) | `ParseLevel` silently coerces unrecognised `MinimumLevel` to `Information` |
+| InboundAPI-019 | [InboundAPI](InboundAPI/findings.md) | `EnableBuffering()` called unconditionally on every request, including bodyless requests |
+| InboundAPI-020 | [InboundAPI](InboundAPI/findings.md) | `ContentType.Contains("json")` is case-sensitive; `application/JSON` with no Content-Length skips body parsing |
+| InboundAPI-023 | [InboundAPI](InboundAPI/findings.md) | `EndpointExtensions.HandleInboundApiRequest` composition wiring has no test coverage |
+| InboundAPI-024 | [InboundAPI](InboundAPI/findings.md) | `_knownBadMethods` is unbounded — an attacker can grow the cache by spamming distinct method names against the audit middleware path |
+| ManagementService-022 | [ManagementService](ManagementService/findings.md) | Design doc is stale on Transport bundle commands, /api/audit/* endpoints, and CommandTimeout |
+| ManagementService-023 | [ManagementService](ManagementService/findings.md) | HandleQueryDeployments unfiltered branch is N+1 on instance lookup |
+| NotificationOutbox-006 | [NotificationOutbox](NotificationOutbox/findings.md) | `ResolveAdapters` rebuilds the `NotificationType → adapter` dictionary on every dispatch sweep |
+| NotificationOutbox-008 | [NotificationOutbox](NotificationOutbox/findings.md) | `FallbackMaxRetries` / `FallbackRetryDelay` path is unreachable in production AND untested |
+| NotificationOutbox-009 | [NotificationOutbox](NotificationOutbox/findings.md) | `StuckAgeThreshold` XML-doc says "in-progress notification is re-claimed" — contradicts the design's display-only stuck detection |
+| NotificationService-022 | [NotificationService](NotificationService/findings.md) | `MailKitSmtpClientWrapper` holds a long-lived `SmtpClient`; combined with per-send factory, the design comment about pooling is contradicted |
+| NotificationService-023 | [NotificationService](NotificationService/findings.md) | XML docs on the orphaned classes still describe the removed site-delivery flow; misleading to maintainers |
+| NotificationService-025 | [NotificationService](NotificationService/findings.md) | `CredentialRedactor` over-masks: any 4-character credential component is masked anywhere it appears, including unrelated log text |
+| Security-018 | [Security](Security/findings.md) | Role names are hard-coded magic strings duplicated across `RoleMapper`, `SiteScopeAuthorizationHandler`, and `AuthorizationPolicies` |
+| Security-019 | [Security](Security/findings.md) | Service-account rebind failure is reported as "Invalid username or password" — masks misconfiguration as a user-credential error |
+| Security-020 | [Security](Security/findings.md) | `SecurityOptions` has no startup validation for required fields (`LdapServer`, `LdapSearchBase`) |
+| Security-021 | [Security](Security/findings.md) | `RequireHttpsCookie=false` dev opt-out has no warning path — an HTTP production deployment silently transmits the JWT bearer credential in cleartext |
+| SiteCallAudit-002 | [SiteCallAudit](SiteCallAudit/findings.md) | Singleton failover does not wait for in-flight async upserts |
+| SiteCallAudit-004 | [SiteCallAudit](SiteCallAudit/findings.md) | Reconciliation puller and daily terminal-purge scheduler still deferred; design-doc drift |
+| SiteCallAudit-005 | [SiteCallAudit](SiteCallAudit/findings.md) | `AckErrorMessage` switch arm for `SiteUnreachable` returns ack message instead of throwing |
+| SiteCallAudit-006 | [SiteCallAudit](SiteCallAudit/findings.md) | Stuck-only paging test does not exercise the multi-page boundary with an interleaved non-stuck row at the cursor |
+| SiteEventLogging-018 | [SiteEventLogging](SiteEventLogging/findings.md) | `FailedWriteCount` is exposed but never consumed by Health Monitoring |
+| SiteEventLogging-019 | [SiteEventLogging](SiteEventLogging/findings.md) | `EventLogPurgeService` runs on every host node; design says "active node" |
+| SiteEventLogging-020 | [SiteEventLogging](SiteEventLogging/findings.md) | `severity` and `eventType` are unvalidated free-form strings; doc enumerates a set that is not enforced |
+| SiteEventLogging-021 | [SiteEventLogging](SiteEventLogging/findings.md) | `DateTimeOffset.Parse` uses the current culture; can throw on non-default locales |
+| SiteEventLogging-022 | [SiteEventLogging](SiteEventLogging/findings.md) | `Cache=Shared` is redundant for a single-connection logger |
+| SiteEventLogging-023 | [SiteEventLogging](SiteEventLogging/findings.md) | Concurrent-stress test uses a non-volatile `stop` flag |
+| SiteRuntime-023 | [SiteRuntime](SiteRuntime/findings.md) | `Convert.ToDouble(value)` in trigger and alarm evaluation is locale-sensitive |
+| SiteRuntime-025 | [SiteRuntime](SiteRuntime/findings.md) | `HandleSetStaticAttribute` persists unknown attribute names as static overrides |
+| SiteRuntime-026 | [SiteRuntime](SiteRuntime/findings.md) | `ReplicationMessages.cs` public record types have no XML documentation |
+| StoreAndForward-022 | [StoreAndForward](StoreAndForward/findings.md) | `NotifyCachedCallObserverAsync` silently drops the entire audit lifecycle when the message id is not a parseable `TrackedOperationId` |
+| StoreAndForward-023 | [StoreAndForward](StoreAndForward/findings.md) | `siteId` silently defaults to empty when no `IStoreAndForwardSiteContext` is registered, degrading audit telemetry correlation |
+| StoreAndForward-024 | [StoreAndForward](StoreAndForward/findings.md) | `StopAsync` does not wait for an in-flight retry sweep, so disposed dependencies can be touched after shutdown |
+| TemplateEngine-022 | [TemplateEngine](TemplateEngine/findings.md) | `LockEnforcer.ValidateLockChange` enforces "once-locked-stays-locked" for `IsLocked` but not for `LockedInDerived` |
+| Transport-008 | [Transport](Transport/findings.md) | `PreviewAsync` issues an N+1 `GetTemplateWithChildrenAsync` per matching template name |
+| Transport-009 | [Transport](Transport/findings.md) | `IAuditCorrelationContext.BundleImportId` is mutated on the same scoped instance the AuditService reads |
+| Transport-011 | [Transport](Transport/findings.md) | Design doc's Step-1 manifest preview promises decryption-free preview, but `LoadAsync` reads and validates content before passphrase |
+| Transport-012 | [Transport](Transport/findings.md) | "Bundle Import" filter promised in design doc not surfaced in Configuration Audit Log Viewer UI |
@@ -5,10 +5,10 @@
 | Module | `src/ScadaLink.Security` |
 | Design doc | `docs/requirements/Component-Security.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-17 |
+| Last reviewed | 2026-05-28 |
 | Reviewer | claude-agent |
-| Commit reviewed | `39d737e` |
-| Open findings | 0 (1 deferred — Security-008) |
+| Commit reviewed | `1eb6e97` |
+| Open findings | 6 (1 deferred — Security-008) |

 ## Summary

@@ -48,6 +48,36 @@ omits the separate idle check (Security-014). The two Low findings concern fragi
 DN parsing of group names containing escaped commas and an un-trimmed username flowing
 into the LDAP filter, fallback DN, and JWT claims.

+#### Re-review 2026-05-28 (commit `1eb6e97`)
+
+Re-reviewed the module on a fresh baseline. All Security-001..007, 009..015 fixes remain
+in place; the only Open carry-over is Security-008 (still correctly **Deferred** —
+`ISecurityRepository` still exposes no per-set scope-rule query, so the N+1 in
+`RoleMapper` cannot be removed from within this module). The original
+Security-014 fix is now load-bearing: `RefreshToken` calls `IsIdleTimedOut` before
+re-issuing, and the new cookie sliding-expiry tests in `SecurityReviewRegressionTests`
+pin CentralUI-005's Security-side contract. This pass surfaced **6 new findings**
+(Security-016..021): one Medium correctness/security defect, one Medium design-adherence
+defect, and four Low. The most consequential is **Security-016** — when a user is
+mapped to *both* a system-wide Deployment LDAP group (e.g. `SCADA-Deploy-All`) and a
+site-scoped Deployment LDAP group (e.g. `SCADA-Deploy-SiteA`), `RoleMapper` silently
+treats the union as site-scoped (the system-wide grant is dropped); the design's
+"multiple groups grant multiple independent roles" intent is not honoured for this
+mix-and-match case. **Security-017** is the cross-module partner of CentralUI-028:
+`SiteScopeRequirement` / `SiteScopeAuthorizationHandler` are declared and registered
+but no production caller ever instantiates them — `[Authorize(Policy = RequireDeployment)]`
+*does not* enforce the documented site scoping, callers must remember to inject
+`SiteScopeService` and re-check `IsSiteAllowedAsync` themselves (which the two new
+report pages flagged by CentralUI-028 forgot to do). The remaining Lows are: role names
+are magic strings duplicated across `RoleMapper`, `SiteScopeAuthorizationHandler`, and
+`AuthorizationPolicies` (Security-018); a service-account-rebind failure is reported
+to the user as "Invalid username or password" — masking a misconfiguration as a
+user-credential error (Security-019); required `SecurityOptions` fields
+(`LdapServer`, `LdapSearchBase`) have no `IValidateOptions` startup check, so empty
+values silently surface only on first login (Security-020); and the
+`RequireHttpsCookie=false` dev opt-out emits no warning, so an HTTP production
+deployment silently transmits the JWT bearer credential in cleartext (Security-021).
+
 ## Checklist coverage

 | # | Category | Examined | Notes |
@@ -63,6 +93,21 @@ into the LDAP filter, fallback DN, and JWT claims.
 | 9 | Testing coverage | ☑ | No tests for `RoleMapper` N+1 behavior, DN-injection inputs, StartTLS path, or idle-timeout-after-refresh. Insecure-config combinations under-tested (Security-011). |
 | 10 | Documentation & comments | ☑ | `SecurityOptions` XML docs say direct bind uses `cn={username}` while the search filter uses `uid=` — comment is misleading (covered under Security-004). |

+_Re-review (2026-05-28, `1eb6e97`):_
+
+| # | Category | Examined | Notes |
+|---|----------|----------|-------|
+| 1 | Correctness & logic bugs | ☑ | `RoleMapper` drops a system-wide Deployment grant when the user is also in any site-scoped Deployment group (Security-016); hard-coded role-name string `"Deployment"` in two separate places allows a refactor to silently break site scoping (Security-018). |
+| 2 | Akka.NET conventions | ☑ | No actors. `AddSecurityActors` is still a registration placeholder. No issues. |
+| 3 | Concurrency & thread safety | ☑ | Services stateless; LDAP sync calls wrapped in `Task.Run` with the now-bounded timeout (Security-009 resolution holds). No issues found. |
+| 4 | Error handling & resilience | ☑ | A service-account-rebind failure inside `AuthenticateAsync` is reported as "Invalid username or password", masking a misconfiguration as a user-credential error (Security-019). LDAP-failure rule + partial-outage path remain correctly enforced post-Security-012. |
+| 5 | Security | ☑ | `SiteScopeRequirement` / `SiteScopeAuthorizationHandler` are dead code — no policy is registered that uses them and no production caller instantiates them, so declarative `[Authorize]` does not enforce site scoping (Security-017, cross-module partner of CentralUI-028). `RequireHttpsCookie=false` dev opt-out has no warning path — a production misconfiguration silently transmits the JWT bearer credential over HTTP (Security-021). |
+| 6 | Performance & resource management | ☑ | Security-008 N+1 remains correctly Deferred (still gated on `ISecurityRepository`). No new perf issues. |
+| 7 | Design-document adherence | ☑ | `RoleMapper`'s drop-system-wide-on-any-scoped behaviour (Security-016) contradicts the design's "A user can hold multiple roles simultaneously … roles are independent — there is no implied hierarchy" rule for the union case; `SiteScopeRequirement` advertises a site-scope authorization pattern the implementation does not actually wire up (Security-017). |
+| 8 | Code organization & conventions | ☑ | Role-name strings are duplicated as magic literals across `RoleMapper.cs`, `SiteScopeAuthorizationHandler.cs`, and `AuthorizationPolicies.cs` — only the audit roles have a single source of truth via `OperationalAuditRoles` / `AuditExportRoles` (Security-018). `SecurityOptions` defaults pass through to runtime with no `IValidateOptions` for required fields like `LdapServer` / `LdapSearchBase` (Security-020). |
+| 9 | Testing coverage | ☑ | No test covers a user mapped to both a system-wide AND a site-scoped Deployment LDAP group (the Security-016 case). No test covers the `SiteScopeRequirement` cross-page integration — tests evaluate the handler in isolation, not the absence of a policy that uses it (Security-017). |
+| 10 | Documentation & comments | ☑ | `SiteScopeAuthorizationHandler` XML doc describes a permission model no caller actually invokes (Security-017). Otherwise stable. |
+
 ## Findings

 ### Security-001 — StartTLS upgrade path is unreachable dead code
@@ -654,3 +699,226 @@ use the single canonical identity. Regression tests
 `NormalizeUsername_TrimsLeadingAndTrailingWhitespace`,
 `BuildFallbackUserDn_TrimmedUsername_NoLeadingTrailingSpace`,
 `AuthenticateAsync_UsernameWithSurroundingWhitespace_StillRejectedForInsecure`.
+
+### Security-016 — `RoleMapper` silently drops the system-wide Deployment grant when a user is also in any site-scoped Deployment group
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.Security/RoleMapper.cs:30-31`, `:41-55`, `:59` |
+
+**Description**
+
+`MapGroupsToRolesAsync` resolves the Deployment role's site scope as a single
+`isSystemWide = hasDeploymentRole && !hasDeploymentWithScopeRules` flag computed across
+ALL matched Deployment mappings. If a user is a member of both a system-wide Deployment
+group (e.g. `SCADA-Deploy-All`, no scope rules) AND a site-scoped Deployment group
+(e.g. `SCADA-Deploy-SiteA`, one scope rule for Site A), the second mapping sets
+`hasDeploymentWithScopeRules = true`, so the final `isSystemWide` becomes `false` and
+the returned `PermittedSiteIds` is just `[SiteA]`. The system-wide grant from
+`SCADA-Deploy-All` is silently dropped — the user loses access to every other site, even
+though one of their LDAP groups was intended to grant them system-wide reach. This
+contradicts the design's "A user can hold multiple roles simultaneously … roles are
+independent — there is no implied hierarchy" intent: the union of grants should be the
+broadest grant in the set, not the narrowest. The mistake is also non-obvious to an
+operator: from the Admin → LDAP Mappings page nothing flags that adding a site-scoped
+Deployment mapping for a user already in `SCADA-Deploy-All` *removes* sites from their
+effective grant. The downstream `SiteScopeService.IsSystemWideAsync` / `FilterSitesAsync`
+faithfully reproduce this narrowing, so the user can no longer see or act on sites
+outside `[SiteA]`.
+
+**Recommendation**
+
+Track the union semantics explicitly: if any matched Deployment mapping has no scope
+rules, the user is system-wide regardless of what other mappings have. The simplest
+change is to set `hasDeploymentWithScopeRules` only when the mapping has scope rules
+AND another flag `hasUnscopedDeploymentMapping` is false; then compute
+`isSystemWide = hasUnscopedDeploymentMapping || (hasDeploymentRole && !hasDeploymentWithScopeRules)`.
+Equivalently: collect per-mapping `(hasRules, scopedSiteIds)` first, then
+`isSystemWide = any mapping has hasRules==false`, and `permittedSiteIds = union of all
+scopedSiteIds` (left empty for system-wide users). Add a regression test
+`MapGroupsToRoles_UserInBothSystemWideAndScopedDeploymentGroup_IsSystemWide` covering
+the design's example pair `SCADA-Deploy-All` + `SCADA-Deploy-SiteA`.
+
+### Security-017 — `SiteScopeRequirement` / `SiteScopeAuthorizationHandler` are dead code from production callers — `[Authorize(Policy = RequireDeployment)]` does NOT enforce site scoping
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Design-document adherence |
+| Status | Open |
+| Location | `src/ScadaLink.Security/SiteScopeAuthorizationHandler.cs:8-58`; `src/ScadaLink.Security/AuthorizationPolicies.cs:113-143` |
+
+**Description**
+
+The module declares `SiteScopeRequirement` (an `IAuthorizationRequirement` carrying a
+`TargetSiteId`) and the matching `SiteScopeAuthorizationHandler` that combines the
+Deployment role claim with the `SiteId` claims to enforce the design's site-scoping
+rule. The handler is registered in `AddScadaLinkAuthorization`
+(`services.AddSingleton<IAuthorizationHandler, SiteScopeAuthorizationHandler>()`). But
+no `AddPolicy` call ever wires the requirement to a named policy, and a grep across
+`src/ScadaLink.CentralUI` and `src/ScadaLink.ManagementService` confirms that **no
+production code ever instantiates `new SiteScopeRequirement(...)` or calls
+`AuthorizeAsync(...)` with one** — the only callers are the unit tests in
+`SecurityTests.cs:1146,1166,1185,1203`. The design + CLAUDE.md state that "Deployment
+and Monitoring pages must filter every site/instance list through `FilterSitesAsync`
+and re-check `IsSiteAllowedAsync` before any cross-site command", and the
+CentralUI-028 finding (High, Open) confirms this is exactly the contract two new
+report pages forgot — because there is no declarative `[Authorize(Policy = ...)]`
+shortcut, callers must remember to inject `SiteScopeService` and write the check by
+hand, and any new page that forgets is a silent regression with no compile-time or
+test-time signal. The module's published surface advertises an authorization-handler
+pattern that is, in practice, unwired plumbing.
+
+**Recommendation**
+
+Either (a) **delete** `SiteScopeRequirement` and `SiteScopeAuthorizationHandler` (and
+the dead `IAuthorizationHandler` registration) and document `SiteScopeService` as the
+sole site-scoping mechanism — this is the smaller change and matches what the codebase
+actually does today; or, preferably, (b) **finish the wiring**: add a `RequireSiteScope`
+policy that uses `SiteScopeRequirement` and provide a small helper / source generator
+or analyzer that flags Deployment-policy-attributed pages without a site-scope check.
+Either way, address the cross-module gap: CentralUI-028 stays open until production
+pages reliably enforce the rule. If (b) is chosen, a route-parameter-aware
+`IAuthorizationPolicyProvider` is needed so the policy can read the target site id from
+the request — that is a meaningful design extension and would need to be planned
+alongside the Central UI's existing `SiteScopeService` usage rather than replacing it
+piecemeal.
+
+### Security-018 — Role names are hard-coded magic strings duplicated across `RoleMapper`, `SiteScopeAuthorizationHandler`, and `AuthorizationPolicies`
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Code organization & conventions |
+| Status | Open |
+| Location | `src/ScadaLink.Security/RoleMapper.cs:41`; `src/ScadaLink.Security/SiteScopeAuthorizationHandler.cs:36`; `src/ScadaLink.Security/AuthorizationPolicies.cs:118,121,124,95,107` |
+
+**Description**
+
+The role-name literals `"Admin"`, `"Design"`, `"Deployment"`, `"Audit"`, and
+`"AuditReadOnly"` are duplicated as magic strings across three separate files:
+`RoleMapper.cs:41` hard-codes `"Deployment"` to detect the site-scope branch;
+`SiteScopeAuthorizationHandler.cs:36` independently hard-codes `"Deployment"` to gate
+the handler; and `AuthorizationPolicies.cs:118,121,124` hard-code the four role names
+as the policy `RequireClaim` values. Only the audit roles have a single source of truth
+(via the `OperationalAuditRoles` / `AuditExportRoles` arrays on
+`AuthorizationPolicies`). A future rename or addition of a role that misses any one of
+these call sites silently breaks the system: e.g. renaming "Deployment" → "Deployer"
+in `RoleMapper` alone would leave the policy still requiring `"Deployment"` (logins
+get the new role name but the policy never matches), while changing it in the policy
+alone would leave `RoleMapper` failing to populate scope rules for the renamed role.
+The bug class is "string drift" — exactly the kind the `OperationalAuditRoles` constant
+was introduced to prevent.
+
+**Recommendation**
+
+Introduce a `public static class Roles { public const string Admin = "Admin"; public const
+string Design = "Design"; public const string Deployment = "Deployment"; public const string
+Audit = "Audit"; public const string AuditReadOnly = "AuditReadOnly"; }` in the Security
+project and replace every magic-string occurrence — including the elements of
+`OperationalAuditRoles` and `AuditExportRoles` — with the constants. A single rename
+will then either succeed everywhere or fail to compile.
+
+### Security-019 — Service-account rebind failure is reported as "Invalid username or password" — masks misconfiguration as a user-credential error
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Error handling & resilience |
+| Status | Open |
+| Location | `src/ScadaLink.Security/LdapAuthService.cs:85-89`, `:147-151` |
+
+**Description**
+
+After the user's credentials bind successfully, `AuthenticateAsync` re-binds as the
+configured service account to perform the group/attribute search
+(`connection.Bind(_options.LdapServiceAccountDn, _options.LdapServiceAccountPassword)`).
+A failure of this second bind — wrong service-account password, deleted/disabled
+service-account, locked-out service-account — throws `LdapException` which is caught by
+the broad outer `catch (LdapException)` and returned as
+`new LdapAuthResult(false, null, username, null, "Invalid username or password.")`.
+The user sees an "invalid credentials" message for *their* credentials even though
+their bind succeeded and the failure was in the system's own service-account
+configuration. Worse, every user attempting to log in sees the same incorrect message
+during a service-account outage, which routes operators down the wrong incident path
+(reset the user's password) instead of the right one (check the service-account
+credentials). The successful user bind itself is also not auditable as a discrete
+event because the result is "Invalid username or password" — indistinguishable from a
+genuine bad-password attempt.
+
+**Recommendation**
+
+Wrap the service-account rebind in its own `try`/`catch (LdapException)` and surface a
+distinct error: log `_logger.LogError(ex, "Service-account rebind failed; check
+LdapServiceAccountDn / LdapServiceAccountPassword configuration")` and return
+`new LdapAuthResult(false, null, username, null, "Authentication service is misconfigured. Contact an administrator.")`.
+Add a regression test that exercises the service-account-bind failure path (a mocked
+or seamed `LdapConnection.Bind` that throws on the second call) and asserts the
+distinct error message.
+
+### Security-020 — `SecurityOptions` has no startup validation for required fields (`LdapServer`, `LdapSearchBase`)
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Error handling & resilience |
+| Status | Open |
+| Location | `src/ScadaLink.Security/SecurityOptions.cs:6-7`, `:36-37`; `src/ScadaLink.Security/ServiceCollectionExtensions.cs:13-30` |
+
+**Description**
+
+`SecurityOptions.JwtSigningKey` correctly fails fast at `JwtTokenService` construction
+(Security-003 fix), but the LDAP-side required fields — `LdapServer` (default
+`string.Empty`) and `LdapSearchBase` (default `string.Empty`) — have no equivalent
+guard. `AddSecurity` does not register an `IValidateOptions<SecurityOptions>`. A
+deployment that fails to set `LdapServer` (a typo in the appsettings.json section name,
+a missing environment-variable substitution, a misconfigured Docker compose file)
+starts cleanly — the Central UI comes up, the login page loads, and only the first
+authentication attempt fails with `LdapConnection.Connect("")` throwing a low-level
+exception that bubbles up as the generic "An unexpected error occurred during
+authentication." message. The misconfiguration surfaces minutes or hours into the
+deploy, on the first real user login, rather than at startup where it is cheap to
+diagnose.
+
+**Recommendation**
+
+Add an `IValidateOptions<SecurityOptions>` registered via
+`services.AddOptions<SecurityOptions>().ValidateOnStart()` that fails when
+`LdapServer` is null/whitespace, `LdapSearchBase` is null/whitespace, or
+`LdapPort <= 0`. Combine with the existing `JwtTokenService` constructor check so
+every required `SecurityOptions` field is enforced at startup, not at first use.
+
+### Security-021 — `RequireHttpsCookie=false` dev opt-out has no warning path — an HTTP production deployment silently transmits the JWT bearer credential in cleartext
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Security |
+| Status | Open |
+| Location | `src/ScadaLink.Security/SecurityOptions.cs:100-108`; `src/ScadaLink.Security/ServiceCollectionExtensions.cs:54-59` |
+
+**Description**
+
+The Security-002 fix added `RequireHttpsCookie` (default `true`) so the auth cookie's
+`SecurePolicy` is `Always` in production. The current Docker dev cluster sets
+`RequireHttpsCookie=false` in both central nodes' `appsettings.Central.json`, downgrading
+to `SameAsRequest` so the local HTTP cluster works. The downgrade is documented in the
+XML doc but is silent at runtime: no log line warns that the cookie carrying the JWT
+bearer credential is being sent over an HTTP-only path. A production deployment that
+inherits a dev-derived appsettings — or that copy-pastes the docker config and forgets
+to flip the flag — transmits the session token in cleartext with no diagnostic signal.
+The default is correct; the gap is that the unsafe override has no operational guard.
+
+**Recommendation**
+
+In the `PostConfigure` block in `AddSecurity`, when `RequireHttpsCookie == false`, log
+a single startup warning along the lines of `_logger.LogWarning("RequireHttpsCookie is
+DISABLED — auth cookie SecurePolicy is SameAsRequest. The cookie-embedded JWT will be
+transmitted over plain HTTP. This setting is intended for local dev only — set
+SecurityOptions:RequireHttpsCookie=true in production.")`. Optionally, also fail
+startup when `RequireHttpsCookie=false` AND `ASPNETCORE_ENVIRONMENT=Production`. Add a
+regression test that asserts the warning is emitted when the flag is disabled and not
+when it is enabled.
@@ -0,0 +1,322 @@
+# Code Review — SiteCallAudit
+
+| Field | Value |
+|-------|-------|
+| Module | `src/ScadaLink.SiteCallAudit` |
+| Design doc | `docs/requirements/Component-SiteCallAudit.md` |
+| Status | Reviewed |
+| Last reviewed | 2026-05-28 |
+| Reviewer | claude-agent |
+| Commit reviewed | `1eb6e97` |
+| Open findings | 6 |
+
+## Summary
+
+The module is small (one actor + DI extension + options class). The actor is a
+central cluster singleton that exposes three responsibility groups: direct
+`UpsertSiteCallCommand` ingest, paginated/KPI read handlers, and the central→site
+Retry/Discard relay. Ingest idempotency is delegated to the repository's
+monotonic-upsert (the CD-015 check-then-act window is mitigated by the
+duplicate-key swallow on the insert leg). Findings cluster around two themes:
+(a) the `SupervisorStrategy` override is dead-code that contradicts the XML
+docstring — it governs children, and this actor has none, so the documented
+"Resume on leaked exception" promise is unenforced; (b) several smaller drifts
+between the design doc and the code (reconciliation puller + daily purge
+schedule are still deferred; `OnUpsertAsync` does not stamp `IngestedAtUtc`
+unlike the dual-write path). The relay path is well covered by Akka TestKit
+unit tests; the ingest + KPI paths are covered by MSSQL-backed integration
+tests using a shared `MsSqlMigrationFixture`.
+
+## Checklist coverage
+
+| # | Category | Examined | Notes |
+|---|----------|----------|-------|
+| 1 | Correctness & logic bugs | Yes | `OnUpsertAsync` does not refresh `IngestedAtUtc` (Finding 003). |
+| 2 | Akka.NET conventions | Yes | `SupervisorStrategy()` override is dead code (Finding 001). `Sender` correctly captured before first await on every handler. `PipeTo` used for read replies. |
+| 3 | Concurrency & thread safety | Yes | `_centralCommunication` mutated only on actor thread via `RegisterCentralCommunication`. DI scope-per-message disposed in `try/finally`. No issues found. |
+| 4 | Error handling & resilience | Yes | Ingest catches all + replies `Accepted=false`. Relay distinguishes `SiteUnreachable` vs `OperationFailed`. Failover handover does not wait for in-flight async work (Finding 002). |
+| 5 | Security | Yes | All SQL is parameterised at the repository (FromSqlInterpolated). Relay carries no user-controlled strings beyond `SourceSite`. No issues found. |
+| 6 | Performance & resource management | Yes | DI scope-per-message correctly disposed. `MaxPageSize=200` clamp present. No issues found. |
+| 7 | Design-document adherence | Yes | Reconciliation puller and daily terminal-purge scheduler still deferred; design doc reads as if they ship (Finding 004). |
+| 8 | Code organization & conventions | Yes | `RegisterCentralCommunication` is a top-level record colocated with the actor — by design (carries `IActorRef`, cannot live in Commons). No issues found. |
+| 9 | Testing coverage | Yes | Relay path well covered (6 unit tests). Ingest/KPI well covered by MSSQL fixture. Stuck-only paging boundary edge not directly exercised (Finding 006). |
+| 10 | Documentation & comments | Yes | XML docstring claims `SupervisorStrategy` uses Resume — incorrect (Finding 001). `AckErrorMessage` switch arm for `SiteUnreachable` falls through instead of throwing (Finding 005). |
+
+## Findings
+
+### SiteCallAudit-001 — SupervisorStrategy override is dead code; XML claims Resume that is not enforced
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Akka.NET conventions |
+| Status | Open |
+| Location | `src/ScadaLink.SiteCallAudit/SiteCallAuditActor.cs:32-46`, `src/ScadaLink.SiteCallAudit/SiteCallAuditActor.cs:147-151` |
+
+**Description**
+
+The XML remarks block (lines 32-46) states:
+
+> "The `SupervisorStrategy` uses `Resume` so an unexpected throw before the catch (defence in depth) does not restart the actor and reset in-flight state."
+
+The override at lines 147-151 returns a `OneForOneStrategy` with `DefaultDecider`
+and `maxNrOfRetries: 0`. Two problems compound:
+
+1. `ActorBase.SupervisorStrategy()` governs the actor's **children**, not the
+   actor itself. `SiteCallAuditActor` creates no children, so this override is
+   dead code.
+2. The returned strategy uses `DefaultDecider` (Restart for most exceptions),
+   **not** `Directive.Resume`. So even if the actor did have children, the
+   strategy would not be Resume — it would be the default Restart-on-most-faults
+   behaviour with `maxNrOfRetries: 0` (which forces a Stop after the first
+   failure).
+
+Net effect: the actor's own self-supervision is whatever the parent supplies
+(`SupervisorStrategy.DefaultDecider` from the singleton manager / user
+guardian in tests), which Restarts on most exceptions. If the `try/catch` in
+`OnUpsertAsync` ever leaked (e.g. a synchronous throw constructing `replyTo`),
+the actor would Restart, reset `_centralCommunication` to null, and silently
+break the relay until `RegisterCentralCommunication` runs again.
+
+This same pattern (with the same misleading XML doc) exists in
+`AuditLogIngestActor`, `AuditLogPurgeActor`, and `SiteAuditReconciliationActor`
+— they were likely cargo-culted; this finding documents the local instance.
+
+**Recommendation**
+
+Either:
+
+- Remove the `SupervisorStrategy()` override entirely (it does nothing useful)
+  and revise the XML comment to drop the "Resume" claim. Self-supervision is
+  the parent's concern (the cluster singleton manager); the `try/catch` in
+  `OnUpsertAsync` is what actually keeps the actor alive.
+- Or, if Resume-on-self-throw is actually desired, that requires wiring a
+  custom supervisor in the parent (`ClusterSingletonManager`) — not overriding
+  `SupervisorStrategy()` here. Simpler path: keep the `try/catch`, drop the
+  override.
+
+The CLAUDE.md "Resume for coordinator actors" decision applies to actors with
+children (Site Runtime hierarchy) — not to leaf cluster singletons.
+
+**Resolution**
+
+_Unresolved._
+
+### SiteCallAudit-002 — Singleton failover does not wait for in-flight async upserts
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Error handling & resilience |
+| Status | Open |
+| Location | `src/ScadaLink.Host/Actors/AkkaHostedService.cs:455-462` (singleton wiring), `src/ScadaLink.SiteCallAudit/SiteCallAuditActor.cs:153-193` |
+
+**Description**
+
+The singleton is created with `terminationMessage: PoisonPill.Instance`. On
+failover the active node's singleton stops as soon as the mailbox is drained
+of normal messages and the PoisonPill is processed. An in-flight
+`OnUpsertAsync` Task started before the PoisonPill arrived will be allowed to
+complete (the message-handler runs synchronously from the mailbox's view),
+but the Akka actor model does NOT cancel the EF Core
+`ExecuteSqlInterpolatedAsync` call.
+
+Two consequences:
+
+1. The new singleton on the other node may begin accepting
+   `UpsertSiteCallCommand` for the same `TrackedOperationId` while the old
+   singleton's in-flight upsert is still running. The repository's
+   monotonic-upsert and the SQL duplicate-key swallow protect storage state.
+2. The original `replyTo` sender may receive its `Accepted=true` after the new
+   singleton has already returned a different reply. Idempotency keys protect
+   correctness; wire-level ordering is best-effort by design.
+
+This is consistent with the design ("eventually-consistent mirror, sites are
+source of truth"), but worth documenting as an explicit invariant. The
+Notification Outbox sibling has the same pattern.
+
+**Recommendation**
+
+- Document the failover/handover semantics in the actor's XML remarks: "On
+  cluster singleton handover, in-flight `OnUpsertAsync` tasks complete on the
+  old node and may produce a late `Accepted=true` reply; the repository's
+  monotonic upsert ensures storage state is consistent."
+- Add an integration test that deliberately races two concurrent upserts on
+  the same `TrackedOperationId` to verify the duplicate-key swallow +
+  monotonic rank check (the CD-015 race-pattern check the parent task
+  flagged).
+
+**Resolution**
+
+_Unresolved._
+
+### SiteCallAudit-003 — `OnUpsertAsync` does not refresh `IngestedAtUtc`; direct-write callers must remember to stamp it
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.SiteCallAudit/SiteCallAuditActor.cs:153-193` |
+
+**Description**
+
+The combined-telemetry hot path (`AuditLogIngestActor.OnCachedTelemetryAsync`)
+stamps `IngestedAtUtc = DateTime.UtcNow` on both the `AuditLog` row and the
+`SiteCall` row at central-side persist time
+(`src/ScadaLink.AuditLog/Central/AuditLogIngestActor.cs:238-239`). The design
+doc treats `IngestedAtUtc` as "central ingested (or last refreshed) this row"
+— a central-side timestamp.
+
+`SiteCallAuditActor.OnUpsertAsync` writes the supplied `SiteCall` as-is, with
+whatever `IngestedAtUtc` the caller stamped. The only current callers are the
+unit tests (which use `DateTime.UtcNow` at command-construction time). Once
+the deferred reconciliation puller lands and starts emitting
+`UpsertSiteCallCommand`s, the puller (running on central) is responsible for
+stamping a central timestamp — but if a future direct-write caller forgets,
+or constructs from a site DTO, the value could drift (e.g. become a site
+clock value).
+
+This is currently latent because no production caller exists, but it's
+inconsistent with the dual-write code path and undocumented.
+
+**Recommendation**
+
+- Either: stamp `IngestedAtUtc = DateTime.UtcNow` inside `OnUpsertAsync`
+  before calling `UpsertAsync` (matching `AuditLogIngestActor`'s behaviour),
+  using `cmd.SiteCall with { IngestedAtUtc = DateTime.UtcNow }`.
+- Or: document in the `UpsertSiteCallCommand` XML that callers MUST stamp
+  `IngestedAtUtc` to a central-side `DateTime.UtcNow` immediately before
+  sending.
+
+Preferred: stamp inside the actor — same as the combined-telemetry path —
+because callers cannot in general know the actor is colocated on central.
+
+**Resolution**
+
+_Unresolved._
+
+### SiteCallAudit-004 — Reconciliation puller and daily terminal-purge scheduler still deferred; design-doc drift
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Design-document adherence |
+| Status | Open |
+| Location | `src/ScadaLink.SiteCallAudit/SiteCallAuditActor.cs:23-30` (actor XML), `src/ScadaLink.SiteCallAudit/ServiceCollectionExtensions.cs:8-13`, `docs/requirements/Component-SiteCallAudit.md:24-32` |
+
+**Description**
+
+The design doc (`Component-SiteCallAudit.md` lines 24-32) lists five
+responsibilities, including:
+
+- "Run periodic per-site reconciliation pulls so missed telemetry self-heals."
+- "Purge terminal audit rows after a configurable retention window."
+
+The repository exposes `PurgeTerminalAsync` but nothing in this module
+schedules a daily call (Notification Outbox owns a `MaintenanceService` for
+its equivalent; no `SiteCallAuditMaintenanceService` exists). The
+reconciliation puller is acknowledged in the actor XML
+(`only reconciliation remains deferred`) but is not surfaced in the design
+doc as deferred — the doc reads as if it ships.
+
+**Recommendation**
+
+- Either: implement the deferred pieces (a hosted service that wakes daily
+  and calls `repo.PurgeTerminalAsync(now - retentionWindow)`, plus a per-site
+  reconciliation puller with a cursor + an `IPullCachedTelemetryClient`).
+- Or: add a "Status" / "Deferred" subsection to the design doc explicitly
+  listing what's not yet implemented (matches the pattern Audit Log uses for
+  its tamper-evidence hash chain).
+
+**Resolution**
+
+_Unresolved._
+
+### SiteCallAudit-005 — `AckErrorMessage` switch arm for `SiteUnreachable` returns ack message instead of throwing
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Documentation & comments |
+| Status | Open |
+| Location | `src/ScadaLink.SiteCallAudit/SiteCallAuditActor.cs:548-563` |
+
+**Description**
+
+```csharp
+return outcome switch
+{
+    SiteCallRelayOutcome.Applied => null,
+    SiteCallRelayOutcome.NotParked => "The operation is no longer parked at the site (...)",
+    SiteCallRelayOutcome.OperationFailed => ack.ErrorMessage,
+    // SiteUnreachable is never produced from a ParkedOperationActionAck —
+    // unreachable responses are built by UnreachableRetry/UnreachableDiscard
+    // before any ack is classified, so this arm is unreachable by construction.
+    SiteCallRelayOutcome.SiteUnreachable => ack.ErrorMessage,
+    _ => throw new ArgumentOutOfRangeException(...)
+};
+```
+
+The comment correctly states the `SiteUnreachable` arm is unreachable when
+called from `ClassifyAck`. The arm therefore exists only to satisfy
+exhaustiveness, but instead of throwing or returning a sentinel, it falls
+through to `ack.ErrorMessage` — indistinguishable from the `OperationFailed`
+arm above. If any future caller *does* feed `SiteUnreachable` into
+`AckErrorMessage` (e.g. via refactor), the result will be a silent
+wrong-detail-text bug rather than an immediate crash. The default arm
+correctly throws `ArgumentOutOfRangeException`, so the `SiteUnreachable` arm
+is the inconsistent one.
+
+**Recommendation**
+
+Replace the `SiteUnreachable => ack.ErrorMessage` arm with:
+
+```csharp
+SiteCallRelayOutcome.SiteUnreachable =>
+    throw new InvalidOperationException(
+        "AckErrorMessage cannot be called for SiteUnreachable — those responses "
+        + "are built by UnreachableRetry/UnreachableDiscard before classification."),
+```
+
+— fail fast if the invariant is ever violated by a refactor.
+
+**Resolution**
+
+_Unresolved._
+
+### SiteCallAudit-006 — Stuck-only paging test does not exercise the multi-page boundary with an interleaved non-stuck row at the cursor
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Testing coverage |
+| Status | Open |
+| Location | `tests/ScadaLink.SiteCallAudit.Tests/SiteCallAuditActorTests.cs:335-392` |
+
+**Description**
+
+`SiteCallQueryRequest_StuckOnly_PagesAreFull_NoEmptyPagesWithCursor` covers
+the case where stuck rows are interleaved with non-stuck rows (page-1 returns
+2 stuck rows, page-2 returns the third). It does not cover the edge where
+the row at the keyset cursor boundary (`AfterCreatedAtUtc + AfterId`) is
+itself a non-stuck row — i.e. the cursor points at a row the next page must
+SKIP through to find more stuck rows. The repository's SQL composes the
+cursor predicate (`CreatedAtUtc < cursor OR (CreatedAtUtc = cursor AND id <
+...)`) with the stuck predicate, so it should be honest, but the test only
+asserts row counts and `IsStuck`, not that the second-page query specifically
+skipped non-stuck rows between the cursor and the next stuck row.
+
+Lower priority because the SQL composition is straightforward, but adding a
+direct test would lock the invariant.
+
+**Recommendation**
+
+Add a test that (a) inserts 6 rows in interleaved order: stuck, not-stuck,
+stuck, not-stuck, stuck, not-stuck (oldest first); (b) issues a `StuckOnly`
+page-size-1 query; (c) asserts each page returns exactly the stuck row, with
+no overlap and all 3 stuck rows visited.
+
+**Resolution**
+
+_Unresolved._
@@ -5,10 +5,10 @@
 | Module | `src/ScadaLink.SiteEventLogging` |
 | Design doc | `docs/requirements/Component-SiteEventLogging.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-17 |
+| Last reviewed | 2026-05-28 |
 | Reviewer | claude-agent |
-| Commit reviewed | `39d737e` |
-| Open findings | 0 |
+| Commit reviewed | `1eb6e97` |
+| Open findings | 9 |

 ## Summary

@@ -46,6 +46,31 @@ keyword-search filter (SiteEventLogging-013) and a claimed initial-purge block o
 host startup thread (SiteEventLogging-014 — later re-triaged to Won't Fix, the
 premise does not hold on .NET 8+).

+#### Re-review 2026-05-28 (commit `1eb6e97`)
+
+Re-reviewed the module at commit `1eb6e97`. All fourteen prior findings remain closed
+and their resolutions hold up under inspection: the lock-guarded `WithConnection`
+overloads, the background-writer `Channel<T>` with disposed-mid-drain fault
+propagation, the `auto_vacuum = INCREMENTAL` schema + logical-size measurement, the
+severity index, the `LIKE` keyword-search escaping, and the concrete-recorder DI
+wiring are all present and correct at this commit. Nine new findings were recorded —
+none are regressions of prior fixes. The most notable (SiteEventLogging-016, **High**)
+is a correctness defect in the query path: timestamps are stored as ISO 8601 strings
+generated from `DateTimeOffset.UtcNow` (so they always have a `+00:00` offset suffix),
+but the `From`/`To` filters are stringified verbatim via `request.From.Value.ToString("o")`
+without normalising to UTC, so a central client that sends a non-UTC `DateTimeOffset`
+gets a broken lexicographic comparison and either spuriously includes or excludes
+events. The next-most-notable findings are SiteEventLogging-015 (unbounded background
+write queue can grow without limit under sustained writer slowness — sister
+`SqliteAuditWriter` uses a bounded channel) and SiteEventLogging-017 (the central
+client's `PageSize` is used verbatim with no upper-bound clamp, defeating the design's
+"prevents broad queries from overwhelming the communication channel" rationale). The
+remaining findings are low-severity hygiene / documentation: an unused
+`FailedWriteCount` metric, untyped severity/event-type fields, non-invariant culture
+parsing, the purge service running on the standby node, the redundant `Cache=Shared`
+on a single-connection logger, and a non-volatile stop flag in a concurrency stress
+test.
+
 ## Checklist coverage

 | # | Category | Examined | Notes |
@@ -61,6 +86,21 @@ premise does not hold on .NET 8+).
 | 9 | Testing coverage | ☑ | No tests for purge interaction with live writes, vacuum effectiveness, the actor bridge, or query error path (-010). |
 | 10 | Documentation & comments | ☑ | `LogEventAsync` XML doc says "asynchronously" but is synchronous (-009); stale "Phase 4+" placeholder (-011). |

+_Re-review (2026-05-28, `1eb6e97`):_
+
+| # | Category | Examined | Notes |
+|---|----------|----------|-------|
+| 1 | Correctness & logic bugs | ☑ | `From`/`To` filters compare non-normalised ISO 8601 strings against UTC-stored timestamps (-016); `DateTimeOffset.Parse` without invariant culture is culture-sensitive (-021); severity/event-type accept any non-empty string with no schema enforcement (-020). |
+| 2 | Akka.NET conventions | ☑ | `EventLogHandlerActor` is a simple `Receive`/`Tell` bridge with no supervision concerns of its own; no new findings. |
+| 3 | Concurrency & thread safety | ☑ | Concurrent-write stress test uses a non-volatile `stop` flag (-023). The shared-connection lock pattern is correct post-SiteEventLogging-003. |
+| 4 | Error handling & resilience | ☑ | `FailedWriteCount` is exposed but nothing in Health Monitoring polls it — the metric is unobserved (-018). |
+| 5 | Security | ☑ | Queries are fully parameterised. `PageSize` and `KeywordFilter` from the central client are not bounded (-017) — a hostile or buggy central could request `int.MaxValue` rows or multi-MB `LIKE` patterns. |
+| 6 | Performance & resource management | ☑ | Background write queue is unbounded (-015); `Cache=Shared` is redundant for a single-connection logger (-022); upper-bound on `PageSize` missing (-017). |
+| 7 | Design-document adherence | ☑ | `EventLogPurgeService` is registered as a per-host `BackgroundService` and runs on the standby too, but the design says "the daily background job runs on the active node" (-019). |
+| 8 | Code organization & conventions | ☑ | `FailedWriteCount` is on the concrete `SiteEventLogger`, not on `ISiteEventLogger`, so any future non-concrete consumer cannot read it (-018). |
+| 9 | Testing coverage | ☑ | Non-volatile `stop` flag in `PurgeByStorageCap_ConcurrentWritesDoNotCorruptConnection` (-023). No tests for `PageSize` bounds, `From`/`To` timezone handling, or unobserved `FailedWriteCount`. |
+| 10 | Documentation & comments | ☑ | `FailedWriteCount` XML doc claims "Health Monitoring can poll" but nothing does (-018). Severity / event-type docs enumerate values that are not enforced (-020). |
+
 ## Findings

 ### SiteEventLogging-001 — `PRAGMA incremental_vacuum` is a no-op; storage cap cannot reclaim space
@@ -706,3 +746,341 @@ re-triage note). No code change made. A verification test
 `StartAsync_DoesNotBlock_OnTheInitialPurge` was added to pin this behaviour
 (asserts `StartAsync` returns in under 1 s and the initial purge still runs on the
 background scheduler).
+
+### SiteEventLogging-015 — Background write queue is unbounded; can grow without limit under sustained writer slowness
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Performance & resource management |
+| Status | Open |
+| Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:58-63` |
+
+**Description**
+
+`SiteEventLogger` creates its background-writer feeder as
+`Channel.CreateUnbounded<PendingEvent>(...)`. The writer thread funnels every write
+through the shared `_writeLock` (acquired by `WithConnection`), so any condition that
+makes a single iteration slow — a long-running query in `EventLogQueryService`
+holding the lock, a `PurgeByStorageCap` run that takes the lock for batched
+`DELETE` + `PRAGMA incremental_vacuum`, a disk stall, or a sustained event burst
+from an alarm storm / script failure loop — drives the queue arbitrarily large.
+Every queued `PendingEvent` retains its `TaskCompletionSource` and its payload
+strings, so there is no upper bound on how much memory the recorder can hold.
+
+The sister centralized-audit component `ScadaLink.AuditLog/Site/SqliteAuditWriter.cs`
+addresses the same hot-path-writer problem with
+`Channel.CreateBounded<...>(new BoundedChannelOptions(_options.ChannelCapacity) { ..., FullMode = BoundedChannelFullMode.Wait })`,
+giving back-pressure to producers. Site event logging picked the riskier choice for
+a component that — per the design — is fed by every site subsystem (script, alarm,
+deployment, DCL, store-and-forward, instance lifecycle, notification) and has both
+a 30-day retention sweep and a 1 GB cap-purge competing for the same lock.
+
+**Recommendation**
+
+Switch to `Channel.CreateBounded<PendingEvent>(...)` with a configurable capacity
+(default in the order of 10 000 — large enough to absorb a normal alarm burst,
+small enough to bound memory). Pick a `FullMode` that matches policy: `Wait` for
+back-pressure (callers `await` and serialise their actor thread on the queue —
+defeats some of the SiteEventLogging-005 win but is safe), or `DropOldest` /
+`DropWrite` with a counter (drop-and-count is closer to "best-effort audit"). Add
+the dropped-event counter to `FailedWriteCount` or a sibling metric. Document the
+chosen policy on `ISiteEventLogger.LogEventAsync`.
+
+### SiteEventLogging-016 — `From`/`To` filters compare non-normalised ISO 8601 strings against UTC-stored timestamps
+
+| | |
+|--|--|
+| Severity | High |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:67-77`, `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:159`, `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:72-78` |
+
+**Description**
+
+Event rows are persisted with `timestamp` = `DateTimeOffset.UtcNow.ToString("o")`,
+which always emits the round-trip ISO 8601 form ending in the literal offset
+`+00:00` (e.g. `2026-05-28T12:34:56.7890123+00:00`). The query path filters by
+range using a direct string comparison:
+
+```
+whereClauses.Add("timestamp >= $from");
+parameters.Add(new SqliteParameter("$from", request.From.Value.ToString("o")));
+```
+
+`request.From` is a `DateTimeOffset?` and `ToString("o")` preserves whatever offset
+the caller passed in. If a central client passes a non-UTC `DateTimeOffset` — for
+example the result of `DateTimeOffset.Now` in a `UTC+05:00` timezone — the produced
+string is `"2026-05-28T17:34:56.0000000+05:00"`, which is lexicographically *greater*
+than the equivalent UTC instant string `"2026-05-28T12:34:56.0000000+00:00"`. The
+comparison `timestamp >= $from` is then evaluated as a byte-by-byte string compare
+(SQLite default `BINARY` collation), so the query either spuriously excludes events
+that genuinely occurred in the range, or spuriously includes events from a wholly
+different hour. The same defect applies to `To`. The retention purge does
+`DateTimeOffset.UtcNow.AddDays(-N).ToString("o")` (UTC) so it is safe; only the
+central query path is vulnerable.
+
+The design explicitly states "All timestamps are UTC throughout the system" but the
+boundary between a central `DateTimeOffset` and the SQLite store is not enforced.
+A central UI rendered in a non-UTC timezone is the most likely trigger, and the
+defect silently corrupts every query that filters by time range — exactly the
+filter most likely to be set on a "show me what happened around the failover" query.
+
+**Recommendation**
+
+Normalise `From` / `To` to UTC before serialising:
+`request.From.Value.ToUniversalTime().ToString("o")` (or
+`.UtcDateTime.ToString("o")`), so the produced offset is always `+00:00`. Add a
+regression test that filters with a `DateTimeOffset` carrying a non-zero offset and
+asserts the matching events are returned. Optionally also store timestamps as
+Unix-epoch `INTEGER` and let SQLite compare numerically, eliminating the
+lexicographic-comparison hazard structurally.
+
+### SiteEventLogging-017 — Central client's `PageSize` is unbounded; defeats the "configurable page size" design rationale
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Security |
+| Status | Open |
+| Location | `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:55`, `src/ScadaLink.Commons/Messages/RemoteQuery/EventLogQueryRequest.cs:18` |
+
+**Description**
+
+`EventLogQueryService.ExecuteQuery` resolves the effective page size as
+`var pageSize = request.PageSize > 0 ? request.PageSize : _options.QueryPageSize;`
+and uses it directly as the SQL `LIMIT $limit` (passing `pageSize + 1` to detect
+"has more"). There is no upper bound. A central client — buggy or hostile — can
+send `PageSize = int.MaxValue`, in which case the query attempts to materialise the
+entire (up to 1 GB) event log into a single `List<EventLogEntry>` while holding the
+shared write lock. This:
+
+- Builds a worst-case ~1 GB managed allocation that, depending on Akka.NET cluster
+  message serialisation limits, will then be serialised into an
+  `EventLogQueryResponse` and pushed over the ClusterClient pipe.
+- Blocks all writes (purge, recorder hot path) for the duration of the scan
+  because the read holds `_writeLock`.
+- Stalls the singleton `EventLogHandlerActor`, also blocking subsequent legitimate
+  queries.
+
+The design explicitly justifies pagination as preventing exactly this — "Results
+are paginated with a configurable page size (default: 500 events) ... This prevents
+broad queries from overwhelming the communication channel." The code honours the
+*default* but does not enforce an *upper bound* on a client-supplied override.
+
+**Recommendation**
+
+Clamp `pageSize` to a configurable maximum (e.g. `SiteEventLogOptions.MaxQueryPageSize`,
+default 5000) before using it. Also bound `KeywordFilter.Length` (e.g. 256 chars) —
+a leading-wildcard `LIKE` of an unbounded pattern is itself an expensive operation
+that runs under the same lock. Add a `Success: false, ErrorMessage: "PageSize
+exceeds maximum"` reject path so a misbehaving central is told why its query is
+refused.
+
+### SiteEventLogging-018 — `FailedWriteCount` is exposed but never consumed by Health Monitoring
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Documentation & comments |
+| Status | Open |
+| Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:67-71,225-226` |
+
+**Description**
+
+`SiteEventLogger.FailedWriteCount` was added under SiteEventLogging-008 with the
+XML doc statement "Surfaced so Health Monitoring can detect a logging outage
+instead of relying on a local log line nobody is watching." The implementation is
+correct (`Interlocked.Increment` on write failure, `Interlocked.Read` getter), but
+a repo-wide search shows **no** caller anywhere in `src/` reads the property —
+neither `ScadaLink.HealthMonitoring`, the central health collector, nor the host's
+`/health` endpoint. The metric is dead-letter: a logging outage still goes
+unnoticed in production, contradicting the original finding's resolution claim.
+
+The property is also exposed only on the concrete `SiteEventLogger`, not on
+`ISiteEventLogger`, so even if Health Monitoring were wired up it would have to
+take a concrete-type dependency (`internal Connection` removed, but
+`FailedWriteCount` remained concrete-only).
+
+**Recommendation**
+
+Either (a) wire `FailedWriteCount` into the existing Health Monitoring metric
+pipeline (e.g. publish it alongside other 30-second-interval site metrics, and
+promote a sustained non-zero value to a Warning), and add it to `ISiteEventLogger`
+so the consumer doesn't downcast; or (b) acknowledge the metric is unobserved by
+softening the XML doc to "Available for future Health Monitoring integration" and
+file a tracking item for the wiring. The current doc claim is misleading.
+
+### SiteEventLogging-019 — `EventLogPurgeService` runs on every host node; design says "active node"
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Design-document adherence |
+| Status | Open |
+| Location | `src/ScadaLink.SiteEventLogging/ServiceCollectionExtensions.cs:21`, `docs/requirements/Component-SiteEventLogging.md:45` |
+
+**Description**
+
+`AddSiteEventLogging` calls `services.AddHostedService<EventLogPurgeService>()`,
+which registers the purge `BackgroundService` per host. On a 2-node site cluster
+both `node-a` and `node-b` start the service independently, so each runs its own
+30-day retention purge and 1 GB cap purge against its own local
+`site_events.db`. The design states only "A daily background job runs on the
+active node and deletes all events older than 30 days." (Component-SiteEventLogging,
+Storage section). In practice the standby node receives no writes, so its purge
+finds nothing to delete and is harmless — but the implementation does not match the
+documented "active node" gating, and the resolution note on SiteEventLogging-004
+already flagged that the *writer* runs on the standby too. The purge has the same
+shape.
+
+Aligning to the design is also a defence against a future change that does write
+to the standby (e.g. local heartbeats), and removes the per-node wake-ups that
+contribute to `Microsoft.Extensions.Hosting` shutdown latency.
+
+**Recommendation**
+
+Either (a) gate the purge service on "this node is the active member of `siteRole`"
+(check the cluster singleton ownership before each `RunPurge()`, or host the
+purge inside the same cluster singleton as `EventLogHandlerActor`), or (b) reword
+the design doc to "the purge runs on every node against its own local database;
+on the standby it is a no-op". Pick one; the current mismatch is a doc-vs-code
+defect.
+
+### SiteEventLogging-020 — `severity` and `eventType` are unvalidated free-form strings; doc enumerates a set that is not enforced
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:144-156`, `src/ScadaLink.SiteEventLogging/ISiteEventLogger.cs:14-15` |
+
+**Description**
+
+`LogEventAsync` validates `eventType` and `severity` only for non-empty/non-whitespace.
+The XML doc enumerates the allowed values: `eventType` ∈ {script, alarm,
+deployment, connection, store_and_forward, instance_lifecycle}, `severity` ∈
+{Info, Warning, Error}. Nothing in the code enforces either set. Any caller can
+pass `"SCRIPT"`, `"Script"`, `"warn"`, `"ERR"`, or a typo and the row is inserted
+verbatim. Two follow-on consequences:
+
+1. The `EventLogQueryService.Severity` filter is `severity = $severity` (exact
+   match, case-sensitive by SQLite default `BINARY` collation). A row stored as
+   `"error"` will not be returned for a query filtering on `"Error"`. The design
+   lists severity as a first-class filter and the central UI will reasonably
+   normalise to one casing — every row stored with a different casing is silently
+   invisible to that filter.
+2. The `Events Logged` table in the design implicitly relies on a stable
+   `event_type` enumeration to drive UI grouping; a typo'd `event_type` slips in
+   silently and is hard to detect later.
+
+**Recommendation**
+
+Validate `eventType` and `severity` against a known set (or accept `enum`s on the
+interface, converting to canonical string at the call site). Reject unknown values
+with `ArgumentException` and log a single-shot warning during construction if a
+deployment is found to be using an unexpected value. Alternatively, normalise
+casing (`severity = severity.ToLowerInvariant()`) so the query filter is
+case-insensitive. Update the XML doc to match the enforced contract.
+
+### SiteEventLogging-021 — `DateTimeOffset.Parse` uses the current culture; can throw on non-default locales
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:138` |
+
+**Description**
+
+`ExecuteQuery` materialises rows via
+`DateTimeOffset.Parse(reader.GetString(1))`. `DateTimeOffset.Parse(string)` uses
+`CultureInfo.CurrentCulture` and `DateTimeStyles.None`. The stored format is ISO
+8601 round-trip (`"o"`), which is *usually* parseable in any culture — but a
+production node running with a non-default culture (e.g. Turkish "tr-TR", which
+has historically broken case-insensitive ASCII comparisons via the
+"Turkish-I" issue, or any culture that overrides the date/time separators) can
+parse incorrectly or throw `FormatException`. The exception is caught by the outer
+`try`, so the entire query is converted to a `Success: false` response — but the
+failure mode is silent and culture-dependent.
+
+The recorder side stores via `DateTimeOffset.UtcNow.ToString("o")`, which is also
+culture-sensitive in the same way; on a hostile-culture node, the round-trip
+between insert and query is not guaranteed to be lossless without explicit
+culture pinning.
+
+**Recommendation**
+
+Parse with explicit invariant culture and round-trip style:
+`DateTimeOffset.Parse(reader.GetString(1), CultureInfo.InvariantCulture,
+DateTimeStyles.RoundtripKind)` (and the same for the `ToString("o", InvariantCulture)`
+emitters in `SiteEventLogger.LogEventAsync` and `EventLogPurgeService.PurgeByRetention`).
+Alternatively switch the schema to store `timestamp` as Unix-epoch `INTEGER` and
+avoid all string-parsing.
+
+### SiteEventLogging-022 — `Cache=Shared` is redundant for a single-connection logger
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Code organization & conventions |
+| Status | Open |
+| Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:52` |
+
+**Description**
+
+The connection string is built as
+`$"Data Source={options.Value.DatabasePath};Cache=Shared"`. SQLite's
+shared-cache mode is a *cross-connection* optimisation: it lets multiple
+`SqliteConnection`s in the same process share an in-process page cache. This
+logger owns exactly one `SqliteConnection` and serialises all access through
+`_writeLock`, so `Cache=Shared` cannot share with anything — the mode is dormant.
+At best it is dead configuration; at worst it adds (very small) per-statement
+lock overhead inside SQLite. The sister `SqliteAuditWriter` carries the same
+unused option, so the smell is a copy-and-paste pattern.
+
+Shared-cache mode also subtly changes the semantics of `PRAGMA busy_timeout` and
+`PRAGMA locking_mode`, so leaving it on while *not* using it is a small future-foot
+gun if anyone later opens a second connection to the same file from another
+component on the same host (e.g. a tooling read-only viewer).
+
+**Recommendation**
+
+Drop `Cache=Shared` from the connection string — the logger is single-connection
+and gains nothing from it. If a future need to share the DB across connections in
+the same process arises, reintroduce it deliberately together with the busy_timeout
+and locking_mode review that should accompany it.
+
+### SiteEventLogging-023 — Concurrent-stress test uses a non-volatile `stop` flag
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Testing coverage |
+| Status | Open |
+| Location | `tests/ScadaLink.SiteEventLogging.Tests/EventLogPurgeServiceTests.cs:282-308` |
+
+**Description**
+
+`PurgeByStorageCap_ConcurrentWritesDoNotCorruptConnection` uses a plain `bool stop = false;`
+that the main test thread mutates after the purge task completes
+(`stop = true;`) while four background writer tasks are spin-checking `while (!stop)`.
+The flag is not declared `volatile`, not wrapped in `Volatile.Read/Volatile.Write`,
+and not behind a memory barrier. On a release build with a relaxed memory model
+the writer threads are permitted to cache the `stop = false` read indefinitely,
+which means in theory the test can hang past xUnit's per-test timeout instead of
+asserting `Empty(exceptions)`. The test relies on observed JIT/runtime behaviour
+that today happens to refresh the field across the `await _eventLogger.LogEventAsync`
+boundary, but that is an implementation detail rather than a contract.
+
+The test is a regression test for SiteEventLogging-003; a flaky / hang-prone
+version of it can mask the very behaviour it is meant to pin.
+
+**Recommendation**
+
+Use a `CancellationTokenSource` (`while (!cts.IsCancellationRequested)`), or change
+`stop` to a `volatile bool`, or use `Interlocked.Exchange` / `Volatile.Read`.
+`CancellationTokenSource` is the canonical .NET pattern and also lets the test
+cooperate with xUnit's `Task.WhenAll` timeout.
@@ -5,10 +5,10 @@
 | Module | `src/ScadaLink.SiteRuntime` |
 | Design doc | `docs/requirements/Component-SiteRuntime.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-17 |
+| Last reviewed | 2026-05-28 |
 | Reviewer | claude-agent |
-| Commit reviewed | `39d737e` |
-| Open findings | 0 |
+| Commit reviewed | `1eb6e97` |
+| Open findings | 7 |

 ## Summary

@@ -47,6 +47,36 @@ and two dead lifecycle handlers in `InstanceActor` that the Deployment Manager
 never routes to (SiteRuntime-019, Low). All three were subsequently resolved on
 2026-05-17. Open findings: 0.

+#### Re-review 2026-05-28 (commit `1eb6e97`)
+
+The module was re-reviewed at commit `1eb6e97` as part of the new baseline
+review. The SiteRuntime source surface has grown materially since the prior
+pass — primarily by threading `ExecutionId`/`ParentExecutionId`/`SourceNode`
+through the script-trust-boundary helpers and the cached-call telemetry
+emitters, and by adding `OperationTrackingStore`, the
+`AuditingDbConnection`/`AuditingDbCommand`/`AuditingDbDataReader` decorators,
+and `ScriptExecutionScheduler`. All 10 checklist categories were walked afresh.
+Seven new findings were recorded: a race that throws
+`InvalidActorNameException` when a second `DeployInstanceCommand` arrives for
+the same instance while a redeployment is still terminating its predecessor
+(SiteRuntime-020, Medium); an artifact-only data-connection update that never
+reaches the DCL (SiteRuntime-021, Medium); `AuditingDbCommand.DbConnection.set`
+reaching into `AuditingDbConnection._inner` via reflection — the same anti-
+pattern SiteRuntime-006 eliminated for the repositories, now reintroduced and
+in direct tension with the script trust model that forbids `System.Reflection`
+(SiteRuntime-022, Medium); `Convert.ToDouble(value)` in `ScriptActor` /
+`AlarmActor` running under `CurrentCulture` so a string attribute value
+becomes locale-sensitive (SiteRuntime-023, Low); `OperationTrackingStore`
+serialising every cached-call write through a single connection +
+`SemaphoreSlim` and using sync-over-async in `Dispose()` (SiteRuntime-024,
+Medium); inbound-API `SetAttribute` (and any future caller) accepting unknown
+attribute names and persisting them as overrides, polluting both `_attributes`
+and the SQLite override table (SiteRuntime-025, Low); and the
+`ReplicationMessages.cs` outbound/inbound record types still missing public XML
+docs (SiteRuntime-026, Low). Prior findings 001–019 remain
+Resolved/Deferred — no regressions observed in any of their fixed call sites.
+Open findings: 7.
+
 ## Checklist coverage

 | # | Category | Examined | Notes |
@@ -62,6 +92,21 @@ never routes to (SiteRuntime-019, Low). All three were subsequently resolved on
 | 9 | Testing coverage | ✓ | No tests for ScriptExecutionActor, AlarmExecutionActor, SiteReplicationActor, or the two repositories. |
 | 10 | Documentation & comments | ✓ | Several XML comments describe behaviour the code does not implement (see findings). |

+_Re-review (2026-05-28, `1eb6e97`):_
+
+| # | Category | Examined | Notes |
+|---|----------|----------|-------|
+| 1 | Correctness & logic bugs | ✓ | Second-deploy race vs. pending redeploy (020); artifact-only data-connection update never reaches DCL (021); unknown-name SetAttribute persists bogus overrides (025). |
+| 2 | Akka.NET conventions | ✓ | Trigger-eval blocking on coordinator mailbox remains Deferred (014); short-lived execution actors and replication actor otherwise conform. |
+| 3 | Concurrency & thread safety | ✓ | DM's `_instanceActors` cache and `_pendingRedeploys` map shifted from old race; new ordering race surfaced (020). `OperationTrackingStore` single-connection + SemaphoreSlim serialises all cached writes (024). |
+| 4 | Error handling & resilience | ✓ | `Task.Run` fire-and-forget replication paths log on faulted (acceptable, per "best-effort replication" design). DM's deploy persistence rollback path (resolved as SiteRuntime-005) intact. |
+| 5 | Security | ✓ | Trust-model semantic analysis (SiteRuntime-011 fix) intact. `AuditingDbCommand` reflects into `AuditingDbConnection._inner` — same anti-pattern as SiteRuntime-006 (022). Audit emitter captures SQL parameter values verbatim per M4 design (M5 will redact). |
+| 6 | Performance & resource management | ✓ | Per-call SQLite connections on hot paths in `SiteStorageService` (existing pattern, acceptable). `OperationTrackingStore` `Dispose()` does sync-over-async (024). `ScriptExecutionScheduler` bounded threads as expected. |
+| 7 | Design-document adherence | ✓ | Artifact-only data-connection update path is silently inert (021) — contradicts the "site is self-contained after artifact deployment" intent. |
+| 8 | Code organization & conventions | ✓ | Repository reflection-via-private-field anti-pattern reintroduced in `AuditingDbCommand` (022). `ReplicationMessages.cs` public records still undocumented (026). |
+| 9 | Testing coverage | ✓ | `SiteReplicationActor` remains uncovered (SiteRuntime-016 deferred that gap to a clustered-ActorSystem harness, still outstanding). New findings have no targeted coverage yet. |
+| 10 | Documentation & comments | ✓ | `ReplicationMessages.cs` records lack XML docs (026); other XML doc surface materially expanded in `1eb6e97`. |
+
 ## Findings

 ### SiteRuntime-001 — `Instance.SetAttribute` never writes to the Data Connection Layer
@@ -902,3 +947,362 @@ stating the Deployment Manager owns this lifecycle. Regression test:
 `InstanceActorTests.InstanceActor_DoesNotHandleDisableOrEnableCommands` asserts the
 Instance Actor produces no `InstanceLifecycleResponse` for either command
 (confirmed to fail against the pre-fix dead handlers and pass after removal).
+
+### SiteRuntime-020 — Second `DeployInstanceCommand` arriving during a pending redeploy races the still-terminating actor on its name
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Concurrency & thread safety |
+| Status | Open |
+| Location | `src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs:285`, `src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs:971` |
+
+**Description**
+
+The SiteRuntime-003 fix makes `HandleDeploy` watch + stop a running Instance
+Actor and buffer the in-flight `DeployInstanceCommand` in `_pendingRedeploys`
+until `Terminated` arrives. The handler also removes the instance from
+`_instanceActors` synchronously, in step with the stop request:
+
+```csharp
+if (_instanceActors.TryGetValue(instanceName, out var existing))
+{
+    _instanceActors.Remove(instanceName);
+    _pendingRedeploys[existing] = new PendingRedeploy(command, Sender);
+    Context.Watch(existing);
+    Context.Stop(existing);
+    UpdateInstanceCounts();
+    return;
+}
+
+// Fresh deployment — no existing actor to replace.
+ApplyDeployment(command, Sender, isRedeploy: false);
+```
+
+If a *second* `DeployInstanceCommand` for the same `instanceName` arrives on
+the singleton's mailbox while the predecessor is still terminating, the
+`_instanceActors.TryGetValue` lookup correctly reports "no existing actor" —
+because the first deploy already removed it — and execution falls through to
+`ApplyDeployment(..., isRedeploy: false)`. `ApplyDeployment` immediately calls
+`CreateInstanceActor`, which calls `Context.ActorOf(props, instanceName)`. But
+the predecessor's Akka child name **is still registered** in the parent's
+child registry: that name is only released after the predecessor's `Terminated`
+signal — exactly the asynchronous gap SiteRuntime-003 was created to plug for
+the *first* redeploy. `Context.ActorOf` therefore throws
+`InvalidActorNameException`, which Akka rethrows as
+`ActorInitializationException` — and the supervisor's `Stop` directive on that
+exception (DeploymentManagerActor.cs:179) silently stops the just-created
+child. The second deploy is then quietly lost: `_instanceActors` doesn't
+contain it (the throw aborted the bookkeeping after `CreateInstanceActor`'s
+own `ContainsKey` guard but before `_instanceActors[instanceName] = actorRef`
+would have run), `_totalDeployedCount` was incremented, and the deployer is
+never told the deployment failed (the persistence `Task.Run` is also dropped
+on the throw path). The race is real on a busy site where central retries a
+deploy because the prior attempt timed out — exactly the scenario the
+DeploymentManager-006 query-then-deploy idempotency mechanism was designed for.
+
+The first-redeploy case (SiteRuntime-003) does NOT exhibit this because at
+that point the predecessor's child name was still in `_instanceActors`, so the
+branch correctly buffers. The bug is specific to the third (and beyond)
+incoming deploy when two are already in flight for the same instance.
+
+**Recommendation**
+
+The pending-redeploy bookkeeping needs to be authoritative for "we are mid-
+redeploy on this instance", not just the `_instanceActors` cache. Add a second
+keyed lookup — e.g. a `Dictionary<string, IActorRef> _terminatingActorsByName`
+populated when the predecessor is stopped — and check it BEFORE
+`ApplyDeployment(isRedeploy: false)`. On a hit, overwrite (or stash) the
+buffered `PendingRedeploy` for that terminating actor so the latest command
+wins on the `Terminated` signal. Alternatively, defer the deploy by stashing
+all messages for that `instanceName` until the predecessor terminates (Akka
+`Stash` pattern). Either way, the fall-through to "fresh deployment" needs to
+be gated on "no instance with this name is currently terminating".
+
+### SiteRuntime-021 — `HandleDeployArtifacts` updates `DataConnections` in SQLite but never sends `CreateConnectionCommand` to the DCL
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Design-document adherence |
+| Status | Open |
+| Location | `src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs:931` |
+
+**Description**
+
+`HandleDeployArtifacts` persists the artifact bundle (shared scripts, external
+systems, database connections, notification lists, SMTP configs, and
+**data connection definitions**) into local SQLite. For data connection
+definitions specifically (`DataConnections`), the handler calls
+`_storage.StoreDataConnectionDefinitionAsync(...)` — but does NOT issue a
+`CreateConnectionCommand` (or any other DCL command) to the `_dclManager`
+actor. The only path that pushes DCL configuration to the DCL is
+`EnsureDclConnections`, called exclusively from the deploy / startup-batch
+paths against the **flattened instance configuration's** inline `Connections`
+dictionary. There is no equivalent for an artifact-only update.
+
+Concretely: an artifact deployment that changes a data connection's endpoint
+URL, credentials, backup endpoint, or failover retry count is stored
+durably in the site SQLite (so on the *next* node restart the site loads the
+new config and `EnsureDclConnections` picks it up) but is silently inert until
+either an instance using that connection is redeployed or the node restarts.
+This contradicts the design's "after artifact deployment, the site is fully
+self-contained" intent (Component-SiteRuntime.md, "System-Wide Artifact
+Handling") — the runtime DCL keeps using the stale connection until a much
+heavier trigger event occurs. It is also asymmetric with how
+`SharedScripts` are handled in the same method: shared scripts are both
+stored *and* recompiled into `_sharedScriptLibrary` on update so the change is
+live immediately.
+
+(SiteRuntime-010 fixed a related defect inside `EnsureDclConnections` — the
+config-hash cache — but that's only consulted on the inline-config path; the
+artifact-deployment path never reaches `EnsureDclConnections`.)
+
+**Recommendation**
+
+In the `DataConnections` branch of `HandleDeployArtifacts`, after the
+`StoreDataConnectionDefinitionAsync` call, also send a
+`CreateConnectionCommand` to `_dclManager` for each updated definition,
+re-using the SiteRuntime-010 config hash so unchanged connections are skipped.
+Alternatively, refactor `EnsureDclConnections` to accept a flat list of
+`(name, protocol, configurationJson, backupConfigurationJson,
+failoverRetryCount)` tuples that both the inline (`FlattenedConfiguration`)
+and artifact paths can drive through it.
+
+### SiteRuntime-022 — `AuditingDbCommand.DbConnection.set` uses reflection to read `AuditingDbConnection._inner`
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Code organization & conventions |
+| Status | Open |
+| Location | `src/ScadaLink.SiteRuntime/Scripts/AuditingDbCommand.cs:138` |
+
+**Description**
+
+The `DbConnection` setter on `AuditingDbCommand` unwraps an
+`AuditingDbConnection` value by reading its private `_inner` field via
+reflection:
+
+```csharp
+set
+{
+    _wrappingConnection = value;
+    _inner.Connection = value switch
+    {
+        AuditingDbConnection auditing => auditing.GetType()
+            .GetField("_inner", BindingFlags.Instance | BindingFlags.NonPublic)
+            !.GetValue(auditing) as DbConnection,
+        _ => value
+    };
+}
+```
+
+This is the same encapsulation-violating anti-pattern that SiteRuntime-006
+called out for the site repositories. A rename or refactor of
+`AuditingDbConnection._inner` breaks the audit decorator at runtime (no
+compile-time signal), the `!.` null-forgiving operator hides the crash, and
+the reflective access trips static analyzers and IL trimming. More
+problematically, the script trust model the same module enforces in
+`ScriptCompilationService.ValidateTrustModel` explicitly forbids
+`System.Reflection` in scripts — yet the auditing helper a script ends up
+running through itself reaches via reflection into a sibling class. Both
+classes are `internal sealed` in the same assembly, so this is purely a
+self-imposed contract violation.
+
+A second smaller concern in the same property: the getter returns
+`_wrappingConnection ?? _inner.Connection`. If the caller obtains a command
+via `AuditingDbConnection.CreateDbCommand()` and immediately reads
+`cmd.Connection`, the getter returns the raw inner connection (not the
+auditing wrapper), because `_wrappingConnection` is only populated when the
+setter is later invoked. That's surprising and at odds with the class's
+audit-everything intent — a script that round-trips a command through
+`cmd.Connection` re-enters the un-audited path.
+
+**Recommendation**
+
+Expose the wrapped connection through a proper API surface. The simplest fix
+that matches the SiteRuntime-006 precedent: add an
+`internal DbConnection Inner { get; }` property to `AuditingDbConnection`
+(both classes are `internal sealed`, so the property stays out of the public
+surface) and replace the reflection switch with `auditing.Inner`. While
+touching the property, also have the getter return `_wrappingConnection` even
+on the synthesised CreateDbCommand path (e.g. set `_wrappingConnection` to
+the parent connection inside `AuditingDbConnection.CreateDbCommand`).
+
+### SiteRuntime-023 — `Convert.ToDouble(value)` in trigger and alarm evaluation is locale-sensitive
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.SiteRuntime/Actors/ScriptActor.cs:446`, `src/ScadaLink.SiteRuntime/Actors/AlarmActor.cs:340`, `src/ScadaLink.SiteRuntime/Actors/AlarmActor.cs:356`, `src/ScadaLink.SiteRuntime/Actors/AlarmActor.cs:444` |
+
+**Description**
+
+`ScriptActor.EvaluateCondition` and the three `AlarmActor` evaluators
+(`EvaluateRangeViolation`, `EvaluateRateOfChange`, `EvaluateHiLo`) call
+`Convert.ToDouble(value)` without specifying a culture. When `value` is a
+string (a path that exists today — attribute values that arrive as JSON-
+deserialized numbers can still surface as strings on some code paths,
+particularly array values that are JSON-stringified at
+`InstanceActor.HandleTagValueUpdate:377`), `Convert.ToDouble` parses against
+`CultureInfo.CurrentCulture`. On a host whose locale uses a comma decimal
+separator (German, French, most of continental Europe), `"1.5"` throws and
+the condition / alarm silently degrades to its catch-fallthrough (returns
+`false` for range/rate-of-change, keeps current level for HiLo, falls back to
+string-compare for conditionals). The CLAUDE.md "All timestamps are UTC"
+discipline is the equivalent rule for time; there is no equivalent invariant-
+culture discipline applied to numeric parsing.
+
+The exposure is bounded — most attribute values arrive as numeric primitives
+from `TagValueUpdate.Value` or static `FlattenedConfiguration.Attributes`
+(also typed) so the implicit-cast `Convert.ToDouble` path is hit. But the
+string path is reachable via inbound API writes
+(`RouteToSetAttributesRequest.AttributeValues` is `IReadOnlyDictionary<string,
+string>`), via the JSON-array stringification at `HandleTagValueUpdate:377`,
+and via static-override values loaded from SQLite (which are persisted as
+strings — see `SetStaticOverrideAsync`).
+
+**Recommendation**
+
+Replace each `Convert.ToDouble(value)` with `Convert.ToDouble(value,
+CultureInfo.InvariantCulture)`, or front-load a typed-numeric extraction
+helper (`if (value is double d) return d; if (value is string s && double.TryParse(s,
+NumberStyles.Float, CultureInfo.InvariantCulture, out var p)) return p;
+return Convert.ToDouble(value, CultureInfo.InvariantCulture);`). The site is a
+deterministic machine-control surface; condition evaluation must not depend
+on the host's regional settings.
+
+### SiteRuntime-024 — `OperationTrackingStore` serialises all writes through one connection + `SemaphoreSlim`, and `Dispose()` does sync-over-async
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Performance & resource management |
+| Status | Open |
+| Location | `src/ScadaLink.SiteRuntime/Tracking/OperationTrackingStore.cs:39`, `src/ScadaLink.SiteRuntime/Tracking/OperationTrackingStore.cs:360` |
+
+**Description**
+
+`OperationTrackingStore` owns exactly one `SqliteConnection` and gates every
+public method through a single `SemaphoreSlim(1, 1)`. The class XML comment
+calls this out as deliberate ("the M3 brief calls out as 'cleaner than the M2
+Channel<T> pipeline given the volume'"), and the *write* volume is genuinely
+low — at most a handful of lifecycle rows per cached call. But on a busy site
+the *read* path (`GetStatusAsync`) is called by every `Tracking.Status(id)`
+invocation from every executing script, and reads are serialised through the
+same gate as writes. A long-running write (e.g. a Roslyn-script-driven
+`RecordTerminalAsync` competing with an SQLite checkpoint) holds the gate and
+stalls every concurrent status query. SQLite supports concurrent readers with
+a single writer in WAL mode; the gate forfeits that capability.
+
+A separate concern in the same class: `Dispose()` calls
+`DisposeAsyncCore().AsTask().GetAwaiter().GetResult()`. That is sync-over-
+async — the very pattern SiteRuntime-008 was a finding for. If a caller
+disposes the store from a synchronization context that does not allow
+re-entrance (e.g. an `IHostedService.StopAsync` continuation observed on the
+host's sync context, or a finalizer pumping on the thread pool with a stuck
+continuation), the `.WaitAsync()` inside `DisposeAsyncCore` waits for a
+continuation that will never run, and the dispose deadlocks. The async path
+itself is correct; only the sync `Dispose()` wrapper is risky.
+
+**Recommendation**
+
+For the single-connection gate: split reads and writes into separate gates,
+or — better — keep the writer single-connection and open a fresh read
+connection (or pool of read connections) per `GetStatusAsync` call. SQLite
+connections are cheap; the `SiteStorageService` precedent already uses per-
+call connections on the read path. For `Dispose()`: prefer
+`Dispose() { GC.SuppressFinalize(this); _connection.Dispose(); _gate.Dispose(); }`
+without an awaited disposal, and have the `IAsyncDisposable.DisposeAsync`
+path do the awaiting. If a synchronous disposable is genuinely needed, do
+not bridge it through the async core — duplicate the dispose-once flag check
+into a sync path that calls `_connection.Dispose()` directly.
+
+### SiteRuntime-025 — `HandleSetStaticAttribute` persists unknown attribute names as static overrides
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.SiteRuntime/Actors/InstanceActor.cs:223`, `src/ScadaLink.SiteRuntime/Actors/InstanceActor.cs:246` |
+
+**Description**
+
+`HandleSetStaticAttribute` resolves the target attribute against
+`_configuration.Attributes` to decide whether to route the write to the DCL or
+treat it as a static-override write. If the lookup fails (`resolved == null`),
+`isDataSourced` is false, and execution falls through to
+`HandleSetStaticAttributeCore` — which unconditionally:
+
+1. inserts the bogus key into the in-memory `_attributes` dictionary,
+2. publishes an `AttributeValueChanged` for the bogus key to the site stream
+   and to every child Script/Alarm actor,
+3. persists a row in `static_attribute_overrides` for the bogus key, and
+4. replies `Success = true` to the caller.
+
+Concretely, an inbound API `Route.To().SetAttribute("notARealAttr", "x")`
+returns success, pollutes the in-memory state with a key that no script can
+legitimately observe (canonical-name lookup will not produce it), persists a
+durable SQLite override row that survives restart, and (on every restart)
+re-injects the polluting key via `HandleOverridesLoaded` at line 608. The
+override is **not** reset on instance redeployment in the same way the
+"genuine" overrides are — `ClearStaticOverridesAsync` does clear by
+`instance_unique_name`, so the row is eventually cleaned, but only on a full
+redeploy; in the meantime each restart resurrects it. The publish-to-stream
+side effect also lets a hostile or buggy inbound caller spam debug-view
+subscribers with synthetic attribute changes.
+
+Worth flagging at Low: the inbound API surface is already authenticated and
+the design assumes its callers are trusted. But the no-validation behaviour
+contradicts the design doc's "Scripts can only read/write attributes on their
+own instance" framing — an inbound API call inherits the same instance-scope
+authority as a script, and the script trust model wouldn't sanction this.
+
+**Recommendation**
+
+In `HandleSetStaticAttribute`, when `resolved == null`, reply
+`SetStaticAttributeResponse(Success: false,
+ErrorMessage: $"Attribute '{command.AttributeName}' not found on instance
+'{_instanceUniqueName}'")` instead of falling through to the override path.
+Optionally also surface the existence check on the `RouteInboundApiSetAttributes`
+fan-out so a multi-attribute write reports the offending key without rolling
+back the others (the per-attribute `Ask` shape already supports a partial
+failure response).
+
+### SiteRuntime-026 — `ReplicationMessages.cs` public record types have no XML documentation
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Documentation & comments |
+| Status | Open |
+| Location | `src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs:10`, `src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs:13`, `src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs:15`, `src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs:17`, `src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs:19`, `src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs:25`, `src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs:28`, `src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs:30`, `src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs:32`, `src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs:34` |
+
+**Description**
+
+The ten public record types in `ReplicationMessages.cs`
+(`ReplicateConfigDeploy`, `ReplicateConfigRemove`, `ReplicateConfigSetEnabled`,
+`ReplicateArtifacts`, `ReplicateStoreAndForward`, `ApplyConfigDeploy`,
+`ApplyConfigRemove`, `ApplyConfigSetEnabled`, `ApplyArtifacts`,
+`ApplyStoreAndForward`) carry no XML documentation. The file header comment
+groups them as "outbound" vs "inbound" but the individual records have no
+`<summary>` and no parameter docs. The XML-doc baseline `1eb6e97` rolled out
+across the rest of the module (the commit being reviewed is literally `docs:
+add XML doc comments across src + Sister Projects section in CLAUDE.md`), so
+this file is now the conspicuous outlier — and the `CommentChecker` skill
+relied on by the `fixdocs` workflow will flag every record as missing docs.
+
+**Recommendation**
+
+Add a `<summary>` per record naming the direction (outbound → peer / inbound
+from peer) and what the operation replicates, and `<param>` docs for each
+record parameter. Mirror the precedent in
+`src/ScadaLink.Commons/Messages/.../*.cs`. While there, consider sealing the
+inbound vs outbound split with a marker base type (currently they're just
+named conventionally) so `Receive<ReplicateXxx>` vs `Receive<ApplyXxx>` is
+expressed at the type level — but that's optional and out of scope for a
+docs-only finding.
@@ -5,10 +5,10 @@
 | Module | `src/ScadaLink.StoreAndForward` |
 | Design doc | `docs/requirements/Component-StoreAndForward.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-17 |
+| Last reviewed | 2026-05-28 |
 | Reviewer | claude-agent |
-| Commit reviewed | `39d737e` |
-| Open findings | 0 (3 Deferred: 002, 011, 012 — see notes) |
+| Commit reviewed | `1eb6e97` |
+| Open findings | 7 (3 Deferred: 002, 011, 012; 7 new Open: 018–024 — see Re-review 2026-05-28) |

 ## Summary

@@ -55,6 +55,76 @@ StoreAndForward-017 records that the Retry/Discard activity-log entries hard-cod
 `ExternalSystem` category, mislabelling notification and cached-DB-write messages in
 the site event log.

+#### Re-review 2026-05-28 (commit `1eb6e97`)
+
+Full re-review against commit `1eb6e97` with the same 10-category checklist. The
+batch-3 / batch-4 resolutions (001, 003–010, 013–017) are still present and intact; no
+regressions detected on prior fixes. Findings 002, 011 and 012 remain validly
+`Deferred` (their preconditions are unchanged) and findings 005, 006, 010, 013, 014,
+016, 017 are confirmed `Resolved` against the current source.
+
+This pass surfaced **seven new findings** clustered around two themes:
+
+The first theme is **design-doc drift on the notification path**, which has acquired
+two now-real defects since the engine became central-targeted. `StoreAndForward-018`
+(High) records that a corrupt notification payload — handled in `NotificationForwarder.
+DeliverAsync` by returning `false` — parks a notification on its first retry-sweep
+encounter, despite the design doc stating "Notifications do not park" (line 47, "Parking
+applies only to the external-system-call and cached-database-write categories"). The
+same path becomes a poison-payload retry-forever trap on the active node if the engine
+ever softened the `false` semantics. `StoreAndForward-019` (Medium) records the
+sibling defect: notifications are enqueued with `MaxRetries` defaulting to
+`StoreAndForwardOptions.DefaultMaxRetries` (50), and the legacy SMTP path
+(`NotificationDeliveryService.SendAsync`) passes a positive bounded `smtpConfig.
+MaxRetries` — so an unreachable central will silently park notifications after a
+finite retry budget rather than "retry at the fixed forward interval until central acks"
+as the design requires. The contract `0 = no limit` is not enforced for the
+notification category.
+
+The second theme is **subtle correctness and contract gaps around the operator paths**
+that survived the StoreAndForward-016/017 batch. `StoreAndForward-020` (Medium) records
+that `RetryParkedMessageAsync` skips replication entirely if `GetMessageByIdAsync`
+returns null after a successful local requeue (a narrow but real race window with a
+concurrent discard / sweep delete), re-introducing the StoreAndForward-016 standby
+divergence in that corner. `StoreAndForward-021` (Medium) is a design-doc-vs-code drift
+that should be reconciled in the doc: the **operation tracking table** is documented
+inside Component-StoreAndForward.md as a S&F responsibility (lines 21, 49, 77–87, 108,
+114), but the actual `OperationTrackingStore` lives in `src/ScadaLink.SiteRuntime/
+Tracking/` and is not consumed by S&F at all — the brief's own note flags this. The
+design doc should be updated to point at SiteRuntime, or the store moved to
+StoreAndForward.
+
+`StoreAndForward-022` (Low) records that `_cachedCallObserver` silently drops audit
+telemetry when a buffered cached-call's `Id` is not a parseable `TrackedOperationId`
+GUID — the engine returns from `NotifyCachedCallObserverAsync` before emitting anything,
+so a legacy enqueue path that buffered a non-GUID id (the engine's own default minting
+produces "N"-formatted GUIDs, which TrackedOperationId.TryParse accepts, but any
+caller passing a custom non-GUID id silently bypasses the entire `Submitted/Forwarded/
+Attempted/Delivered/Parked/Discarded` audit lifecycle). `StoreAndForward-023` (Low)
+records that `siteId` is silently defaulted to `string.Empty` when no
+`IStoreAndForwardSiteContext` is registered, so a misconfigured host produces audit
+telemetry with `SourceSite = ""` and the central audit-log's `(SourceSite,
+TrackedOperationId)` correlation degrades to a per-id-only index. `StoreAndForward-024`
+(Low) is a stop-time ordering defect: `StopAsync` disposes the timer but a
+mid-flight `RetryPendingMessagesAsync` invocation continues using `_storage` and
+`_replication` after `StopAsync` returns; downstream resources disposed by the host
+shutdown sequence (the DI container) can then NRE through the still-running sweep.
+
+## Checklist coverage — Re-review 2026-05-28
+
+| # | Category | Examined | Notes |
+|---|----------|----------|-------|
+| 1 | Correctness & logic bugs | ☑ | Notification corrupt-payload parks contrary to design (018); RetryParkedMessageAsync skips replication when message reload races a deletion (020). |
+| 2 | Akka.NET conventions | ☑ | `ParkedMessageHandlerActor` uses `PipeTo` correctly with success/failure projections (007 resolution preserved). No new findings. |
+| 3 | Concurrency & thread safety | ☑ | Sweep-vs-stop race: a timer callback running while `StopAsync` returns can touch disposed dependencies (024). |
+| 4 | Error handling & resilience | ☑ | Notifications park after `DefaultMaxRetries` exhaustion (019) — contradicts the design doc's "retried until central acks". |
+| 5 | Security | ☑ | No issues found — parameterised SQL throughout, payload JSON opaque, no secret material handled. |
+| 6 | Performance & resource management | ☑ | No new findings — the connection-per-call documented trade-off and pooled `OpenAsync` remain acceptable. |
+| 7 | Design-document adherence | ☑ | Operation Tracking Table documented in StoreAndForward but actually lives in SiteRuntime (021); notification non-parking guarantee broken by 018 + 019. |
+| 8 | Code organization & conventions | ☑ | `IStoreAndForwardSiteContext` silently defaults `SiteId` to empty (023) — a configuration hole rather than an entity placement issue. |
+| 9 | Testing coverage | ☑ | The seven new findings have no regression tests in `tests/ScadaLink.StoreAndForward.Tests/` — particularly the notification-doesn't-park invariant (018, 019), the requeue-after-reload-null replication gap (020), and the stop-during-sweep behaviour (024). |
+| 10 | Documentation & comments | ☑ | `CachedCallAttemptOutcome.ParkedMaxRetries` XML doc says "S&F semantics" but the code applies it to notifications too if 018/019 fire — minor drift, captured under 018. The `TrackedOperationId.TryParse` silent-skip behaviour in `NotifyCachedCallObserverAsync` is documented in the source but not on the public observer contract (022). |
+
 ## Checklist coverage

 | # | Category | Examined | Notes |
@@ -914,3 +984,474 @@ the StoreAndForward-016 replication) — and pass it to `RaiseActivity` (falling
 `RetryParkedMessageAsync_ActivityUsesMessageRealCategory` and
 `DiscardParkedMessageAsync_ActivityUsesMessageRealCategory` assert the activity carries
 `Notification` / `CachedDbWrite` respectively; both fail against the pre-fix code.
+
+### StoreAndForward-018 — Notification corrupt-payload parks the buffered message, contradicting the "notifications do not park" design invariant
+
+| | |
+|--|--|
+| Severity | High |
+| Category | Design-document adherence |
+| Status | Open |
+| Location | `src/ScadaLink.StoreAndForward/NotificationForwarder.cs:62`–`:69`, `:105`–`:122`; `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:369`–`:397` |
+
+**Description**
+
+The Component design doc explicitly carves out notifications from the parking lifecycle:
+
+> "Notifications do not park — they are retried at the fixed forward interval until
+> central acks." (`docs/requirements/Component-StoreAndForward.md:47`)
+> "Parking applies only to the external-system-call and cached-database-write
+> categories." (same line)
+
+`NotificationForwarder.DeliverAsync` violates this. When `TryBuildSubmit` fails to
+deserialize the buffered payload — either because `JsonSerializer.Deserialize` throws a
+`JsonException` (line 114) or because it returns `null` (line 119) — `DeliverAsync`
+returns `false` (line 68). On the **retry path** the S&F engine treats handler `false`
+as a permanent failure and **parks the message immediately** via the conditional
+`UpdateMessageIfStatusAsync(... Parked)` write at `StoreAndForwardService.cs:373`–`:385`.
+Result: a notification with a corrupt buffered payload — a row that the engine itself
+treats as opaque ("Payload: Serialized message content…"; `Component-StoreAndForward.md:
+110`) — enters the parked state and surfaces in the central UI's parked-message list
+under the `Notification` category, contradicting the doc's invariant and the resolved
+StoreAndForward-017's "Notification / CachedDbWrite" Retry/Discard category mapping.
+
+The defect is real today: the inline comment on `NotificationForwarder.cs:64` even
+documents the violation ("An unreadable payload cannot be fixed by retrying — park it
+(return false)") as the intended behaviour, but that behaviour is what the design doc
+forbids. Either the doc needs to acknowledge a poison-payload parking exception for
+notifications, or the forwarder needs a different escape hatch (discard? log + drop?
+permanent-failure-as-`true` to clear the buffer?). Today there is no consistent answer
+between code and design.
+
+Additionally, on the **immediate-delivery** path (a fresh enqueue followed by a
+`DeliverAsync` returning `false`), the engine returns `WasBuffered: false` and the row
+is never persisted — so the corrupt-payload "park" only occurs on the retry path, where
+the message has already been buffered (and replicated to the standby). The
+**inconsistency between the two paths** ("not buffered" vs "parked") for the same
+permanent-failure outcome is itself a contract surprise; the resolved StoreAndForward-004
+documents the immediate vs retry asymmetry, but does not anticipate that the retry
+asymmetry will violate a per-category invariant.
+
+**Recommendation**
+
+Choose one consistent reconciliation. Preferred option: change `NotificationForwarder.
+DeliverAsync` to **discard** a corrupt payload rather than park it — delete the
+buffered row directly, log a Site Event Log entry under `Discard`, and return `true` so
+the engine clears the buffer. This preserves the design's "notifications do not park"
+invariant. Alternatives: (a) update the design doc to acknowledge a poison-payload
+parking exception specifically for notifications, and revise the resolved
+StoreAndForward-017 wording; (b) treat `JsonException` as transient (would retry-forever
+on a corrupt payload — bad); (c) introduce a per-category park-allowed flag on the
+engine and gate the retry-path park behind it for the Notification category.
+Add a regression test in `tests/ScadaLink.StoreAndForward.Tests/NotificationForwarderTests.
+cs` asserting that a corrupt-payload notification reaches a terminal **non-Parked**
+state — today the corrupt-payload behaviour is uncovered.
+
+**Resolution**
+
+_Unresolved._
+
+### StoreAndForward-019 — Notifications park after `DefaultMaxRetries` exhaustion, contradicting "retried until central acks"
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Error handling & resilience |
+| Status | Open |
+| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:229`, `:407`–`:437`; `src/ScadaLink.StoreAndForward/StoreAndForwardOptions.cs:18`; `src/ScadaLink.SiteRuntime/Scripts/ScriptRuntimeContext.cs:1773`–`:1778`; `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:149`–`:156` |
+
+**Description**
+
+The design doc requires a buffered notification to be retried indefinitely until
+central acks:
+
+> "The **notification** category retries differently: it has no source-entity setting.
+> The site→central forward uses a single fixed retry interval configured in the host
+> `appsettings.json`. … A buffered notification is retried until central acks it; it is
+> not parked on a retry limit (central, once reachable, owns delivery, retry, and
+> parking from that point on)." (`Component-StoreAndForward.md:55`–`:59`)
+
+The current engine cannot honour that. `RetryMessageAsync` enforces parking at
+`message.MaxRetries > 0 && message.RetryCount >= message.MaxRetries`
+(`StoreAndForwardService.cs:407`); a `MaxRetries == 0` is the documented "no limit"
+escape hatch (now correctly explained by the resolved StoreAndForward-015). But the two
+notification enqueue paths both supply a positive bounded `MaxRetries`:
+
+- `ScriptRuntimeContext.cs:1773`–`:1778` (the `Notify.Send` site script path) calls
+  `EnqueueAsync` without supplying the `maxRetries` argument, so the engine
+  defaults to `StoreAndForwardOptions.DefaultMaxRetries = 50` (`StoreAndForwardOptions.
+  cs:18`). After 50 retry sweeps with central unreachable, the notification is parked.
+- `NotificationDeliveryService.cs:149`–`:156` (the legacy SMTP-style path retained for
+  the central-side `INotificationDeliveryService` callers) passes
+  `smtpConfig.MaxRetries > 0 ? smtpConfig.MaxRetries : null` — `null` falls back to the
+  same 50-retry default, and any positive `smtpConfig.MaxRetries` still bounds the
+  retry budget. Either way, a long central outage parks the notification.
+
+A parked notification cannot be cleared by a central recovery: it stays parked until an
+operator clicks **Retry** in the parked-message UI. The design's invariant — that
+notification delivery converges automatically as soon as central is reachable — is
+broken: an extended central outage requires manual intervention to clear the backlog,
+which is exactly the behaviour the central-only outbox redesign was meant to remove
+from the site.
+
+This is closely related to (but distinct from) StoreAndForward-018: 018 is the
+*permanent-failure-path* parking violation; 019 is the *transient-failure-path*
+parking violation under the engine's normal max-retries policy.
+
+**Recommendation**
+
+Make the notification enqueue paths pass `maxRetries: 0` so the documented "no limit /
+never parked" semantics apply, and guard against regression by adding an integration
+test in `tests/ScadaLink.StoreAndForward.Tests/NotificationForwarderTests.cs` that runs
+a sweep many more times than `DefaultMaxRetries` against an always-failing handler and
+asserts the buffered notification's status stays `Pending` (not `Parked`). A cleaner
+alternative is to special-case the `Notification` category inside
+`RetryMessageAsync`'s max-retries guard (treat it as `MaxRetries == 0` regardless of
+the field value) so the invariant is enforced at the single chokepoint rather than
+relying on every caller to pass the right value — this also fixes the legacy
+`NotificationDeliveryService` path without editing the consumer.
+
+**Resolution**
+
+_Unresolved._
+
+### StoreAndForward-020 — `RetryParkedMessageAsync` skips standby replication when the message is deleted between local update and re-load
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Concurrency & thread safety |
+| Status | Open |
+| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:599`–`:616` |
+
+**Description**
+
+The StoreAndForward-016 resolution wired Requeue replication into operator-initiated
+retry. The fix uses a two-step pattern:
+
+```csharp
+public async Task<bool> RetryParkedMessageAsync(string messageId)
+{
+    var success = await _storage.RetryParkedMessageAsync(messageId);   // step 1
+    if (success)
+    {
+        var message = await _storage.GetMessageByIdAsync(messageId);    // step 2 (no txn)
+        var category = message?.Category ?? StoreAndForwardCategory.ExternalSystem;
+        if (message != null)
+        {
+            _replication?.ReplicateRequeue(message);                    // step 3
+        }
+        RaiseActivity("Retry", category, ...);
+    }
+    return success;
+}
+```
+
+The two storage calls are on separate connections with no surrounding transaction. A
+concurrent writer between step 1 (which moved the row from Parked → Pending) and step 2
+(which re-reads the row) can delete or mutate the row:
+
+- An operator who issues `DiscardParkedMessageAsync` immediately after retry — the
+  `DiscardParkedMessageAsync` storage call is conditional on `status = Parked`, so it
+  will be a no-op (correct), but a sweep that succeeds in delivering the just-requeued
+  row will then call `_storage.RemoveMessageAsync` (unconditional), which deletes it.
+  In a single retry-sweep cycle this race is real because `DefaultRetryInterval = Zero`
+  is the standard test default and the operator action and a sweep tick can overlap.
+- A `RemoveMessageAsync` runs in step 1's wake; `GetMessageByIdAsync` returns null;
+  step 3 (`_replication?.ReplicateRequeue`) is **skipped entirely**, but step 1
+  already requeued the row locally. The standby is now left in `Parked` state while
+  the active node has Pending-then-Deleted, exactly the standby-divergence StoreAndForward-016
+  was supposed to fix. (On the active node a subsequent failover lands on a Parked
+  standby copy of a discarded message — the same regression 016 already documented.)
+
+The category-fallback path (`StoreAndForwardCategory.ExternalSystem` when message is
+null) silently mislabels the activity log entry too — the same defect that
+StoreAndForward-017 fixed for the non-racy path, except this branch handed back a
+hard-coded fallback rather than re-loading. The activity log entry is a minor side
+effect; the missing replication is the real defect.
+
+**Recommendation**
+
+Capture the message **once**, before the local Parked → Pending storage update, so the
+replication path has the row in hand even if a concurrent writer deletes it
+afterwards:
+
+```csharp
+var message = await _storage.GetMessageByIdAsync(messageId);  // before the update
+if (message == null || message.Status != StoreAndForwardMessageStatus.Parked)
+    return false;
+
+var success = await _storage.RetryParkedMessageAsync(messageId);
+if (!success) return false;
+
+// `message` was the parked row; the active node just wrote it back to Pending with
+// retry_count = 0 — construct the replicated state from those known mutations.
+message.Status = StoreAndForwardMessageStatus.Pending;
+message.RetryCount = 0;
+message.LastError = null;
+message.LastAttemptAt = null;
+_replication?.ReplicateRequeue(message);
+RaiseActivity("Retry", message.Category, $"Parked message {messageId} moved back to queue");
+return true;
+```
+
+Add a regression test in `StoreAndForwardReplicationTests` that simulates the
+delete-between-update-and-reload race and asserts the `Requeue` replication
+operation is still emitted with the correct category.
+
+**Resolution**
+
+_Unresolved._
+
+### StoreAndForward-021 — Design doc claims the Operation Tracking Table lives in StoreAndForward but the implementation is in SiteRuntime
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Design-document adherence |
+| Status | Open |
+| Location | `docs/requirements/Component-StoreAndForward.md:21`, `:49`–`:51`, `:77`–`:87`, `:108`, `:114`; `src/ScadaLink.SiteRuntime/Tracking/OperationTrackingStore.cs:37`; `src/ScadaLink.StoreAndForward/` (whole module) |
+
+**Description**
+
+Component-StoreAndForward.md repeatedly assigns the **Operation Tracking Table** to
+this component:
+
+- **Responsibilities** (line 21): "Maintain a site-local **operation tracking table**
+  holding one row per `TrackedOperationId` for cached calls … the authoritative status
+  record consulted by `Tracking.Status(id)`."
+- **Message Lifecycle** (lines 49–51): "the operation tracking table is the status
+  record and the S&F buffer is purely the retry mechanism. A cached call that succeeds
+  on its first immediate attempt is written directly as a terminal `Delivered` tracking
+  row and never enters the S&F buffer."
+- **Operation Tracking Table** section (lines 77–87): "Alongside the S&F buffer DB,
+  each site node holds a **site-local operation tracking table** in SQLite. … Each row
+  records the operation kind (`TrackedOperationKind`) …"
+
+The actual implementation lives outside this module: `src/ScadaLink.SiteRuntime/
+Tracking/OperationTrackingStore.cs` (and `IOperationTrackingStore`, `OperationTrackingOptions`).
+The StoreAndForward project contains no references to the tracking store, owns no
+`operation_tracking` table, and `StoreAndForwardService.NotifyCachedCallObserverAsync`
+is only a hook handing telemetry context to an `ICachedCallLifecycleObserver` — the
+audit bridge wired in `ScadaLink.AuditLog`. The S&F module is **not** the table's
+owner; SiteRuntime is.
+
+This is a real design-doc drift, not a code defect, and is flagged explicitly in the
+brief's "Module-specific notes". The drift matters because the design doc's
+discussion of the lifecycle — "immediate success writes a terminal Delivered tracking
+row directly here", "operator discard sets terminal `Discarded`", "central never
+mutates the mirror row directly" — places coordination responsibilities on the wrong
+component. A reader looking for the source of truth for `Tracking.Status(id)` would
+read `Component-StoreAndForward.md` and search `src/ScadaLink.StoreAndForward/` in
+vain. The doc also lists Site Call Audit / Audit Log telemetry-emission as a S&F
+responsibility (line 22), but the emission actually happens via the `AuditLog` site
+component subscribing to `ICachedCallLifecycleObserver`.
+
+**Recommendation**
+
+Reconcile the doc with the code. The simplest fix is doc-side: update
+Component-StoreAndForward.md to scope its responsibilities back to the retry
+mechanism + replication + parked-message management, and add a cross-reference to a
+new (or existing) component doc for Operation Tracking (Component-SiteRuntime.md, or
+a new Component-OperationTracking.md). The code is internally consistent — the audit
+bridge subscribes to the observer hook, the SiteRuntime store writes the rows, the S&F
+engine emits attempt telemetry on the cached-call hot path — but the design doc is
+several refactors out of date. The hierarchical map should be:
+
+- `Component-StoreAndForward.md` → S&F buffer + Replication + Parked-message
+  management + Notification forwarding to central + cached-call telemetry **hook**.
+- New doc / SiteRuntime doc → Operation Tracking Table semantics and lifecycle.
+- `Component-SiteCallAudit.md` / `Component-AuditLog.md` → telemetry emission +
+  central-side mirror.
+
+**Resolution**
+
+_Unresolved._
+
+### StoreAndForward-022 — `NotifyCachedCallObserverAsync` silently drops the entire audit lifecycle when the message id is not a parseable `TrackedOperationId`
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Documentation & comments |
+| Status | Open |
+| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:484`–`:515` |
+
+**Description**
+
+`NotifyCachedCallObserverAsync` (the per-attempt observer notifier wired by the M3
+Bundle E rollout) bails out with no audit emission when
+`TrackedOperationId.TryParse(message.Id, out var trackedId)` returns false
+(`StoreAndForwardService.cs:510`–`:515`). The inline comment justifies the behaviour as
+back-compat for "pre-M3 message (random GUID-N id from S&F itself, no
+TrackedOperationId threaded in)", but the documented contract is broken in two ways:
+
+1. **Silent dropping of every audit row, not just the first one.** The skip means no
+   `Attempted` row, no `CachedResolve` terminal row, no audit trail at all for that
+   operation's S&F lifecycle — yet the rest of the system (script trust boundary,
+   parked-message UI, etc.) still treats the operation as audit-tracked. The drop is
+   not surfaced via a metric, log warning (the path is a silent `return`), or counter,
+   so a misconfigured caller bypasses the audit hot path with zero feedback.
+
+2. **The contract is hidden in field-level XML.** The `ICachedCallLifecycleObserver`
+   public interface contract (defined in `ScadaLink.Commons`) does not document that
+   the observer will be silently skipped when the underlying S&F message id is not a
+   GUID. A consumer reading the interface contract reasonably expects every cached-call
+   attempt to surface — the audit pipeline depends on it. The silent-drop is an
+   implementation detail of the S&F bridge that should be either lifted onto the
+   contract or removed.
+
+The engine itself mints GUID-N ids via `Guid.NewGuid().ToString("N")` (line 224), which
+`TrackedOperationId.TryParse` accepts, so the skip path is unreachable for engine-minted
+ids. It is reachable only for callers that supply their own `messageId` argument with a
+non-GUID format. The current callers (`NotificationOutbox` enqueue path with
+NotificationId, cached-call enqueue path with `TrackedOperationId.ToString()`) all
+supply GUID-shaped ids. The defect is latent — a future caller passing a non-GUID id
+would silently bypass audit.
+
+**Recommendation**
+
+Two options. The cheap fix: change the skip to a `_logger.LogWarning` with the offending
+id so a misconfigured caller is observable, and update the
+`ICachedCallLifecycleObserver` XML doc to mention the "non-GUID id → no telemetry"
+contract explicitly. The more correct fix: emit a still-audited row for the
+non-GUID case (e.g. synthesise a `TrackedOperationId` from the underlying id, or emit a
+distinguished "tracking-id-missing" audit row) so the audit pipeline never has silent
+holes. Add a regression test in `CachedCallAttemptEmissionTests` capturing the chosen
+contract — the existing
+`Attempt_MessageIdNotAGuid_NoObserverNotification` test pins today's silent-skip; if
+the fix is "log + skip", that test should be updated to also assert the log emission;
+if the fix is "emit anyway", the test should be replaced.
+
+**Resolution**
+
+_Unresolved._
+
+### StoreAndForward-023 — `siteId` silently defaults to empty when no `IStoreAndForwardSiteContext` is registered, degrading audit telemetry correlation
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Code organization & conventions |
+| Status | Open |
+| Location | `src/ScadaLink.StoreAndForward/ServiceCollectionExtensions.cs:43`–`:53`; `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:99`, `:524` |
+
+**Description**
+
+`AddStoreAndForward`'s service-collection factory resolves the optional
+`IStoreAndForwardSiteContext` and falls back to `string.Empty` when not registered:
+
+```csharp
+var siteContext = sp.GetService<IStoreAndForwardSiteContext>();
+var siteId = siteContext?.SiteId ?? string.Empty;
+return new StoreAndForwardService(storage, options, logger, replication,
+    cachedCallObserver, siteId);
+```
+
+The constructor's parameter is even defaulted to `""`. The empty-string `siteId` flows
+straight into every emitted `CachedCallAttemptContext.SourceSite`, which the central
+audit pipeline uses as part of the `(SourceSite, TrackedOperationId)` correlation key.
+A host that registers an `ICachedCallLifecycleObserver` (the audit observer wired by
+`AddAuditLog`) but forgets to register `IStoreAndForwardSiteContext` will produce a
+stream of telemetry rows with `SourceSite = ""` — the central audit mirror cannot
+distinguish them by site, and the central-site routing of
+`RetryParkedOperation`/`DiscardParkedOperation` commands keyed on `SourceSite` will
+fail to find the owning site.
+
+The Host's `IStoreAndForwardSiteContext` adapter and the audit observer registration
+are wired in lock-step, so the current configuration is correct, but the silent
+empty-string fallback is a contract hazard for future hosts (CLI test harness, second
+site cluster topology, etc.) and for tests that wire one without the other.
+
+**Recommendation**
+
+Make the contract explicit: when `cachedCallObserver` is non-null, require
+`IStoreAndForwardSiteContext` to be registered — throw an `InvalidOperationException`
+with a clear "Audit observer registered without a site context — register
+IStoreAndForwardSiteContext" message at construction time. When the audit observer is
+absent (no `AddAuditLog`), keep the empty-string default since `_siteId` is unused.
+Alternatively, change `siteId` from a parameter to a `Func<string>` resolved lazily
+from the service provider so a late-registered context still takes effect.
+
+**Resolution**
+
+_Unresolved._
+
+### StoreAndForward-024 — `StopAsync` does not wait for an in-flight retry sweep, so disposed dependencies can be touched after shutdown
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Concurrency & thread safety |
+| Status | Open |
+| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:122`–`:127`, `:136`–`:143`, `:303`–`:329` |
+
+**Description**
+
+`StartAsync` arms `_retryTimer` with `_ => _ = RetryPendingMessagesAsync()` (line 123).
+The `_ =` discards the returned `Task`, so when the timer fires the sweep runs **fire
+and forget** on a thread-pool thread. `StopAsync` (lines 136–143) disposes the timer:
+
+```csharp
+if (_retryTimer != null)
+{
+    await _retryTimer.DisposeAsync();
+    _retryTimer = null;
+}
+```
+
+`Timer.DisposeAsync()` returns once any in-flight timer **callback** has completed —
+but the timer callback in this service is a one-line `_ = RetryPendingMessagesAsync()`
+that synchronously returns immediately and leaves the actual sweep running on the
+thread pool. So `Timer.DisposeAsync` does not wait for the sweep; only for the
+synchronous `_ = ...` discarding step. `StopAsync` returns while a sweep is potentially
+still running, touching `_storage` (which the host will dispose), `_replication`
+(which the host will tear down), and `_cachedCallObserver` (whose downstream gRPC
+channel the host will shut down).
+
+The host shutdown sequence (`AkkaHostedService`) tears down the actor system and the
+DI container after this service's `StopAsync` completes — meaning a sweep that runs
+past `StopAsync` can call into disposed `SqliteConnection`s (yielding
+`ObjectDisposedException`, caught by the sweep's outer `try/catch` as a log) or, more
+seriously, push a replication operation into a half-disposed Akka actor pipeline and
+trigger noisy dead-letter warnings during a clean shutdown.
+
+The race window is small (the sweep typically finishes in <100 ms in tests) but it is
+real, particularly when shutting down a site under load with a non-empty buffer.
+
+**Recommendation**
+
+Track in-flight sweep tasks and `await` them in `StopAsync`:
+
+```csharp
+private Task? _currentSweep;
+
+public async Task StopAsync()
+{
+    if (_retryTimer != null)
+    {
+        await _retryTimer.DisposeAsync();
+        _retryTimer = null;
+    }
+    if (_currentSweep is { } sweep)
+    {
+        try { await sweep; } catch { /* logged inside RetryPendingMessagesAsync */ }
+    }
+}
+```
+
+Change the timer callback to:
+
+```csharp
+_retryTimer = new Timer(_ => _currentSweep = RetryPendingMessagesAsync(), ...);
+```
+
+Add a `CancellationTokenSource` so a long sweep can be cooperatively aborted on stop;
+plumb the cancellation token into `_storage` / `_replication` / `_cachedCallObserver`
+calls. Add a regression test in `StoreAndForwardServiceTests` that calls `StopAsync`
+mid-sweep and asserts no further storage activity occurs after `StopAsync` returns.
+
+**Resolution**
+
+_Unresolved._
+
@@ -5,10 +5,10 @@
 | Module | `src/ScadaLink.TemplateEngine` |
 | Design doc | `docs/requirements/Component-TemplateEngine.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-17 |
+| Last reviewed | 2026-05-28 |
 | Reviewer | claude-agent |
-| Commit reviewed | `39d737e` |
-| Open findings | 0 |
+| Commit reviewed | `1eb6e97` |
+| Open findings | 6 |

 ## Summary

@@ -48,8 +48,49 @@ Both are limited-impact (nested compositions are the less common case and there
 is design-time visibility) but represent genuine drift from the recursive-nesting
 design promise.

+#### Re-review 2026-05-28 (commit `1eb6e97`)
+
+Re-reviewed the whole module against all ten checklist categories at commit
+`1eb6e97`. All sixteen prior findings remain closed. Six new findings surfaced,
+clustered in three themes:
+
+1. **Revision-hash / diff coverage gaps** — `RevisionHashService` and
+   `DiffService` both omit `Attributes.Description`, `Alarms.Description`, and
+   the entire `Connections` map. A change that only edits an attribute/alarm
+   description, or a data-connection endpoint, will deploy a new flattened
+   configuration but be invisible to staleness detection and the diff view —
+   the very gap the revision hash was introduced to close (TemplateEngine-017,
+   TemplateEngine-018). Severity Medium/High.
+
+2. **TemplateEngine-013 fix only partially applied** — the `0`-as-no-parent
+   sentinel was removed from `CycleDetector` but `TemplateResolver
+   .BuildInheritanceChain` still uses `currentId != 0` / `ParentTemplateId ?? 0`.
+   A template with a real Id of 0 is treated as "no template" and silently
+   excluded from its own inheritance chain, so every flatten/resolve through
+   that template loses its members. The fix from `adb5e75` did not propagate
+   into the resolver (TemplateEngine-019). Severity Medium.
+
+3. **Audit log integrity / drift** — every `Create` audit entry in
+   `TemplateService` and `SharedScriptService` is written with `EntityId = "0"`
+   *before* `SaveChangesAsync` populates the real key, so the audit trail loses
+   the link back to the created row (TemplateEngine-020); `MoveTemplateAsync`
+   never validates folder-acyclicity / sibling-name-uniqueness even though
+   `TemplateFolderService.MoveFolderAsync` does (TemplateEngine-021); and the
+   advertised `IS NOT_locked & not_LockedInDerived & not_IsInherited`
+   self-reference loop is intact, but `LockEnforcer.ValidateLockChange` permits
+   downgrading a `LockedInDerived` flag on a base template — there is no
+   equivalent of the once-locked-stays-locked rule for the `LockedInDerived`
+   flag (TemplateEngine-022). Severity Low/Medium.
+
+Themes: hash/diff drift from the deployment payload, asymmetric application of
+the duplicate-Id / null-sentinel fix from the last batch, and audit-write
+ordering inconsistency between `TemplateService` (logs then saves) and
+`InstanceService` (saves then logs).
+
 ## Checklist coverage

+_Re-review (2026-05-17, `39d737e`):_
+
 | # | Category | Examined | Notes |
 |---|----------|----------|-------|
 | 1 | Correctness & logic bugs | ✓ | Prior bugs (001–005, 013) all resolved and verified. Re-review 2026-05-17 found two new nested-composition defects: rename does not cascade (TemplateEngine-015), composed-script `ParentPath` always empty (TemplateEngine-016). |
@@ -63,6 +104,21 @@ design promise.
 | 9 | Testing coverage | ✓ | Tests exist for every file, but the dead/placeholder paths (TemplateEngine-004, 005) and deep nesting (TemplateEngine-001) are not exercised. |
 | 10 | Documentation & comments | ✓ | Mostly accurate; a misleading converter comment (TemplateEngine-011) and a stale enum/doc mismatch (TemplateEngine-012). |

+_Re-review (2026-05-28, `1eb6e97`):_
+
+| # | Category | Examined | Notes |
+|---|----------|----------|-------|
+| 1 | Correctness & logic bugs | ✓ | New: `TemplateResolver.BuildInheritanceChain` still uses the `0`-as-no-parent sentinel that was removed from `CycleDetector` in `adb5e75` (TemplateEngine-019). `TemplateService.MoveTemplateAsync` performs no folder-acyclicity or sibling-name-uniqueness check (TemplateEngine-021). |
+| 2 | Akka.NET conventions | ✓ | No actors. `AddTemplateEngineActors` is still an empty placeholder. Nothing to assess. |
+| 3 | Concurrency & thread safety | ✓ | Services remain stateless, scoped per request. No new findings. |
+| 4 | Error handling & resilience | ✓ | `Result<T>` used consistently. `MoveTemplateAsync` is missing target-folder validation found elsewhere — see TemplateEngine-021. |
+| 5 | Security | ✓ | No new findings. Forbidden-API limitations still tracked under the closed TemplateEngine-006 (resolved as advisory). |
+| 6 | Performance & resource management | ✓ | `MergeHiLoConfig` / `PrefixTriggerAttribute` allocate a `MemoryStream` + `Utf8JsonWriter` + `Encoding.UTF8.GetString` per call — fine for the per-flatten frequency, no finding. No new resource leaks. |
+| 7 | Design-document adherence | ✓ | New drift: `RevisionHashService` and `DiffService` both omit `Description` fields and the `Connections` map from the deployable payload (TemplateEngine-017, TemplateEngine-018), so the revision hash and diff do not reflect every committed deployment input. |
+| 8 | Code organization & conventions | ✓ | Audit-write ordering asymmetric: `TemplateService.Create*` and `SharedScriptService.CreateSharedScriptAsync` log with `EntityId = "0"` before `SaveChangesAsync`, while `InstanceService.CreateInstanceAsync` saves first then logs with the real Id (TemplateEngine-020). |
+| 9 | Testing coverage | ✓ | New finding paths exercised in part — `RevisionHashServiceTests` does not assert that Description / Connections changes change the hash; no test for `BuildInheritanceChain` with a real Id of 0; no test for `MoveTemplateAsync` rejecting a target folder. |
+| 10 | Documentation & comments | ✓ | New: `LockEnforcer.ValidateLockChange` is documented as enforcing the once-locked-stays-locked rule but has no equivalent for `LockedInDerived` (TemplateEngine-022). |
+
 ## Findings

 ### TemplateEngine-001 — Deeply nested composed members are dropped during flattening
@@ -780,3 +836,313 @@ passes the enclosing module's `prefix` — and the `ScriptScope` now sets
 `SelfPath = "Outer.Inner"` pairs with `ParentPath = "Outer"` and `Parent.X`
 resolves against the real parent module. Regression test:
 `Flatten_NestedComposedScript_ScopeCarriesCorrectParentPath`.
+
+### TemplateEngine-017 — Revision hash and diff both ignore `Description` and `Connections`, defeating staleness detection for real deployment changes
+
+| | |
+|--|--|
+| Severity | High |
+| Category | Design-document adherence |
+| Status | Open |
+| Location | `src/ScadaLink.TemplateEngine/Flattening/RevisionHashService.cs:128`, `src/ScadaLink.TemplateEngine/Flattening/RevisionHashService.cs:156`, `src/ScadaLink.TemplateEngine/Flattening/RevisionHashService.cs:42`, `src/ScadaLink.TemplateEngine/Flattening/DiffService.cs:110`, `src/ScadaLink.TemplateEngine/Flattening/DiffService.cs:118` |
+
+**Description**
+
+The design states the revision hash is "computed from the resolved content" and
+backs both staleness detection and diff correlation. The `Hashable*` records,
+however, omit fields that are part of the deployed `FlattenedConfiguration`:
+
+- `HashableAttribute` skips `ResolvedAttribute.Description` and the resolved
+  connection name/protocol (`BoundDataConnectionName`/`BoundDataConnectionProtocol`).
+- `HashableAlarm` skips `ResolvedAlarm.Description`.
+- The top-level `HashableConfiguration` skips the entire `Connections` map —
+  the `ConnectionConfig` per connection name carries the protocol, the primary
+  endpoint JSON, the backup endpoint JSON, and the failover retry count, all
+  of which travel in the deployment package.
+
+The same gaps exist in `DiffService.AttributesEqual`, `AlarmsEqual`, and there
+is no entry for `Connections` at all. Concrete consequences:
+
+1. A Design user edits an attribute's `Description` (an authoring-time
+   concern) → the flattened payload changes → no hash change, no diff entry.
+2. A Deployment user edits the primary endpoint JSON of a data connection
+   bound to an instance → the deployment package now ships a different
+   `ConnectionConfig` → no hash change, no diff entry, so the staleness
+   indicator says the instance is up to date and the diff view shows no
+   pending change. The site quietly receives different connection
+   credentials/host on the next redeploy.
+
+The Description case is mostly cosmetic. The `Connections` case is a deployment
+correctness gap — staleness detection is the mechanism that tells operators
+"this instance has drifted from its template and needs redeployment", and a
+connection-endpoint change is exactly the kind of drift it must catch.
+
+**Recommendation**
+
+Add `Description` to `HashableAttribute` and `HashableAlarm` (alphabetically
+placed, per the determinism contract) and to `AttributesEqual` / `AlarmsEqual`.
+Add a `HashableConnections : SortedDictionary<string, HashableConnection>`
+field (or equivalent) to `HashableConfiguration` that includes Protocol,
+ConfigurationJson, BackupConfigurationJson, and FailoverRetryCount, and mirror
+it in `DiffService`. Add tests:
+`Hash_DescriptionEditChangesHash`,
+`Hash_ConnectionEndpointEditChangesHash`,
+`Diff_ConnectionEndpointEdit_ProducesEntry`.
+
+**Resolution**
+
+_Unresolved._
+
+### TemplateEngine-018 — `DiffService` reports no entries for added/removed/changed connections
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.TemplateEngine/Flattening/DiffService.cs:19` |
+
+**Description**
+
+`DiffService.ComputeDiff` returns a `ConfigurationDiff` with `AttributeChanges`,
+`AlarmChanges`, and `ScriptChanges` only. The `FlattenedConfiguration` it diffs
+also carries a `Connections` dictionary (per-attribute connection bindings
+collapsed to one connection-config-per-name during flattening — see
+`FlatteningService:99-118`), and this dictionary materially affects what the
+site receives at deploy time. A connection added to or removed from the
+flattened configuration (e.g., an instance gains its first data-sourced
+attribute, or its last binding is cleared) produces no diff entry. Operators
+inspecting the diff view to decide whether to redeploy see "no changes" when
+the site will in fact receive a structurally different deployment package.
+
+This is the diff-view counterpart of TemplateEngine-017's hash gap; they are
+separable because the `ConfigurationDiff` data shape would have to be extended
+even after the hash is fixed.
+
+**Recommendation**
+
+Add `ConnectionChanges` (or equivalent) to `ConfigurationDiff` in `Commons`
+(`Types/Flattening/ConfigurationDiff.cs`), populate it in
+`DiffService.ComputeDiff` via a new `ComputeEntityDiff` over
+`Connections.Keys`, and add a `ConnectionsEqual` helper. Update the Central UI
+diff display to render the new section. Add regression tests:
+`Diff_NewConnectionBinding_ReportedAsAdded`,
+`Diff_ClearedBinding_ReportedAsRemoved`,
+`Diff_EndpointEdit_ReportedAsChanged`.
+
+**Resolution**
+
+_Unresolved._
+
+### TemplateEngine-019 — `TemplateResolver.BuildInheritanceChain` still uses the `0`-as-no-parent sentinel that was removed from `CycleDetector`
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.TemplateEngine/TemplateResolver.cs:117`, `src/ScadaLink.TemplateEngine/TemplateResolver.cs:123` |
+
+**Description**
+
+TemplateEngine-013 removed the `0`-as-no-parent sentinel from `CycleDetector`
+(`adb5e75`) — `ParentTemplateId` is `int?`, so a missing value means "no
+parent" and a real Id of 0 must walk the chain like any other node. The fix
+did not propagate into `TemplateResolver.BuildInheritanceChain`:
+
+```csharp
+var currentId = templateId;
+...
+while (currentId != 0 && lookup.TryGetValue(currentId, out var current))
+{
+    ...
+    currentId = current.ParentTemplateId ?? 0;
+}
+```
+
+The seeded `currentId = templateId` is treated as "no template" when
+`templateId == 0`, so `ResolveAllMembers(0, ...)` returns an empty chain even
+when a template with Id 0 exists. Walking up, `current.ParentTemplateId ?? 0`
+then `currentId != 0` collapses a real parent of Id 0 onto the "no parent"
+exit, silently truncating the chain. The chain is the input to every
+flatten/resolve/validate path through `FlatteningService`, `TemplateService
+.ResolveTemplateMembersAsync`, and `InstanceService.SetAlarmOverrideAsync` — a
+template with a real Id of 0 (which EF identity sequences avoid in production
+but which any in-memory test or import-staging path can produce) silently
+loses its inheritance contribution. The duplicate-tolerant `BuildLookup` added
+in `adb5e75` is used here, so the test gap is one half of the same fix.
+
+**Recommendation**
+
+Switch the walk to the `int?` form, mirroring `CycleDetector
+.DetectInheritanceCycle`:
+
+```csharp
+int? currentId = templateId;
+while (currentId.HasValue && lookup.TryGetValue(currentId.Value, out var current))
+{
+    if (!visited.Add(currentId.Value)) break;
+    chain.Add(current);
+    currentId = current.ParentTemplateId;
+}
+```
+
+Add regression test
+`TemplateResolverTests.BuildInheritanceChain_RealIdZero_StillResolves`.
+
+**Resolution**
+
+_Unresolved._
+
+### TemplateEngine-020 — `Create*` audit entries are written with `EntityId = "0"` before `SaveChangesAsync` populates the real key
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Code organization & conventions |
+| Status | Open |
+| Location | `src/ScadaLink.TemplateEngine/TemplateService.cs:77`, `src/ScadaLink.TemplateEngine/TemplateService.cs:256`, `src/ScadaLink.TemplateEngine/TemplateService.cs:407`, `src/ScadaLink.TemplateEngine/TemplateService.cs:556`, `src/ScadaLink.TemplateEngine/TemplateService.cs:734`, `src/ScadaLink.TemplateEngine/SharedScriptService.cs:71` |
+
+**Description**
+
+`IAuditService.LogAsync` takes a `string entityId` argument and `TemplateService
+.CreateTemplateAsync`, `AddAttributeAsync`, `AddAlarmAsync`, `AddScriptAsync`,
+`AddCompositionAsync`, and `SharedScriptService.CreateSharedScriptAsync` all
+hard-code it to `"0"`:
+
+```csharp
+await _repository.AddTemplateAsync(template, cancellationToken);
+await _auditService.LogAsync(user, "Create", "Template", "0", name, template, cancellationToken);
+await _repository.SaveChangesAsync(cancellationToken);
+```
+
+EF Core populates `template.Id` only when `SaveChangesAsync` runs, but the
+audit row is written and queued in the change tracker *before* the save with a
+literal `"0"`. The single save then commits the audit row with `EntityId =
+"0"` and the new template/attribute/alarm/script with its real Id. Every
+"Create" entry in the audit trail therefore loses the link back to the row it
+describes — searching the audit log by entity id of a created row finds
+nothing, only the subsequent Update/Delete rows are findable.
+
+Note that `InstanceService.CreateInstanceAsync` uses the opposite order
+(`AddInstanceAsync` → `SaveChangesAsync` → `LogAsync(... instance.Id ...)`,
+lines 90–94) and gets the real Id. The asymmetry is the smoking gun: half the
+module audits Create correctly, half does not.
+
+A separate consideration: writing the audit row in the same `SaveChangesAsync`
+as the entity is correct (it gives transactional all-or-nothing) — the fix is
+to save the entity first, then log, then save the audit row (two-phase, like
+`InstanceService` and `TemplateService.DeleteTemplateAsync` already do).
+
+**Recommendation**
+
+For every `Create*` path in `TemplateService` and `SharedScriptService`, swap
+the order to `AddXxxAsync` → `SaveChangesAsync` → `LogAsync(... newId
+.ToString() ...)` → `SaveChangesAsync`, matching `InstanceService
+.CreateInstanceAsync` and `TemplateService.DeleteTemplateAsync`. Add regression
+tests that assert the `EntityId` recorded on the audit row matches the
+created row's Id.
+
+**Resolution**
+
+_Unresolved._
+
+### TemplateEngine-021 — `MoveTemplateAsync` skips folder cycle and sibling-name-collision validation
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.TemplateEngine/TemplateService.cs:173` |
+
+**Description**
+
+`TemplateService.MoveTemplateAsync` validates only that the target folder
+exists, then unconditionally assigns `template.FolderId = newFolderId`.
+`TemplateFolderService.MoveFolderAsync` (the sibling for folder-to-folder
+moves) by contrast validates:
+
+- the target folder is not the folder being moved (self-parent);
+- the target folder is not a descendant of the folder being moved (cycle);
+- no sibling at the destination has the same name (case-insensitive).
+
+The first two are folder-graph concerns and don't apply to template moves, but
+the third does — two templates with the same name in the same folder is the
+authoring-time scenario the design's "naming collisions are design-time
+errors" rule was meant to cover. Today, two templates named "Pump" can be
+moved into the same folder with no error, breaking any UI that locates a
+template by `(FolderId, Name)` and producing a worse user experience than the
+folder-rename path which does check.
+
+Separately, the design doc states folders carry "no semantic meaning for
+template resolution, flattening, validation, or inheritance" — so this is
+strictly a UI-organization invariant, but it is documented elsewhere
+(`TemplateFolderService` enforces it for folders) and the asymmetry is
+surprising.
+
+**Recommendation**
+
+After resolving the target folder, run a sibling-name-uniqueness check across
+templates with the same `FolderId == newFolderId` and the same `Name`
+(case-insensitive), mirroring `TemplateFolderService.MoveFolderAsync` lines
+130–142. Add a regression test `MoveTemplate_NameCollisionAtDestination_Fails`.
+
+**Resolution**
+
+_Unresolved._
+
+### TemplateEngine-022 — `LockEnforcer.ValidateLockChange` enforces "once-locked-stays-locked" for `IsLocked` but not for `LockedInDerived`
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Documentation & comments |
+| Status | Open |
+| Location | `src/ScadaLink.TemplateEngine/LockEnforcer.cs:109`, `src/ScadaLink.TemplateEngine/TemplateService.cs:323`, `src/ScadaLink.TemplateEngine/TemplateService.cs:476`, `src/ScadaLink.TemplateEngine/TemplateService.cs:623` |
+
+**Description**
+
+`LockEnforcer.ValidateLockChange` documents and enforces the rule that an
+already-locked member cannot be unlocked downstream (`originalIsLocked &&
+!proposedIsLocked` → error). The class-level XML doc describes locking as
+covering both fields:
+
+> Locking rules: ... Once locked, a member stays locked — it cannot be
+> unlocked downstream.
+
+But the `LockedInDerived` field has no equivalent guard. `UpdateAttributeAsync`,
+`UpdateAlarmAsync`, and `UpdateScriptAsync` all let the proposed
+`LockedInDerived` flag flip in either direction on a base-template member.
+This is a subtle correctness gap with two failure modes:
+
+1. A base template originally marked an attribute `LockedInDerived = true` to
+   protect derived templates from overriding it. A subsequent edit can clear
+   the flag while leaving existing derived-template overrides intact — those
+   overrides become legal retroactively even though the design intent was
+   that they were always blocked.
+2. The XML doc on `LockEnforcer` and the class summary on `TemplateService`
+   describe a one-way ratchet that the code does not implement for one of the
+   two lock flags. A reader of the documentation cannot tell which rules are
+   actually enforced.
+
+The defect is "Low" because the design doc for the Template Engine itself
+does not explicitly call out a once-locked-stays-locked rule for
+`LockedInDerived`. The most likely fix is therefore to (a) correct the
+`LockEnforcer` XML doc to describe only `IsLocked`, or (b) add the equivalent
+guard for `LockedInDerived` and a regression test. The choice is a design
+question — pick one and align the code and docs.
+
+**Recommendation**
+
+Decide the policy. If `LockedInDerived` is intended to be once-set-stays-set
+like `IsLocked`, extend `ValidateLockChange` (or add a sibling
+`ValidateLockedInDerivedChange`) and reject the downgrade in
+`UpdateAttributeAsync` / `UpdateAlarmAsync` / `UpdateScriptAsync`. If it is
+intended to be mutable, update the `LockEnforcer` summary to scope the rule
+to `IsLocked` only. Either way, add a test pinning the chosen behaviour.
+
+**Resolution**
+
+_Unresolved._
+
@@ -0,0 +1,456 @@
+# Code Review — Transport
+
+| Field | Value |
+|-------|-------|
+| Module | `src/ScadaLink.Transport` |
+| Design doc | `docs/requirements/Component-Transport.md` |
+| Status | Reviewed |
+| Last reviewed | 2026-05-28 |
+| Reviewer | claude-agent |
+| Commit reviewed | `1eb6e97` |
+| Open findings | 12 |
+
+## Summary
+
+The Transport module is structurally clean, follows the design doc's pipeline
+layout (Encryption → Serialization → Export / Import), and has solid lower-tier
+coverage (encryptor round-trips, manifest validator, dependency resolver,
+session store, diff engine). The big surface-area concerns cluster around two
+themes. First, the `Overwrite` resolution path is structurally incomplete: it
+updates only the parent entity's scalar fields (e.g. `Template.Description /
+FolderId`, `ExternalSystem.EndpointUrl / AuthType / AuthConfiguration`) and
+never replaces child collections (attributes, alarms, scripts, external-system
+methods), silently diverging from both the design doc's audit-row table and
+operator intent. Second, the 3-strike / per-IP unlock-rate-limit story declared
+in `TransportOptions` and the design doc isn't wired into the import service —
+the only counter is a local field on `TransportImport.razor.cs`, and
+`MaxUnlockAttemptsPerIpPerHour` is referenced nowhere. There are also some
+smaller integrity-and-resource issues (manifest fields outside `ContentHash`
+aren't bound to the encryption envelope, decrypted plaintext lives in the
+in-memory session for the full TTL on the failure path, and ZIP reads have no
+entry-count / per-entry decompression cap).
+
+## Checklist coverage
+
+| # | Category | Examined | Notes |
+|---|----------|----------|-------|
+| 1 | Correctness & logic bugs | Yes | Overwrite paths miss child sync (Transport-001, Transport-002); composition Overwrite intentionally clears (good). |
+| 2 | Akka.NET conventions | Yes | No issues found — Transport is service-only, no actors / messages. |
+| 3 | Concurrency & thread safety | Yes | `IAuditCorrelationContext` mutation is documented as not thread-safe (Transport-009); singleton `BundleSessionStore` w/ `ConcurrentDictionary` is fine. |
+| 4 | Error handling & resilience | Yes | Rollback-failure path is well-considered, but failed sessions are never evicted (Transport-007). |
+| 5 | Security | Yes | Unlock lockout + per-IP cap not enforced server-side (Transport-003, Transport-004); manifest fields outside ContentHash are unauthenticated (Transport-005); zip-bomb / per-entry decompression cap missing (Transport-006); secrets travel in plaintext in unencrypted bundles by design but UI-only warning (acceptable per doc). |
+| 6 | Performance & resource management | Yes | `BundleSession.DecryptedContent` retained in memory for 30 min even on failure (Transport-007); `PreviewAsync` issues N+1 calls to `GetTemplateWithChildrenAsync` (Transport-008); `BundleSerializer.Pack` serializes content twice. |
+| 7 | Design-document adherence | Yes | Overwrite-doesn't-sync-children contradicts the design doc's audit row table (Transport-001); per-IP-per-hour lockout in §11 not implemented (Transport-004); design says "bundles are not retained server-side after ApplyAsync commits" — but failed bundles are retained until TTL (Transport-007). |
+| 8 | Code organization & conventions | Yes | No major issues found — clean separation, POCO DTOs in `Serialization/`, scoped vs singleton service lifetimes appropriate. |
+| 9 | Testing coverage | Yes | Critical gap: no Overwrite-with-modified-children test for Templates or ExternalSystems (Transport-010); no test exercising failed-bundle session retention or per-IP lockout. |
+| 10 | Documentation & comments | Yes | XML comments are extensive and accurate; design doc has some staleness (Transport-011, Transport-012). |
+
+## Findings
+
+### Transport-001 — Template Overwrite never syncs attributes / alarms / scripts
+
+| | |
+|--|--|
+| Severity | High |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.Transport/Import/BundleImporter.cs:844-851` |
+
+**Description**
+
+The `ResolutionAction.Overwrite` branch in `ApplyTemplatesAsync` only writes
+`Description` and `FolderId` on the existing template and calls
+`UpdateTemplateAsync(ex, …)`. The bundle DTO's `Attributes`, `Alarms`, and
+`Scripts` collections are never copied onto the existing entity, so an Overwrite
+of a template whose child collections changed silently leaves the target's
+existing children in place. `ResolveAlarmScriptLinksAsync` then runs against the
+unmodified existing alarms/scripts and does nothing useful for the Overwrite
+case. This contradicts the design doc's Configuration Audit Trail table
+("Template overwritten → `TemplateUpdated` + per-field rows
+(`TemplateAttributeAdded`, `TemplateScriptUpdated`, …)") and the operator's
+mental model — an Overwrite that produces no diff is a footgun. The only
+integration test (`ConflictResolutionTests.Overwrite_replaces_existing_template_description`)
+asserts only on `Description`, so the regression is not caught.
+
+**Recommendation**
+
+For the Overwrite branch, replace the existing template's children to match the
+bundle DTO (delete-then-add or diff-and-merge), then re-run the alarm-script and
+composition rewire passes against the post-merge state. Emit the per-field audit
+rows the design doc enumerates. Add an integration test that overwrites a
+template whose Scripts / Attributes / Alarms differ.
+
+**Resolution**
+
+_Unresolved._
+
+### Transport-002 — ExternalSystem Overwrite never syncs methods
+
+| | |
+|--|--|
+| Severity | High |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.Transport/Import/BundleImporter.cs:1213-1221` |
+
+**Description**
+
+`ApplyExternalSystemsAsync` Overwrite path writes `EndpointUrl`, `AuthType`, and
+`AuthConfiguration` on the existing `ExternalSystemDefinition` and calls
+`UpdateExternalSystemAsync`. The DTO's `Methods` collection is never written —
+any added, removed, or modified method on the incoming bundle silently does
+not land. Same shape of bug as Transport-001 but on a different entity. The
+design doc's audit-row table says
+"External system overwritten → `ExternalSystemDefinitionUpdated` + per-method
+rows", confirming methods are expected to round-trip.
+
+**Recommendation**
+
+Sync `Methods` on Overwrite via add / update / delete by name (mirroring the
+diff classification in `ArtifactDiff.CompareExternalSystem`) and emit the
+per-method audit rows. Add a test that overwrites an external system whose
+methods differ.
+
+**Resolution**
+
+_Unresolved._
+
+### Transport-003 — Unlock lockout is enforced only client-side; server session is never marked Locked
+
+| | |
+|--|--|
+| Severity | High |
+| Category | Security |
+| Status | Open |
+| Location | `src/ScadaLink.Transport/Import/BundleImporter.cs:184-203`, `src/ScadaLink.CentralUI/Components/Pages/Design/TransportImport.razor.cs:267-309`, `src/ScadaLink.Commons/Types/Transport/BundleSession.cs:14-16` |
+
+**Description**
+
+`BundleSession` exposes `FailedUnlockAttempts` and a `Locked` computed property,
+and `PreviewAsync` / `ApplyAsync` correctly refuse to proceed when
+`session.Locked == true`. But for an encrypted bundle, `LoadAsync` throws
+`CryptographicException` before any session is opened, so no session ever holds
+a non-zero `FailedUnlockAttempts`. The 3-strike counter lives only in the
+Blazor page's local `_failedUnlockAttempts` field; a second tab / circuit / CLI
+caller bypassing the UI can retry the same uploaded bytes indefinitely
+because the importer accepts a passphrase against a stream and runs PBKDF2 each
+call (600 000 iterations / call). The Locked invariant on `BundleSession` is
+effectively unreachable — the field is dead code.
+
+**Recommendation**
+
+Move the lockout into `IBundleImporter`. Two viable shapes:
+(a) open a session on the first `LoadAsync` call (skip the decryption step until
+a separate `UnlockAsync` is called) and increment / lock there;
+(b) keep a per-content-hash counter in the session store, scoped by bundle SHA,
+so retries against the same bundle bytes are throttled regardless of the UI
+client. Either way, emit `BundleImportUnlockFailed` from the service, not from
+the Razor page. Test that a second concurrent caller cannot side-step the
+lockout.
+
+**Resolution**
+
+_Unresolved._
+
+### Transport-004 — `MaxUnlockAttemptsPerIpPerHour` option is declared but never enforced
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Security |
+| Status | Open |
+| Location | `src/ScadaLink.Transport/TransportOptions.cs:12`, `docs/requirements/Component-Transport.md` §11 |
+
+**Description**
+
+`TransportOptions.MaxUnlockAttemptsPerIpPerHour` defaults to 10 and is
+documented in the design doc (§11, "Failed-unlock rate limit: per-session
+3-strike lockout; per-IP-per-hour cap (default 10, configurable) to deter brute
+force against a stolen bundle"). A repo-wide grep finds zero readers of the
+field. There is no IP-keyed rate limiter, no `IHttpContextAccessor` in the
+importer, no middleware in Central UI guarding the import endpoints. The
+documented brute-force defence does not exist in code.
+
+**Recommendation**
+
+Either implement the per-IP cap (e.g. via `Microsoft.AspNetCore.RateLimiting`
+on the `TransportImport` page and the `ManagementActor` import command path,
+keyed on remote-IP for the UI and on authenticated principal for the CLI), or
+drop the option and the design-doc paragraph if the project is intentionally
+deferring this. Don't leave a dead-letter option that promises a security
+control that isn't there.
+
+**Resolution**
+
+_Unresolved._
+
+### Transport-005 — Manifest fields outside `ContentHash` are not bound to the encrypted payload
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Security |
+| Status | Open |
+| Location | `src/ScadaLink.Transport/Encryption/BundleSecretEncryptor.cs:31-49`, `src/ScadaLink.Transport/Serialization/ManifestValidator.cs:29-53` |
+
+**Description**
+
+AES-GCM is called with no Associated Authenticated Data (AAD). The `manifest`
+fields — `SourceEnvironment`, `ExportedBy`, `ScadaLinkVersion`, `Summary`,
+`Contents`, `CreatedAtUtc`, etc. — are plaintext and only the `ContentHash`
+field is checked against the content bytes. An attacker who obtains a bundle
+can edit any non-`ContentHash` manifest field (e.g. rewrite the
+`SourceEnvironment` displayed in the Step-4 typo-resistant confirmation gate,
+forge a more recent `CreatedAtUtc`, lie about `ExportedBy`) without breaking
+decryption. The Step-4 confirmation gate the design doc relies on
+("User types the source environment name to confirm — typo-resistant gate at
+the prod boundary") is therefore tamperable.
+
+**Recommendation**
+
+Pass the SHA-256 of the manifest's canonical bytes (excluding `ContentHash` and
+`Encryption`, or simply the whole manifest minus those two fields) as the
+`associatedData` argument to `AesGcm.Encrypt` / `AesGcm.Decrypt`. Any
+tampering of the manifest's other fields then yields an authentication-tag
+mismatch on decrypt. Same change in the plaintext path can be approximated by
+extending the hash domain (compute a manifest-and-content hash, or sign the
+manifest, depending on how far you want to go).
+
+**Resolution**
+
+_Unresolved._
+
+### Transport-006 — Bundle ZIP read has no per-entry size cap or entry-count cap (zip-bomb / decompression-bomb)
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Security |
+| Status | Open |
+| Location | `src/ScadaLink.Transport/Serialization/BundleSerializer.cs:121-156`, `src/ScadaLink.Transport/Import/BundleImporter.cs:132-143` |
+
+**Description**
+
+`LoadAsync` caps the raw bundle bytes at `MaxBundleSizeMb` (default 100 MB)
+before opening the ZIP. But `ReadContentBytes` calls `entry.Open()` and
+`CopyTo(MemoryStream)` with no per-entry size limit and no defence against
+compression ratios — a 100 MB DEFLATE-compressed bundle can decompress to
+gigabytes. There is also no cap on the number of entries iterated; only two
+known entries are read (`manifest.json` + `content.json`/`content.enc`), but
+`ReadContentBytes` does not validate that no extra entries exist or that the
+expected entry's `Length` is bounded. A malicious importer-with-RequireAdmin
+(or a stolen bundle delivered to an admin) can OOM the central node.
+
+**Recommendation**
+
+Cap each entry's decompressed length explicitly (compare `ZipArchiveEntry.Length`
+against a configurable max, or copy into a length-limited stream). Reject
+bundles whose entry list contains anything other than the known manifest +
+content entries. Consider also rejecting any compression ratio over ~50x as a
+defence-in-depth measure.
+
+**Resolution**
+
+_Unresolved._
+
+### Transport-007 — Failed import sessions retain decrypted plaintext for the full 30-minute TTL
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Performance & resource management |
+| Status | Open |
+| Location | `src/ScadaLink.Transport/Import/BundleImporter.cs:614-696`, `src/ScadaLink.Transport/Import/BundleSessionStore.cs:67-93` |
+
+**Description**
+
+`ApplyAsync` calls `_sessionStore.Remove(sessionId)` only on the success path
+(line 614). The catch block re-throws without removing the session, so a failed
+apply leaves the `BundleSession` (with `DecryptedContent` up to ~100 MB) in the
+in-memory dictionary until the TTL elapses 30 min later (or `Get` lazily evicts
+on a separate lookup). Decrypted secrets — DB connection strings, SMTP
+credentials, external-system auth configs — sit in process memory for that
+window, accessible to anyone holding the session id. Multiplied across repeated
+import attempts on the same circuit, this can produce significant memory
+pressure (10 failed 100 MB imports = 1 GB) and contradicts the design doc's
+"Bundles are not retained server-side after ApplyAsync commits" statement.
+
+**Recommendation**
+
+In the `ApplyAsync` catch block, call `_sessionStore.Remove(sessionId)` (or at
+least zero out `session.DecryptedContent`) before re-throwing. Also clear
+`DecryptedContent` from the session on the success path before removing — the
+buffer is potentially still rooted by a caller-held reference. Consider
+shortening the TTL when a session is in a known-stuck state. The session
+store's `EvictExpired` exists but is only called on demand — wire it to a
+periodic timer so abandoned sessions clear even without traffic.
+
+**Resolution**
+
+_Unresolved._
+
+### Transport-008 — `PreviewAsync` issues an N+1 `GetTemplateWithChildrenAsync` per matching template name
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Performance & resource management |
+| Status | Open |
+| Location | `src/ScadaLink.Transport/Import/BundleImporter.cs:252-272` |
+
+**Description**
+
+Building the per-template diff loops over every existing stub returned by
+`GetAllTemplatesAsync` and, for any name that matches an incoming DTO, calls
+`GetTemplateWithChildrenAsync(stub.Id)` to re-fetch with children. On a target
+DB with many templates that overlap the bundle this is one round-trip per
+matching template (often the whole bundle), each query carrying the full
+attributes/alarms/scripts/compositions joins. The diff itself is read-only and
+fits a single eager-loaded `GetAllTemplatesWithChildrenAsync` query.
+
+**Recommendation**
+
+Add a `GetAllTemplatesWithChildrenAsync` (or extend `GetAllTemplatesAsync` with
+an `includeChildren` flag) on `ITemplateEngineRepository` and use it here. The
+same N+1 appears in `ResolveCompositionEdgesAsync` (line 1093) for the
+just-imported templates, but that loop is bounded by the bundle's size and is
+less of a concern.
+
+**Resolution**
+
+_Unresolved._
+
+### Transport-009 — `IAuditCorrelationContext.BundleImportId` is mutated on the same scoped instance the AuditService reads
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Concurrency & thread safety |
+| Status | Open |
+| Location | `src/ScadaLink.Transport/Import/BundleImporter.cs:528, 668, 703`, `src/ScadaLink.ConfigurationDatabase/Services/AuditCorrelationContext.cs` |
+
+**Description**
+
+The XML doc on `IAuditCorrelationContext` correctly notes that mutating
+`BundleImportId` is not thread-safe and that concurrent imports inside a single
+scope would cross-contaminate audit rows. The contract is "Blazor circuit / API
+request — sequential await chain — single writer". The risk is that this
+invariant is documentation-only — there is no enforcement (e.g. a mutex on set,
+or an `AsyncLocal<Guid?>` impl) and no test exercising a concurrent-callers
+scenario. A future change that schedules audit writes on a different
+synchronization context inside the apply transaction (e.g. `Task.WhenAll` over
+the Apply helpers) would silently start leaking the id across rows.
+
+**Recommendation**
+
+Either (a) back `BundleImportId` with an `AsyncLocal<Guid?>` so each logical
+call chain inherits the value and concurrent chains can't trample it, or
+(b) wrap the apply in a try/finally that snapshots and restores. (b) is closer
+to the current design. Either way, add an integration test that fires two
+overlapping `ApplyAsync` calls and asserts each bundle's rows carry only that
+bundle's id.
+
+**Resolution**
+
+_Unresolved._
+
+### Transport-010 — Critical Overwrite + cross-cutting paths uncovered by tests
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Testing coverage |
+| Status | Open |
+| Location | `tests/ScadaLink.Transport.IntegrationTests/ConflictResolutionTests.cs`, `tests/ScadaLink.Transport.IntegrationTests/Import/BundleImporterApplyTests.cs` |
+
+**Description**
+
+The existing tests cover the happy path well (round-trip, semantic-validator
+gating, rollback even when `RollbackAsync` itself throws, composition imports),
+but the per-entity Overwrite resolutions are only spot-tested:
+`ConflictResolutionTests.Overwrite_replaces_existing_template_description`
+asserts on `Description` only. Specifically missing:
+- Overwrite of a `Template` whose `Attributes` / `Alarms` / `Scripts` /
+  `Compositions` diverged from the existing row (would catch Transport-001).
+- Overwrite of an `ExternalSystem` whose `Methods` diverged (would catch
+  Transport-002).
+- Overwrite of a `NotificationList` whose `Recipients` collection diverged
+  (NotificationList Overwrite does sync recipients via clear+add — needs an
+  asserting test).
+- Concurrent `ApplyAsync` calls on a shared scope to exercise the
+  `IAuditCorrelationContext` mutation contract (would catch Transport-009).
+- Per-IP unlock-throttle behaviour (would catch Transport-004).
+- A session that survives a failed Apply (would catch Transport-007).
+
+**Recommendation**
+
+Add the missing integration tests above. Most can be modelled after
+`ConflictResolutionTests`' export-then-mutate-target-then-apply pattern.
+
+**Resolution**
+
+_Unresolved._
+
+### Transport-011 — Design doc's Step-1 manifest preview promises decryption-free preview, but `LoadAsync` reads and validates content before passphrase
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Documentation & comments |
+| Status | Open |
+| Location | `docs/requirements/Component-Transport.md` Import Flow Step 1, `src/ScadaLink.Transport/Import/BundleImporter.cs:124-203` |
+
+**Description**
+
+The design doc says: "The manifest is plaintext so the import wizard can
+preview bundle contents and source provenance before the user supplies a
+passphrase." `LoadAsync` honours that — but does so by ALWAYS reading and
+hashing the content blob (encrypted or not) on the first call, regardless of
+whether the caller has a passphrase. For an encrypted bundle with no
+passphrase, the code path that surfaces the encrypted-bundle prompt is the
+`ArgumentException` thrown at line 195, which has already performed the full
+manifest parse + content-hash check + read of the encrypted blob. That's fine,
+but it means there is no cheap "manifest peek" — the UI's "let the user see
+the manifest before deciding whether to type a passphrase" is at least
+O(bundle-size) and consumes the full upload buffer each call. The design doc
+gives a misleading impression of cost.
+
+**Recommendation**
+
+Either (a) add an explicit `ReadManifestAsync(Stream)` interface method that
+skips the content read for the pure preview case, or (b) update the design
+doc to clarify the full envelope is read on every `LoadAsync` and the cheap
+"peek" is conceptual rather than runtime.
+
+**Resolution**
+
+_Unresolved._
+
+### Transport-012 — "Bundle Import" filter promised in design doc not surfaced in Configuration Audit Log Viewer UI
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Documentation & comments |
+| Status | Open |
+| Location | `docs/requirements/Component-Transport.md` §Audit Trail, `src/ScadaLink.ConfigurationDatabase/Repositories/CentralUiRepository.cs:148` |
+
+**Description**
+
+The design doc says: "The existing Configuration Audit Log Viewer gains a
+**Bundle Import** filter that surfaces all rows for a given import. The
+`BundleImported` summary row links to the filtered view." A repository filter
+on `BundleImportId` is wired into `CentralUiRepository` (line 148), but no UI
+filter control surfaces it and the `BundleImported` summary row does not carry
+a hyperlink in `Configuration Audit Log Viewer`. This is a documentation-vs-code
+gap, not a bug in Transport itself, but the spec lives in the Transport doc so
+it's reasonable to flag.
+
+**Recommendation**
+
+Either implement the filter dropdown + summary-row link in the Configuration
+Audit Log Viewer, or note the deferral in the design doc.
+
+**Resolution**
+
+_Unresolved._
@@ -33,9 +33,21 @@ def discover_modules():
    return modules


+def parse_header(module, text):
+    """Extract (last_reviewed, commit) from the module's header table.
+    Falls back to the historical baseline when the field is absent or templated."""
+    last = re.search(r"\|\s*Last reviewed\s*\|\s*([0-9]{4}-[0-9]{2}-[0-9]{2})", text)
+    commit = re.search(r"\|\s*Commit reviewed\s*\|\s*`([^`]+)`", text)
+    return (
+        last.group(1) if last else "2026-05-16",
+        commit.group(1) if commit else "9c60592",
+    )
+
+
 def parse_findings(module):
-    """Parse one module's findings.md into (module, id, severity, title, status) tuples."""
+    """Parse one module's findings.md into ((last_reviewed, commit), [(module, id, severity, title, status), ...])."""
    text = open(os.path.join(BASE, module, "findings.md")).read()
+    header = parse_header(module, text)
    findings = []
    for block in re.split(r"^### ", text, flags=re.M)[1:]:
        head = block.splitlines()[0].strip()
@@ -49,7 +61,7 @@ def parse_findings(module):
        if not sev or not status:
            raise SystemExit(f"{module}/findings.md: {fid} is missing a Severity or Status field")
        findings.append((module, fid, sev.group(1), title, status.group(1).strip()))
-    return findings
+    return header, findings


 def finding_number(finding):
@@ -58,7 +70,7 @@ def finding_number(finding):

 def build_readme(modules, per_module):
    pending = sorted(
-        (f for fs in per_module.values() for f in fs if f[4] in PENDING_STATUSES),
+        (f for fs in per_module.values() for f in fs[1] if f[4] in PENDING_STATUSES),
        key=lambda f: (SEVERITY_ORDER.get(f[2], 9), f[0], finding_number(f)),
    )

@@ -66,7 +78,7 @@ def build_readme(modules, per_module):
        return sum(1 for f in pending if f[2] == sev)

    def open_count(module, sev):
-        return sum(1 for f in per_module[module]
+        return sum(1 for f in per_module[module][1]
                   if f[2] == sev and f[4] in PENDING_STATUSES)

    lines = []
@@ -123,9 +135,10 @@ def build_readme(modules, per_module):
    add("|--------|---------------|--------|----------------|------|-------|")
    for module in modules:
        counts = [open_count(module, s) for s in ("Critical", "High", "Medium", "Low")]
-        add(f"| [{module}]({module}/findings.md) | 2026-05-16 | `9c60592` "
+        last_reviewed, commit = per_module[module][0]
+        add(f"| [{module}]({module}/findings.md) | {last_reviewed} | `{commit}` "
            f"| {counts[0]}/{counts[1]}/{counts[2]}/{counts[3]} "
-            f"| {sum(counts)} | {len(per_module[module])} |")
+            f"| {sum(counts)} | {len(per_module[module][1])} |")
    add("")
    add("## Pending Findings")
    add("")
@@ -159,8 +172,8 @@ def main():

    readme_path = os.path.join(BASE, "README.md")
    pending = sum(1 for fs in per_module.values()
-                  for f in fs if f[4] in PENDING_STATUSES)
-    total = sum(len(fs) for fs in per_module.values())
+                  for f in fs[1] if f[4] in PENDING_STATUSES)
+    total = sum(len(fs[1]) for fs in per_module.values())

    if check:
        current = open(readme_path).read() if os.path.exists(readme_path) else ""