docs(code-review): full review at 4307c381 — 18 modules, 67 findings recorded + remediation tracked

Full per-module re-review of the 16 stale modules (last seen 1eb6e97 / 2026-05-28)
plus first-ever reviews of KpiHistory (#26) and ScriptAnalysis (#25), at HEAD 4307c381.

67 new findings (0 Critical, 6 High, 27 Medium, 34 Low). Remediation in commit
fd618cf1 closed 5 of the 6 Highs and ~33 Medium/Low; the rest are Deferred/Won't Fix
with rationale. Remaining pending (4) are all InboundAPI's Database-helper findings
(IA-026 High .. IA-029), left to the active feat/ipsen-movein effort per owner decision.

Highlights: caught a central-only-delivery security drift (SMTP creds broadcast to
sites — DM-025/SR-031), a never-committed 'Resolved' fix (SiteEventLogging-016 → -024),
an unguarded KPI recorder tick (KH-001), a trust-analyzer fallback weakening (SA-001),
and a native-alarm subscribe-path leak (DCL-023). ScriptAnalysis verdict: trust boundary
is semantically sound (symbol-based) in the production cluster config.

README regenerated; regen-readme.py --check passes (4 pending / 567 total).
This commit is contained in:
Joseph Doherty
2026-06-20 18:02:32 -04:00
parent fd618cf1dc
commit d39089f4ed
19 changed files with 4031 additions and 69 deletions
+227 -2
View File
@@ -5,9 +5,9 @@
| Module | `src/ZB.MOM.WW.ScadaBridge.AuditLog` | | Module | `src/ZB.MOM.WW.ScadaBridge.AuditLog` |
| Design doc | `docs/requirements/Component-AuditLog.md` | | Design doc | `docs/requirements/Component-AuditLog.md` |
| Status | Reviewed | | Status | Reviewed |
| Last reviewed | 2026-05-28 | | Last reviewed | 2026-06-20 |
| Reviewer | claude-agent | | Reviewer | claude-agent |
| Commit reviewed | `1eb6e97` | | Commit reviewed | `4307c381` |
| Open findings | 0 | | Open findings | 0 |
## Summary ## Summary
@@ -59,6 +59,231 @@ chain doesn't reject a central composition root that mistakenly calls the site b
## Findings ## Findings
#### Re-review 2026-06-20 (commit `4307c381`) — full review
Since `1eb6e97` the module was renamed (`ScadaLink → ZB.MOM.WW.ScadaBridge`) and substantially
re-architected: the site SQLite store is now a two-table canonical design (`audit_event`
append-only + `audit_forward_state` sidecar with a precomputed `IsCachedKind`); the payload
filter became the `IAuditRedactor` seam (`ScadaBridgeAuditRedactor` + `SafeDefaultAuditRedactor`
+ stateless `AuditRedactionPrimitives`); and milestone work landed (M5.3 T7 inbound response
capture, M5.5 T3 per-channel retention, K8 KPI source, central health snapshot). The module
remains one of the best-engineered in the tree: the AuditLog-001..011 fixes from the last pass
are all present and intact (dual read connection, async scopes, per-EventId reconciliation
retry escape valve, thread-pool-hop `Dispose`, lifecycle CTS), the best-effort "never abort the
action" contract is honoured throughout, and redaction over-redacts on fault. This pass found
**no Critical/High** issues; two design-adherence drifts where documented per-target/purge
config knobs are silently inert, one error-handling asymmetry that lets a transient DI fault
restart the ingest singleton, and two Low items. Five new Open findings (AuditLog-012..016).
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | Yes | `IsErrorRow` case-sensitive parse is correct (Status is `enum.ToString()` PascalCase). `IsCachedKind` parse falls back to `InboundRequest` (non-cached) on bad JSON — safe. `OnIngestAsync` scope-resolution throw is uncaught (AuditLog-014). |
| 2 | Akka.NET conventions | Yes | Sender/EventStream captured before first await across all three central actors; Tell for hot path, Ask only at the gRPC/ClusterClient boundary; supervisor-strategy comments now correctly credit per-row catch (AuditLog-002 fix intact). No findings. |
| 3 | Concurrency & thread safety | Yes | WAL dual-connection + `_readLock`/`_writeLock` split intact; `Interlocked` counters; `_drainGate` serialises ring drain. `WriteAsync` fast path ignores `ct` when `TryWrite` succeeds (AuditLog-015). |
| 4 | Error handling & resilience | Yes | Best-effort contract honoured everywhere; gRPC pull collapses tolerable faults to empty; reconciliation retry escape valve (AuditLog-004) intact. `OnIngestAsync` lacks the scope-creation try/catch its sibling `OnCachedTelemetryAsync` has (AuditLog-014). |
| 5 | Security | Yes | Header/body/SQL-param redaction + over-redact-on-fault safety net solid; `SafeDefaultAuditRedactor` fallback wired at all three writer sites (AuditLog-008 fix intact); SQL built with bound parameters only. No findings. |
| 6 | Performance & resource management | Yes | Hot path batched + back-pressured; backlog scan off the write lock; partition switch metadata-only; per-channel DELETE batched + clamped; gRPC channel cache race-safe. No findings. |
| 7 | Design-document adherence | Yes | Per-target `CapBytes` override is documented + has a config property but is read NOWHERE (AuditLog-012); purge config section/key drift from the doc's `AuditLogPurge`/`ChannelPurgeBatchSize` (AuditLog-013). Combined-telemetry transport (AuditLog-001) now wired. `SkipBodyCapture` honoured in InboundAPI middleware (out of module). |
| 8 | Code organization & conventions | Yes | Composition root well-segmented with "safe from any root" invariant; `INodeIdentityProvider` standardised on `GetRequiredService` (AuditLog-007 fix intact); idempotency sentinel on the health bridge (AuditLog-011 fix intact). No new findings. |
| 9 | Testing coverage | Yes | Broad (~13k lines). Gaps: no test asserts per-target `CapBytes` is applied (because it isn't — AuditLog-012), and no test binds `AuditLogPurgeOptions` from an `IConfiguration` shaped like the doc (would catch AuditLog-013). |
| 10 | Documentation & comments | Yes | Mostly excellent and accurate. `AuditLogIngestActor` class XML has a duplicated "the central-side" phrase and several stale "Bundle X" milestone tags that mislead (AuditLog-016). |
### AuditLog-012 — Per-target `CapBytes` override is documented and bindable but never read by the redactor
| | |
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Deferred |
| Location | `src/ZB.MOM.WW.ScadaBridge.AuditLog/Redaction/ScadaBridgeAuditRedactor.cs:185-196`, `src/ZB.MOM.WW.ScadaBridge.AuditLog/Configuration/PerTargetRedactionOverride.cs:13` |
**Description**
The design (Component-AuditLog.md §"Payload Capture Policy" / §"Configuration") documents a
per-target payload-cap override and gives it as a worked example:
`"PerTargetOverrides": { "Weather/GetForecast": { "CapBytes": 4096 } }`. The
`PerTargetRedactionOverride.CapBytes` property exists and its XML doc promises "Optional payload
cap override (bytes); null inherits the global cap." But `ScadaBridgeAuditRedactor.SelectCap`
chooses among `InboundMaxBytes` / `ErrorCapBytes` / `DefaultCapBytes` only — it never consults
`opts.PerTargetOverrides[target].CapBytes`. A repo-wide grep confirms `CapBytes` (the per-target
property) is read in zero production code paths. An operator who sets a per-target cap to bound a
chatty target's payloads gets the global `DefaultCapBytes`/`ErrorCapBytes` instead, with no error
and no log — the knob is silently inert. This is a code-vs-spec drift: either the feature was
dropped in the redactor re-architecture (C2/C3) without updating the doc + override XML, or it
was never implemented. The `AdditionalBodyRedactors` and `RedactSqlParamsMatching` per-target
fields ARE honoured, which makes the silent `CapBytes` omission especially easy to miss.
**Recommendation**
Either (a) implement it: in `SelectCap`, when `category != ApiInbound` and the resolved
`PerTargetOverrides` entry for `rawEvent.Target` has a non-null `CapBytes`, use
`Math.Min`/`Max(over.CapBytes.Value, …)` per the intended semantics (the doc says it *overrides*
the global cap — clarify whether error rows still get the larger of the two), and add a
`ScadaBridgeAuditRedactorTests` case asserting a per-target cap shortens a payload the global cap
would not; or (b) if the feature is intentionally dropped, delete `PerTargetRedactionOverride.CapBytes`,
remove the `CapBytes` example from the Component-AuditLog.md Configuration block, and note the
removal in the design doc so the override surface stops advertising a no-op.
**Resolution**
Deferred 2026-06-20: whether to implement per-target `CapBytes` in `SelectCap` or delete the unused property + doc example is a design-owner decision (was it intentionally dropped in the redactor re-architecture, or lost?). Recorded; no change this pass.
### AuditLog-013 — Purge config section + key drift: documented `AuditLogPurge`/`ChannelPurgeBatchSize` does not bind
| | |
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.AuditLog/ServiceCollectionExtensions.cs:56,370-371`, `src/ZB.MOM.WW.ScadaBridge.AuditLog/Central/AuditLogPurgeOptions.cs:40,47` |
**Description**
Component-AuditLog.md §"Retention & Purge" / §"Configuration" documents the purge tuning as a
top-level section named `AuditLogPurge` with keys `IntervalHours` and `ChannelPurgeBatchSize`:
```jsonc
"AuditLogPurge": { "IntervalHours": 24, "ChannelPurgeBatchSize": 5000 }
```
The code does not match on either axis. (1) `ServiceCollectionExtensions.PurgeSectionName` is
`"AuditLog:Purge"` — a nested subsection, not the documented top-level `AuditLogPurge`. (2) The
batch-size property is `AuditLogPurgeOptions.ChannelPurgeBatchSizeConfigured`, so its bind key is
`ChannelPurgeBatchSizeConfigured`, not the documented `ChannelPurgeBatchSize` (which is a
computed read-only property the binder ignores). There is no `[ConfigurationKeyName]` attribute
to reconcile the names. The net effect: an operator who follows the design doc and writes
`"AuditLogPurge": { "ChannelPurgeBatchSize": 1000 }` has BOTH values silently ignored — the
wrong section path means `IntervalHours` falls back to the 24 h default, and the property-name
mismatch means the channel batch size falls back to 5000. The actor still functions on defaults,
so nothing crashes, but the documented operator-facing tuning surface is inert. No test binds
`AuditLogPurgeOptions` from an `IConfiguration` shaped like the doc, so the drift is unguarded.
(The §407 doc text says `AuditLogPurge:ChannelPurgeBatchSize`, agreeing with the JSON block and
disagreeing with the code on both section and key.)
**Recommendation**
Pick one canonical shape and make code, doc, and a binding test agree. Recommended: keep the
nested `AuditLog:Purge` section (consistent with `AuditLog:SiteWriter` / `AuditLog:SiteTelemetry`
/ `AuditLog:Reconciliation`), but then fix the doc's JSON block + §407 to read
`"AuditLog": { "Purge": { "IntervalHours": …, "ChannelPurgeBatchSize": … } }`, AND add
`[ConfigurationKeyName("ChannelPurgeBatchSize")]` to `ChannelPurgeBatchSizeConfigured` (or rename
it and move the clamp into the `Interval`-style computed getter pattern) so the documented key
binds. Add an `AuditLogOptionsBindingTests`-style test that binds the purge section from an
in-memory config and asserts both `IntervalHours` and the channel batch size take effect.
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): added `[ConfigurationKeyName("ChannelPurgeBatchSize")]` so the documented key binds, reconciled the design doc to the real nested `AuditLog:Purge` section, and added a binding test proving the purge options bind from the documented keys.
### AuditLog-014 — `AuditLogIngestActor.OnIngestAsync` does not guard scope/repository resolution, so a transient DI fault restarts the singleton and drops the reply
| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.AuditLog/Central/AuditLogIngestActor.cs:135-148` |
**Description**
`OnIngestAsync` opens `await using var scope = _serviceProvider!.CreateAsyncScope();` and
`scope.ServiceProvider.GetRequiredService<IAuditLogRepository>()` with NO surrounding try/catch.
The per-row try/catch lives inside `IngestWithRepositoryAsync`, so it only protects the insert
loop — a throw from scope creation or from `GetRequiredService` (a `DbContext` factory throw on
transient SQL-connection exhaustion, a pooled-context init fault, a DI resolution race during
host churn) propagates straight out of the message handler. When a `ReceiveActor`'s async handler
faults, Akka applies the parent's supervision and restarts the singleton; the captured
`replyTo.Tell(...)` at line 147 never runs, so the site's `Ask<IngestAuditEventsReply>` times out
and (correctly) retries — but the singleton has been bounced over a transient fault. The class
XML even claims "that per-row catch is what keeps this actor alive across handler throws" — which
is true for insert throws but false for the scope-resolution throw this handler leaves uncaught.
The sibling `OnCachedTelemetryAsync` (line 216) wraps its entire scope-creation-plus-loop body in
a try/catch and logs + replies-with-partial precisely to avoid this; `OnIngestAsync` is the
asymmetric one. `AuditLogPurgeActor.OnTickAsync` and `SiteAuditReconciliationActor.OnTickAsync`
both guard their `GetRequiredService` with try/catch too — `OnIngestAsync` is the outlier.
**Recommendation**
Wrap the scope-creation + repository-resolution in `OnIngestAsync` in a try/catch that mirrors
`OnCachedTelemetryAsync`: on failure, log the resolution fault and `replyTo.Tell(new
IngestAuditEventsReply(accepted))` (accepted will be whatever was processed before the throw,
typically empty) so the site keeps its rows `Pending` and retries on the next drain without
bouncing the singleton. Optionally bump the `ICentralAuditWriteFailureCounter` on that path too,
so a sustained DI/connection fault is visible on the dashboard.
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): `OnIngestAsync` now wraps the scope-creation + repo resolution + ingest in a try/catch that logs, best-effort bumps the write-failure counter, and still `replyTo.Tell(IngestAuditEventsReply(accepted))` — a transient DI/DbContext fault no longer bounces the central ingest singleton or drops the reply (mirrors `OnCachedTelemetryAsync`).
### AuditLog-015 — `SqliteAuditWriter.WriteAsync` ignores the cancellation token on the fast (`TryWrite`) path
| | |
|--|--|
| Severity | Low |
| Category | Concurrency & thread safety |
| Location | `src/ZB.MOM.WW.ScadaBridge.AuditLog/Site/SqliteAuditWriter.cs:254-275` |
| Status | Deferred |
**Description**
`WriteAsync(AuditEvent, CancellationToken ct)` accepts a `ct` but, on the common fast path where
`_writeQueue.Writer.TryWrite(pending)` succeeds, it returns `pending.Completion.Task` without ever
observing `ct`. The token is only honoured on the slow path (`WriteSlowPathAsync`, which passes it
to `WriteAsync`). So a caller that passes an already-cancelled (or soon-cancelled) token still
enqueues the row and then awaits an uncancellable completion. In practice the hot-path callers
(`FallbackAuditWriter`, `CachedCallTelemetryForwarder`) thread a real `ct` through, and the audit
contract is best-effort durable-in-microseconds, so the impact is minor — the row simply gets
written instead of cancelled. But the signature advertises cancellation it does not deliver on the
dominant path, which is a latent surprise for any future caller that relies on it (e.g. to bound a
shutdown drain). The completion `Task` returned is not linked to `ct`, so an awaiter cannot abandon
the wait either.
**Recommendation**
Either (a) honour the token on the fast path — `ct.ThrowIfCancellationRequested()` before
`TryWrite`, and return a `ct`-linked wait (e.g. `pending.Completion.Task.WaitAsync(ct)`) so an
awaiter can abandon a stuck completion; or (b) if cancellation is intentionally a no-op for this
best-effort hot path, drop the `ct` parameter (or document on the method that it is observed only
on the back-pressured slow path) so the surface stops advertising behaviour it does not provide.
**Resolution**
Deferred 2026-06-20: whether cancellation should be honoured on the best-effort `TryWrite` hot path (vs. intentionally a no-op) is a design decision. Recorded; no change this pass.
### AuditLog-016 — `AuditLogIngestActor` class XML has a duplicated phrase and stale "Bundle"/`IngestedAtUtc`-column wording
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.AuditLog/Central/AuditLogIngestActor.cs:13-46` |
**Description**
The `AuditLogIngestActor` class summary reads "Each row is stamped with the central-side **the
central-side IngestedAtUtc**" — the phrase "the central-side" is duplicated, a copy/paste slip
that survived the rename. The same summary describes `IngestedAtUtc` as if it were a promoted
column, but post-C3 it is a `DetailsJson` field stamped via `AuditRowProjection.WithIngestedAtUtc`
(as the handler bodies and their own inline comments correctly note) — the class-level prose is
now stale relative to the canonical-record shim. The remarks also still narrate the design in
"Bundle A / Bundle D / Bundle E" milestone terms (and the two-constructor rationale references
"Bundle D's tests" / "Bundle E's host wiring"), which no longer map to any current artifact and
read as archaeology to a new maintainer. None of this is wrong behaviourally, but the duplicated
phrase plus the column-vs-DetailsJson mismatch is exactly the kind of doc rot that misleads.
**Recommendation**
Fix the duplicated "the central-side the central-side" to a single phrase; reword the
`IngestedAtUtc` sentence to say it is stamped into `DetailsJson` (matching the handler comments);
and either drop the "Bundle X" tags in favour of the feature/milestone names used elsewhere in
the module, or add a one-line legend. A quick sweep of the other central actors' class XML for
the same "Bundle" archaeology would be worthwhile while in there.
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): class XML doc deduped ('the central-side' duplicate removed), `IngestedAtUtc` reworded to 'stamped into DetailsJson (no promoted column)', and the stale Bundle milestone tags replaced with role-neutral wording.
### AuditLog-001 — Combined-telemetry transport is plumbed end-to-end but never invoked in production ### AuditLog-001 — Combined-telemetry transport is plumbed end-to-end but never invoked in production
| | | | | |
+98 -3
View File
@@ -5,10 +5,10 @@
| Module | `src/ZB.MOM.WW.ScadaBridge.ClusterInfrastructure` | | Module | `src/ZB.MOM.WW.ScadaBridge.ClusterInfrastructure` |
| Design doc | `docs/requirements/Component-ClusterInfrastructure.md` | | Design doc | `docs/requirements/Component-ClusterInfrastructure.md` |
| Status | Reviewed | | Status | Reviewed |
| Last reviewed | 2026-05-28 | | Last reviewed | 2026-06-20 |
| Reviewer | claude-agent | | Reviewer | claude-agent |
| Commit reviewed | `1eb6e97` | | Commit reviewed | `4307c381` |
| Open findings | 1 | | Open findings | 0 |
## Summary ## Summary
@@ -82,6 +82,32 @@ either did not surface or that have aged into the file:
but the method now offers nothing — keeping it is API-surface noise that an but the method now offers nothing — keeping it is API-surface noise that an
IDE will still suggest via auto-complete. IDE will still suggest via auto-complete.
#### Re-review 2026-06-20 (commit `4307c381`) — full review
The module remains small and clean — `ClusterOptions`, `ClusterOptionsValidator`,
and `ServiceCollectionExtensions`, all well-documented and well-tested. The
split-brain-resolver, seed-node, and failure-detector configuration contract
matches the design doc, and all fourteen prior findings (CI-001..014) still hold
as resolved: `DownIfAlone` is now consumed by the Host's HOCON builder, the dead
`AddClusterInfrastructureActors` surface is gone, and the validator enforces every
catastrophic value. The only new observation is one Low code-organization item — the
validator hand-rolls every rule via raw `RequireThat` rather than the base's
`MinCount`/`OneOf`/`PositiveTimeSpan` primitives (CI-015), a deliberate trade-off
flagged as a judgment call rather than a clear defect.
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ✓ | Validator rules and DI registration are correct. No new defects. |
| 2 | Akka.NET conventions | ✓ | No actors in this module (legitimate, per CI-001 resolution). Nothing actor-shaped to evaluate. |
| 3 | Concurrency & thread safety | ✓ | Validator and DI extension remain stateless. No issues. |
| 4 | Error handling & resilience | ✓ | Validator rejects every catastrophic value the design doc enumerates, now including `DownIfAlone = false` (CI-010) and `SeedNodes.Count < 2` (CI-012). No issues. |
| 5 | Security | ✓ | No authn/authz surface, no secret handling, no remoting transport configured here. No issues. |
| 6 | Performance & resource management | ✓ | No resources held; validator allocates a small failure list per call only. No issues. |
| 7 | Design-document adherence | ✓ | `ClusterOptions` contract complete; SBR/seed/failure-detector config matches the design doc. No new drift. |
| 8 | Code organization & conventions | ✓ | Options/validator placement and Options pattern correct; CI-014 dead surface removed. New — validator hand-rolls every rule via raw `RequireThat` instead of the `OptionsValidatorBase` primitives `MinCount`/`OneOf`/`PositiveTimeSpan` (CI-015, a judgment call). |
| 9 | Testing coverage | ✓ | 19 tests across three classes covering options, validator (incl. single-seed and `DownIfAlone=false`), and DI registration. No gaps. |
| 10 | Documentation & comments | ✓ | XML docs accurate across all source files; CI-013 intent comment present. No issues. |
## Checklist coverage ## Checklist coverage
Original review (2026-05-16, `9c60592`) below; the re-review notes (2026-05-17, Original review (2026-05-16, `9c60592`) below; the re-review notes (2026-05-17,
@@ -857,3 +883,72 @@ explicitly stating that this project exposes no actor-registration extension
(actor wiring lives in `ZB.MOM.WW.ScadaBridge.Host`). If the user prefers to keep the (actor wiring lives in `ZB.MOM.WW.ScadaBridge.Host`). If the user prefers to keep the
"fail-fast" trap, mark the method `[Obsolete(true, error: true)]` so the compiler — "fail-fast" trap, mark the method `[Obsolete(true, error: true)]` so the compiler —
not the runtime — rejects the call. not the runtime — rejects the call.
### ClusterInfrastructure-015 — Validator hand-rolls every rule via raw `RequireThat` instead of the base primitives
| | |
|--|--|
| Severity | Low |
| Category | Code organization & conventions |
| Status | Deferred |
| Location | `src/ZB.MOM.WW.ScadaBridge.ClusterInfrastructure/ClusterOptionsValidator.cs:29-57` |
**Description**
`ClusterOptionsValidator` derives from `OptionsValidatorBase<ClusterOptions>`
(`ZB.MOM.WW.Configuration`), whose `ValidationBuilder` exposes a set of typed
validation primitives — `MinCount<T>` ("requires a collection with at least N
items"), `OneOf` ("requires the value to be one of `allowed`, case-insensitive"),
`PositiveTimeSpan` ("requires a strictly positive duration"), plus `Required`,
`Port`, and `HostPort` — alongside the lower-level `RequireThat(bool, message)`
escape hatch for "custom or cross-field rules". `ClusterOptionsValidator.Validate`
expresses *every* one of its rules through raw `RequireThat`, even where a
first-class primitive exists for the exact check:
- the `SeedNodes.Count >= 2` rule (29-33) is what `MinCount<string>(options.SeedNodes, 2, …)` does;
- the `keep-oldest`-only strategy rule (35-39) is what `OneOf(options.SplitBrainResolverStrategy, AllowedStrategies, …)` does;
- the three `> TimeSpan.Zero` timing rules (45-52) are each what `PositiveTimeSpan(…)` does.
Only the `MinNrOfMembers == 1` equality check, the cross-field
`HeartbeatInterval < FailureDetectionThreshold` comparison, and the `DownIfAlone`
boolean (59-63) genuinely require `RequireThat` (no primitive covers them). Routing
the three primitive-eligible rules through `RequireThat` makes this validator
inconsistent with the *intent* of the shared base (the primitives exist precisely so
common rules read uniformly across the codebase's validators) and risks subtle
wording drift between modules for identical checks.
This is **explicitly a deliberate trade-off, not a clear win**, and is filed as a
judgment call rather than a defect. Adopting the primitives would **lose** what makes
the current messages valuable: each hand-rolled message cites the governing design-doc
section (`Component-ClusterInfrastructure.md → Node Configuration` / `→ Split-Brain
Resolution`) and spells out the catastrophic consequence of the misconfiguration
("would risk a total cluster shutdown on a partition", "blocks the cluster singleton
after failover and halts all data collection", "the oldest node can run as an isolated
single-node cluster"). The shared primitives emit terse, generic messages
("requires a collection with at least 2 items") that cannot carry that
safety-critical context. The peer `AuditLogOptionsValidator` likewise prefers raw
`RequireThat` for the same reason, so the current style is at least internally
consistent.
**Recommendation**
Treat this as a conscious choice between two competing goods rather than assuming a
fix is warranted:
- **Adopt the primitives** (`MinCount`/`OneOf`/`PositiveTimeSpan` for the three
eligible rules; keep `RequireThat` for the equality, cross-field, and boolean
checks) if codebase-wide *wording consistency* across validators is the priority —
accepting that the design-doc citations and consequence explanations move out of the
failure message (e.g. into XML docs or an appended suffix).
- **Keep as-is** if the *richer, safety-critical failure messages* are the priority
for a module whose misconfiguration is cluster-fatal — accepting the minor
inconsistency with the base's primitive surface.
Do not assume either direction; decide deliberately. If kept as-is, a one-line code
comment recording the choice (richer messages over primitive uniformity) would stop a
future reader from "tidying" the validator into the primitives and silently dropping
the consequence text.
**Resolution**
Deferred 2026-06-20: this is a deliberate trade-off, not a clear win — the hand-rolled `RequireThat` messages carry design-doc citations and cluster-fatal consequence text that the shared base primitives (`MinCount`/`OneOf`/`PositiveTimeSpan`) cannot, and the peer `AuditLogOptionsValidator` follows the same style. Adopting the primitives would prioritize cross-validator wording consistency over those richer safety-critical messages — a deliberate decision left to the owner. No change this pass.
+137 -3
View File
@@ -5,10 +5,10 @@
| Module | `src/ZB.MOM.WW.ScadaBridge.Communication` | | Module | `src/ZB.MOM.WW.ScadaBridge.Communication` |
| Design doc | `docs/requirements/Component-Communication.md` | | Design doc | `docs/requirements/Component-Communication.md` |
| Status | Reviewed | | Status | Reviewed |
| Last reviewed | 2026-05-28 | | Last reviewed | 2026-06-20 |
| Reviewer | claude-agent | | Reviewer | claude-agent |
| Commit reviewed | `1eb6e97` | | Commit reviewed | `4307c381` |
| Open findings | 2 | | Open findings | 0 |
## Summary ## Summary
@@ -68,6 +68,41 @@ keyed by caller-supplied `CorrelationId` could orphan a subscriber on ID reuse
(Communication-022). Seven new findings, all Open: one High, one Medium, five (Communication-022). Seven new findings, all Open: one High, one Medium, five
Low. Low.
#### Re-review 2026-06-20 (commit `4307c381`) — full review
All prior findings (Communication-001..022) remain closed (Resolved, or Resolved-by-Comm-016).
This module was renamed `ScadaLink → ZB.MOM.WW.ScadaBridge` since `1eb6e97`, so the
`git diff 1eb6e97 HEAD` shows the whole module as added; the review re-examined the
full current state across all 10 categories, with focus on the surface added since the
last review — the four new gRPC RPCs on `SiteStreamService` (`IngestAuditEvents`,
`IngestCachedTelemetry`, `PullAuditEvents`, `PullSiteCalls`) and the M7 command/control
forwards (`SearchAddressSpaceCommand`, `WriteTagRequest`, `VerifyEndpointCommand`,
cert-trust commands). The module is healthy: actor state stays on the actor thread,
`PipeTo`/`Forward`/Sender-capture are clean, gRPC channel/stream/CTS lifecycles are
disciplined, the duplicate-stream and Subscribe-throws cleanup paths are correct, and
the gRPC ingest/pull handlers degrade best-effort (empty ack / empty response, rows stay
Pending) on fault. The only new findings are **design-doc drift**: the doc still claims
`SiteStreamService` has "a single RPC `SubscribeInstance`" and that audit/telemetry
reconciliation rides ClusterClient, when in fact four RPCs now exist and the
reconciliation *pull* runs over gRPC (Communication-023); and the AlarmStateUpdate field
table stops at field 21 while the proto/code carry fields 2223 (Communication-024). Two
new findings, both Open: zero Critical/High, zero Medium-code, one Medium-doc, one Low.
## Checklist coverage 2026-06-20
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ✓ | `SubscribeInstance` duplicate-removal + Subscribe-throws cleanup (Comm-021) correct; telemetry open/close gauge is balanced (counted only after Subscribe succeeds, decremented in finally). `DebugStreamBridgeActor` stream-first buffer/dedup sound. No new issues. |
| 2 | Akka.NET conventions | ✓ | Coordinator `Resume` strategies present on both actors; `Forward` preserves Sender on every site relay; `SiteAddressCacheLoaded` now read-only-typed (Comm-020). gRPC ingest handlers capture Sender before `PipeTo`. No new issues. |
| 3 | Concurrency & thread safety | ✓ | `_activeStreams`/`_subscriptions` use ownership-preserving `TryRemove(KeyValuePair)`; `GetOrCreate` uses `AddOrUpdate` atomically; bridge state mutated only on the actor thread via `Self.Tell`. No new issues. |
| 4 | Error handling & resilience | ✓ | gRPC `Ingest*`/`Pull*` degrade to empty ack/response on fault (rows stay Pending) — divergent from the ClusterClient path's `Status.Failure` pipe but behaviourally equivalent (both → Pending+retry), and explicitly documented. `LoadSiteAddressesFromDb` lifecycle-CT bounded (Comm-019). No new issues. |
| 5 | Security | ✓ | `correlation_id` validated against `ActorPath.IsValidPathElement` before actor naming (Comm-014). `Pull*` reject `batch_size <= 0` with `InvalidArgument`. Secrets ride encrypted blocks, not this layer. No new issues. |
| 6 | Performance & resource management | ✓ | gRPC channel disposed synchronously (Comm-007); per-stream CTS `CancelAfter(GrpcMaxStreamLifetime)`; bounded `Channel` (1000, DropOldest). `RemoveSiteAsync` still test-only but Comm-013 documents that endpoint change is handled by `GetOrCreate`. No new issues. |
| 7 | Design-document adherence | ✓ | Doc claims a *single* RPC and ClusterClient-only reconciliation; reality is 4 RPCs + gRPC reconciliation pulls (Communication-023). AlarmStateUpdate field table stops at 21; proto/code carry 2223 (Communication-024). |
| 8 | Code organization & conventions | ✓ | DTO mappers correctly hosted in Communication (owns generated DTOs); proto evolution is additive (field numbers 2223 appended, never reused); Options pattern intact. No new issues. |
| 9 | Testing coverage | ✓ | `ProtoContractTests` guards oneof variants; `StreamRelayActorTests` + `ConvertToDomainEvent` cover enriched fields 823 end-to-end; dedicated proto tests cover the audit/telemetry/pull DTOs. `ProtoRoundtripTests.AlarmStateUpdate_RoundTrip` only asserts fields 15, but the gap is covered elsewhere — no separate finding. |
| 10 | Documentation & comments | ✓ | In-code XML docs are thorough and accurate (the prior-finding comments are precise). The drift is in the design doc only (Communication-023/024), not the code comments. |
## Checklist coverage 2026-05-28 ## Checklist coverage 2026-05-28
| # | Category | Examined | Notes | | # | Category | Examined | Notes |
@@ -1120,3 +1155,102 @@ The `_debugSubscriptions` dictionary, `TrackMessageForCleanup` helper, and the
`HandleConnectionStateChanged` handler that consumed them were all deleted as `HandleConnectionStateChanged` handler that consumed them were all deleted as
part of Comm-016's dead-code purge. There is no longer any caller-supplied part of Comm-016's dead-code purge. There is no longer any caller-supplied
correlation-id keyed map to overwrite — the orphan-on-reuse hazard is gone. correlation-id keyed map to overwrite — the orphan-on-reuse hazard is gone.
---
### Communication-023 — Design doc describes `SiteStreamService` as a single RPC and reconciliation as ClusterClient-only; both are now stale
| | |
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Resolved |
| Location | `docs/requirements/Component-Communication.md:85`, `docs/requirements/Component-Communication.md:160-166`, `src/ZB.MOM.WW.ScadaBridge.Communication/Protos/sitestream.proto:8-14` |
**Description**
The design doc's "gRPC Proto Definition" section states: *"**Service**: `SiteStreamService`
with **a single RPC** `SubscribeInstance(InstanceStreamRequest) returns (stream
SiteStreamEvent)`."* (line 85). The proto and the generated service now expose **five**
RPCs — `SubscribeInstance`, `IngestAuditEvents`, `IngestCachedTelemetry`,
`PullAuditEvents`, and `PullSiteCalls` (`sitestream.proto:8-14`,
`SiteStreamGrpcServer.cs:211/330/386/436/519`). The four new RPCs are not mentioned
anywhere in the component doc — neither the proto section, the responsibilities, the
topology diagram, nor the Dependencies (the server now also depends on `ISiteAuditQueue`
and `IOperationTrackingStore`).
This drift also contradicts two transport statements elsewhere in the same doc. The
"Cached Call Telemetry" section (line 166) and the "Notification Submission" section
(line 158) both assert **"Transport: ClusterClient"**, and the reconciliation-pull
paragraph (line 164) describes `CachedCallReconcileRequest`/`CachedCallReconcileResponse`.
In the shipped code the reconciliation **pull** runs over **gRPC**, not ClusterClient:
`GrpcPullAuditEventsClient` / `GrpcPullSiteCallsClient`
(`src/ZB.MOM.WW.ScadaBridge.AuditLog/Central/`) call the `PullAuditEvents` / `PullSiteCalls`
RPCs against `SiteStreamGrpcServer`. So the gRPC channel — documented as carrying
**real-time data only** ("`SiteCommunicationActor` and `CentralCommunicationActor` have no
role in streaming event delivery", "gRPC server-streaming for real-time data") — now also
carries site→central audit/telemetry **ingest** and central→site **reconciliation pulls**.
A reader relying on the doc to understand the two-transport split (the central CLAUDE.md
"Data & Communication" invariant) would be materially misled about what flows over which
transport.
**Recommendation**
Update `Component-Communication.md`: (1) replace "a single RPC" with the full RPC list
and document each new RPC's direction, payload, and idempotency; (2) reconcile the
"Transport: ClusterClient" / `CachedCallReconcile*` wording in §10 (and the Site Call
Audit / Audit Log interaction notes) with the gRPC `PullAuditEvents` / `PullSiteCalls`
reality; (3) add `ISiteAuditQueue` / `IOperationTrackingStore` to the SiteStreamGrpcServer
description and Dependencies. If the doc's "gRPC = real-time data only" framing is meant
to be a hard invariant, state explicitly that audit/telemetry ingest and reconciliation
are the documented exceptions.
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): the design doc (Component-Communication.md) now
enumerates all five `SiteStreamService` RPCs (the one server-streaming `SubscribeInstance`
plus the unary `IngestAuditEvents`/`IngestCachedTelemetry`/`PullAuditEvents`/`PullSiteCalls`),
reconciles the §10 reconciliation wording with the gRPC-pull reality (push telemetry stays
ClusterClient; reconciliation pulls are gRPC request/response), and adds `ISiteAuditQueue`/`IOperationTrackingStore`
to Dependencies. Doc-only.
---
### Communication-024 — AlarmStateUpdate proto carries fields 2223 but the design-doc field table stops at 21
| | |
|--|--|
| Severity | Low |
| Category | Design-document adherence |
| Status | Resolved |
| Location | `docs/requirements/Component-Communication.md:92-114`, `src/ZB.MOM.WW.ScadaBridge.Communication/Protos/sitestream.proto:87-88` |
**Description**
The "Enriched AlarmStateUpdate" section states the message "was extended **additively**:
existing fields 17 are unchanged, and **fields 821** carry the enriched native-alarm
state" (line 92), and the field table enumerates fields 821 (lines 94-109). The proto
now also defines **field 22 `native_source_canonical_name`** and **field 23
`is_configured_placeholder`** (`sitestream.proto:87-88`), both populated on the server
(`StreamRelayActor.HandleAlarmStateChanged`, lines 97-98) and read back on the client
(`SiteStreamGrpcClient.ConvertToDomainEvent`, lines 257-258). The doc's field table and
its "fields 821" / "repopulated from fields 821" prose (lines 92, 112) are stale — they
omit fields 2223. `is_configured_placeholder` in particular is load-bearing: it is the
discriminator `StreamRelayActor` uses to drop snapshot-only placeholder rows before
packing the live stream (lines 63-66), a behaviour the doc does not capture.
The proto evolution itself is correct (additive, field numbers appended and never reused),
so this is purely doc staleness, not a contract regression.
**Recommendation**
Extend the AlarmStateUpdate field table to fields 2223, update the "fields 821" prose to
"823" in both the intro and the client-mapping bullet, and add a one-line note that
`is_configured_placeholder` rows are dropped at the relay (never relayed to the live gRPC
stream).
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): the `AlarmStateUpdate` field table was extended to
fields 2223 (`native_source_canonical_name`, `is_configured_placeholder`) with a note that
placeholder rows are dropped at the relay. Doc-only.
+240 -3
View File
@@ -5,10 +5,10 @@
| Module | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer` | | Module | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer` |
| Design doc | `docs/requirements/Component-DataConnectionLayer.md` | | Design doc | `docs/requirements/Component-DataConnectionLayer.md` |
| Status | Reviewed | | Status | Reviewed |
| Last reviewed | 2026-05-28 | | Last reviewed | 2026-06-20 |
| Reviewer | claude-agent | | Reviewer | claude-agent |
| Commit reviewed | `1eb6e97` | | Commit reviewed | `4307c381` |
| Open findings | 5 | | Open findings | 0 |
## Summary ## Summary
@@ -116,6 +116,37 @@ DCL-007 fixed for `ReadBatchAsync`). New findings are numbered from
## Findings ## Findings
#### Re-review 2026-06-20 (commit `4307c381`) — full review
The 2026-06-20 full re-review walked all 10 checklist categories against the current
source, focused on the M7 native-alarm subscribe path, the OPC UA node-browser
(BrowseNext/search/type-info), and the verify-endpoint + site-local cert-trust surface.
The M7 surface is well-built overall — the verify-endpoint probe correctly *captures
but never trusts* an untrusted server certificate, the per-node `CertStoreActor`
broadcast keeps both site nodes' PKI stores consistent across failover, and the
node-browser paging/search/type-info additions are clean. All 22 prior findings remain
`Resolved` and their fixes were verified in place. The review found **4 new findings**,
all clustering on the native-alarm subscribe path, which did **not** inherit the
guards the tag-subscribe path earned across DCL-018 / DCL-021 / DCL-022: the alarm
path leaks its adapter feed on a mid-flight unsubscribe (no DCL-021-style obsolete-
completion guard), has no alarm-resolution retry to match the tag path's
`tag-resolution-retry` timer, leaks `_alarmCts` on `DisposeAsync` (only
`DisconnectAsync` tears it down), and carries a stale "first subscriber wins" comment
on a last-wins filter assignment.
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | x | No new issues; DCL-016/020/021 counter/response fixes verified. Alarm-path snapshot atomic-swap + per-source prefix routing are correct. |
| 2 | Akka.NET conventions | x | No new issues; DCL-022 `IsTimerActive` gate verified. Alarm subscribe uses `ContinueWith(...).PipeTo(Self)` with `Self`/generation captured (no `Sender`/`this` in closures). |
| 3 | Concurrency & thread safety | x | Finding 023 — native-alarm mid-flight unsubscribe leaks the adapter alarm feed (the tag path's DCL-021 obsolete-completion guard was not mirrored on the alarm path). |
| 4 | Error handling & resilience | x | Finding 024 — no alarm-resolution retry: a failed initial alarm subscribe strands the subscriber with no feed until the next full reconnect (the tag path retries on `tag-resolution-retry`). |
| 5 | Security | x | No new issues. The M7 verify-endpoint probe builds a temporary `RealOpcUaClient`, captures the untrusted server cert, and NEVER trusts it; cert trust is gated through `CertStoreActor` broadcast to both nodes. DCL-012/014 auto-accept warning + Commons default remain out-of-scope follow-ups. |
| 6 | Performance & resource management | x | Finding 025 — `MxGatewayDataConnection.DisposeAsync` abandons `_alarmCts` (alarm task + CTS leak on failover/stop); `DisconnectAsync` already cancels+disposes it under `_alarmLock`. |
| 7 | Design-document adherence | x | No new issues; M7 native-alarm read-only mirror, `AlarmKind` discriminator, and per-source `IAlarmSubscribableConnection` feed match the design. DCL-009 doc action still open at doc level (out of scope). |
| 8 | Code organization & conventions | x | No issues — alarm messages/POCOs in Commons, `IAlarmSubscribableConnection` capability seam in the module, options class owned by component. |
| 9 | Testing coverage | x | DCL001022 regression tests present. Gaps for findings 023 (alarm unsubscribe mid-flight), 024 (failed alarm subscribe → no retry), 025 (`DisposeAsync` alarm-CTS leak). |
| 10 | Documentation & comments | x | Finding 028 — the `_alarmSourceFilter` "first subscriber wins" XML comment contradicts the last-wins assignment in `HandleSubscribeAlarms`. |
### DataConnectionLayer-001 — `Task.Run` in `HandleSubscribe` mutates actor state off the actor thread ### DataConnectionLayer-001 — `Task.Run` in `HandleSubscribe` mutates actor state off the actor thread
| | | | | |
@@ -1267,3 +1298,209 @@ calling `StartPeriodicTimer` — `IsTimerActive` is on `ITimerScheduler`. Apply
same gate at both call sites. Add a regression test that fires 5 subscribes with same gate at both call sites. Add a regression test that fires 5 subscribes with
unresolved tags within one retry interval and asserts the retry fires at most one unresolved tags within one retry interval and asserts the retry fires at most one
interval after the first failure (not after the fifth subscribe). interval after the first failure (not after the fifth subscribe).
### DataConnectionLayer-023 — Native-alarm mid-flight unsubscribe leaks the adapter alarm feed
| | |
|--|--|
| Severity | High |
| Category | Concurrency & thread safety |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Actors/DataConnectionActor.cs:1773,1790-1809,1864-1882` |
**Description**
The native-alarm subscribe path mirrors the tag-subscribe path's
dispatch-I/O-then-`PipeTo(Self)` shape but never inherited the obsolete-completion
guard the tag path earned in DCL-021. `HandleSubscribeAlarms` adds the source to
`_alarmSubscribesInFlight` (line 1773) and dispatches
`alarmable.SubscribeAlarmsAsync(...)` whose `ContinueWith` pipes an
`AlarmSubscribeCompleted` back to `Self`. If an `UnsubscribeAlarmsRequest` for the
last (or only) subscriber is processed on the actor thread between that dispatch and
the completion, `HandleUnsubscribeAlarms` (lines 1864-1882) removes the subscriber,
empties `_alarmSourceSubscribers`, and removes the filter entries — but it tries to
tear down the adapter feed via `_alarmSubscriptionIds.Remove(...)`, which **fails**
because the subscription id has not been stored yet (it is still in flight). It also
leaves the stale `_alarmSubscribesInFlight` entry in place. The late
`AlarmSubscribeCompleted` then reaches `HandleAlarmSubscribeCompleted` (lines
1790-1809), which **unconditionally** stores `_alarmSubscriptionIds[msg.SourceReference]
= msg.SubscriptionId` (line 1796) without re-checking whether any subscriber still
exists in `_alarmSourceSubscribers`.
The net result is an orphaned adapter alarm feed: the OPC UA monitored-item set / the
MxGateway gateway-wide alarm stream stays alive for a source with **zero** subscribers,
streaming transitions into `HandleAlarmTransitionReceived` that match no subscriber set
and are silently dropped — the feed is never torn down for the lifetime of the adapter
(only a full `ReSubscribeAllAlarms` on reconnect clears `_alarmSubscriptionIds`, and
even that re-subscribes from `_alarmSourceSubscribers`, which no longer contains the
orphaned source, so the stale device-side feed is simply abandoned, not closed). The
tag path closes exactly this race in `HandleSubscribeCompleted` (lines 834-867) by
releasing the just-created adapter handle and clearing `_subscribesInFlight` when the
instance entry has gone; the alarm path does neither. This matters because native-alarm
sources are created/destroyed on every deploy/undeploy and instance stop, so each
mid-flight unsubscribe permanently leaks one device-side alarm subscription and its
publish traffic.
**Recommendation**
Mirror the closed DCL-021 fix on the alarm path. In `HandleAlarmSubscribeCompleted`,
before storing the id, guard on the live subscriber set: `if
(!_alarmSourceSubscribers.ContainsKey(msg.SourceReference)) {
if (msg.Success && msg.SubscriptionId != null && _adapter is
IAlarmSubscribableConnection alarmable) _ =
alarmable.UnsubscribeAlarmsAsync(msg.SubscriptionId); return; }` so a feed that
completed after its last subscriber left is released at the adapter rather than stored.
Additionally clear the `_alarmSubscribesInFlight` entry in `HandleUnsubscribeAlarms`
when the last subscriber leaves, so the in-flight marker does not linger. Add a
regression test that subscribes a source, sends `UnsubscribeAlarmsRequest` while the
alarm subscribe I/O is in flight, completes the subscribe, and asserts
`UnsubscribeAlarmsAsync` is called and `_alarmSubscriptionIds` /
`_alarmSubscribesInFlight` are clean.
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): `HandleAlarmSubscribeCompleted` now guards against an orphaned completion (if no live subscriber remains for the source, it tears down the just-created adapter subscription and returns) and `HandleUnsubscribeAlarms` clears the in-flight marker when the last subscriber leaves — mirroring the DCL-021 tag-path fix. Regression test added (verified failing pre-fix).
### DataConnectionLayer-024 — No alarm-resolution retry: a failed initial alarm subscribe strands the subscriber until the next full reconnect
| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Deferred |
| Location | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Actors/DataConnectionActor.cs:1752-1757,1790-1809,1885-1908` |
**Description**
When an initial native-alarm subscribe fails, the subscriber is left registered but
with no feed and no path to recovery short of a full connection reconnect.
`HandleSubscribeAlarms` registers the subscriber in `_alarmSourceSubscribers` (lines
1752-1757) **before** issuing the adapter subscribe, so a transition arriving
mid-subscribe is still routed. If `alarmable.SubscribeAlarmsAsync` fails, the piped
`AlarmSubscribeCompleted` reaches `HandleAlarmSubscribeCompleted` (lines 1790-1809),
which on the failure branch only logs a warning (line 1801) and replies a
`SubscribeAlarmsResponse(success: false, ...)`. It does **not** re-arm any retry, does
**not** push `NativeAlarmSourceUnavailable` to the subscriber, and does **not** drop
the source from `_alarmSourceSubscribers`. The subscriber therefore stays in the
routing map with the adapter feed never established.
Recovery only happens via `ReSubscribeAllAlarms` (lines 1885-1908), which is invoked
solely from `BecomeConnected` after a full connection reconnect (line 583). So a
transient device-side failure on the *initial* alarm subscribe — the alarm server
still booting, a momentary fault — leaves the source dark until the entire connection
happens to cycle through `Reconnecting`. This diverges from the tag path, which arms a
periodic `tag-resolution-retry` timer (lines 989-993, 1678-1682; fired by
`HandleRetryTagResolution` at line 1491) so a tag that fails to resolve at subscribe
time is retried every `TagResolutionRetryInterval` (10 s default) without needing a
reconnect. The native-alarm source has no analogous self-healing.
This is partly a **design judgment call** rather than a clear-cut defect: the M7 design
models native alarms as a read-only mirror, and "drop the source and signal the
subscriber" is an equally valid contract to "retry the alarm subscribe periodically."
The defect is that the code currently does **neither** — it silently leaves a
registered subscriber with a permanently-dark feed and no signal — which is the worst
of both.
**Recommendation**
Pick one of the two valid contracts and implement it (this is a judgment call — do not
assume which is wanted):
- **Retry:** add an `alarm-resolution-retry` periodic timer (mirroring
`tag-resolution-retry`): track sources whose subscribe failed, re-issue
`SubscribeAlarmsAsync` on the timer, and gate the timer start with
`IsTimerActive` (per the DCL-022 pattern) so a burst of failed subscribes does not
reset it.
- **Fail fast:** on the failure branch of `HandleAlarmSubscribeCompleted`, push
`NativeAlarmSourceUnavailable` to the subscriber(s) for that source and remove the
source from `_alarmSourceSubscribers` / `_alarmSubscribesInFlight`, so the
`NativeAlarmActor` learns the feed is unavailable rather than waiting forever.
Either way, add a regression test for "initial alarm subscribe fails → subscriber is
not silently left dark."
**Resolution**
Deferred 2026-06-20: the fix (add an alarm-resolution retry timer vs. push `NativeAlarmSourceUnavailable` and drop the source on initial-subscribe failure) is a design-owner decision; the current code does neither. Awaiting that decision before implementing.
### DataConnectionLayer-025 — `MxGatewayDataConnection.DisposeAsync` abandons `_alarmCts`, leaking the alarm task on failover/stop
| | |
|--|--|
| Severity | Low |
| Category | Performance & resource management |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Adapters/MxGatewayDataConnection.cs:277-283,121-134` |
**Description**
`MxGatewayDataConnection.DisposeAsync` (lines 277-283) cancels `_eventLoopCts` and
disposes the underlying `_client`, but never touches `_alarmCts`. The alarm feed is
established in `SubscribeAlarmsAsync` (lines 153-175): under `_alarmLock` it lazily
creates `_alarmCts = new CancellationTokenSource()` and launches a long-running
`Task.Run(() => client.RunAlarmStreamAsync(null, ..., token))`. That background alarm
stream is bound only to `_alarmCts.Token`. `DisconnectAsync` (lines 121-134) already
tears it down correctly — under `_alarmLock` it cancels and disposes `_alarmCts`,
nulls it, and resets `_alarmSubCount` — but `DisposeAsync` does not, so the
`CancellationTokenSource` and the alarm-stream `Task` are both leaked whenever the
adapter is disposed without a prior `DisconnectAsync`.
The `DataConnectionActor` disposes adapters fire-and-forget on failover (and on
actor/connection stop) via `_adapter.DisposeAsync()` without necessarily calling
`DisconnectAsync` first, so every MxGateway failover or connection teardown that goes
through `DisposeAsync` leaks one running alarm-stream task plus its CTS. Severity is
Low because the leaked task is cancellation-bound to a CTS that will be GC-reclaimed
eventually and the gRPC stream it holds will fault when `_client` is disposed, but a
still-running alarm-stream loop racing a disposed client is precisely the class of
dangling background work the lock-guarded `DisconnectAsync` block was written to avoid.
**Recommendation**
In `DisposeAsync`, cancel and dispose `_alarmCts` under `_alarmLock` before disposing
the client — copy the block already present in `DisconnectAsync` (lines 124-130):
`lock (_alarmLock) { _alarmCts?.Cancel(); _alarmCts?.Dispose(); _alarmCts = null;
_alarmSubCount = 0; }`. This guarantees the alarm-stream task is cancelled
deterministically on every teardown path, not just the `DisconnectAsync` one.
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): `MxGatewayDataConnection.DisposeAsync` now cancels and disposes `_alarmCts` under `_alarmLock`, matching `DisconnectAsync` — no more alarm task/CTS leak on failover/stop.
### DataConnectionLayer-028 — `_alarmSourceFilter` "first subscriber wins" comment contradicts the last-wins code
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Actors/DataConnectionActor.cs:102-103,1758-1761` |
**Description**
The XML doc on the `_alarmSourceFilter` field (lines 102-103) reads "sourceReference →
raw condition filter string passed to the adapter (**first subscriber wins**)." The
code does the opposite: `HandleSubscribeAlarms` **unconditionally** overwrites the
filter on every subscribe — `_alarmSourceFilter[request.SourceReference] =
request.ConditionFilter;` (line 1758) and likewise re-parses
`_alarmSourceFilterPredicate[request.SourceReference] =
AlarmConditionFilter.Parse(request.ConditionFilter)` (line 1761) — with no
"already present" guard. So a **second** subscriber to the same source reference
re-filters the shared feed with its own condition filter, i.e. **last subscriber
wins**, not first. The comment will mislead a maintainer reasoning about which filter
governs a shared alarm feed when two instances subscribe to the same source with
different condition filters, and it understates a real behavioural subtlety (the second
subscriber silently changes the gate applied to the first subscriber's transitions in
`HandleAlarmTransitionReceived`).
**Recommendation**
Correct the comment to "last subscriber wins" (and note that the shared feed carries a
single filter, so co-subscribers to one source reference must agree on the condition
filter). If per-subscriber filtering is actually intended, make the predicate
per-subscriber rather than per-source so each subscriber's own filter governs only its
own deliveries — but that is a behaviour change, not a comment fix, and should be
decided explicitly.
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): the `_alarmSourceFilter` comment corrected from 'first subscriber wins' to 'last subscriber wins' (and notes co-subscribers share the single filter).
+206 -2
View File
@@ -5,9 +5,9 @@
| Module | `src/ZB.MOM.WW.ScadaBridge.DeploymentManager` | | Module | `src/ZB.MOM.WW.ScadaBridge.DeploymentManager` |
| Design doc | `docs/requirements/Component-DeploymentManager.md` | | Design doc | `docs/requirements/Component-DeploymentManager.md` |
| Status | Reviewed | | Status | Reviewed |
| Last reviewed | 2026-05-28 | | Last reviewed | 2026-06-20 |
| Reviewer | claude-agent | | Reviewer | claude-agent |
| Commit reviewed | `1eb6e97` | | Commit reviewed | `4307c381` |
| Open findings | 0 | | Open findings | 0 |
## Summary ## Summary
@@ -83,8 +83,51 @@ The remaining six findings are medium/low: lifecycle-timeout audit gap
(DeploymentManager-023), and shared static state across `*ProbeActor` tests (DeploymentManager-023), and shared static state across `*ProbeActor` tests
(DeploymentManager-024). (DeploymentManager-024).
#### Re-review 2026-06-20 (commit `4307c381`) — full review
Re-reviewed the whole current module at HEAD after the rename, the
cert-broadcast / Transport `IStaleInstanceProbe` work, and milestone changes.
DeploymentManager-001..024 all remain `Resolved` and verified against source —
the ref-counted `OperationLockManager`, the broadened/non-cancellable failure
writes, the `ApplyPostSuccessSideEffectsAsync` shared helper with
`forceEnabledState` (Disabled-preservation), the lifecycle-timeout audit helper,
the structured diff with List-value normalization, the hoisted global artifact
fetch, and the instance-state-aware reconciliation are all present and correct.
The two flagged cross-module/architectural seams the prompt called out — the
`TrustServerCert`/`RemoveServerCert` broadcast-to-both-nodes and the
`DeploymentManagerActor` deploy-state query handler — live in **SiteRuntime /
Communication / CentralUI**, not this module, so they are out of scope here.
This review found **3 new findings**. The material one is DeploymentManager-025:
the system-wide artifact path still fetches and broadcasts **notification lists
and SMTP configurations (including SMTP credentials)** to every site, in direct
contradiction of the now-explicit design decision that these are central-only
and "no SMTP credential is ever distributed to sites" (Component-DeploymentManager.md
lines 142-146; CLAUDE.md notification-central-only decision). This supersedes
the earlier accepted-deployable-artifact framing of the closed
DeploymentManager-013. DeploymentManager-026 (deployment records are insert-only
— a new row per deploy accumulates per instance, contradicting "only current
status stored, no history table", and the same-tick `OrderByDescending(DeployedAt)`
read has no tiebreaker) and DeploymentManager-027 (artifact tests *assert* the
forbidden notif/SMTP shipping, cementing the DeploymentManager-025 violation)
are the remaining two.
## Checklist coverage ## Checklist coverage
#### Re-review 2026-06-20 (commit `4307c381`)
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ✓ | New: deployment records are insert-only — `DeployInstanceAsync` Adds a new row per deploy; reconciliation's `GetCurrentDeploymentStatusAsync` orders by `DeployedAt` with no tiebreaker (DeploymentManager-026). |
| 2 | Akka.NET conventions | ✓ | Module remains a plain service layer; no actors. The deploy-state-query/cert-broadcast actors live in SiteRuntime, out of scope. No issues. |
| 3 | Concurrency & thread safety | ✓ | `OperationLockManager` ref-counting + gate re-verified; `DeployToAllSitesAsync` prebuilds per-site commands before the parallel phase (no shared DbContext under `Task.WhenAll`). No issues. |
| 4 | Error handling & resilience | ✓ | Failure-status writes use `CancellationToken.None`; lifecycle timeouts now audit; delete-orphan path surfaced. No new issues. |
| 5 | Security | ✓ | New: SMTP credentials are still serialized into the per-site artifact command and broadcast to every site, which the current design forbids outright (DeploymentManager-025). |
| 6 | Performance & resource management | ✓ | Global artifact queries hoisted (DM-023 resolved). Deployment-record row growth is unbounded per instance (part of DeploymentManager-026). |
| 7 | Design-document adherence | ✓ | New: notification lists + SMTP configs are still treated as deployable artifacts, contradicting the "central-only, never distributed to sites" design (DeploymentManager-025). |
| 8 | Code organization & conventions | ✓ | Options bound via Host; `OptionsSection` constant correct. No new issues. |
| 9 | Testing coverage | ✓ | Broad and current. New: artifact tests assert the forbidden notif/SMTP shipping (DeploymentManager-027). |
| 10 | Documentation & comments | ✓ | `ArtifactDeploymentService` class XML doc still lists notification lists + SMTP as broadcast artifacts (stale vs design — folded into DeploymentManager-025). |
#### Re-review 2026-05-28 (commit `1eb6e97`) #### Re-review 2026-05-28 (commit `1eb6e97`)
| # | Category | Examined | Notes | | # | Category | Examined | Notes |
@@ -1234,3 +1277,164 @@ recipient (an `IActorRef` to a TestKit probe), and assert via `ExpectMsg`
in each test. Where the simpler counter shape is preferred, pass a in each test. Where the simpler counter shape is preferred, pass a
shared-state object into the actor's constructor so each test owns its own shared-state object into the actor's constructor so each test owns its own
instance — never reach for `static` mutable test state. instance — never reach for `static` mutable test state.
### DeploymentManager-025 — Notification lists and SMTP configurations (with credentials) are still broadcast to every site, contradicting the central-only design
| | |
|--|--|
| Severity | High |
| Category | Design-document adherence |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.DeploymentManager/ArtifactDeploymentService.cs:13-22,128,130,150-151,181-188` |
**Description**
The design now states explicitly, in two authoritative places, that notification
lists and SMTP configuration are **not** deployable artifacts:
- `docs/requirements/Component-DeploymentManager.md:142-146`: *"Notification lists
and SMTP configuration are **not** deployable artifacts — they are central-only
definitions managed by the Notification Service ... Notification delivery happens
on the central cluster, so **no notification artifact or SMTP credential is ever
distributed to sites**."*
- CLAUDE.md "External Integrations" decision: *"Notification delivery is
central-only ... Notification lists and SMTP config are no longer deployed to
sites; recipient resolution happens at central, at delivery time."*
`ArtifactDeploymentService` still does the opposite. `FetchGlobalArtifactsAsync`
queries `_notificationRepo.GetAllNotificationListsAsync` and
`GetAllSmtpConfigurationsAsync` (lines 150-151), maps them into
`NotificationListArtifact`s and `SmtpConfigurationArtifact`s — the SMTP artifact
carrying `smtp.Credentials` verbatim (line 188) — and `BuildDeployArtifactsCommandAsync`
places both into the `DeployArtifactsCommand` sent to **every site** (lines 128,
130). The site side persists them: `SiteReplicationActor` (lines 192-201) and
`DeploymentManagerActor` (lines 1383-1419) loop over `command.NotificationLists`
and `command.SmtpConfigurations` and write them to site SQLite via
`SiteNotificationRepository`.
This is the precise scenario the design says must never happen: SMTP credentials
travel across the inter-cluster transport and land on every site's SQLite. It
supersedes the framing of the now-closed DeploymentManager-013, which accepted
SMTP-as-deployable-artifact as a documented design decision — the design has since
flipped to forbid distribution entirely, so this is a fresh divergence, not the
same finding. The class-level XML doc (lines 13-22) is correspondingly stale: it
still advertises "notification lists ... and SMTP configurations" as artifacts the
service "broadcasts ... to all sites."
Secondary defect in the same mapping: `NotificationListArtifact` is built from
`nl.Recipients.Where(r => r.EmailAddress is not null)` (line 182), which silently
drops every SMS-only recipient (`PhoneNumber` set, `EmailAddress` null) of a
`NotificationType.Sms` list. Even if list distribution were intended, the SMS
recipient set would be lost — but since lists must not be distributed at all, this
is subsumed by the primary fix.
**Recommendation**
Stop fetching, mapping, and shipping notification lists and SMTP configurations
from the artifact deployment path. Drop the `_notificationRepo` queries from
`FetchGlobalArtifactsAsync`, pass `null` (or empty) for the `NotificationLists`
and `SmtpConfigurations` fields of `DeployArtifactsCommand`, and update the class
XML doc to remove both from the artifact list. The message fields can remain on
`DeployArtifactsCommand` for additive compatibility but must never be populated
from central. Coordinate removal of the consuming code in SiteRuntime
(`SiteReplicationActor`, `DeploymentManagerActor`, `SiteNotificationRepository`)
in the same session per the project's "design + code + tests travel together"
rule. Update the contradicting tests (see DeploymentManager-027).
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): central `FetchGlobalArtifactsAsync` no longer queries or ships notification lists / SMTP configs (passes null; `DeployArtifactsCommand` fields kept for contract compatibility), and the site purges any already-persisted notification_lists / smtp_configurations rows (clearing the plaintext SMTP password) on both apply paths — enforcing the central-only delivery design. Verified no site runtime/delivery path reads this config.
### DeploymentManager-026 — `DeployInstanceAsync` inserts a new deployment record every deploy; per-instance rows accumulate and the "current status" read has no tiebreaker
| | |
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.DeploymentManager/DeploymentService.cs:215-225`, `src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase/Repositories/DeploymentManagerRepository.cs:55-61` |
**Description**
`DeployInstanceAsync` always creates a brand-new `DeploymentRecord` and calls
`_repository.AddDeploymentRecordAsync(record, …)` (line 223) — it never reuses or
updates the instance's existing record. There is no update-in-place or
delete-prior path on the deploy flow, so every successful or failed deployment of
an instance leaves its predecessor row behind. Over the life of a central process
— amplified by the bulk "deploy all out-of-date instances" workflow and by repeated
redeploys after timeouts — the `DeploymentRecords` table grows without bound, one
row per deploy attempt per instance.
This contradicts the design's "Deployment Status Persistence" section
(`Component-DeploymentManager.md:106-109`): *"Only the **current deployment
status** per instance is stored in the configuration database ... No deployment
history table — the audit log (via IAuditService) already captures every
deployment action."* The audit log is the history; the deployment-record table is
supposed to hold only the current status. The implementation instead keeps an
ad-hoc, unindexed history there.
The accumulation also makes the reconciliation read order-sensitive.
`TryReconcileWithSiteAsync` reads the "prior" record via
`GetCurrentDeploymentStatusAsync`, which is
`OrderByDescending(d => d.DeployedAt).FirstOrDefault()` with **no secondary sort
key** (e.g. `ThenByDescending(d => d.Id)`). `DeployedAt` is a `DateTimeOffset`
stamped with `DateTimeOffset.UtcNow` at record creation; two records inserted
within the same clock tick (rapid redeploy, or a redeploy immediately after a
timed-out attempt) tie on `DeployedAt`, and SQL Server's choice between equal
sort keys is undefined. Reconciliation could then read the *wrong* prior record
(e.g. an older Success instead of the latest stuck InProgress), skipping the
intended site query, or vice-versa.
**Recommendation**
Either (a) make the deploy path upsert the instance's single current record
(update-in-place when one exists, insert only on first deploy) so the table holds
exactly one row per instance per the design, or (b) if multiple rows are
deliberately retained, add a deterministic tiebreaker to
`GetCurrentDeploymentStatusAsync` (`OrderByDescending(d => d.DeployedAt)
.ThenByDescending(d => d.Id)`) and document the retention/cleanup story so the
table does not grow unbounded. Option (a) aligns with the design and is preferred.
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): added a deterministic `.ThenByDescending(d => d.Id)` tiebreaker to `GetCurrentDeploymentStatusAsync` so same-tick deployment records resolve to the newest row. Insert-per-deploy behaviour unchanged (history-vs-upsert remains a separate design question).
### DeploymentManager-027 — Artifact tests assert that notification lists and SMTP configs are shipped, cementing the DeploymentManager-025 design violation
| | |
|--|--|
| Severity | Low |
| Category | Testing coverage |
| Status | Resolved |
| Location | `tests/ZB.MOM.WW.ScadaBridge.DeploymentManager.Tests/ArtifactDeploymentServiceTests.cs:173-174,200-201` |
**Description**
`ArtifactDeploymentServiceTests` asserts, via NSubstitute, that the artifact
deployment path queries the forbidden artifact sets exactly once per deployment:
```
await _notificationRepo.Received(1).GetAllNotificationListsAsync(Arg.Any<CancellationToken>());
await _notificationRepo.Received(1).GetAllSmtpConfigurationsAsync(Arg.Any<CancellationToken>());
```
(both in the `DeployToAllSitesAsync` global-query-hoisting test at 173-174 and the
`RetryForSiteAsync` test at 200-201). These assertions pin the exact behaviour the
current design forbids (DeploymentManager-025): they will keep the service shipping
notification lists and SMTP configs to sites and will actively block the fix —
removing the queries makes these `Received(1)` assertions fail. Tests that lock in
a design violation are worse than no test, because they make the correct change
look like a regression.
**Recommendation**
When DeploymentManager-025 is fixed, change these to
`DidNotReceive().GetAllNotificationListsAsync(...)` /
`DidNotReceive().GetAllSmtpConfigurationsAsync(...)` (and assert the
`DeployArtifactsCommand`'s `NotificationLists` / `SmtpConfigurations` fields are
null/empty) so the tests enforce the central-only design instead of contradicting
it.
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): the artifact tests no longer assert the forbidden notification/SMTP shipping — flipped `Received(1)``DidNotReceive()` and added assertions that the shipped command's NotificationLists/SmtpConfigurations are null.
+197 -2
View File
@@ -5,9 +5,9 @@
| Module | `src/ZB.MOM.WW.ScadaBridge.ExternalSystemGateway` | | Module | `src/ZB.MOM.WW.ScadaBridge.ExternalSystemGateway` |
| Design doc | `docs/requirements/Component-ExternalSystemGateway.md` | | Design doc | `docs/requirements/Component-ExternalSystemGateway.md` |
| Status | Reviewed | | Status | Reviewed |
| Last reviewed | 2026-05-28 | | Last reviewed | 2026-06-20 |
| Reviewer | claude-agent | | Reviewer | claude-agent |
| Commit reviewed | `1eb6e97` | | Commit reviewed | `4307c381` |
| Open findings | 0 | | Open findings | 0 |
## Summary ## Summary
@@ -81,6 +81,42 @@ design-doc drift). Theme: every new finding is in a code path that was added or
touched by the earlier fix bundle but whose error-propagation contract was not touched by the earlier fix bundle but whose error-propagation contract was not
verified end-to-end against the S&F engine or the design doc. verified end-to-end against the S&F engine or the design doc.
#### Re-review 2026-06-20 (commit `4307c381`) — full review
All twenty-three prior findings (001023) remain `Resolved`; spot-checks against the
current source confirm the fixes still hold. Since `1eb6e97` the module gained a new
SQL error-classification layer (`SqlErrorClassifier` + `TransientDatabaseException` /
`PermanentDatabaseException`, commits `d0527064` / `de375ff7`) and `DatabaseGateway.CachedWriteAsync`
was reshaped to attempt the write immediately and classify the outcome (transient →
buffer, permanent → synchronous `Failed`, mirroring `CachedCallAsync`); the `ScadaLink → ZB.MOM.WW.ScadaBridge`
rename also landed. The full 10-category checklist was re-walked and surfaced **three new
findings**, none Critical/High. The most serious (`ExternalSystemGateway-024`, Medium) is
that `SqlErrorClassifier.IsTransient(Exception)` classifies *every* `InvalidOperationException`
(and `TimeoutException`) as transient, so a DB-layer authoring/driver-misuse bug that surfaces
as `InvalidOperationException` is silently buffered and retried instead of propagating —
contradicting the classifier's own stated "authoring bugs must propagate" contract and the
HTTP-path symmetry it claims. `-025` (Low) is a caller-cancellation that the SQL driver
raises as a `SqlException` (mid-flight cancel) being reclassified as a *permanent* DB error
rather than propagating the cancellation (a narrow asymmetry with the `-008` contract; untested).
`-026` (Low) is a mis-ordered numbered-label comment block in `ExecuteWriteAsync`. Theme:
the new SQL-classification seam faithfully mirrors the HTTP path's *shape* but is more
permissive on the non-typed-exception branch than the HTTP path it cites as its model.
_Re-review (2026-06-20, `4307c381`):_
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ☑ | `BuildUrl` / `JsonElementToParameterValue` precision cascade / verb validation all still correct (006/017/020/022). New: caller-cancel surfacing as `SqlException` is misclassified permanent — finding 025. |
| 2 | Akka.NET conventions | ☑ | Still no actors; `AddExternalSystemGatewayActors` remains a no-op. Cached-call/write lifecycle + audit emission live in SiteRuntime/AuditLog, correct boundary. No issues. |
| 3 | Concurrency & thread safety | ☑ | Services stateless and DI-scoped; new `ExecuteWriteAsync`/`RunSqlAsync`/`SqlErrorClassifier` seams introduce no shared mutable state (static `HashSet` is read-only). No findings. |
| 4 | Error handling & resilience | ☑ | New SQL transient/permanent split is the headline area. `SqlErrorClassifier.IsTransient(Exception)` is over-broad — all `InvalidOperationException`/`TimeoutException` are transient, sweeping authoring bugs into retry — finding 024. Number-based set (024 aside) is sound; unknown→permanent fail-fast is correct. |
| 5 | Security | ☑ | `ExecuteWriteAsync`/`SqlErrorClassifier` use the connection NAME, never the connection string, in error text (verified — credentials not leaked). `ApplyAuth` secrets still never logged. SQL `CommandText` is script-authored by design (parameterised values bound separately). No new findings. |
| 6 | Performance & resource management | ☑ | `RunSqlAsync` uses `await using`/`using` for connection + command. `EmptyParameters` reused. `client.Timeout` set per rented client (correct — `IHttpClientFactory` clients are per-call). No new findings. |
| 7 | Design-document adherence | ☑ | Updated doc (lines 6566) now describes the immediate-attempt-then-buffer cached-write model the code implements — in sync. PATCH reconciled (023). No new drift. |
| 8 | Code organization & conventions | ☑ | `Transient`/`PermanentDatabaseException` are component-local exception types (parallel to the HTTP ones); acceptable. `SqlErrorClassifier` is a static policy class in the component. No new findings. |
| 9 | Testing coverage | ☑ | New SQL-classification paths well covered (immediate + retry, transient/permanent/non-Sql/cancel/unexpected). Gaps: caller-cancel-as-`SqlException` (025) and the `InvalidOperationException`-authoring-bug case (024) are untested — recorded under those findings. |
| 10 | Documentation & comments | ☑ | XML docs accurate and thorough. The numbered case labels in `ExecuteWriteAsync`'s comment block are mis-ordered vs the catch sequence — finding 026. |
## Checklist coverage ## Checklist coverage
| # | Category | Examined | Notes | | # | Category | Examined | Notes |
@@ -1312,3 +1348,162 @@ code travel together" rule:
- If PATCH is not in scope, remove `method.HttpMethod.Equals("PATCH", ...)` from the - If PATCH is not in scope, remove `method.HttpMethod.Equals("PATCH", ...)` from the
body branch in `InvokeHttpAsync` and let finding-022's verb validation reject it. body branch in `InvokeHttpAsync` and let finding-022's verb validation reject it.
The design-doc list then remains the single source of truth. The design-doc list then remains the single source of truth.
### ExternalSystemGateway-024 — `SqlErrorClassifier.IsTransient(Exception)` classifies all `InvalidOperationException`/`TimeoutException` as transient, so DB authoring/driver-misuse bugs are buffered and retried instead of propagating
| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Deferred |
| Location | `src/ZB.MOM.WW.ScadaBridge.ExternalSystemGateway/SqlErrorClassifier.cs:123-142`, `src/ZB.MOM.WW.ScadaBridge.ExternalSystemGateway/DatabaseGateway.cs:346-355` |
**Description**
`SqlErrorClassifier.IsTransient(Exception)` is a pure exception-type test:
```csharp
return exception is InvalidOperationException
or IOException
or SocketException
or TimeoutException
or TaskCanceledException
or DbException;
```
`DatabaseGateway.ExecuteWriteAsync` uses it as the third ordered catch
(`catch (Exception ex) when (SqlErrorClassifier.IsTransient(ex))`) to decide whether a
write that did *not* surface as a `SqlException` should be buffered and retried. The
type's own XML doc justifies the set as "a live DB outage does not always surface as a
`SqlException`" and explicitly contrasts it with authoring bugs: *"Authoring bugs
(`ArgumentException`, `NullReferenceException`, etc.) are loud, fixable failures —
silently buffering and retrying them forever would hide the bug."*
The problem is that `InvalidOperationException` and `TimeoutException` are exactly the
two most common shapes of an authoring/driver-misuse bug on the ADO.NET write path, and
both are now swept into the transient set:
1. `Microsoft.Data.SqlClient` throws `InvalidOperationException` for a long list of
*programming* errors that are NOT outages and will fail identically on every retry —
e.g. a `SqlParameter` already belongs to another `SqlParameterCollection`, the
command's `Connection` property is not set, an open `SqlDataReader` already exists on
the connection, or `CommandText` is invalid for the call. The doc treats
`InvalidOperationException` as "the connection is not open / pooled connection
broken", but the type carries no such discrimination — any of those authoring
conditions is classified transient and buffered.
2. A bare `TimeoutException` (or any of these) thrown by a custom/wrapped provider for a
non-outage reason is likewise classified transient.
Consequence: a `CachedWrite` whose payload triggers one of these deterministic
programming errors is returned to the script as `WasBuffered: true` (a lie — it will
never succeed), then re-attempted on every S&F sweep, failing identically each time,
until `RetryCount >= DefaultMaxRetries` and it is finally parked — a noisy retry loop
hiding a fixable bug, which is the exact outcome the doc says the design avoids. It is
also asymmetric with the HTTP path it cites as its model: `ErrorClassifier.IsTransient(Exception)`
(ExternalSystemClient) deliberately does NOT catch `InvalidOperationException`, so the
same class of bug propagates on the API side.
**Recommendation**
Narrow the non-`SqlException` transient set to genuine transport/outage shapes. Drop
the broad `InvalidOperationException` arm — or, if connection-state errors must stay
transient, narrow it (e.g. only when the message indicates a broken/closed connection,
or only the `DbException`/`SocketException`/`IOException`/timeout shapes) so that
ADO.NET programming `InvalidOperationException`s propagate as the design intends.
Mirror the HTTP path's narrower `ErrorClassifier.IsTransient(Exception)` set unless
there is a documented reason the SQL path must be broader. Add a regression test that
throws a programming-shaped `InvalidOperationException` (e.g. "The SqlParameter is
already contained by another SqlParameterCollection.") from `RunSqlAsync` and asserts
it propagates (not buffered), companion to the existing
`CachedWrite_UnexpectedException_Propagates_NotClassifiedTransient`.
**Resolution**
Deferred 2026-06-20: narrowing the over-broad SQL transient set (currently all `InvalidOperationException`/`TimeoutException` are treated transient) to match the HTTP path is a classification-policy decision for the owner; recommended narrowing recorded. No data loss today (bounded noisy retry returning a false WasBuffered), so deferred rather than forced.
### ExternalSystemGateway-025 — Caller-token cancellation surfacing from the SQL driver as a `SqlException` is reclassified as a permanent DB error instead of propagating
| | |
|--|--|
| Severity | Low |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.ExternalSystemGateway/DatabaseGateway.cs:329-345`, `src/ZB.MOM.WW.ScadaBridge.ExternalSystemGateway/SqlErrorClassifier.cs:158-174` |
**Description**
`ExecuteWriteAsync`'s ordered catches handle caller cancellation only via
`catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested)`.
That covers the common case where the driver observes the token before/around the
network call and raises `OperationCanceledException`/`TaskCanceledException`. However,
when `Microsoft.Data.SqlClient` is cancelled *mid-statement* by the caller's token it
can instead raise a **`SqlException`** (historically error number `0` / message
"Operation cancelled by user", and on some paths `3980` "the request failed to run
because the batch is aborted"). Such a `SqlException` does not derive from
`OperationCanceledException`, so the first filter is skipped; it falls into
`catch (SqlException ex) { throw SqlErrorClassifier.Throw(connectionName, ex); }`, and
because `0`/`3980` are not in the transient set, `SqlErrorClassifier.Throw` raises a
`PermanentDatabaseException`.
The caller-requested cancellation is therefore reported to the script as a *permanent
database failure* (`ExternalCallResult.Success == false`, "Permanent database error: …")
rather than propagating as an `OperationCanceledException`. This contradicts the
`ExternalSystemGateway-008` contract ("the caller asked to abandon the work — do not
reclassify") that the comment block at `DatabaseGateway.cs:333-337` explicitly cites,
and it is the SQL-path analogue of the cancellation/timeout conflation that `-008`
fixed for HTTP. The existing `CachedWrite_CancellationRequested_PropagatesOperationCanceled_NotReclassified`
test only exercises the `OperationCanceledException` shape, so the `SqlException`
cancellation path is uncovered.
**Recommendation**
Before classifying a `SqlException`, re-check `cancellationToken.IsCancellationRequested`
and, if set, rethrow as `OperationCanceledException(cancellationToken)` (or call
`cancellationToken.ThrowIfCancellationRequested()`), so a caller-initiated cancel
propagates regardless of whether the driver surfaced it as `OperationCanceledException`
or as a cancellation-shaped `SqlException`. Add a regression test that cancels the
caller token and throws a cancellation-shaped `SqlException` (or, since `SqlException`
has no public constructor, drive it through `RunSqlAsync` with a token-cancelled
`OperationCanceledException` wrapped to assert the guard) and asserts an
`OperationCanceledException` propagates and nothing is buffered.
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): `ExecuteWriteAsync` now calls `cancellationToken.ThrowIfCancellationRequested()` at the top of the `catch (SqlException)` block (before `SqlErrorClassifier.Throw`), so a caller-token cancel that surfaces as a SqlException propagates as `OperationCanceledException` regardless of the driver's exception shape (version-independent). Test added (verified failing pre-fix).
### ExternalSystemGateway-026 — `ExecuteWriteAsync` comment-block case labels are mis-ordered relative to the catch sequence
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.ExternalSystemGateway/DatabaseGateway.cs:316-355` |
**Description**
The header comment for `ExecuteWriteAsync` (lines 316-328) numbers the classification
cases 14 in catch order: `1.` caller-requested cancellation propagates unchanged; `2.`
a `SqlException` is classified by number; `3.` a non-`SqlException` transport/connection
failure is transient; `4.` unexpected exceptions propagate. The inline labels on the
catch bodies, however, are inverted: the cancellation catch (line 333) is annotated
`// [2]` and the non-`SqlException` transient catch (line 348) is annotated `// [1]`.
The labels also do not match the parallel labelling in the test file, where `[1]` is
used for the non-Sql-outage case and `[2]` for cancellation — so the source comment and
the tests disagree with the source's own header list.
This is cosmetic but actively misleading: a reader cross-referencing the numbered
rationale in the header against the `[n]`-tagged catch bodies will map each case to the
wrong rationale, and the discrepancy undermines confidence in an otherwise carefully
documented classification seam.
**Recommendation**
Renumber the inline `// [n]` labels to match the header list (cancellation = `[1]`,
non-`SqlException` transient = `[3]`), or drop the bracketed numbers entirely and keep
the prose, so the comment is self-consistent. Align the test-file annotations to
whatever scheme is chosen.
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): renumbered the inline `// [n]` case labels in `ExecuteWriteAsync` to match the method header's rationale list (cancellation=[1], SqlException=[2], non-Sql=[3]), and realigned the three corresponding labels in the test file. Comment-only.
+149 -3
View File
@@ -5,10 +5,10 @@
| Module | `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring` | | Module | `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring` |
| Design doc | `docs/requirements/Component-HealthMonitoring.md` | | Design doc | `docs/requirements/Component-HealthMonitoring.md` |
| Status | Reviewed | | Status | Reviewed |
| Last reviewed | 2026-05-28 | | Last reviewed | 2026-06-20 |
| Reviewer | claude-agent | | Reviewer | claude-agent |
| Commit reviewed | `1eb6e97` | | Commit reviewed | `4307c381` |
| Open findings | 2 | | Open findings | 0 |
## Summary ## Summary
@@ -80,6 +80,45 @@ covers (HealthMonitoring-023). All sequence-number and offline-detection
arithmetic uses `_timeProvider.GetUtcNow()` consistently — no wall-clock vs arithmetic uses `_timeProvider.GetUtcNow()` consistently — no wall-clock vs
monotonic mismatch was observed. monotonic mismatch was observed.
#### Re-review 2026-06-20 (commit `4307c381`) — full review
All twenty-three prior findings (HealthMonitoring-001..023) remain `Resolved`
and were spot-verified against the current source: the HealthMonitoring-017/018
"snapshot-then-restore-on-failure" counter logic is in place
(`HealthReportSender.cs:158-176`, `CentralHealthReportLoop.cs:118-134`,
`SiteHealthCollector.AddIntervalCounters`), the HealthMonitoring-020 offline→online
`max(receivedAt, now)` anchor is correct (`CentralHealthAggregator.cs:138-149`),
and the HealthMonitoring-021 `$central` sentinel is honoured by the aggregator and
the Central UI. The full 10-category checklist produced **2 new findings, both Low,
none crash-class** — both confined to the new M6 `SiteHealthKpiSampleSource`
(KPI History sampling) added since the last baseline: the synthetic `$central`
self-report is sampled as a real `KpiScopes.Site` series with meaningless
zero connection/instance/S&F values (HealthMonitoring-024), and 8 of the 12
emitted Site Health metrics are persisted every minute per site but never read by
any chart while the design doc claims all 12 are "rendered in the dashboard's
per-site KpiTrendChart panel" (HealthMonitoring-025). The KPI-recorder
scope/captive-dependency reasoning is sound (recorder opens a per-pass scope and
the scoped source over a singleton aggregator is the normal lifetime relationship),
the per-source fault isolation is correct, and the `AddIntervalCounters` /
`SetSiteEventLogWriteFailures` default-interface no-ops are an appropriate
test-fake-compatibility seam. No concurrency, security, or sequence-arithmetic
regressions were found.
_Re-review (2026-06-20, `4307c381`):_
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | x | Re-verified HealthMonitoring-017/018/020 fixes in place and correct. `SiteHealthKpiSampleSource.CollectAsync` arithmetic (connectionsUp/Down split, summed `sfBufferDepth`, null-`SiteAuditBacklog`→0) is correct. No new logic bug. |
| 2 | Akka.NET conventions | x | Module still contains no actors. `IHealthReportTransport.Send` is fire-and-forget (Tell semantics). KPI source is a plain `IKpiSampleSource` consumed by the KpiHistory recorder singleton (out of scope). No issues found. |
| 3 | Concurrency & thread safety | x | `SiteHealthCollector` counters use `Interlocked`; `AddIntervalCounters` restore is per-field `Interlocked.Add` and correctly sums with concurrent increments against the zero left by `CollectReport`'s `Exchange`. Aggregator CAS pattern unchanged. `SiteEventLogFailureCountReporter` uses a linked CTS + `Task.Run` loop with catch-all probe isolation — sound. No issues found. |
| 4 | Error handling & resilience | x | Counter-restore-on-failure (HM-017/018) confirmed. KPI source is exception-isolated by the recorder per-source. `SiteEventLogFailureCountReporter.SafeProbe` logs and continues. No issues found. |
| 5 | Security | x | No issues found. Numeric/string operational metrics only; no secrets, auth surface, or untrusted-input parsing. |
| 6 | Performance & resource management | x | `PeriodicTimer`/CTS disposed correctly. New: `SiteHealthKpiSampleSource` writes 12 Site-scoped rows per site per minute, of which 8 are never charted (HealthMonitoring-025) — bounded but wasteful; also emits a full row-set for the synthetic `$central` (HealthMonitoring-024). |
| 7 | Design-document adherence | x | Design doc line 119 claims all 12 emitted Site Health metrics are "rendered in the dashboard's per-site KpiTrendChart panel" — only 4 (`connectionsDown`/`scriptErrors`/`sfBufferDepth`/`deadLetters`) are actually charted (HealthMonitoring-025). Synthetic `$central` is sampled as a real Site scope (HealthMonitoring-024). |
| 8 | Code organization & conventions | x | KPI source's metric-name catalog is split: 4 charted metrics share `KpiMetrics.SiteHealth` (Commons), 8 are private literals — noted under HealthMonitoring-025. Options/validator ownership and idempotent registration correct. `AddCentralHealthAggregation` registers the scoped KPI source via `TryAddEnumerable` — correct. |
| 9 | Testing coverage | x | `SiteHealthKpiSampleSourceTests` covers the populated/heartbeat-only/null-backlog/empty paths and exact (metric,value) tuples well. No test asserts `$central` is excluded (it currently is not — HealthMonitoring-024). Otherwise coverage is strong (73 tests). |
| 10 | Documentation & comments | x | XML docs accurate. The only doc drift is design-doc line 119's "all 12 rendered" claim (HealthMonitoring-025). |
## Checklist coverage ## Checklist coverage
| # | Category | Examined | Notes | | # | Category | Examined | Notes |
@@ -1153,3 +1192,110 @@ a non-bug.
Rename to `StoreAndForwardBufferDepths_DefaultsToEmpty_WhenSetterNotCalled` Rename to `StoreAndForwardBufferDepths_DefaultsToEmpty_WhenSetterNotCalled`
(or similar) and update the test body's intent — purely a documentation / (or similar) and update the test body's intent — purely a documentation /
maintainability fix; no behaviour change. maintainability fix; no behaviour change.
### HealthMonitoring-024 — `SiteHealthKpiSampleSource` samples the synthetic `$central` self-report as a real `KpiScopes.Site` series
| | |
|--|--|
| Severity | Low |
| Category | Design-document adherence |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/Kpi/SiteHealthKpiSampleSource.cs:69-86`, `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/CentralHealthReportLoop.cs:115-120` |
**Description**
`SiteHealthKpiSampleSource.CollectAsync` iterates
`ICentralHealthAggregator.GetAllSiteStates()` and emits the 12-metric catalog
for **every** entry whose `LatestReport` is non-null. The aggregator keyspace
includes the synthetic central self-report under `CentralHealthReportLoop.CentralSiteId`
(`"$central"`), which `CentralHealthReportLoop` feeds via `ProcessReport` every
30s — so `$central` always has a non-null `LatestReport` once the loop has run.
The source therefore writes 12 `KpiSample` rows per minute with
`Scope = KpiScopes.Site`, `ScopeKey = "$central"` for the central cluster.
The class XML doc, the design doc (`Component-HealthMonitoring.md:119`,
"emits `IKpiSampleSource` (`SiteHealthKpiSampleSource`, per-Site)"), and the
`KpiScopes.Site` scope all describe this as a **per-real-site** series.
`$central` is not a site — its `DataConnectionStatuses`, `StoreAndForwardBufferDepths`,
`ParkedMessageCount`, and instance counts are structurally empty/zero
(the central node runs no DCL, no S&F engine, no instances), so the persisted
`connectionsUp`/`connectionsDown`/`sfBufferDepth`/`parkedMessages`/`deployedInstances`
trend rows for `$central` are meaningless constant-zero noise. The Central UI
trend selector (`Health.razor:570-572`) deliberately pins `$central` first and
queries `KpiScopes.Site`/`scopeKey="$central"`, so the dashboard will plot these
flat-zero "Central Cluster" trends.
This is consistent (the UI renders central as a card by design) so it is not a
behaviour bug — but it conflates a synthetic non-site with real-site KPI history,
permanently stores meaningless zero series, and contradicts the "per-Site"
contract. None of the other M6 sample sources sample a synthetic scope key.
**Recommendation**
Decide whether central-cluster trends belong in the per-Site KPI store. If not,
skip the `CentralHealthReportLoop.CentralSiteId` entry in `CollectAsync` (a single
`if (siteId == CentralHealthReportLoop.CentralSiteId) continue;` guard) and remove
the `$central` option from the trend selector. If central trends are intended,
give them a distinct scope (e.g. a `KpiScopes.Central` scope or a central-specific
metric subset) so the data is not labelled as a real site, and update the design
doc + class XML doc to say so. Either way add a test asserting the chosen
behaviour for `$central`.
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): the design doc now documents that the synthetic `$central` self-report is intentionally sampled into the per-site KPI store (the Central UI deliberately pins `$central` first in the trend selector as 'Central Cluster'), and that its zero connection/instance/S&F values reflect the self-report carrying no site-runtime data — not a defect. Doc-only.
### HealthMonitoring-025 — 8 of the 12 emitted Site Health KPI metrics are never charted; design doc claims all 12 are rendered
| | |
|--|--|
| Severity | Low |
| Category | Design-document adherence |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/Kpi/SiteHealthKpiSampleSource.cs:40-51`, `docs/requirements/Component-HealthMonitoring.md:119` |
**Description**
`SiteHealthKpiSampleSource` emits 12 metrics per site per sampling pass:
`connectionsUp`, `connectionsDown`, `scriptErrors`, `alarmEvalErrors`,
`sfBufferDepth`, `deadLetters`, `parkedMessages`, `deployedInstances`,
`enabledInstances`, `disabledInstances`, `auditBacklogPending`,
`eventLogWriteFailures`. Only **four** are ever read back: the Central UI
per-site trend panel calls `LoadTrendSeriesAsync` for exactly
`KpiMetrics.SiteHealth.ConnectionsDown`, `.ScriptErrors`, `.SfBufferDepth`, and
`.DeadLetters` (`Health.razor:596-603`) — and only those four are in the public
`KpiMetrics.SiteHealth` catalog (`Commons/Types/Kpi/KpiMetrics.cs:84-97`). The
other 8 metric names are private string literals in the source and have no
reader anywhere in the codebase.
Two consequences:
1. **Stale design doc.** `Component-HealthMonitoring.md:119` lists all 12 metric
names and states they are "rendered in the dashboard's per-site
`KpiTrendChart` panel." Eight of them are never rendered. A reader / future
maintainer is misled about what the trend panel shows.
2. **Persisted-but-dead samples.** The `KpiHistoryRecorderActor` writes all 12
rows to the central `KpiSample` table every minute for every reporting site;
8 are pure write-amplification that nothing queries and that the 90-day
purge eventually drops. Bounded, but wasteful — roughly two-thirds of the
Site Health KPI rows are never read.
A related maintainability smell: the split catalog (4 metrics shared via
`KpiMetrics.SiteHealth`, 8 private literals) means a future chart added for, say,
`alarmEvalErrors` would have to re-type the literal and risk a silent typo
mismatch against the source.
**Recommendation**
Reconcile intent in one of two directions: (a) if only the four charted metrics
are wanted, stop emitting the other eight (and correct the design doc's list to
the four actually rendered); or (b) if the full 12-metric history is intended for
future charts / ad-hoc query, promote all 12 names into the public
`KpiMetrics.SiteHealth` catalog so source and any future UI binding key off one
symbol, and either add the missing charts or soften the doc's "rendered … panel"
wording to "recorded for trend history." Add the design-doc fix in the same
session per the repo's doc-and-code-travel-together rule.
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): corrected the design doc — of the 12 sampled Site Health KPI metrics only four are charted (connectionsDown, scriptErrors, sfBufferDepth, deadLetters); the other eight are now documented as sampled-but-not-charted (retained for future surfaces / ad-hoc query). Doc-only.
+204 -2
View File
@@ -5,9 +5,9 @@
| Module | `src/ZB.MOM.WW.ScadaBridge.Host` | | Module | `src/ZB.MOM.WW.ScadaBridge.Host` |
| Design doc | `docs/requirements/Component-Host.md` | | Design doc | `docs/requirements/Component-Host.md` |
| Status | Reviewed | | Status | Reviewed |
| Last reviewed | 2026-05-28 | | Last reviewed | 2026-06-20 |
| Reviewer | claude-agent | | Reviewer | claude-agent |
| Commit reviewed | `1eb6e97` | | Commit reviewed | `4307c381` |
| Open findings | 0 | | Open findings | 0 |
## Summary ## Summary
@@ -80,6 +80,33 @@ Serilog is the only logger provider and the section is dead config (Host-021);
and `ParseLevel` silently swallows an unrecognised `MinimumLevel` value (e.g. and `ParseLevel` silently swallows an unrecognised `MinimumLevel` value (e.g.
a typo) and falls back to `Information` with no warning (Host-022). a typo) and falls back to `Information` with no warning (Host-022).
#### Re-review 2026-06-20 (commit `4307c381`) — full review
All twenty-two prior findings (Host-001..022) remain `Resolved` in the current tree
(Host-008 was deliberately *reverted* in M2.9 — see Host-024). This is the first
review since the `ScadaBridge` rename, so a `git diff 1eb6e97 HEAD` shows the whole
directory as additions and is not a useful delta; the review walked the full current
state plus the per-file commit history. Major changes since `1eb6e97`: REQ-HOST-4a was
re-architected (the leader-only `active-node` check left `/health/ready`; a new
leadership-agnostic `RequiredSingletonsHealthCheck` probes the five required central
singletons via `Identify` — clean), shared `ZB.MOM.WW.Health`/`Telemetry` adoption
(`/healthz`, `/metrics`, OTel), a site HTTP/1.1 `MetricsPort` (8084) listener, the
KpiHistory + Transport central registrations, the disable-login dev flag, and the
library-backed inbound API-key auth. Four new findings, none crash/data-loss class.
Host-023 (Medium) is a regression-of-resolution: the shipped `appsettings.Site.json`
second seed-node now targets `localhost:8084` — the new HTTP/1.1 MetricsPort, not an
Akka remoting endpoint — the same defect class as the resolved Host-004 (which chose
8084 as a "remoting port" before MetricsPort claimed it); `StartupValidator` rejects a
seed on the GrpcPort but not on the MetricsPort. Host-024 (Medium) flags the M2.9
reversal of Host-008: `MachineDataDb` is again *required* for Central yet is consumed
by no DbContext/repository AND is absent from the shipped `appsettings.Central.json`,
so a developer running the shipped Central binary fails startup for a value nothing
uses. Host-025 (Low): the Host directly consumes KpiHistory (`AddKpiHistory`,
`KpiHistoryRecorderActor`, `KpiHistoryOptions`) but has no direct `ProjectReference`
it relies on a transitive pull through CentralUI. Host-026 (Low): the Component
Registration Matrix in `Component-Host.md` is stale — KpiHistory and Transport are
registered central-only in `Program.cs` but absent from the matrix.
## Checklist coverage ## Checklist coverage
| # | Category | Examined | Notes | | # | Category | Examined | Notes |
@@ -110,6 +137,21 @@ _Re-review (2026-05-28, `1eb6e97`):_
| 9 | Testing coverage | ☑ | Strong existing suite. No coverage for the Site `CentralContactPoints` second-entry rule (Host-016), the site-shutdown ordering (Host-017), the `NodeName`-absent shipped config (Host-018), the unused `CancellationToken` parameter (Host-019), the `MinimumLevel.Is` override semantics (Host-020) or the `ParseLevel` silent fallback (Host-022). | | 9 | Testing coverage | ☑ | Strong existing suite. No coverage for the Site `CentralContactPoints` second-entry rule (Host-016), the site-shutdown ordering (Host-017), the `NodeName`-absent shipped config (Host-018), the unused `CancellationToken` parameter (Host-019), the `MinimumLevel.Is` override semantics (Host-020) or the `ParseLevel` silent fallback (Host-022). |
| 10 | Documentation & comments | ☑ | Re-review: layered `MinimumLevel.Is` / `ReadFrom.Configuration` semantics are not surfaced — an operator-set `Serilog:MinimumLevel` is silently overridden by `ScadaBridge:Logging:MinimumLevel` (Host-020); `ParseLevel` silently coerces a misspelled level to `Information` with no warning (Host-022). | | 10 | Documentation & comments | ☑ | Re-review: layered `MinimumLevel.Is` / `ReadFrom.Configuration` semantics are not surfaced — an operator-set `Serilog:MinimumLevel` is silently overridden by `ScadaBridge:Logging:MinimumLevel` (Host-020); `ParseLevel` silently coerces a misspelled level to `Information` with no warning (Host-022). |
_Re-review (2026-06-20, `4307c381`):_
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ☑ | Re-review: shipped `appsettings.Site.json` second seed-node targets MetricsPort 8084 (HTTP/1.1 Kestrel listener), not an Akka remoting endpoint — regression of the resolved Host-004 fix (Host-023). |
| 2 | Akka.NET conventions | ☑ | All five central singletons + the four site singletons reviewed: ClusterSingletonManager/Proxy pairing, role scoping (site singletons `WithRole(site-*)`, central on base "Central"), `PhaseClusterLeave` graceful-stop drain tasks, receptionist registrations, `RequiredSingletonsHealthCheck` `Identify` probe — all sound. No new issues. |
| 3 | Concurrency & thread safety | ☑ | `GetOrCreateActorSystem` double-checked lock (HOST-021) is correct; `_trackedDisposables` snapshot-under-lock; `RequiredSingletonsHealthCheck` probes concurrently and never throws. No new issues. |
| 4 | Error handling & resilience | ☑ | `StartupRetry` now threads `ApplicationStopping` + transient-only classifier (Host-019/015 resolved); migration retry sound. No new issues. |
| 5 | Security | ☑ | Shipped configs keep `${...}` secret placeholders; pepper fail-fast validated (≥16 chars). Note: docker per-node `appsettings.Central.json` still carry plaintext SQL passwords / JWT key — out of this module's edit scope (Host-003 follow-up), not re-filed. |
| 6 | Performance & resource management | ☑ | ActorSystem singleton-not-transient avoids per-probe disposal; gRPC stream cancel-on-shutdown wired. No new issues. |
| 7 | Design-document adherence | ☑ | Re-review: Component Registration Matrix omits KpiHistory + Transport (both central-only in Program.cs) (Host-026); `MachineDataDb` required-but-unconsumed and absent from shipped Central config (Host-024). |
| 8 | Code organization & conventions | ☑ | Re-review: Host directly consumes KpiHistory types but has no direct `ProjectReference` (transitive via CentralUI) (Host-025). |
| 9 | Testing coverage | ☑ | Strong suite (RequiredSingletons, Hocon, Serilog, StartupRetry covered). Gaps: no test asserts a seed on the MetricsPort is rejected (Host-023 — the existing 8084-seed test only avoids the grpc rule), and no shipped-config test catches the missing `MachineDataDb` key (Host-024). |
| 10 | Documentation & comments | ☑ | Re-review: `Component-Host.md` matrix stale vs `Program.cs` (Host-026). Inline XML/comments otherwise accurate and unusually thorough. |
## Findings ## Findings
### Host-001 — `/health/ready` includes the leader-only `active-node` check ### Host-001 — `/health/ready` includes the leader-only `active-node` check
@@ -1118,3 +1160,163 @@ behaviour you choose.
**Resolution** **Resolution**
_Open._ _Open._
### Host-023 — Shipped Site `SeedNodes` second entry targets the MetricsPort, not a remoting port
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Deferred |
| Location | `src/ZB.MOM.WW.ScadaBridge.Host/appsettings.Site.json:13-17`, `src/ZB.MOM.WW.ScadaBridge.Host/StartupValidator.cs:117-127` |
**Description**
The shipped site config sets `Node:RemotingPort = 8082`, `Node:GrpcPort = 8083`,
`Node:MetricsPort = 8084`, but `Cluster:SeedNodes` is
`["akka.tcp://scadabridge@localhost:8082", "akka.tcp://scadabridge@localhost:8084"]`.
The second seed targets `8084` — which `Program.cs:419-422` binds as the **HTTP/1.1
Kestrel `/metrics` listener** (`HttpProtocols.Http1AndHttp2`), not an Akka.Remote
endpoint. A node joining via that seed attempts an Akka.Remote TCP association against
the Prometheus scrape listener and fails. This is the exact defect class as the
**resolved Host-004** (where the second seed targeted the gRPC port 8083): Host-004's
fix corrected the seed to `8084` and called it a "remoting port", but the later
introduction of `MetricsPort` (commits `bbc9f092`/`c41cb41c`) claimed `8084`, silently
re-breaking the example. `StartupValidator` (`StartupValidator.cs:121-127`) was
extended by Host-004 to reject a seed whose port equals `GrpcPort`, but it does **not**
reject a seed whose port equals `MetricsPort`, so this misconfiguration passes
validation silently. For the single-node dev loopback the first seed (`8082`) succeeds
and the bug is masked, but it is an incorrect example that copies into multi-node
configs (the docker site configs correctly use distinct hostnames both on `8082`).
**Recommendation**
The loopback dev site cannot host two distinct remoting endpoints, so drop the
second seed (mirror Host-016's single-entry `CentralContactPoints` template) or use a
distinct hostname placeholder both on `8082`, with a comment that multi-node sites
list each node's *remoting* port. Extend the Site block of `StartupValidator` to also
reject any seed whose port equals the resolved `MetricsPort` (and ideally any port
the node binds for non-remoting use). Add a `StartupValidatorTests` case
`Site_SeedNodeOnMetricsPort_FailsValidation` mirroring
`Site_SeedNodeOnGrpcPort_FailsValidation`.
**Resolution**
Deferred 2026-06-20: the shipped `appsettings.Site.json` second seed-node points at the MetricsPort and the validator lacks a `seedPort != metricsPort` rule; fixing touches shipped dev config and a single-node-loopback assumption. Recorded for a follow-up (no production node ships this dev config).
### Host-024 — `MachineDataDb` re-required for Central but consumed by nothing and absent from the shipped config
| | |
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Deferred |
| Location | `src/ZB.MOM.WW.ScadaBridge.Host/StartupValidator.cs:63-65`, `src/ZB.MOM.WW.ScadaBridge.Host/appsettings.Central.json:22-24` |
**Description**
Commit `76198b36` ("reverts Host-008") re-added a fail-fast `.Require` for
`ScadaBridge:Database:MachineDataDb` to the Central `.When` block, citing
REQ-HOST-3/REQ-HOST-4 and that the docker per-node configs carry the key. Two problems
remain after the reversal:
1. **Still consumed by nothing.** A repo-wide search shows `MachineDataDb` is read only
in `StartupValidator.cs` and declared on `DatabaseOptions.cs:11` — no DbContext, no
repository, no component reads it. Only `ConfigurationDb` is wired into
`AddConfigurationDatabase` (`Program.cs:200-202`). The original Host-008 rationale
("dead configuration that fails startup for a value nothing uses") is therefore
still entirely true; the reversal re-instated the fail-fast without adding a
consumer.
2. **Absent from the shipped Central config.** `appsettings.Central.json` (the template
a developer running the binary directly uses) contains `Database:ConfigurationDb`
(a `${...}` placeholder) but **no `MachineDataDb` key at all**. So a Central node
started from the shipped config fails `StartupValidator` immediately with
"ScadaBridge:Database:MachineDataDb connection string required for Central" — for a
value nothing consumes. The docker configs happen to set it; the shipped default
does not. (This is *not* a re-file of Host-008, which was Resolved-then-reverted;
the new, distinct defect is the validation breaking the shipped dev config.)
This is also a code-vs-doc inconsistency: the design intent (validate a value the
system needs) is unmet because the value is needed by nothing.
**Recommendation**
Pick one and make code, shipped config, and doc agree: either (a) wire a machine-data
store so the value is actually consumed and add a `MachineDataDb` placeholder to the
shipped `appsettings.Central.json` (matching the `_secrets` env-var pattern), or (b)
remove the `.Require`, the `DatabaseOptions.MachineDataDb` property, and the docker
keys again. Add a config-shape regression test asserting the shipped
`appsettings.Central.json` satisfies `StartupValidator` for the Central role (the
absence of such a test is why this slipped through).
**Resolution**
Deferred 2026-06-20: `MachineDataDb` is re-required for Central but consumed by nothing and absent from shipped Central config (the shipped Central binary would fail StartupValidator). This is a reversal of the reverted Host-008 and needs a product decision — whether a machine-data store is actually coming (add a config placeholder) or the requirement should be removed again. Recorded for that decision.
### Host-025 — Host directly consumes KpiHistory but has no direct project reference
| | |
|--|--|
| Severity | Low |
| Category | Code organization & conventions |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.Host/ZB.MOM.WW.ScadaBridge.Host.csproj:42-63`, `src/ZB.MOM.WW.ScadaBridge.Host/Program.cs:118`, `src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs:717-757` |
**Description**
`Program.cs` calls `builder.Services.AddKpiHistory(builder.Configuration)` and
`AkkaHostedService.RegisterCentralActors` constructs the `KpiHistoryRecorderActor`
cluster singleton from `KpiHistoryOptions` / `KpiHistoryRecorderActor`
(`ZB.MOM.WW.ScadaBridge.KpiHistory`), yet the Host `.csproj` carries **no
`ProjectReference` to `ZB.MOM.WW.ScadaBridge.KpiHistory`**. It compiles only because
CentralUI (`ZB.MOM.WW.ScadaBridge.CentralUI.csproj`) transitively re-exports it. The
Host is the composition root for every component and the project deliberately declares
direct dependencies elsewhere — the csproj even comments that `ZB.MOM.WW.Theme` is
declared directly "rather than leaning on CentralUI re-exporting it transitively"
(lines 37-39). KpiHistory is the same situation but was not given the same treatment.
If CentralUI ever drops or restructures its KpiHistory reference, the Host fails to
compile for a dependency it directly uses — a latent build fragility.
**Recommendation**
Add `<ProjectReference Include="../ZB.MOM.WW.ScadaBridge.KpiHistory/ZB.MOM.WW.ScadaBridge.KpiHistory.csproj" />`
to the Host csproj's `ProjectReference` group, mirroring the explicit Theme reference
rationale already documented there.
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): added an explicit `<ProjectReference>` to the KpiHistory project in the Host csproj, so Host's direct use of KpiHistory types no longer relies on the transitive pull through CentralUI.
### Host-026 — Component Registration Matrix omits KpiHistory and Transport
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Resolved |
| Location | `docs/requirements/Component-Host.md` (Component Registration Matrix, lines 177-198), `src/ZB.MOM.WW.ScadaBridge.Host/Program.cs:92,118` |
**Description**
`Program.cs` registers two central-only components that have shipped since the matrix
was last updated: **Transport** (`builder.Services.AddTransport()`, line 92, #24) and
**KpiHistory** (`builder.Services.AddKpiHistory(...)`, line 118, plus the
`kpi-history-recorder` cluster singleton in `AkkaHostedService`, #26). Neither appears
in the Component Registration Matrix in `Component-Host.md` (lines 177-198), which is
the documented source of which components register on which role. The matrix is
described in CLAUDE.md as something that "must stay in sync with actual component
documents"; it is now stale against the actual composition root. A maintainer auditing
role-based registration against the doc would miss both components (and miss that
KpiHistory's recorder is deliberately *not* readiness-gated — worth a matrix note).
**Recommendation**
Add `Transport` (Central: Yes / Site: No / DI: Yes / Actors: No / Endpoints: No) and
`KpiHistory` (Central: Yes / Site: No / DI: Yes / Actors: Yes / Endpoints: No) rows to
the matrix, and note KpiHistory's recorder singleton is intentionally absent from
`RequiredSingletonsHealthCheck` (not readiness-gated). Re-check the Dependencies list
("All 19 component libraries") against the actual csproj reference count while editing.
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): added `KpiHistory` and `Transport` rows to the Component Registration Matrix in Component-Host.md (both Central=Yes/Site=No/DI=Yes), reflecting their actual central-only registration.
+229 -3
View File
@@ -5,10 +5,10 @@
| Module | `src/ZB.MOM.WW.ScadaBridge.InboundAPI` | | Module | `src/ZB.MOM.WW.ScadaBridge.InboundAPI` |
| Design doc | `docs/requirements/Component-InboundAPI.md` | | Design doc | `docs/requirements/Component-InboundAPI.md` |
| Status | Reviewed | | Status | Reviewed |
| Last reviewed | 2026-05-28 | | Last reviewed | 2026-06-20 |
| Reviewer | claude-agent | | Reviewer | claude-agent |
| Commit reviewed | `1eb6e97` | | Commit reviewed | `4307c381` |
| Open findings | 0 | | Open findings | 4 |
## Summary ## Summary
@@ -124,6 +124,36 @@ configuration database, but the invariant is undocumented.)
| 9 | Testing coverage | ☑ | `EndpointExtensions.HandleInboundApiRequest` composition wiring has no test (InboundAPI-023); middleware/filter/validator/executor/route are individually covered. | | 9 | Testing coverage | ☑ | `EndpointExtensions.HandleInboundApiRequest` composition wiring has no test (InboundAPI-023); middleware/filter/validator/executor/route are individually covered. |
| 10 | Documentation & comments | ☑ | No new issues. | | 10 | Documentation & comments | ☑ | No new issues. |
#### Re-review 2026-06-20 (commit `4307c381`) — full review
All 25 prior findings remain `Resolved` (InboundAPI-022's `IActiveNodeGate` Host
registration is now present at `Program.cs:260`; InboundAPI-025's POST-only audit
`UseWhen` predicate is live at `Program.cs:337-344`). The module changed materially:
the whole project was renamed (ScadaLink → ZB.MOM.WW.ScadaBridge), inbound auth was
re-architected to the shared `ZB.MOM.WW.Auth.ApiKeys` Bearer/`sbk_` verifier (the SQL
`ApiKey` store, `ApiKeyValidator`, and peppered `IApiKeyHasher` were retired),
`ForbiddenApiChecker` became a thin shim over Script Analysis #25's `ScriptTrustValidator`,
and JSON-Schema `$ref` runtime resolution (`SchemaRefResolver`) plus recursive
Object/List validation+materialization landed. The dominant new finding is that a
**read-only `InboundDatabaseHelper`** was added to `InboundScriptContext.Database`
reintroducing the very raw-DB-access capability the design doc still explicitly
forbids (regressing the InboundAPI-007 closure), with no read-only enforcement and a
sync-over-async, token-ignoring data path. **4 new findings**: 1 High, 1 Medium,
2 Low — no Critical.
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ☑ | Recursive Materialize/Validate look sound; `WaitForAttribute` deadline-vs-wait-timeout tension (InboundAPI-029). |
| 2 | Akka.NET conventions | ☑ | ASP.NET-hosted; routes via `IInstanceRouter``CommunicationService`. Correlation/parent ids stamped on all four `Route.To()` verbs. No issues. |
| 3 | Concurrency & thread safety | ☑ | `ConcurrentDictionary` caches; per-request DI scope created from root in singleton executor (valid). New DB helper blocks a pool thread (InboundAPI-027). |
| 4 | Error handling & resilience | ☑ | DB helper's `ExecuteScalar`/`ExecuteReader` are synchronous and ignore `_ct` — a slow query is unbounded by the method timeout (InboundAPI-027). |
| 5 | Security | ☑ | `InboundDatabaseHelper` runs arbitrary script-supplied SQL (not read-only despite the name) on central with host privileges (InboundAPI-026). Bearer verifier delegates constant-time compare to shared lib. Forbidden-API check delegates to #25. |
| 6 | Performance & resource management | ☑ | Capped capture/known-bad caches; `ArrayPool` rent. Sync-over-async DB calls (InboundAPI-027). |
| 7 | Design-document adherence | ☑ | `Database` helper contradicts the doc's "No direct database access" decision (InboundAPI-026, regresses InboundAPI-007); `WaitForAttribute` timeout binding diverges from spec §6 line 192 (InboundAPI-029). |
| 8 | Code organization & conventions | ☑ | `InboundDatabaseHelper` is a component-local helper (fine); schema types in Commons. No issues. |
| 9 | Testing coverage | ☑ | `InboundDatabaseHelper` tests cover SELECT happy-path only — no write/DDL-rejection test (because none is enforced); `WaitForAttribute`/`Database` not end-to-end via the endpoint (InboundAPI-028). |
| 10 | Documentation & comments | ☑ | Helper XML doc claims "read-only" which is unenforced (folded into InboundAPI-026). |
## Findings ## Findings
### InboundAPI-001 — Singleton script handler cache mutated without synchronization ### InboundAPI-001 — Singleton script handler cache mutated without synchronization
@@ -1299,3 +1329,199 @@ route, not on the `/api/` prefix. Options:
The endpoint-filter form is the recommended fix — it co-locates the audit-emission The endpoint-filter form is the recommended fix — it co-locates the audit-emission
scope with the route definition and matches how InboundAPI-006/008 gating is scope with the route definition and matches how InboundAPI-006/008 gating is
already wired. already wired.
### InboundAPI-026 — `InboundDatabaseHelper` gives inbound scripts arbitrary raw SQL (not read-only); contradicts the design doc's "No direct database access" decision and regresses InboundAPI-007
| | |
|--|--|
| Severity | High |
| Category | Security |
| Status | Open |
| Location | `src/ZB.MOM.WW.ScadaBridge.InboundAPI/InboundDatabaseHelper.cs:13-70`, `src/ZB.MOM.WW.ScadaBridge.InboundAPI/InboundScriptExecutor.cs:267-281`, `:387`; design doc `docs/requirements/Component-InboundAPI.md:202-215` |
**Description**
A new `InboundDatabaseHelper` is exposed to every inbound API script as
`InboundScriptContext.Database` (`InboundScriptExecutor.cs:387`), backed by the
production `IDatabaseGateway` (`AddExternalSystemGateway()` in the central-role branch
of `Program.cs:81`). Its `QuerySingle<T>`/`Query` methods take a script-supplied `sql`
string and run it verbatim — `cmd.CommandText = sql; cmd.ExecuteScalar()/ExecuteReader()`
(`InboundDatabaseHelper.cs:27-29`, `:41-44`) — against a raw ADO.NET `DbConnection`
obtained from `IDatabaseGateway.GetConnectionAsync` (`IDatabaseGateway.cs:20`, the same
gateway the External System Gateway uses for SCADA machine-data DBs). Two problems:
1. **It directly contradicts the design doc.** `Component-InboundAPI.md:202-215` — which
is the *resolution* of finding InboundAPI-007 — states verbatim: *"**No direct
database access.** Inbound API scripts are not given a raw database client. Handing a
script a raw `SqlConnection` is in direct tension with the ScadaBridge script trust
model … Scripts interact with the system only through the curated `Route` and
`Parameters` surfaces … If a method needs data from the configuration or machine-data
databases, that access belongs behind a dedicated, scoped helper — not a
general-purpose connection — and would be added here as an explicit design change."*
The code now ships exactly the capability the doc forbids, and the doc was **not**
updated. This is a regression of the closed InboundAPI-007 (whose chosen resolution
was to remove the DB API and document its absence). Inbound API scripts are authored
by the less-trusted-than-Admin Design role and execute on the central node with the
host process's privileges, so this materially widens the trust boundary the design
deliberately drew.
2. **"Read-only" is unenforced — it is a comment, not a constraint.** The class summary
says *"Read-only database access"* but nothing restricts the SQL: a script may run
`Database.Query("conn", "DELETE FROM Machine")`, `UPDATE`, `INSERT`, `DROP`, or any
DDL/DML, and `Query`/`QuerySingle` will execute it (`ExecuteReader`/`ExecuteScalar`
happily run non-SELECT statements). The gateway opens a normal read-write connection
(no `ApplicationIntent=ReadOnly`, no statement allow-listing). The "read-only" claim
in the XML doc and the call-site comment (`InboundScriptExecutor.cs:240`,
"read-only Database helper") is therefore false and misleading.
Parameter *values* are correctly bound (`AddParameters`, `:59-68`) so value-injection is
not the issue — the issue is that arbitrary statement text is script-controlled and the
"read-only" containment does not exist.
**Recommendation**
Reconcile code and design, and make the safety claim real. Either (a) if scoped DB
access is now intended, update `Component-InboundAPI.md:202-215` to authorize it as the
explicit design change the prior text demanded, AND enforce read-only — restrict to
`SELECT` (reject any statement whose first significant token is not `SELECT`/`WITH`, or
better, route through a read-intent connection / a curated query API that cannot mutate),
fix the XML summary to describe what is actually enforced, and gate which connection
names a script may reach; or (b) if direct DB access is still out of scope, remove
`InboundDatabaseHelper` and the `Database` member (reverting to the InboundAPI-007
posture). Add a regression test asserting a non-SELECT statement is rejected once
read-only is enforced (see InboundAPI-028).
**Resolution**
_Unresolved._
### InboundAPI-027 — `InboundDatabaseHelper` is sync-over-async and ignores the method-deadline token — thread-pool starvation and an unbounded slow query
| | |
|--|--|
| Severity | Medium |
| Category | Performance & resource management |
| Status | Open |
| Location | `src/ZB.MOM.WW.ScadaBridge.InboundAPI/InboundDatabaseHelper.cs:25-33`, `:40-44` |
**Description**
`QuerySingle`/`Query` execute on the ASP.NET request thread (the script runs inside
`CSharpScript.RunAsync` on a thread-pool thread). Both methods do
`_gateway.GetConnectionAsync(connectionName, _ct).GetAwaiter().GetResult()`
(`:25`, `:40`) — a synchronous block on an async DB-open — and then call the **synchronous**
`cmd.ExecuteScalar()` (`:29`) / `cmd.ExecuteReader()` (`:44`). Two consequences:
1. **Thread-pool starvation.** The inbound API has no rate limiting (a deliberate design
choice). Each DB-backed method invocation blocks a pool thread for the full duration
of the DB round-trip via `.GetAwaiter().GetResult()`. Under concurrent load this
consumes pool threads that cannot be released until the DB responds — exactly the
pattern ASP.NET guidance warns against (it won't deadlock without a sync context, but
it does starve the pool and degrade the whole central node).
2. **The slow query is not bounded by the method timeout.** The cancellation token
`_ct` (the method-deadline `cts.Token` passed at `InboundScriptExecutor.cs:268`) is
forwarded to `GetConnectionAsync` but **not** to the synchronous `ExecuteScalar()` /
`ExecuteReader()` — those overloads ignore cancellation entirely. So once a connection
is open, a hung or slow `SELECT` runs to completion (or the provider's own command
timeout) regardless of the method timeout, holding the blocked thread the whole time.
This contradicts the InboundAPI-016 guarantee that the method timeout bounds the
method's work.
**Recommendation**
Use the async ADO.NET path end-to-end and honour the token: `await
_gateway.GetConnectionAsync(...)`, then `await cmd.ExecuteReaderAsync(_ct)` /
`await cmd.ExecuteScalarAsync(_ct)`, and make `QuerySingle`/`Query` async (`Task<T?>` /
`Task<IReadOnlyList<...>>`) so the script `await`s them rather than blocking a pool
thread. Set a `CommandTimeout` derived from the remaining method deadline as a backstop.
If a synchronous script-facing signature must be kept, at minimum register the command
with the token and a command timeout so a slow query cannot run unbounded.
**Resolution**
_Unresolved._
### InboundAPI-028 — `InboundDatabaseHelper` has no negative-path tests; `Database`/`WaitForAttribute` are not exercised end-to-end through the endpoint
| | |
|--|--|
| Severity | Low |
| Category | Testing coverage |
| Status | Open |
| Location | `tests/ZB.MOM.WW.ScadaBridge.InboundAPI.Tests/InboundDatabaseHelperTests.cs:34-67`, `tests/ZB.MOM.WW.ScadaBridge.InboundAPI.Tests/EndpointExtensionsTests.cs` |
**Description**
`InboundDatabaseHelperTests` covers only the SELECT happy path (single value, no-rows
default, multi-column row, empty result). There is no test that:
- a non-SELECT statement (`DELETE`/`UPDATE`/`DROP`) is rejected — because nothing rejects
it today (InboundAPI-026); a test would currently *prove* the helper is not read-only;
- the `InvalidOperationException("Database is not available …")` path fires when the
gateway is null (`InboundDatabaseHelper.cs:24`, `:39`);
- the method-deadline token cancels a slow query (InboundAPI-027).
Separately, `InboundScriptContext.Database` and `RouteTarget.WaitForAttribute` — both
new script-facing capabilities — are unit-tested at the helper/route level but never
driven end-to-end through `HandleInboundApiRequest` (the `EndpointExtensionsTests`
TestServer suite does not invoke a script that touches `Database` or `WaitForAttribute`),
so a wiring regression that fails to construct the `Database` helper or thread the
deadline into a wait would be silent.
**Recommendation**
Add negative-path tests to `InboundDatabaseHelperTests` (null-gateway throw; once
InboundAPI-026 enforces read-only, a write/DDL-rejection test; an InboundAPI-027
cancellation test). Add an `EndpointExtensionsTests` case driving a method script that
calls `Database.Query(...)` and one that calls `Route.To(...).WaitForAttribute(...)` so
the executor→context→helper wiring is covered through the real endpoint.
**Resolution**
_Unresolved._
### InboundAPI-029 — Routed `WaitForAttribute` is cancelled by the method-level deadline, contradicting spec §6 (wait bounded by the wait timeout, not the method timeout)
| | |
|--|--|
| Severity | Low |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ZB.MOM.WW.ScadaBridge.InboundAPI/RouteHelper.cs:220-247`, `:302-303`; design doc `docs/requirements/Component-InboundAPI.md:192` |
**Description**
`Component-InboundAPI.md:192` says of `WaitForAttribute`: *"The cluster call is bounded
by the **wait timeout** rather than the generic integration timeout."* But
`RouteTarget.WaitForAttribute` routes with `Effective(cancellationToken)` (`:226`),
which — when the script passes no explicit token, the normal case — resolves to
`_deadlineToken` (the method-level timeout CTS bound by `WithDeadline` at
`InboundScriptExecutor.cs:279`). So a method whose timeout (default 30 s) is shorter than
the requested wait timeout (e.g. `WaitForAttribute("Flag", true, TimeSpan.FromMinutes(5))`)
has its wait cancelled at 30 s by the method deadline, never reaching the 5-minute wait
timeout the spec says governs it. The routed wait is thus bounded by the *generic method
timeout* — exactly what spec line 192 says it should not be.
This is in genuine tension with the InboundAPI-016 decision ("the method timeout covers
routed calls, including waits"). The two design statements conflict for the
`WaitForAttribute` case: §6 line 192 wants the wait timeout to win; InboundAPI-016 wants
the method deadline to be the hard ceiling. The current code silently resolves it in
favour of the method deadline, which makes the documented `WaitForAttribute` timeout
semantics (and its `false`-on-timeout return contract) unreliable whenever the method
timeout is the smaller of the two.
**Recommendation**
Reconcile the two design statements explicitly. Options: (a) treat the method deadline as
the hard ceiling and update `Component-InboundAPI.md:192` to say the wait is bounded by
`min(waitTimeout, remaining method deadline)` and that an undersized method timeout will
cut a wait short; or (b) honour spec §6 by deriving the wait's effective token from the
wait timeout (a per-wait CTS) rather than the method deadline, accepting that a long wait
can outlive the generic method timeout. Either way, add a `RouteHelperTests` case pinning
the chosen semantics when method-timeout < wait-timeout (the current suite only tests
deadline-token *inheritance*, not the timeout *ordering* conflict).
**Resolution**
_Unresolved._
+307
View File
@@ -0,0 +1,307 @@
# Code Review — KpiHistory
| Field | Value |
|-------|-------|
| Module | `src/ZB.MOM.WW.ScadaBridge.KpiHistory` |
| Design doc | `docs/requirements/Component-KpiHistory.md` |
| Status | Reviewed |
| Last reviewed | 2026-06-20 |
| Reviewer | claude-agent |
| Commit reviewed | `4307c381` |
| Open findings | 0 |
## Summary
KpiHistory is a small, well-built observability module: a single ~305-line recorder
singleton (`KpiHistoryRecorderActor`), a strongly-typed options class with a fail-fast
validator, and a thin DI composition root. The owned code is clean — the actor is
textbook best-effort: every sample pass and purge sweep runs off the actor thread via
`PipeTo`, per-source faults are isolated so one throwing `IKpiSampleSource` never aborts
a tick or the other sources, the repository write/purge is guarded, no exception escapes
either tick handler, and a lifecycle `CancellationTokenSource` is cancelled in `PostStop`
so an in-flight pass observes shutdown promptly. The singleton is wired correctly in the
Host (ClusterSingletonManager + proxy + `PhaseClusterLeave` drain) and is deliberately
absent from the readiness barrier, exactly as the design requires. The options validator,
the EAV table mapping, both named indexes, and all four `IKpiSampleSource` implementations
exist and are registered on the central host as designed.
The dominant theme is **unbounded work under load**, in two places. First, the recorder
has **no in-flight guard** on its sample timer — directly contradicting its own XML-doc
claim to "mirror the NotificationOutboxActor timer + scope-per-tick + PipeTo pattern,"
because the NotificationOutbox dispatcher *does* hold an in-flight guard and the recorder
does not. When a sample pass runs longer than `SampleInterval` (slow/recovering DB), Akka
periodic timers enqueue, not coalesce, so overlapping `RunSamplePass` tasks pile up,
multiplying DB load at exactly the moment the store is struggling and double-writing
samples for overlapping windows. Second, `GetRawSeriesAsync` has **no server-side row
cap**: the design's `DefaultMaxSeriesPoints` ceiling is applied by `KpiSeriesBucketer`
only *after* the full raw window is materialised into memory — a 7-day window at the
default 60 s cadence is ~10 080 rows per series pulled to the Central UI before
downsampling, defeating the stated intent of the cap. A secondary theme is **bucketer
contract drift** (the "largest-timestamp-wins for unsorted input" doc claim is not what
the code does, and the short-series early-return emits raw capture timestamps where the
downsample path emits bucket-boundary timestamps) — both live in Commons but are core to
this module's query reducer. No Critical findings; one High, three Medium, two Low.
## Checklist coverage
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | Yes | `capturedAt`/cut-off captured on the actor thread (correct). `KpiSeriesBucketer` short-series early-return emits raw capture timestamps vs bucket-boundary timestamps on the downsample path (KpiHistory-005). Bucketer "largest-timestamp-wins for unsorted input" doc claim is false — it is last-in-iteration (KpiHistory-006). |
| 2 | Akka.NET conventions | Yes | `PipeTo` + scope-per-tick + off-thread I/O + `IWithTimers` all correct; sender not captured across awaits; messages immutable singletons. But no in-flight guard despite the XML claiming to mirror NotificationOutbox (KpiHistory-001). |
| 3 | Concurrency & thread safety | Yes | Actor state is effectively stateless; `_shutdownCts` lifecycle is correct. The missing in-flight guard allows overlapping sample passes under DB latency (KpiHistory-001). |
| 4 | Error handling & resilience | Yes | Per-source isolation, write/purge guards, and `OperationCanceledException` shutdown handling are all correct and tested. No issues beyond KpiHistory-001's load amplification. |
| 5 | Security | Yes | No injection surface — all queries are parameterised LINQ; `Source`/`Metric`/`Scope`/`ScopeKey` are equality predicates, never interpolated SQL. Scope isolation via `ScopeKey == scopeKey` (incl. `IS NULL` for Global) is correct. No issues found. |
| 6 | Performance & resource management | Yes | `GetRawSeriesAsync` returns the entire window with no server-side cap before bucketing (KpiHistory-002). `RecordSamplesAsync` short-circuits empty batches (good). Purge is set-based `ExecuteDeleteAsync` but is a single unbatched statement (KpiHistory-004). |
| 7 | Design-document adherence | Yes | The recorder XML claims to mirror NotificationOutbox's pattern but omits its in-flight guard (KpiHistory-001). `DefaultMaxSeriesPoints`-as-a-true-cap intent is undermined by KpiHistory-002. Otherwise faithful: singleton, not-readiness-gated, daily purge, EAV schema, indexes. |
| 8 | Code organization & conventions | Yes | Options class owned by the component (correct); validator co-located; `IKpiHistoryRepository`/`IKpiSampleSource`/`KpiSample`/bucketer in Commons, impl in ConfigurationDatabase (correct); singleton Props built in Host. No issues found. |
| 9 | Testing coverage | Yes | Recorder isolation, faulted-tick recovery, and purge cut-off are tested; validator bounds fully covered; bucketer has strong sorted-input coverage. Gaps: no overlapping-tick test, no unsorted-input bucketer test, no short-series timestamp-semantics assertion (KpiHistory-003). |
| 10 | Documentation & comments | Yes | XML is generally excellent. Two drift points: the "mirror NotificationOutbox" claim (KpiHistory-001) and the bucketer unsorted-input claim (KpiHistory-006). `SampleComplete`/`PurgeComplete` no-op handlers are documented. |
## Findings
### KpiHistory-001 — Recorder has no in-flight guard; overlapping sample passes pile up under DB latency
| | |
|--|--|
| Severity | High |
| Category | Akka.NET conventions |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.KpiHistory/KpiHistoryRecorderActor.cs:89-92`, `:103-107`, `:143-159` |
**Description**
The sample timer is a plain Akka periodic timer
(`Timers.StartPeriodicTimer(SampleTimerKey, SampleTick.Instance, …, interval: _options.SampleInterval)`).
`HandleSampleTick` launches `RunSamplePass(...)` off-thread via `PipeTo` and returns
immediately; the piped-back `SampleComplete` is a deliberate no-op
(`Receive<SampleComplete>(_ => { })`). There is **no in-flight guard** — nothing prevents
a second `SampleTick` from starting a second `RunSamplePass` while the first is still
awaiting its DB round-trip.
Akka periodic timers do not coalesce missed ticks — they enqueue. So when a sample pass
takes longer than `SampleInterval` (a slow, contended, or recovering central MS SQL —
exactly the regime where observability matters most), each subsequent tick spawns *another*
concurrent pass. Each pass opens its own DI scope + `DbContext` and issues its own
`AddRange` + `SaveChangesAsync`. The result is a self-amplifying load spiral against the
struggling store, plus duplicate sample rows whose `CapturedAtUtc` values straddle the same
real-time window (the design's "one shared tick timestamp" invariant assumes one pass per
tick).
The actor's own XML-doc states it "mirrors the `NotificationOutboxActor` timer +
scope-per-tick + PipeTo pattern" (lines 37-39) — but `NotificationOutboxActor` holds an
explicit in-flight boolean cleared on `DispatchComplete` precisely to serialise sweeps.
This recorder dropped that half of the pattern. The piped-back `SampleComplete` message is
the natural place the guard would be lowered, and its current empty body is the tell that
the guard was intended but omitted.
**Recommendation**
Add an in-flight guard: set a `_sampleInFlight` flag at the top of `HandleSampleTick`, skip
(and log at debug) the tick if already set, and clear it in the `SampleComplete` handler
(both success and failure projections already route there). This matches the
NotificationOutbox pattern the doc claims to follow and bounds the recorder to one pass per
tick. Add a regression test that drives two `SampleTick`s before the first pass completes
(e.g. a repository whose `RecordSamplesAsync` blocks on a gate) and asserts only one pass
ran. The purge tick is daily + idempotent so a guard there is optional, but consider the
same treatment for symmetry.
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): added a `_sampleInFlight` guard to the recorder — `HandleSampleTick` skips (debug-logs) if a pass is in flight; the flag is cleared via a `SampleComplete` message piped on BOTH success and failure, so a faulted source can't wedge the guard. Overlapping-tick regression test added.
### KpiHistory-002 — `GetRawSeriesAsync` materialises the entire window with no server-side cap; `DefaultMaxSeriesPoints` is applied only after the fact
| | |
|--|--|
| Severity | Medium |
| Category | Performance & resource management |
| Status | Deferred |
| Location | `src/ZB.MOM.WW.ScadaBridge.Commons/Interfaces/Repositories/IKpiHistoryRepository.cs:39-41` (contract); `src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase/Repositories/KpiHistoryRepository.cs:45-62` (impl); consumed by `src/ZB.MOM.WW.ScadaBridge.CentralUI/Services/KpiHistoryQueryService.cs:77-93` |
**Description**
`GetRawSeriesAsync` has no `limit`/`maxPoints`/`Take` parameter. The implementation issues a
`Where(...).OrderBy(CapturedAtUtc).Select(...).ToListAsync()` that pulls **every** sample in
`[fromUtc, toUtc]` for the series into memory. The `DefaultMaxSeriesPoints` ceiling
(default 200, design-described as the cap that prevents "a single trend query [from
streaming] an arbitrarily large series to the Central UI" — `KpiHistoryOptionsValidator`
XML) is enforced by `KpiSeriesBucketer.Bucket` **after** the full raw set has already been
fetched across the wire and allocated as a `List<KpiSeriesPoint>` in the query service.
At the default 60 s sample cadence, a 24 h window is ~1 440 rows/series, a 7 d window is
~10 080 rows/series, and a 30 d window ~43 200 rows/series — per chart, with up to four
trend panels on a page each issuing its own query. The cap that the design relies on for
back-pressure provides none at the data tier; it only trims the in-memory result the UI
binds to. The `IX_KpiSample_Series` index makes the *range scan* efficient, but the row
count returned is still unbounded by anything except the window the parent page chooses.
**Recommendation**
Push the downsampling toward the store, or at least bound the fetch. Cheapest correct fix:
add an optional `int? maxRows` to `GetRawSeriesAsync` and have the query service pass a
generous multiple of `effectiveMax` (e.g. `effectiveMax * k`) so the bucketer still has
enough density to pick representative last-values while the DB-side `Take` caps the
transfer. A more thorough fix is a server-side bucketed aggregation (GROUP BY a computed
bucket index, MAX(CapturedAtUtc) per bucket), but that is a larger change the design
explicitly deferred ("v1 ships exactly one aggregation"). At minimum, document the
unbounded-fetch behaviour and the practical window ceiling so an operator does not point a
30 d chart at a busy multi-site deployment.
**Resolution**
Deferred 2026-06-20: bounding the raw window fetch (cheap `Take`-cap vs. server-side bucketed aggregation) is a design decision and the code lives in ConfigurationDatabase/Commons, not this project. Recorded as a query-path enhancement; the practical window ceiling should be documented when addressed.
### KpiHistory-003 — Missing tests: overlapping ticks, unsorted-input bucketing, short-series timestamp semantics
| | |
|--|--|
| Severity | Medium |
| Category | Testing coverage |
| Status | Resolved |
| Location | `tests/ZB.MOM.WW.ScadaBridge.KpiHistory.Tests/KpiHistoryRecorderActorTests.cs`; `tests/ZB.MOM.WW.ScadaBridge.Commons.Tests/Kpi/KpiSeriesBucketerTests.cs` |
**Description**
The recorder and bucketer tests are otherwise strong (per-source isolation, faulted-tick
recovery, purge cut-off, validator bounds, sorted-input bucketing, right-edge handling,
empty-bucket omission, out-of-window filtering). Three behaviour gaps remain, each tied to
a finding here:
1. **No overlapping-tick test** (KpiHistory-001). Every recorder test sends a single tick
and awaits its effect; nothing exercises a second `SampleTick` arriving while the first
pass is still in flight, so the missing in-flight guard is invisible to the suite. A
gated-repository test (block `RecordSamplesAsync`, send two ticks, count passes) would
pin the intended one-pass-per-tick behaviour.
2. **No unsorted-input bucketer test** (KpiHistory-006). All bucketer tests pass ascending
input, so the doc's "largest-timestamp-wins for unsorted input" claim is never checked —
and it is in fact wrong.
3. **No short-series timestamp-semantics assertion** (KpiHistory-005).
`Bucket_BucketStartUtc_IsSetToBucketStart…` covers only the downsample path; no test
asserts what `BucketStartUtc` the early-return (`raw.Count <= maxPoints`) path emits, so
the inconsistency between the two paths is untested.
**Recommendation**
Add the three tests above. The overlapping-tick test belongs in
`KpiHistoryRecorderActorTests` (it is a recorder behaviour); the two bucketer tests belong
in `KpiSeriesBucketerTests`.
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): added the overlapping-tick gated-repository test plus unsorted-input and short-series bucketer tests (the latter pin the documented short-series behaviour).
### KpiHistory-004 — Retention purge is a single unbatched `ExecuteDeleteAsync`; a large backlog deletes in one transaction
| | |
|--|--|
| Severity | Low |
| Category | Performance & resource management |
| Status | Deferred |
| Location | `src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase/Repositories/KpiHistoryRepository.cs:65-71` |
**Description**
`PurgeOlderThanAsync` runs one set-based `Where(s => s.CapturedAtUtc < before).ExecuteDeleteAsync(...)`.
For the steady-state daily cadence this deletes one day of expired rows and is fine. But if
the purge has not run for an extended period (singleton down across a long failover window,
`PurgeInterval` mis-set and later corrected, or `RetentionDays` shortened), a single
unbounded `DELETE` can touch a very large row count in one transaction — lock escalation on
`KpiSample`, a long-running transaction, and transaction-log growth, which on the shared
central MS SQL can affect operational tables. The Audit Log purge path, by contrast, uses a
bounded batched delete for exactly this reason.
This is observability data on a non-partitioned `[PRIMARY]` table, so the blast radius is
bounded and the severity is Low — but the unbatched delete is a latent operational hazard
on the shared store.
**Recommendation**
Loop a bounded delete (`DELETE TOP (N) … WHERE CapturedAtUtc < @before`, or EF Core's
batching) until zero rows are affected, mirroring the Audit Log purge shape. Keep the
returned total for the existing log line.
**Resolution**
Deferred 2026-06-20: batching the retention purge is a shared-MS-SQL-store tradeoff (vs. the Audit Log's batched-delete precedent); low severity for non-partitioned observability data. Recorded as a future enhancement.
### KpiHistory-005 — `KpiSeriesBucketer` short-series early-return emits raw capture timestamps where the downsample path emits bucket-boundary timestamps
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Deferred |
| Location | `src/ZB.MOM.WW.ScadaBridge.Commons/Types/Kpi/KpiSeriesBucketer.cs:57-58` (early return) vs `:93-94` (downsample) |
**Description**
`KpiSeriesBucketer.Bucket` has two return paths that disagree on what `BucketStartUtc`
means. When `raw.Count <= maxPoints` it returns `raw` unchanged — those points carry the
raw `CapturedAtUtc` as their `BucketStartUtc` (the repository builds
`new KpiSeriesPoint(s.CapturedAtUtc, s.Value)`). When `raw.Count > maxPoints` it returns
points whose `BucketStartUtc` is the **bucket boundary**
(`fromUtc + bucketIndex * bucketWidthTicks`), as the dedicated test
`Bucket_BucketStartUtc_IsSetToBucketStartNotRawPointTimestamp` asserts.
So the same series, charted at a density below vs above `maxPoints`, plots its points at
different x-positions: actual capture instants in the sparse case, evenly-spaced bucket
starts in the dense case. The downstream `KpiTrendChart` normalises X across
`[min(BucketStartUtc), max(BucketStartUtc)]`, so the visual impact is minor (the time range
is essentially the same), but the contract is inconsistent and the x-axis "tick spacing"
subtly changes as a window crosses the cap. This is the bucketer that the KpiHistory design
defines as the module's query reducer, so the inconsistency is in-scope even though the file
lives in Commons.
**Recommendation**
Make the two paths agree. Either document the difference explicitly on `Bucket` (the
short-series path returns raw capture instants; the downsample path returns bucket starts),
or — cleaner — have the short-series path also project onto a consistent timestamp basis if
exact bucket-start semantics are part of the contract. Add the short-series timestamp test
from KpiHistory-003.
**Resolution**
Deferred 2026-06-20: whether the bucketer's short-series early-return and the downsample path must agree on `BucketStartUtc` semantics (vs. documenting the difference) is a contract decision for the design owner. The doc was corrected (KpiHistory-006) to describe current behaviour; the contract change is deferred.
### KpiHistory-006 — `KpiSeriesBucketer` doc claims "largest timestamp within each bucket is selected" for unsorted input; the code selects last-in-iteration
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.Commons/Types/Kpi/KpiSeriesBucketer.cs:20-21` (doc), `:88-97` (code) |
**Description**
The `raw` param doc states: *"If not sorted, the point with the largest timestamp within
each bucket is selected."* The code does not do this. When a point is stored into a bucket,
`best[bucketIndex].BucketStartUtc` is set to the **bucket-start** timestamp
(`fromUtc + bucketIndex * bucketWidthTicks`), not the raw point's timestamp. The
last-value comparison for a subsequent point in the same bucket is then
`point.BucketStartUtc > best[bucketIndex].BucketStartUtc` — i.e. it compares the new raw
point's capture time against the *bucket start*, which (for any in-bucket point) is almost
always true. The effect is that each later-in-iteration point overwrites the previous one
regardless of their relative timestamps, so the **last point in iteration order** wins, not
the point with the largest timestamp.
For the production caller this is harmless: `KpiHistoryRepository.GetRawSeriesAsync` always
`OrderBy(CapturedAtUtc)`, so iteration order equals time order and last-in-iteration is the
largest timestamp. But the documented contract for unsorted input is simply false, and the
"if ties, keep first encountered — stable" comment (line 89) is also inaccurate — the
overwrite triggers on equal-as-well-as-greater for any in-bucket point. A future caller that
trusts the unsorted-input guarantee will get wrong results silently.
**Recommendation**
Either (a) fix the comparison to track the selected raw point's actual timestamp (store the
raw `point.BucketStartUtc` alongside the emitted value and compare against that), making the
"largest timestamp wins" claim true for unsorted input; or (b) tighten the doc to state the
method requires ascending-sorted input and selects last-in-iteration otherwise, and drop the
inaccurate "largest timestamp" / "stable ties" language. Pair with the unsorted-input test
from KpiHistory-003.
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): corrected the `KpiSeriesBucketer` param XML doc — dropped the false 'largest-timestamp-wins / stable ties' claim; now states it requires ascending-sorted input and selects last-in-iteration otherwise. Doc-only.
+150 -2
View File
@@ -5,9 +5,9 @@
| Module | `src/ZB.MOM.WW.ScadaBridge.NotificationService` | | Module | `src/ZB.MOM.WW.ScadaBridge.NotificationService` |
| Design doc | `docs/requirements/Component-NotificationService.md` | | Design doc | `docs/requirements/Component-NotificationService.md` |
| Status | Reviewed | | Status | Reviewed |
| Last reviewed | 2026-05-28 | | Last reviewed | 2026-06-20 |
| Reviewer | claude-agent | | Reviewer | claude-agent |
| Commit reviewed | `1eb6e97` | | Commit reviewed | `4307c381` |
| Open findings | 0 | | Open findings | 0 |
## Summary ## Summary
@@ -100,6 +100,39 @@ and `CredentialRedactor` masks any component of the credential string that is
≥ 4 characters long — a 4-character user name like `root` or a 4-char tenant ≥ 4 characters long — a 4-character user name like `root` or a 4-char tenant
prefix could be aggressively scrubbed out of unrelated log text (NS-025). prefix could be aggressively scrubbed out of unrelated log text (NS-025).
#### Re-review 2026-06-20 (commit `4307c381`) — full review
Re-reviewed the whole current state. Since `1eb6e97` the module was renamed
(`ScadaLink → ZB.MOM.WW.ScadaBridge`, so the diff shows every file as "added" at
the new path), gained complete XML-doc coverage, and — most importantly — **NS-021
is now RESOLVED**: `ISmtpClientWrapper.AuthenticateAsync` threads an `oauth2UserName`
and the OAuth2 branch builds `new SaslMechanismOAuth2(oauth2UserName, token)`,
throwing `SmtpPermanentException` if the identity is empty; the real production
caller `EmailNotificationDeliveryAdapter.SendAsync` passes `config.FromAddress`
(`src/ZB.MOM.WW.ScadaBridge.NotificationOutbox/Delivery/EmailNotificationDeliveryAdapter.cs:196`),
and `MailKitSmtpClientWrapperTests` pins both the throw and the XOAUTH2 framing.
All ten prior NS-019..NS-025 findings hold; the module is now cleanly scoped to the
shared central SMTP primitives (no orphaned site-delivery code). The re-review
surfaced **three new findings**, none above Medium: the OAuth2 token endpoint and
scope are hardcoded to Microsoft 365 so no other OAuth2 provider can authenticate
despite the design doc promising "other modern SMTP providers" (NS-026), one test
file is misplaced in this test project (its system-under-test lives in
NotificationOutbox — NS-027), and one test's XML doc still names the
NS-019-deleted `NotificationDeliveryService` (NS-028).
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ☑ | NS-021 now resolved (oauth2 user identity threaded + tested). Classification/redaction correct. No new correctness defect. |
| 2 | Akka.NET conventions | ☑ | No actors. `AddNotificationServiceActors` is a documented no-op. |
| 3 | Concurrency & thread safety | ☑ | `OAuth2TokenService` per-credential `ConcurrentDictionary` + per-entry `SemaphoreSlim` double-check is correct; concurrent-fetch test passes. No issues. |
| 4 | Error handling & resilience | ☑ | Delivery error path now lives entirely in NotificationOutbox; the shared `SmtpErrorClassifier` is sound (typed exceptions + numeric SMTP code, cancellation excluded). No issues here. |
| 5 | Security | ☑ | OAuth2 secret/token never logged (tenant only); `CredentialRedactor` tightened (NS-025). At-rest encryption of `SmtpConfiguration.Credentials` still NS-013 Deferred (Commons-scoped). OAuth2 endpoint hardcoded to M365 (NS-026). |
| 6 | Performance & resource management | ☑ | Single `SmtpClient` per wrapper, disposed by caller's `using`; per-delivery factory documented (NS-022). No leak. |
| 7 | Design-document adherence | ☑ | Central-only redesign clean. New gap: OAuth2 hardcoded to M365 token URL + outlook.office365.com scope contradicts "other modern SMTP providers" in the doc (NS-026). |
| 8 | Code organization & conventions | ☑ | One-type-per-file holds. `NotificationOutboxKpiSampleSourceTests.cs` is in this test project but its SUT is in NotificationOutbox (NS-027). |
| 9 | Testing coverage | ☑ | Primitives well covered (classifier, TLS parser, redactor, OAuth2 token cache/refresh/concurrency, auth-wrapper negative paths incl. NS-021). No false-coverage; orphaned tests gone (NS-024). |
| 10 | Documentation & comments | ☑ | XML docs accurate after NS-019/023, EXCEPT `SmtpErrorClassifierTests.cs:11` still references the deleted `NotificationDeliveryService` (NS-028). |
## Checklist coverage ## Checklist coverage
| # | Category | Examined | Notes | | # | Category | Examined | Notes |
@@ -860,3 +893,118 @@ The threshold is a defence-in-depth choice; the existing tests assert that `Hunt
**Recommendation** **Recommendation**
Tighten the redaction policy: mask only the obviously-secret components — the password (Basic), the client secret (OAuth2), and the whole packed string — not the user name / tenant / client id. The simplest implementation is to redact only the **last** colon-separated component (the secret) plus the full packed string. Bump the per-component minimum length to something high enough that a typical short user name does not match (≥ 12 chars is the usual heuristic for a password). Add a test asserting `Scrub("/root/.config", "root:hunter2")` does not mask `/root/.config`'s `root`. Tighten the redaction policy: mask only the obviously-secret components — the password (Basic), the client secret (OAuth2), and the whole packed string — not the user name / tenant / client id. The simplest implementation is to redact only the **last** colon-separated component (the secret) plus the full packed string. Bump the per-component minimum length to something high enough that a typical short user name does not match (≥ 12 chars is the usual heuristic for a password). Add a test asserting `Scrub("/root/.config", "root:hunter2")` does not mask `/root/.config`'s `root`.
### NotificationService-026 — OAuth2 token endpoint and scope are hardcoded to Microsoft 365; no other OAuth2 SMTP provider can authenticate
| | |
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Deferred |
| Location | `src/ZB.MOM.WW.ScadaBridge.NotificationService/OAuth2TokenService.cs:73`, `src/ZB.MOM.WW.ScadaBridge.NotificationService/OAuth2TokenService.cs:80` |
**Description**
`OAuth2TokenService.GetTokenAsync` constructs the token URL as
`$"https://login.microsoftonline.com/{tenantId}/oauth2/v2.0/token"` (`:73`) and
requests the fixed scope `"https://outlook.office365.com/.default"` (`:80`). Both
are Microsoft-365-specific: the authority host is Azure AD's, and the scope is the
Outlook/Office-365 SMTP-send resource. The design doc (`Component-NotificationService.md`,
"Email Server Configuration") states the OAuth2 Client Credentials mode is
"For Microsoft 365 **and other modern SMTP providers** that require OAuth2", and
CLAUDE.md's External Integrations decision likewise says "OAuth2 Client Credentials
(Microsoft 365) **or** Basic Auth". As implemented, only Microsoft 365 can ever
work: a Google Workspace, on-prem Keycloak/ADFS, or any non-Azure OAuth2 SMTP relay
would need a different authority host and a different scope, neither of which is
configurable. The credential triple carries only `tenantId:clientId:clientSecret`
— there is no authority-host or scope field to override either constant. An operator
who configures an `oauth2` SMTP config pointing at a non-M365 provider gets a token
request sent to `login.microsoftonline.com` with `{tenantId}` interpolated as a path
segment, which fails (404/400) and surfaces as a permanent delivery failure with no
hint that the provider is simply unsupported.
This is latent rather than actively broken — the only documented/tested target is
M365 — but it contradicts the design doc's stated breadth and is the kind of gap that
silently blocks a future deployment against a non-Microsoft tenant.
**Recommendation**
Either (a) make the authority host and scope configurable — add optional
`OAuth2Authority`/`OAuth2Scope` fields to `SmtpConfiguration` (defaulting to the M365
values) and thread them into `GetTokenAsync` — and update the cache key to include
them; or (b) narrow the design doc and CLAUDE.md to say OAuth2 Client Credentials is
**Microsoft 365 only** and document the M365-only constraint on `OAuth2TokenService`,
so an operator is not misled into configuring an unsupported provider. Option (b) is
the smaller change if M365-only is acceptable for v1.
**Resolution**
Deferred 2026-06-20: the OAuth2 token endpoint + scope are hardcoded to Microsoft 365 while the doc implies multi-provider support. Whether to make authority/scope configurable or narrow the doc to 'Microsoft 365 only' for v1 is a product/scope decision. Latent today (only M365 is deployed/tested). Recorded for that decision.
### NotificationService-027 — `NotificationOutboxKpiSampleSourceTests` is in the wrong test project (its system-under-test lives in NotificationOutbox)
| | |
|--|--|
| Severity | Low |
| Category | Code organization & conventions |
| Status | Resolved |
| Location | `tests/ZB.MOM.WW.ScadaBridge.NotificationService.Tests/Kpi/NotificationOutboxKpiSampleSourceTests.cs:1`, `src/ZB.MOM.WW.ScadaBridge.NotificationOutbox/Kpi/NotificationOutboxKpiSampleSource.cs` |
**Description**
`tests/ZB.MOM.WW.ScadaBridge.NotificationService.Tests/Kpi/NotificationOutboxKpiSampleSourceTests.cs`
was added (commit `0d6c026d`, "K6 — NotificationOutbox sample source") inside the
**NotificationService** test project, but the class it exercises —
`NotificationOutboxKpiSampleSource` — lives in the **NotificationOutbox** project
(`src/ZB.MOM.WW.ScadaBridge.NotificationOutbox/Kpi/NotificationOutboxKpiSampleSource.cs`),
confirmed by a repo-wide grep showing the type defined only there. A module's test
project should test that module: this test belongs in
`tests/ZB.MOM.WW.ScadaBridge.NotificationOutbox.Tests`. As placed, it forces the
NotificationService test project to reference the NotificationOutbox project, blurs
the module boundary the review process treats as the unit of review, and means a
reviewer scoping "the NotificationService tests" picks up coverage for a different
component. It also inflates this module's apparent test count with tests that assert
nothing about NotificationService.
**Recommendation**
Move `Kpi/NotificationOutboxKpiSampleSourceTests.cs` to
`tests/ZB.MOM.WW.ScadaBridge.NotificationOutbox.Tests/Kpi/`, and drop the
NotificationOutbox project reference from `ZB.MOM.WW.ScadaBridge.NotificationService.Tests.csproj`
if nothing else in that test project needs it.
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): moved `NotificationOutboxKpiSampleSourceTests` from NotificationService.Tests into NotificationOutbox.Tests (where its SUT lives), with the namespace adjusted to that project's convention. Passes in its new home.
### NotificationService-028 — Stale XML cross-reference: `SmtpErrorClassifierTests` still names the NS-019-deleted `NotificationDeliveryService`
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Resolved |
| Location | `tests/ZB.MOM.WW.ScadaBridge.NotificationService.Tests/SmtpErrorClassifierTests.cs:11` |
**Description**
The class-summary XML doc on `SmtpErrorClassifierTests` says the classification
policy "is shared between `<see cref="NotificationDeliveryService"/>` and the central
outbox's `EmailNotificationDeliveryAdapter`". `NotificationDeliveryService` was
deleted as part of the NS-019 resolution (the orphaned site-side sender), so this
cross-reference points at a type that no longer exists. The `<see cref=...>` will not
resolve and the sentence describes a sharing relationship that ended with the
central-only redesign — the classifier is now consumed only by
`EmailNotificationDeliveryAdapter`. This is the same stale-doc class NS-023 cleaned up
in the production source, missed in this one test file. Minor, but misleading to a
maintainer and a dangling `cref`.
**Recommendation**
Drop the `NotificationDeliveryService` reference from the summary; state that the
classifier is shared by the central `EmailNotificationDeliveryAdapter` (and is a
reusable primitive of this module), consistent with the NS-019/NS-023 cleanup.
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): removed the dangling `<see cref="NotificationDeliveryService"/>` (type deleted by NS-019) from the `SmtpErrorClassifierTests` XML doc, repointing the reference to the actual production caller. Doc-only.
+35 -26
View File
@@ -40,37 +40,39 @@ module file and counted in **Total**.
| Severity | Open findings | | Severity | Open findings |
|----------|---------------| |----------|---------------|
| Critical | 0 | | Critical | 0 |
| High | 0 | | High | 1 |
| Medium | 0 | | Medium | 1 |
| Low | 0 | | Low | 2 |
| **Total** | **0** | | **Total** | **4** |
## Module Status ## Module Status
| Module | Last reviewed | Commit | Open (C/H/M/L) | Open | Total | | Module | Last reviewed | Commit | Open (C/H/M/L) | Open | Total |
|--------|---------------|--------|----------------|------|-------| |--------|---------------|--------|----------------|------|-------|
| [AuditLog](AuditLog/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 11 | | [AuditLog](AuditLog/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 16 |
| [CLI](CLI/findings.md) | 2026-06-19 | `d6ead8ae` | 0/0/0/0 | 0 | 24 | | [CLI](CLI/findings.md) | 2026-06-19 | `d6ead8ae` | 0/0/0/0 | 0 | 24 |
| [CentralUI](CentralUI/findings.md) | 2026-06-19 | `d6ead8ae` | 0/0/0/0 | 0 | 36 | | [CentralUI](CentralUI/findings.md) | 2026-06-19 | `d6ead8ae` | 0/0/0/0 | 0 | 36 |
| [ClusterInfrastructure](ClusterInfrastructure/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 14 | | [ClusterInfrastructure](ClusterInfrastructure/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 15 |
| [Commons](Commons/findings.md) | 2026-06-19 | `d6ead8ae` | 0/0/0/0 | 0 | 26 | | [Commons](Commons/findings.md) | 2026-06-19 | `d6ead8ae` | 0/0/0/0 | 0 | 26 |
| [Communication](Communication/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 22 | | [Communication](Communication/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 24 |
| [ConfigurationDatabase](ConfigurationDatabase/findings.md) | 2026-06-19 | `d6ead8ae` | 0/0/0/0 | 0 | 27 | | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | 2026-06-19 | `d6ead8ae` | 0/0/0/0 | 0 | 27 |
| [DataConnectionLayer](DataConnectionLayer/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 22 | | [DataConnectionLayer](DataConnectionLayer/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 26 |
| [DeploymentManager](DeploymentManager/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 24 | | [DeploymentManager](DeploymentManager/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 27 |
| [ExternalSystemGateway](ExternalSystemGateway/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 23 | | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 26 |
| [HealthMonitoring](HealthMonitoring/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 23 | | [HealthMonitoring](HealthMonitoring/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 25 |
| [Host](Host/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 22 | | [Host](Host/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 26 |
| [InboundAPI](InboundAPI/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 25 | | [InboundAPI](InboundAPI/findings.md) | 2026-06-20 | `4307c381` | 0/1/1/2 | 4 | 29 |
| [KpiHistory](KpiHistory/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 6 |
| [ManagementService](ManagementService/findings.md) | 2026-06-19 | `d6ead8ae` | 0/0/0/0 | 0 | 26 | | [ManagementService](ManagementService/findings.md) | 2026-06-19 | `d6ead8ae` | 0/0/0/0 | 0 | 26 |
| [NotificationOutbox](NotificationOutbox/findings.md) | 2026-06-19 | `d6ead8ae` | 0/0/0/0 | 0 | 13 | | [NotificationOutbox](NotificationOutbox/findings.md) | 2026-06-19 | `d6ead8ae` | 0/0/0/0 | 0 | 13 |
| [NotificationService](NotificationService/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 25 | | [NotificationService](NotificationService/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 28 |
| [Security](Security/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 21 | | [ScriptAnalysis](ScriptAnalysis/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 8 |
| [SiteCallAudit](SiteCallAudit/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 6 | | [Security](Security/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 25 |
| [SiteEventLogging](SiteEventLogging/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 23 | | [SiteCallAudit](SiteCallAudit/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 9 |
| [SiteRuntime](SiteRuntime/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 26 | | [SiteEventLogging](SiteEventLogging/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 27 |
| [StoreAndForward](StoreAndForward/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 24 | | [SiteRuntime](SiteRuntime/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 31 |
| [TemplateEngine](TemplateEngine/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 22 | | [StoreAndForward](StoreAndForward/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 27 |
| [TemplateEngine](TemplateEngine/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 25 |
| [Transport](Transport/findings.md) | 2026-06-19 | `d6ead8ae` | 0/0/0/0 | 0 | 15 | | [Transport](Transport/findings.md) | 2026-06-19 | `d6ead8ae` | 0/0/0/0 | 0 | 15 |
## Pending Findings ## Pending Findings
@@ -84,14 +86,21 @@ description, location, recommendation — lives in the module's `findings.md`.
_None open._ _None open._
### High (0) ### High (1)
_None open._ | ID | Module | Title |
|----|--------|-------|
| InboundAPI-026 | [InboundAPI](InboundAPI/findings.md) | `InboundDatabaseHelper` gives inbound scripts arbitrary raw SQL (not read-only); contradicts the design doc's "No direct database access" decision and regresses InboundAPI-007 |
### Medium (0) ### Medium (1)
_None open._ | ID | Module | Title |
|----|--------|-------|
| InboundAPI-027 | [InboundAPI](InboundAPI/findings.md) | `InboundDatabaseHelper` is sync-over-async and ignores the method-deadline token — thread-pool starvation and an unbounded slow query |
### Low (0) ### Low (2)
_None open._ | ID | Module | Title |
|----|--------|-------|
| InboundAPI-028 | [InboundAPI](InboundAPI/findings.md) | `InboundDatabaseHelper` has no negative-path tests; `Database`/`WaitForAttribute` are not exercised end-to-end through the endpoint |
| InboundAPI-029 | [InboundAPI](InboundAPI/findings.md) | Routed `WaitForAttribute` is cancelled by the method-level deadline, contradicting spec §6 (wait bounded by the wait timeout, not the method timeout) |
+407
View File
@@ -0,0 +1,407 @@
# Code Review — ScriptAnalysis
| Field | Value |
|-------|-------|
| Module | `src/ZB.MOM.WW.ScadaBridge.ScriptAnalysis` |
| Design doc | `docs/requirements/Component-ScriptAnalysis.md` |
| Status | Reviewed |
| Last reviewed | 2026-06-20 |
| Reviewer | claude-agent |
| Commit reviewed | `4307c381` |
| Open findings | 0 |
## Summary
This is the first formal review of the ScriptAnalysis component (#25), the single
authoritative source of truth for the ScadaBridge script trust boundary. The module
is small (5 source files, ~700 lines) and well-documented, and its core design is
**sound**: the trust verdict is genuinely **semantic** (Roslyn symbol resolution in
Pass 1), not a string/regex deny-list, so the classic spelling-based evasions —
aliases, `using static`, `global::`, partial-namespace qualification — are correctly
defeated by resolving the symbol rather than the text. The reflection-gateway
hardening (Pass 2) closes the `typeof(x).Assembly.GetType("System.IO.File")` class of
bypass that a namespace-only deny-list cannot see. All four documented consumers
(Template Engine, Site Runtime, Inbound API, Central UI) genuinely delegate to
`ScriptTrustValidator.FindViolations` and **hard-reject** on a non-empty result; none
re-implements or relaxes the policy. A reflection-based parity guard for the
compile-only surfaces exists in `SiteRuntime.Tests`. **No working trust-boundary
bypass was found at HEAD against the production (cluster-hosted) configuration.**
That said, the review surfaced several real issues. The highest-impact one is a
**silent degradation of the semantic pass when `TRUSTED_PLATFORM_ASSEMBLIES` is
unavailable** (single-file/AOT/trimmed host): `AnalysisReferences` falls back to the
5-assembly minimal set, after which a *bare* forbidden type inside an *allowed*
namespace (the documented `Process` via `using System.Diagnostics;` case) no longer
resolves and slips Pass 1 entirely — and Pass 2 never flags bare identifiers, only
dotted spellings (ScriptAnalysis-001). This is the only finding that can weaken the
verdict, and it is narrow (the production hosts carry a TPA list, and the downstream
real `Compile` gate catches it for two of the four consumers), so it is rated High
rather than Critical. The remaining findings are robustness/correctness/doc issues:
the semantic pass swallows nothing but the unresolved-symbol fallback is namespace-
spelling-dependent (Medium); `nameof(...)` over-flags legitimate non-forbidden uses
inconsistently (Low, documented-intentional but worth pinning); `ParseDiagnostics`/
`Compile` drop warnings despite the doc promising "errors and warnings"; there is no
caching/throttling of the Roslyn compilations (performance); the `extraReferences`
XML-doc on `FindViolations` is contradicted by the Central UI consumer; and the
adversarial test corpus, though good, has gaps (no extension-method, no
TPA-fallback, no `file`-scoped/`unsafe`, no verbatim/Unicode-escape identifier
cases). The trust boundary is **semantically sound in production**; the open items
harden the edges and align the spec.
## Checklist coverage
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ☑ | `nameof` over-flags inconsistently (ScriptAnalysis-005); `GetSyntacticScopes` only returns text for dotted names so a bare unresolved forbidden identifier yields no scope (root of ScriptAnalysis-001/002). |
| 2 | Akka.NET conventions | ☑ | No issues found — module has no actors, no Akka dependency (by design; csproj references only Roslyn + Commons). Stateless static APIs, thread-safe. |
| 3 | Concurrency & thread safety | ☑ | No issues found — all public surfaces are static/stateless; the shared `static readonly` policy collections are read-only after init; `HardeningWalker` is per-call. `AnalysisReferences` eager-built once at type init. |
| 4 | Error handling & resilience | ☑ | `RoslynScriptCompiler.Compile` converts exceptions into a fail-the-gate error entry (fail-safe). `BuildAnalysisReferences` swallows per-assembly read errors and silently falls back to the minimal set — the silent fallback is the resilience gap (ScriptAnalysis-001). |
| 5 | Security | ☑ | Semantic (symbol-based) verdict — no spelling bypass found in production. One conditional degradation (TPA-absent host) weakens Pass 1 to spelling-dependent (ScriptAnalysis-001); `extraReferences` can never cause a false allow (verified). Defence-in-depth only, as documented — not a runtime sandbox (ScriptAnalysis-008). |
| 6 | Performance & resource management | ☑ | No compilation caching/throttling; every gate call re-parses, rebuilds a compilation, and re-runs Roslyn — `FindViolations` builds a fresh `CSharpCompilation` against the *full framework* reference set on every call (ScriptAnalysis-006). |
| 7 | Design-document adherence | ☑ | `ParseDiagnostics`/`Compile` drop warnings though doc says "errors and warnings" (ScriptAnalysis-004); `AnalysisReferences` (full-framework set) is not described in REQ-SA-2, which says DefaultReferences (ScriptAnalysis-007). |
| 8 | Code organization & conventions | ☑ | No issues found — POCO/static layout is clean, namespace correct, depends only on Commons + Roslyn. `ScriptCompileSurface` lives in ScriptAnalysis (not Commons) — acceptable as it mirrors a runtime surface and is consumed here. |
| 9 | Testing coverage | ☑ | Good adversarial reject/allow corpus, but missing: TPA-fallback degradation, extension methods, verbatim/Unicode-escape identifiers, `unsafe`/pointer, comment/trivia, and `AnalysisReferences`/exception paths (ScriptAnalysis-003). |
| 10 | Documentation & comments | ☑ | `FindViolations` XML-doc claim "Forbidden references are NOT added here" is contradicted by the Central UI consumer forwarding `compilation.References` (ScriptAnalysis-007, doc half). Otherwise thorough and accurate. |
## Findings
### ScriptAnalysis-001 — Semantic trust pass silently degrades to spelling-dependent when `TRUSTED_PLATFORM_ASSEMBLIES` is unavailable
| | |
|--|--|
| Severity | High |
| Category | Security |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.ScriptAnalysis/ScriptTrustPolicy.cs:146-184`; interacts with `ScriptTrustValidator.cs:111-136`, `:195-213` |
**Description**
`AnalysisReferences` is the reference set that lets Pass 1 (the semantic symbol pass —
the *authoritative* guard) resolve every type a script names to its true namespace.
It is built from `AppContext.GetData("TRUSTED_PLATFORM_ASSEMBLIES")`. When that value
is absent — a single-file deployment, an AOT/trimmed host, or any runtime that does
not publish the TPA list — `BuildAnalysisReferences` falls back to the 5-assembly
`DefaultReferences` set (`ScriptTrustPolicy.cs:183`) **silently** (the per-assembly
`catch { }` at `:165`/`:178` and the `Count > 0` fallback at `:183` emit nothing).
After that fallback, a *bare* forbidden type that lives inside an *allowed* namespace
no longer resolves. The policy's own documented example is exactly this case:
`using System.Diagnostics; var p = Process.Start("x");`. `System.Diagnostics` is an
allowed namespace (so `VisitUsingDirective` does not flag the import), and bare
`Process` is an `IdentifierName`. With `Process.dll` absent from the minimal
reference set, `model.GetSymbolInfo` returns no symbol, so Pass 1 falls to
`GetSyntacticScopes`, which returns an empty scope list for a bare (dotless)
identifier (`ScriptTrustValidator.cs:210-212`). Pass 2's `HardeningWalker` only flags
forbidden *dotted* names and forbidden `using` imports — never a bare identifier
token (`VisitIdentifierName`, `:373-395`, only checks `dynamic`/`Activator`). The
forbidden `Process.Start` therefore slips the trust validator entirely. The same
degradation reopens the broader "forbidden type in an allowed namespace reached as a
bare identifier" class that the full-framework reference set was specifically added
(commit `06975720`, M3.6) to close.
This is rated High not Critical because: (a) the production hosts (Akka.NET cluster,
docker topology) run on a normal framework-dependent runtime that publishes the TPA
list, so the full set is loaded; and (b) for two of the four consumers (Template
Engine, Site Runtime) the downstream real `Compile` gate against a curated reference
set independently rejects an undefined `Process` symbol. But Inbound API and the
Central UI run gate rely on `FindViolations` for the reflection/namespace verdict,
and a deployment-mode change that drops the TPA list would silently weaken the
boundary with no diagnostic.
**Recommendation**
Make the degradation loud and fail-safe: when the TPA list is unavailable, log a
warning (or expose a diagnostic flag the consumers can surface), and — more
importantly — add the BCL assemblies that host the forbidden-type-in-allowed-namespace
cases (at minimum `System.Diagnostics.Process`) to `DefaultAssemblies` so they resolve
even in the minimal fallback. Alternatively, extend Pass 2 to flag bare identifiers
whose name matches the simple name of a known forbidden type (e.g. `Process`,
`WebClient`), independent of symbol resolution, so the verdict never depends solely on
the reference set being complete.
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): the TPA-absent fallback now folds the
assemblies hosting the forbidden anchor types (Process, Socket, File, Thread, Assembly,
Marshal) into the minimal validation reference set so a bare forbidden type still
resolves and is flagged; an `AnalysisReferencesDegraded` flag + a `Trace.TraceWarning`
make the degraded mode observable. CRITICAL: anchors are added ONLY to the validation
`AnalysisReferences`, never to `DefaultReferences` — keeping the compile gate from being
able to bind `Process`. Verified by a TPA-fallback test that a bare `Process.Start` is
still rejected.
### ScriptAnalysis-002 — Pass 2 (syntactic hardening) is spelling-dependent and cannot catch a bare forbidden identifier on its own
| | |
|--|--|
| Severity | Medium |
| Category | Security |
| Status | Deferred |
| Location | `src/ZB.MOM.WW.ScadaBridge.ScriptAnalysis/ScriptTrustValidator.cs:289-395`; `:195-213` |
**Description**
Pass 2 (`HardeningWalker`) is purely textual: `VisitUsingDirective` flags forbidden
`using` imports (`:289-295`), `VisitQualifiedName`/`VisitMemberAccessExpression` flag
forbidden *dotted* names (`:298-369`), and `VisitIdentifierName` flags only the bare
identifiers `dynamic`/`Activator` (`:373-395`). It cannot independently flag any other
bare forbidden identifier — it relies entirely on Pass 1's semantic resolution for
that. The two passes are therefore not the "two independent guards" the design implies
for the bare-identifier case: for an unresolved bare identifier (see
ScriptAnalysis-001) **neither** pass fires. `GetSyntacticScopes` deliberately returns
nothing for dotless names (`:210-212`, "bare local identifiers cannot reach a forbidden
namespace") — which is true only while Pass 1 resolves them; once resolution fails the
assumption breaks.
This is the structural root of ScriptAnalysis-001 and is recorded separately because
it is a design property (the syntactic pass is not self-sufficient against bare names)
that should be acknowledged even after the TPA-fallback hole is closed: the defence is
single-layered, not double-layered, for the bare-identifier vector.
**Recommendation**
Document explicitly that Pass 1 is the sole guard for bare (single-identifier)
forbidden references and that Pass 2 only backstops *spelled* forbidden namespaces. If
genuine defence-in-depth is wanted for bare names, add a simple-name deny-set to Pass 2
(the simple names of the forbidden root types — `Process`, `WebClient`, `Marshal`,
`Registry`, `Thread`, `File`, `Directory`, etc.), accepting the small false-positive
risk on a script's own identically-named local, which a security gate should tolerate.
**Resolution**
Deferred 2026-06-20: strengthening Pass-2 (syntactic) so a bare forbidden identifier is
caught without Pass-1 symbol resolution is a defence-in-depth enhancement; the
production path (TPA present, two-layer compile gate on most call sites) is sound.
Recorded for a follow-up.
### ScriptAnalysis-003 — Adversarial test corpus has bypass-relevant gaps
| | |
|--|--|
| Severity | Medium |
| Category | Testing coverage |
| Status | Resolved |
| Location | `tests/ZB.MOM.WW.ScadaBridge.ScriptAnalysis.Tests/ScriptTrustValidatorTests.cs`; `RoslynScriptCompilerTests.cs` |
**Description**
The reject/allow corpus is solid for the spelled-namespace, alias, `using static`,
`global::`, reflection-gateway, generic-argument, and nested-lambda vectors. For a
security-critical trust boundary, however, several bypass-relevant cases are untested:
- **TPA-fallback degradation** (ScriptAnalysis-001) — no test forces
`AnalysisReferences` onto the minimal set and asserts that bare `Process`/`WebClient`
is still rejected. This is the highest-value missing test.
- **Extension methods** — no test that a forbidden-namespace extension method invoked
in receiver-position (`x.SomeForbiddenExt()`) is resolved and flagged.
- **Verbatim / Unicode-escaped identifiers**`@dynamic`, `dynamic`,
`Activator`, or `GetType` are not exercised; `VisitIdentifierName` compares
`Identifier.ValueText`, which decodes Unicode escapes (so it likely holds), but the
case is not pinned, and `dynamic` is matched via `text == "dynamic"` (`:380`) which is
a separate path from the `ForbiddenIdentifiers` set.
- **`unsafe`/pointer/`stackalloc`** — relies implicitly on `ScriptOptions` defaulting
`AllowUnsafe=false`; not asserted.
- **Comment/trivia and string-literal payloads** — no test that a forbidden namespace
inside a comment or string is *not* flagged (false-positive guard) and that a
forbidden token reconstructed only at runtime from strings is correctly *not* the
validator's concern (documents the sandbox caveat).
**Recommendation**
Add the cases above, prioritising the TPA-fallback test (inject a minimal-reference
build of `AnalysisReferences` or factor the reference set so a test can force the
fallback) and the extension-method test. Add a verbatim/Unicode-escape identifier
reject test for `dynamic`/`Activator`/`GetType`.
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): added 18 adversarial tests — TPA-fallback
bare-Process/Socket, extension-method invocation, verbatim/Unicode-escaped identifiers,
`unsafe` blocks (benign vs forbidden-API-inside), and comment/string-literal
false-positive guards.
### ScriptAnalysis-004 — `ParseDiagnostics`/`Compile` drop warnings though the design doc promises "errors and warnings"
| | |
|--|--|
| Severity | Low |
| Category | Design-document adherence |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.ScriptAnalysis/RoslynScriptCompiler.cs:30-33`, `:73-76` |
**Description**
REQ-SA-3 states `Compile` returns the diagnostics "filtered to errors and warnings"
and `ParseDiagnostics` returns "syntax-level diagnostics (errors and warnings)". The
code filters `d.Severity == DiagnosticSeverity.Error` only in both methods
(`RoslynScriptCompiler.cs:31` and `:74`), so warnings are silently dropped. Filtering
to errors-only is arguably the safer choice for a compile *gate* (a warning should not
block a deploy), but the code and the spec disagree, and a future maintainer reading
the doc will expect warnings in the returned list.
**Recommendation**
Reconcile: either change the doc (REQ-SA-3) to say "errors only", or include warnings
and let callers decide. Given the gate semantics, updating the doc to "error-severity
diagnostics" is the cleaner fix.
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): corrected the design doc (REQ-SA-3) to
'error-severity only', matching the compile wrapper's existing behaviour (a warning must
not block a deploy gate). Code unchanged.
### ScriptAnalysis-005 — `nameof(forbiddenType)` flagging is inconsistent between passes and across resolution states
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Deferred |
| Location | `src/ZB.MOM.WW.ScadaBridge.ScriptAnalysis/ScriptTrustValidator.cs:42-48`, `:99-136`, `:298-325` |
**Description**
The class comment (`:42-48`) states that a forbidden type inside `nameof(...)` is
"deliberately flagged" as a conservative fail-safe, and `Rejects_NameOf_ForbiddenType`
pins `nameof(System.IO.File)`. However the flagging is incidental, not deliberate, and
inconsistent: `nameof(System.IO.File)` is caught because the inner `System.IO.File` is
a `QualifiedNameSyntax`/member access that both passes inspect directly — the
`nameof` wrapper is irrelevant. A bare `nameof(Process)` (after `using
System.Diagnostics;`, with full TPA loaded) resolves `Process` to a real symbol and
*is* flagged by Pass 1; but the same `nameof(Process)` under the minimal-reference
fallback resolves to nothing and is *not* flagged. So the "deliberate fail-safe"
behaviour actually varies with the reference set, and `nameof` — which produces only a
compile-time string and reaches no API — is a genuine false positive that will reject a
legitimate script doing `nameof(System.Net.IPAddress)` for a log message.
**Recommendation**
Decide the intended semantics explicitly. If `nameof` should be allowed (it reaches no
runtime API), special-case `NameOfExpression`/`InvocationExpression` named `nameof` to
suppress the inner-reference report. If it should be rejected, make that uniform
regardless of resolution state (tie to a simple-name check), and keep the test. Either
way, align the comment with the actual, deterministic behaviour.
**Resolution**
Deferred 2026-06-20: the `nameof(forbiddenType)` over-flagging is a documented
deliberate fail-safe; whether `nameof` (which reaches no API) should be allowed vs
uniformly rejected is a design-owner decision. Recorded; no behaviour change this pass.
### ScriptAnalysis-006 — No caching or throttling of Roslyn compilations; full-framework compilation rebuilt per call
| | |
|--|--|
| Severity | Low |
| Category | Performance & resource management |
| Status | Deferred |
| Location | `src/ZB.MOM.WW.ScadaBridge.ScriptAnalysis/ScriptTrustValidator.cs:73-97`; `RoslynScriptCompiler.cs:50-71` |
**Description**
`FindViolations` parses the script and builds a fresh `CSharpCompilation` against the
**entire framework reference set** (`AnalysisReferences`, potentially hundreds of
`MetadataReference`s) on every call (`:87-95`). `RoslynScriptCompiler.Compile`
similarly creates a new `CSharpScript` and calls `.Compile()` per call. Roslyn
compilation is the expensive part of these calls. The metadata references are cached
(built once in `AnalysisReferences`), which is the main cost, so this is Low rather
than Medium — but for the Central UI editor path (which can call the validator on every
keystroke-debounced analysis) and for batch template validation (many scripts per
deploy), the absence of any per-source memoisation or a bounded compilation cache means
repeated work. There is no back-pressure or concurrency cap either; the Inbound API
review (InboundAPI-009) already noted uncapped recompilation downstream.
**Recommendation**
Consider a small bounded cache keyed on a hash of `(code, globalsType)` for `Compile`
results and on `code` for `FindViolations` verdicts (the verdict is a pure function of
the source and the static policy). At minimum, document that callers on hot paths
(editor live-analysis) should debounce and cache, and confirm the Central UI debounces
before calling.
**Resolution**
Deferred 2026-06-20: compilation caching/throttling (the full-framework
`CSharpCompilation` is rebuilt per `FindViolations` call) is a performance enhancement,
not a correctness defect. Recorded for a follow-up.
### ScriptAnalysis-007 — `FindViolations` `extraReferences` XML-doc contradicts the Central UI consumer; `AnalysisReferences` undocumented in REQ-SA-2
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.ScriptAnalysis/ScriptTrustValidator.cs:57-64`; `ScriptTrustPolicy.cs:125-144`; `docs/requirements/Component-ScriptAnalysis.md:74-80` |
**Description**
Two doc/code mismatches, neither a vulnerability:
1. The `extraReferences` XML-doc on `FindViolations` (`:57-64`) says "Forbidden
references are NOT added here". But the Central UI run gate calls
`FindViolations(tree.ToString(), compilation.References)`
(`CentralUI/ScriptAnalysis/ScriptAnalysisService.cs:1319-1320`), forwarding the full
BCL surface as `extraReferences`. This is in fact *safe* (the deny-list is by
namespace/type, so adding references can only improve resolution, never produce a
false allow — and the policy comment at `ScriptTrustPolicy.cs:138-142` says exactly
that), but the `FindViolations` doc as written implies the caller must not pass
forbidden assemblies, which is both unnecessary and contradicted in practice.
2. REQ-SA-2 in the design doc describes Pass 1 as using "the full trusted-platform
reference set from `ScriptTrustPolicy.DefaultReferences`". The code uses a separate
`ScriptTrustPolicy.AnalysisReferences` (the full framework), and `DefaultReferences`
is explicitly the *minimal* set used by `RoslynScriptCompiler`. The doc conflates the
two and never mentions `AnalysisReferences` or the TPA mechanism — material because
that mechanism is the heart of the security verdict (and of ScriptAnalysis-001).
**Recommendation**
Rewrite the `extraReferences` doc to state that extra references only widen symbol
resolution and never whitelist a forbidden API. Update REQ-SA-2 to describe
`AnalysisReferences` (full-framework, TPA-sourced) versus `DefaultReferences` (minimal,
runtime-fidelity), and note the TPA-fallback behaviour.
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): corrected the `FindViolations`
`extraReferences` XML doc (extra references only widen resolution, never whitelist) and
documented `AnalysisReferences` vs `DefaultReferences` in REQ-SA-2.
### ScriptAnalysis-008 — Static trust gate is defence-in-depth, not a sandbox; correctly documented but worth recording as a standing risk
| | |
|--|--|
| Severity | Low |
| Category | Security |
| Status | Won't Fix |
| Location | `src/ZB.MOM.WW.ScadaBridge.ScriptAnalysis/ScriptTrustValidator.cs:34-40`; `docs/requirements/Component-ScriptAnalysis.md:151` |
**Description**
The component is explicit (both in code comments and REQ-SA-5) that it is a
compile-time deny-list, not a runtime sandbox: scripts execute in-process, and a
sufficiently determined script could reconstruct a forbidden API at runtime via paths
the static walker cannot see (e.g. building type names from runtime strings, or any
gap in the deny-list). The reflection-gateway hardening narrows this, and the
`dynamic`/`Activator` bans close the obvious late-binding routes, but the residual risk
is real and inherent to in-process execution of partially-trusted C#. This is recorded
(not as a defect of this module, which behaves as designed) so the standing risk is
visible in the audit trail and is revisited if/when the deferred runtime boundary
(restricted `AssemblyLoadContext` / curated reference set / out-of-process sandbox) is
scoped.
**Recommendation**
Keep the caveat prominent. Track the runtime-boundary work as a roadmap item; in the
meantime ensure the curated *execution* reference sets (SiteRuntime `ScriptAssemblies`,
`RoslynScriptCompiler` `DefaultReferences`) stay minimal so that even a deny-list miss
cannot bind a forbidden type at execution time (this is the real second layer of
defence for the compile-running consumers).
**Resolution**
Won't Fix 2026-06-20: the static gate is intentionally defence-in-depth, not a runtime
sandbox — this is the explicitly-documented design posture, not a defect. Recorded as a
standing risk for visibility.
+222 -3
View File
@@ -5,10 +5,10 @@
| Module | `src/ZB.MOM.WW.ScadaBridge.Security` | | Module | `src/ZB.MOM.WW.ScadaBridge.Security` |
| Design doc | `docs/requirements/Component-Security.md` | | Design doc | `docs/requirements/Component-Security.md` |
| Status | Reviewed | | Status | Reviewed |
| Last reviewed | 2026-05-28 | | Last reviewed | 2026-06-20 |
| Reviewer | claude-agent | | Reviewer | claude-agent |
| Commit reviewed | `1eb6e97` | | Commit reviewed | `4307c381` |
| Open findings | 0 (1 deferred — Security-008) | | Open findings | 0 (2 deferred — Security-008, Security-022) |
## Summary ## Summary
@@ -78,6 +78,50 @@ values silently surface only on first login (Security-020); and the
`RequireHttpsCookie=false` dev opt-out emits no warning, so an HTTP production `RequireHttpsCookie=false` dev opt-out emits no warning, so an HTTP production
deployment silently transmits the JWT bearer credential in cleartext (Security-021). deployment silently transmits the JWT bearer credential in cleartext (Security-021).
#### Re-review 2026-06-20 (commit `4307c381`) — full review
Major architectural change since `1eb6e97`: the project was renamed (ScadaLink →
ZB.MOM.WW.ScadaBridge) and the bespoke `LdapAuthService`/`LdapTransport`/`LdapAuthResult`
were **cut over to the external `ZB.MOM.WW.Auth.Ldap` library** (commit `ac34dac4`); the
prior `SiteScopeAuthorizationHandler` was deleted (Security-017 resolution), and the role
model was canonicalised + collapsed (`Admin→Administrator`, `Design→Designer`,
`Deployment→Deployer`, `Audit→Administrator`, `AuditReadOnly→Viewer`; commit `b104760b`)
with M7 `Operator`/`Verifier` added (`a0ce8b6c`). The interactive cookie session now signs
in with **bare claims** via the shared `SessionClaimBuilder` and is policed by
`CookieSessionValidator` (idle-timeout + LDAP-free DB role-refresh, fail-closed, well
tested). All prior fixes that survived the cutover remain in place (key-length guard,
issuer/audience binding, `RoleMapper` union semantics, `Roles` constants, cookie hardening
via `ZbCookieDefaults`, the `RequireHttpsCookie` warning). Security-008 (N+1 scope-rule
query) is still correctly **Deferred**. This pass surfaced **4 new findings**
(Security-022..025): three Medium and one Low. The most consequential is **Security-022**
the documented "cookie + JWT hybrid" no longer holds at HEAD: `/auth/token` still *mints* a
bearer JWT but **no code anywhere validates it** (CLI uses HTTP Basic Auth, Inbound API uses
the `sbk_` ApiKeys verifier, the cookie path uses bare claims), so `JwtTokenService`'s entire
`ValidateToken`/`RefreshToken`/`RecordActivity`/`ShouldRefresh`/`IsIdleTimedOut` surface is
dead code minting an orphaned credential. **Security-023**`LdapGroupMapping.Role` accepts
arbitrary free strings via CLI/API with no canonical-set validation, and the Central UI
`RequireClaim` policies are case-sensitive while the ManagementActor authz check is
case-insensitive, so a mis-cased/typo'd mapping silently grants inconsistent access across
the two surfaces with no operator feedback. **Security-024** — the M7 `RequireOperator`/
`RequireVerifier` SoD policies have no functional `AuthorizeAsync` grant/deny test (only
constant-name assertions), unlike every other policy. **Security-025** (Low) — a unit test
makes a live network connect to `nonexistent.invalid:9999`.
_Re-review (2026-06-20, `4307c381`):_
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ☑ | `RoleMapper` union semantics (Security-016 fix) hold and read correctly. `LdapGroupMapping.Role` is an unvalidated free string + case-sensitive UI policy vs case-insensitive ManagementActor authz (Security-023). |
| 2 | Akka.NET conventions | ☑ | No actors. `AddSecurityActors` is still a placeholder. No issues. |
| 3 | Concurrency & thread safety | ☑ | Services stateless/DI-scoped; `CookieSessionValidator` reads injected `TimeProvider`. LDAP now in the external library. No shared mutable state. No issues. |
| 4 | Error handling & resilience | ☑ | `CookieSessionValidator` fails closed on idle (missing anchor = timed out) and swallows refresh faults keeping current roles (mirrors "LDAP failure: active sessions continue") — correct and well tested. `LdapAuthFailureMessages` maps the library's structured failure enum to stable user text. No issues. |
| 5 | Security | ☑ | Cookie hardening (HttpOnly/SameSite=Strict/Secure-when-required/sliding) comes from `ZbCookieDefaults` — verified correct. JWT mint binds issuer/audience, pins `MapInbound/OutboundClaims=false`, fails fast on short key. BUT the minted bearer JWT is validated by no consumer (Security-022). DisableLogin auto-login is gated at registration + loud warning + doc'd (acceptable). |
| 6 | Performance & resource management | ☑ | Security-008 N+1 remains correctly Deferred (gated on `ISecurityRepository`). `CookieSessionValidator` deliberately avoids re-minting on the no-refresh path. No new perf issues. |
| 7 | Design-document adherence | ☑ | The cookie+JWT-hybrid design is now cookie-with-bare-claims; the JWT half is orphaned/unvalidated (Security-022) — the M2.19 doc note acknowledges bare-claim cookies but the doc still presents JWT-in-cookie as the model and does not flag that `/auth/token`'s JWT is unconsumed. Operator/Verifier policies match the design. |
| 8 | Code organization & conventions | ☑ | `Roles`/`SecurityOptions` owned by the component; repository interface in Commons; additive `IGroupRoleMapper<string>` adapter. Dead JWT methods (Security-022) are the main org smell. |
| 9 | Testing coverage | ☑ | `CookieSessionValidator`, `SessionClaimBuilder`, `RoleMapper` union, cookie wiring, `SecurityOptionsValidator`, DisableLogin are well covered. Gaps: no functional authz test for `RequireOperator`/`RequireVerifier` (Security-024); a unit test does live network I/O (Security-025); no test for a non-canonical/mis-cased mapping role (Security-023). |
| 10 | Documentation & comments | ☑ | XML docs are thorough and current. `SecurityOptions` and `JwtTokenService` docs still describe an active JWT-refresh lifecycle that no longer has a caller (Security-022). |
## Checklist coverage ## Checklist coverage
| # | Category | Examined | Notes | | # | Category | Examined | Notes |
@@ -1008,3 +1052,178 @@ SecurityOptions:RequireHttpsCookie=true in production.")`. Optionally, also fail
startup when `RequireHttpsCookie=false` AND `ASPNETCORE_ENVIRONMENT=Production`. Add a startup when `RequireHttpsCookie=false` AND `ASPNETCORE_ENVIRONMENT=Production`. Add a
regression test that asserts the warning is emitted when the flag is disabled and not regression test that asserts the warning is emitted when the flag is disabled and not
when it is enabled. when it is enabled.
### Security-022 — `/auth/token` mints a bearer JWT that no consumer validates; the entire `JwtTokenService` validate/refresh surface is dead code
| | |
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Deferred |
| Location | `src/ZB.MOM.WW.ScadaBridge.Security/JwtTokenService.cs:149`, `:194`, `:209`, `:239`, `:272`; `src/ZB.MOM.WW.ScadaBridge.CentralUI/Auth/AuthEndpoints.cs:99-148` |
**Description**
The design (Component-Security.md "Cookie + JWT Hybrid", CLAUDE.md "cookie+JWT hybrid …
HMAC-SHA256, 15-minute expiry w/ sliding refresh") presents a cookie-embedded JWT as the
session model. At HEAD this is no longer how the interactive session works (the M2.19 note
acknowledges the cookie now carries **bare claims** policed by `CookieSessionValidator`),
and — more importantly — the JWT half is now an **orphaned, unconsumed credential**. A grep
across `src/` shows `JwtTokenService.ValidateToken`, `RefreshToken`, `RecordActivity`, and
`ShouldRefresh` have **zero production callers**; only `GenerateToken` is called, solely by
`POST /auth/token` (`AuthEndpoints.cs:134`). That endpoint returns an `access_token` to the
caller, but **no endpoint in the codebase accepts or validates that JWT** as a credential:
the CLI authenticates to the Management API with HTTP Basic Auth
(`CLI/ManagementHttpClient.cs:36`), the Inbound API authenticates with the external
`ZB.MOM.WW.Auth.ApiKeys` `sbk_<keyId>_<secret>` verifier, and the cookie pipeline uses
`SessionClaimBuilder` bare claims (never `ValidateToken`). The result: `/auth/token` issues
HMAC-signed bearer tokens that grant access to nothing, and ~120 lines of carefully-hardened
issuer/audience/clock-skew/idle/refresh logic (the subject of resolved findings
Security-006/007/014) are dead code. The hazard is twofold: (a) the design doc materially
overstates the live session mechanism, misleading any future reader/auditor; and (b) an
orphaned token mint is a latent footgun — a future endpoint added under the assumption that
`/auth/token` produces a *usable* credential would wire a second, parallel auth path with
no test coverage proving it actually validates.
**Recommendation**
Decide the intent explicitly. Either (a) **wire the bearer path**: add
`AddAuthentication().AddJwtBearer(...)` (or a custom handler delegating to
`JwtTokenService.ValidateToken`) on whatever surface is meant to accept the token, and add
an end-to-end test that a `/auth/token`-issued JWT authorizes a protected request — then the
hybrid model the doc describes is real; or (b) **remove the dead code**: delete
`/auth/token` and `JwtTokenService.ValidateToken`/`RefreshToken`/`RecordActivity`/
`ShouldRefresh`/`IsIdleTimedOut` (keep only what the bare-claim cookie path needs, which is
none of `JwtTokenService`), and rewrite Component-Security.md's "Cookie + JWT Hybrid" /
"Token Lifecycle" sections to describe the actual bare-claim cookie + `CookieSessionValidator`
mechanism as the primary model rather than a footnote. Whichever is chosen, the design doc
and the code must agree on whether a usable JWT exists.
**Resolution**
Deferred 2026-06-20: this needs a product decision — either wire a JWT-bearer auth handler so the documented cookie+JWT hybrid is real, or delete `/auth/token` + the orphaned `JwtTokenService` validate/refresh surface and rewrite the design doc to describe the actual bare-claim cookie model. Recorded for that decision.
### Security-023 — `LdapGroupMapping.Role` is an unvalidated free string; case-sensitive UI policy vs case-insensitive server authz grants inconsistent access
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.Security/AuthorizationPolicies.cs:144-157`; `src/ZB.MOM.WW.ScadaBridge.Security/RoleMapper.cs:40-42`; `src/ZB.MOM.WW.ScadaBridge.Commons/Entities/Security/LdapGroupMapping.cs:10`; `src/ZB.MOM.WW.ScadaBridge.ManagementService/ManagementActor.cs:106`, `:1893` |
**Description**
The role string an operator maps an LDAP group to (`LdapGroupMapping.Role`) is a free
`string` with no validation against the canonical `Roles.*` set anywhere on the write path:
`ManagementActor` (`:1893`) does `new LdapGroupMapping(cmd.LdapGroupName, cmd.Role)` verbatim,
and the CLI `security role-mapping create --role <value>`
(`CLI/Commands/SecurityCommands.cs:157`) accepts any string. The Central UI dropdown
(`LdapMappingForm.razor`) constrains the *UI* path to the canonical constants, but the CLI/API
path does not. Two enforcement surfaces then disagree on casing:
- **Central UI authorization** uses `policy.RequireClaim(JwtTokenService.RoleClaimType, Roles.Deployer)`
(`AuthorizationPolicies.cs:150`). ASP.NET Core's `RequireClaim` compares claim *values*
with ordinal (case-**sensitive**) equality.
- **ManagementActor authorization** uses `user.Roles.Contains(requiredRole, StringComparer.OrdinalIgnoreCase)`
(`ManagementActor.cs:106`) — case-**insensitive**.
- **`RoleMapper`** matches the Deployer scope branch with `StringComparison.OrdinalIgnoreCase`
(`RoleMapper.cs:42`) but stamps the **DB row's verbatim casing** into the role claim.
So a mapping created via CLI/API as `--role deployer` (lowercase) or a typo like `Deploy`
produces a principal whose role claim is `"deployer"`: the **ManagementActor/CLI surface
authorizes it** (case-insensitive), the **Central UI `RequireDeployment` policy denies it**
(case-sensitive), and for the Deployer case `RoleMapper` still resolves site scope (matching
case-insensitively) — a user who can deploy via CLI but is locked out of the equivalent UI
pages, or vice versa, with no error explaining why. A pure typo (`"Deploer"`) silently grants
a role claim that *no* policy and no ManagementActor check matches — the user appears to hold
a role yet is denied everywhere, with no validation feedback at mapping-creation time.
**Recommendation**
Validate `cmd.Role` against the canonical `Roles.All` set (case-insensitively, normalising to
the canonical casing) when creating/updating a mapping — in `ManagementActor` (the single
server-side write path) and reject non-canonical values with a clear error; optionally also
guard in the CLI for a fast local failure. Separately, make the two authorization surfaces
agree on casing: either normalise the role claim to canonical casing in `SessionClaimBuilder`/
`RoleMapper` before it is stamped, or make the ManagementActor check case-sensitive to match
`RequireClaim` (the safer direction once values are validated). Add a regression test for a
mis-cased mapping role.
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): added membership validation rejecting any `cmd.Role` not in `Roles.All` (canonical set, case-insensitive) at LDAP-group-mapping create AND update, before any DB write — closing the unvalidated-free-string gap. The casing/comparison-asymmetry half is a larger change and remains a recorded follow-up (membership check is safe: a non-canonical role never worked). No existing test regressed.
### Security-024 — `RequireOperator` / `RequireVerifier` policies have no functional authorization test
| | |
|--|--|
| Severity | Medium |
| Category | Testing coverage |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.Security/AuthorizationPolicies.cs:153-157`; `tests/ZB.MOM.WW.ScadaBridge.Security.Tests/SecurityTests.cs:1045-1186` (AuthorizationPolicyTests); `tests/ZB.MOM.WW.ScadaBridge.Security.Tests/RolesTests.cs:44-45` |
**Description**
The M7 two-person Secured Write feature added the `RequireOperator` and `RequireVerifier`
policies (`AuthorizationPolicies.cs:153-157`), each a single-role `RequireClaim`. These gate a
SCADA control-surface write workflow whose entire safety argument is separation of duties, so
correct grant/deny behaviour is security-critical. But `AuthorizationPolicyTests` — which
exercises every *other* policy with real `IAuthorizationService.AuthorizeAsync` evaluations
(Admin/Design/Deployment/OperationalAudit/AuditExport, including the load-bearing
"Viewer reads but cannot export" SoD case) — has **no** Operator/Verifier test. The only
coverage is `RolesTests.cs:44-45`, which asserts the *constant string values*
(`"RequireOperator"`, `"RequireVerifier"`) — not that an `Operator` principal satisfies
`RequireOperator`, that a `Verifier` does not, or that the two policies are mutually distinct.
A regression that, say, mapped `RequireOperator` to `Roles.Verifier` (or to
`OperationalAuditRoles`) would compile and pass the existing tests while silently collapsing
the separation of duties.
**Recommendation**
Add `AuthorizeAsync`-based `[Theory]` cases to `AuthorizationPolicyTests` mirroring the
existing policy tests: `RequireOperator` succeeds for `Roles.Operator` and fails for
`Roles.Verifier`/`Administrator`/empty; `RequireVerifier` succeeds for `Roles.Verifier` and
fails for `Roles.Operator`; and a combined case asserting an Operator-only principal cannot
satisfy `RequireVerifier` (the SoD invariant at the policy layer).
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): added functional `AuthorizeAsync` grant/deny tests for `RequireOperator`/`RequireVerifier` (and a combined SoD-distinctness test), guarding the separation-of-duties policy wiring against regression.
### Security-025 — Unit test performs a live network connection to `nonexistent.invalid:9999`
| | |
|--|--|
| Severity | Low |
| Category | Testing coverage |
| Status | Resolved |
| Location | `tests/ZB.MOM.WW.ScadaBridge.Security.Tests/SecurityTests.cs:71-89` (`AuthenticateAsync_ConnectionFailure_FailsClosed_NeverThrows`) |
**Description**
`LdapAuthServiceTests.AuthenticateAsync_ConnectionFailure_FailsClosed_NeverThrows` constructs
the external library's `LdapAuthService` pointed at `Server = "nonexistent.invalid"`,
`Port = 9999` with a 2-second `ConnectionTimeoutMs`, then asserts the result fails closed.
Although the test relies on the connection *failing*, it still performs a real DNS resolution
and TCP connect attempt from the unit-test process. In an offline or network-sandboxed CI
environment the resolution/connect can behave differently (immediate failure vs. the full
2-second timeout vs. a captive-portal redirect), making the test both **slow** (up to ~2s of
wall-clock dead time) and **environment-dependent**. Unit tests in this suite are otherwise
network-free (the sibling cases are explicitly named `_NoNetwork`). The behaviour under test
(fail-closed mapping of an unreachable directory) is really the external library's
responsibility and is, per the file's own header comment, already covered by the library's own
test suite through its internal connection seam.
**Recommendation**
Drop this case from the ScadaBridge unit suite (the library owns the fail-closed contract via
its internal `ILdapConnection` seam, as the region comment notes), or move it to an explicitly
opt-in integration-test category so it does not run in the default network-free unit pass. If
kept as a smoke test, point it at a guaranteed-unroutable loopback address (e.g.
`127.0.0.1` on a closed port) to bound the failure deterministically rather than relying on
DNS resolution of `.invalid`.
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): the fail-closed connection test now targets `127.0.0.1` on a closed port instead of `nonexistent.invalid:9999` — deterministic loopback connection-refused, no external DNS/timeout dependency. Same assertion preserved.
+238 -3
View File
@@ -5,10 +5,10 @@
| Module | `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit` | | Module | `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit` |
| Design doc | `docs/requirements/Component-SiteCallAudit.md` | | Design doc | `docs/requirements/Component-SiteCallAudit.md` |
| Status | Reviewed | | Status | Reviewed |
| Last reviewed | 2026-05-28 | | Last reviewed | 2026-06-20 |
| Reviewer | claude-agent | | Reviewer | claude-agent |
| Commit reviewed | `1eb6e97` | | Commit reviewed | `4307c381` |
| Open findings | 2 | | Open findings | 0 |
## Summary ## Summary
@@ -42,6 +42,23 @@ tests using a shared `MsSqlMigrationFixture`.
| 9 | Testing coverage | Yes | Relay path well covered (6 unit tests). Ingest/KPI well covered by MSSQL fixture. Stuck-only paging boundary edge not directly exercised (Finding 006). | | 9 | Testing coverage | Yes | Relay path well covered (6 unit tests). Ingest/KPI well covered by MSSQL fixture. Stuck-only paging boundary edge not directly exercised (Finding 006). |
| 10 | Documentation & comments | Yes | XML docstring claims `SupervisorStrategy` uses Resume — incorrect (Finding 001). `AckErrorMessage` switch arm for `SiteUnreachable` falls through instead of throwing (Finding 005). | | 10 | Documentation & comments | Yes | XML docstring claims `SupervisorStrategy` uses Resume — incorrect (Finding 001). `AckErrorMessage` switch arm for `SiteUnreachable` falls through instead of throwing (Finding 005). |
#### Re-review 2026-06-20 (commit `4307c381`) — full review
Since the `1eb6e97` baseline the module grew the two pieces that were "still deferred" at the prior pass: the periodic per-site reconciliation puller (Piece A — `OnReconciliationTickAsync``ReconcileSiteAsync`, in-memory per-site cursor + idempotent monotonic upsert + per-site failure isolation) and the daily terminal-row purge scheduler (Piece B — `OnPurgeTickAsync``PurgeTerminalAsync`, continue-on-error). All of the new code is clean, idiomatic, and well-tested (dedicated `SiteCallAuditReconciliationTests`, `SiteCallAuditPurgeTests`, `SiteCallRelayTests`, KPI-sample and options suites); the M6 `SiteCallAuditKpiSampleSource` correctly anchors both cutoffs on a single capture instant and emits Global + per-Site + per-Node snapshots, and the per-node KPI handler is a clean additive peer of the per-site one. The three new findings are all forward-compat / cosmetic: a latent purge-gating gap (the purge timer is armed only when the reconciliation collaborators are present — currently masked because `Program.cs` co-registers them) plus two doc/edge items (the design doc over-claims six charted trend series when only three are charted, and the reconciliation cursor ignores `MoreAvailable` so a single-timestamp batch saturation re-pulls without progress). No correctness, concurrency, security, or resilience defects in the new code; the inclusive-cursor + idempotent-upsert pairing keeps every edge case safe-by-construction (wasted work, never corruption).
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | Yes | Reconciliation cursor ignores `MoreAvailable`; a single-`UpdatedAtUtc` batch saturation re-pulls without progress (Finding 009 — wasted work only, idempotent upsert prevents corruption). Finding 003 (`IngestedAtUtc`) resolved — now stamped in `OnUpsertAsync` and `ReconcileSiteAsync`. |
| 2 | Akka.NET conventions | Yes | `Sender` captured before first await on every async handler; `PipeTo` used for all read/relay replies; self-ticks scheduled via `ScheduleTellRepeatedlyCancelable` with `Self` sender, cancelled in `PostStop`. No issues found. |
| 3 | Concurrency & thread safety | Yes | `_reconciliationCursors`, `_centralCommunication`, timers all mutated only on the actor thread. Per-tick/per-message DI scope (`CreateAsyncScope` for the async tick paths, `CreateScope` for sync read paths) disposed in `finally` / `await using`. No issues found. |
| 4 | Error handling & resilience | Yes | Reconciliation has per-site try/catch isolation; purge is continue-on-error; ingest catches all and replies `Accepted=false`. Deliberate coarse-retry divergence from `SiteAuditReconciliationActor` (no per-row abandon) is documented and safe given idempotent upsert. No issues found. |
| 5 | Security | Yes | All SQL parameterised at the repository. Relay carries no user-controlled strings beyond `SourceSite` (a site id). No issues found. |
| 6 | Performance & resource management | Yes | One DI scope per tick reused across all sites; `MaxPageSize=200` clamp; async DbContext disposal off the dispatcher. `MoreAvailable`-ignoring re-pull is bounded wasted work (Finding 009). |
| 7 | Design-document adherence | Yes | Reconciliation puller + purge scheduler now implemented (prior Finding 004 resolved). Design doc over-claims six charted KPI trend series; only three are charted (Finding 008). |
| 8 | Code organization & conventions | Yes | Purge timer gated on reconciliation collaborators rather than just the repository — a host registering SiteCallAudit without the reconciliation client silently never purges (Finding 007, latent). `KpiSampleSource` registered via `TryAddEnumerable`; options owned by the component. |
| 9 | Testing coverage | Yes | Reconciliation/purge/relay/KPI/options all have dedicated suites. No test exercises a saturated reconciliation batch (`MoreAvailable: true`) or the single-timestamp no-progress edge — every reconciliation test uses `MoreAvailable: false` (relates to Finding 009). |
| 10 | Documentation & comments | Yes | Prior Findings 001/005 doc fixes still in place. Reconciliation `ReconcileSiteAsync` XML claims parity with `SiteAuditReconciliationActor` while diverging on `MoreAvailable` (the sibling reads it for stalled detection; this actor ignores it) — Finding 009 recommendation includes a doc correction. |
## Findings ## Findings
### SiteCallAudit-001 — SupervisorStrategy override is dead code; XML claims Resume that is not enforced ### SiteCallAudit-001 — SupervisorStrategy override is dead code; XML claims Resume that is not enforced
@@ -323,3 +340,221 @@ Add a test that (a) inserts 6 rows in interleaved order: stuck, not-stuck,
stuck, not-stuck, stuck, not-stuck (oldest first); (b) issues a `StuckOnly` stuck, not-stuck, stuck, not-stuck (oldest first); (b) issues a `StuckOnly`
page-size-1 query; (c) asserts each page returns exactly the stuck row, with page-size-1 query; (c) asserts each page returns exactly the stuck row, with
no overlap and all 3 stuck rows visited. no overlap and all 3 stuck rows visited.
### SiteCallAudit-007 — Daily purge timer is armed only when the reconciliation collaborators are present; a host without the reconciliation client silently never purges
| | |
|--|--|
| Severity | Medium |
| Category | Code organization & conventions |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:307-321`, `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:202-227` |
**Description**
`StartPurgeTimer` (lines 307-321) gates the daily terminal-row purge tick on
the **reconciliation** collaborators being present:
```csharp
private void StartPurgeTimer()
{
if (_pullClient is null || _siteEnumerator is null)
{
return;
}
// ... schedule PurgeTick ...
}
```
But the purge pass (`OnPurgeTickAsync``PurgeWithRepositoryAsync`
`ISiteCallAuditRepository.PurgeTerminalAsync`) needs only the repository — it
has no dependency on `IPullSiteCallsClient` / `ISiteEnumerator`. The two
collaborators are resolved by the production constructor (lines 202-227) via
`serviceProvider.GetService<IPullSiteCallsClient>()` /
`GetService<ISiteEnumerator>()` — both registered by
`AddAuditLogCentralReconciliationClient`
(`AuditLog/ServiceCollectionExtensions.cs:473,523`, registered as `ISiteEnumerator`
and `IPullSiteCallsClient`). `GetService` (not `GetRequiredService`) returns
`null` if that helper was never called, so a host that registers
`AddSiteCallAudit()` **without** also calling
`AddAuditLogCentralReconciliationClient(...)` constructs the actor with both
collaborators null. In `PreStart` both `StartReconciliationTimer` and
`StartPurgeTimer` early-return, so the actor runs forever with **no purge timer
at all** → unbounded growth of the central `SiteCalls` table, with no log line
to say the purge was skipped.
This is **currently latent, not live**: `Program.cs` co-registers
`AddAuditLogCentralReconciliationClient(builder.Configuration)` (line 107)
immediately before `AddSiteCallAudit()` (line 113), so on the real central node
both collaborators resolve and both timers run today. The risk is a future host
(or a refactor that splits the reconciliation client out of the central
composition root, or a test/embedded host that wants only ingest + purge)
silently losing the purge with no diagnostic. The gate exists for a legitimate
reason — keeping the repo-only test ctor free of *both* background timers so the
MSSQL read/upsert tests see no scheduled side effects — but it conflates "no
reconciliation route" with "no purge", and the actor's own XML
(lines 36-38) documents the coupling as deliberate rather than flagging it as a
hazard.
**Recommendation**
Decouple `StartPurgeTimer` from the reconciliation collaborators — purge needs
only a repository, which both the production and the repo-only test ctors
always have. Two viable shapes:
- Preferred: gate the purge timer on its own real precondition (a repository is
always available, so arm it unconditionally in the production + reconciliation
ctors; keep it off only in the repo-only test ctor via an explicit
"background timers off" flag rather than by proxy of the reconciliation
collaborators). This keeps the MSSQL test isolation while removing the
accidental coupling.
- At minimum: log a `Warning` in `StartPurgeTimer` (and `StartReconciliationTimer`)
when the timer is **not** armed on the production path, so a misconfigured host
surfaces "SiteCallAudit purge timer not started — `PurgeTerminalAsync` will
never run" instead of growing the table silently.
Severity is a judgment call: Medium because the consequence (unbounded central
table growth) is real and silent, but it is latent today (the only production
composition root co-registers the reconciliation client), so an argument for Low
is defensible. Flagged Medium to weight the silent-failure + data-growth aspect.
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): decoupled the daily purge timer from the reconciliation collaborators (a new `_backgroundTimersEnabled` flag) so a central node that omits the reconciliation client still purges — no more silent unbounded `SiteCalls` growth; a Warning is logged if the reconciliation collaborators are absent. Test added (purge arms without the reconciliation client).
### SiteCallAudit-008 — Design doc claims six charted KPI trend series; only three are actually charted
| | |
|--|--|
| Severity | Low |
| Category | Design-document adherence |
| Status | Resolved |
| Location | `docs/requirements/Component-SiteCallAudit.md:149-154`, `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/Kpi/SiteCallAuditKpiSampleSource.cs:42-47`, `src/ZB.MOM.WW.ScadaBridge.Commons/Types/Kpi/KpiMetrics.cs:52-62` |
**Description**
The design doc (`Component-SiteCallAudit.md` lines 149-154, the KPI History
interaction) states:
> "… the resulting `buffered` / `parked` / `failedLastInterval` /
> `deliveredLastInterval` / `stuck` / `oldestPendingAgeSeconds` series render as
> trends on the Site Calls page via `KpiTrendChart`."
That lists **six** series as charted. In the code only **three** are charted:
- The public charted catalog `KpiMetrics.SiteCallAudit`
(`KpiMetrics.cs:52-62`) exposes exactly `Buffered`, `Parked`, and
`FailedLastInterval` — and its own XML says "Charted Site Call Audit (#22)
metrics … Rendered by the Central UI Site Calls report trend panel."
- `SiteCallAuditKpiSampleSource` (lines 42-47) keys those three off the public
Commons catalog (`KpiMetrics.SiteCallAudit.*`) and keeps the other three —
`deliveredLastInterval`, `stuck`, `oldestPendingAgeSeconds` — as **private**
string literals, with the comment "Charted metrics share the public Commons
catalog … the uncharted internal metrics stay private here (#178)."
- The UI confirms it: `SiteCalls/SiteCallsReport.razor.cs` calls
`LoadSeriesAsync` for exactly `KpiMetrics.SiteCallAudit.Buffered`, `.Parked`,
and `.FailedLastInterval` — three series, no more.
So all six metrics are *sampled* into the `KpiSample` history store, but only
three are *charted*. The doc reads as if all six are rendered, which is a small
design-doc drift (over-claiming the UI surface). The code is internally
consistent and correctly commented; the doc is the stale party.
**Recommendation**
One-line doc edit on `Component-SiteCallAudit.md` lines 149-154: state that the
three series `buffered` / `parked` / `failedLastInterval` render as trends on
the Site Calls page via `KpiTrendChart`, and that `deliveredLastInterval` /
`stuck` / `oldestPendingAgeSeconds` are **sampled into the KPI-history store but
not charted** (available for future trend panels / ad-hoc query). No code change
— the code is already the intended state.
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): the design doc now lists the three actually-charted KPI metrics (buffered/parked/failedLastInterval) and marks the other three as sampled-but-not-yet-charted, matching the `KpiMetrics` catalog. Doc-only.
### SiteCallAudit-009 — Reconciliation cursor advances inclusively and ignores `MoreAvailable`; a single-timestamp batch saturation re-pulls without progress
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:505-535` |
**Description**
`ReconcileSiteAsync` (lines 505-535) pulls rows since the per-site cursor,
upserts each, and advances the cursor to the maximum `UpdatedAtUtc` observed:
```csharp
var since = _reconciliationCursors.TryGetValue(site.SiteId, out var c) ? c : DateTime.MinValue;
var response = await client.PullAsync(site.SiteId, since, _options.ReconciliationBatchSize, ...);
var maxUpdated = since;
foreach (var row in response.SiteCalls)
{
await repository.UpsertAsync(row with { IngestedAtUtc = nowUtc });
if (row.UpdatedAtUtc > maxUpdated) maxUpdated = row.UpdatedAtUtc;
}
_reconciliationCursors[site.SiteId] = maxUpdated;
```
Two interacting properties:
1. The pull is **inclusive**`PullAsync(site, since, …)` asks for rows with
`UpdatedAtUtc >= since` (documented in the method's XML, lines 497-503), and
the cursor advances to the **max** `UpdatedAtUtc` seen, which is itself one
of the rows just pulled. So the boundary row is re-pulled every tick and
deduped by the idempotent monotonic upsert — intended, harmless.
2. `response.MoreAvailable` is **never read** at all. The `PullSiteCallsResponse`
carries `MoreAvailable` (`PullSiteCallsResponse.cs:14-17`: "True when the
site saturated the requested batch size — the caller should advance the
cursor and pull again"), but `ReconcileSiteAsync` ignores it and relies on
the natural tick cadence to drain the backlog over successive ticks.
The edge case: if a site has **more rows than the batch size all sharing one
exact `UpdatedAtUtc`** (e.g. a burst written in the same tick / same clock
value), the saturated batch returns rows whose max `UpdatedAtUtc` equals
`since`. `maxUpdated` therefore stays at `since`, the cursor does **not**
advance, and because the pull is inclusive the next tick re-pulls the identical
window — and again, and again — making no forward progress on that site's
backlog. Because the upsert is idempotent and the `SiteCalls` table is an
eventually-consistent mirror (not the source of truth), this is **wasted work,
never corruption** — but it is an unbounded re-pull loop on a pathological
input, and any rows in that backlog beyond the batch ceiling never get
reconciled.
The sibling `SiteAuditReconciliationActor` shares the inclusive-cursor /
max-timestamp shape, so the same single-timestamp-saturation no-progress edge
applies to it — but that sibling **does read `MoreAvailable`** (it feeds its
stalled-detection state machine, `SiteAuditReconciliationActor.cs:324-325`,
publishing `SiteAuditTelemetryStalledChanged` so a non-draining site surfaces a
health signal). `ReconcileSiteAsync`'s XML (lines 530-534) claims the
no-immediate-re-pull behaviour "match[es] `SiteAuditReconciliationActor`", which
is only half true: it matches the cursor cadence but diverges by dropping
`MoreAvailable` entirely, so this actor has neither a continuation pull nor a
stalled signal — a saturated site lags silently with no observability.
**Recommendation**
- Consume `MoreAvailable`: either continue draining within the same tick while
`MoreAvailable` is true (bounded by a max-iterations guard), or — matching the
sibling — surface a stalled/non-draining signal when a batch comes back
saturated so a stuck site is observable rather than silent.
- Defend the single-timestamp no-progress edge with a tiebreaker beyond the
timestamp (e.g. advance on `(UpdatedAtUtc, TrackedOperationId)` as a composite
keyset cursor, and ask the pull for rows strictly after that composite), so a
burst sharing one `UpdatedAtUtc` cannot pin the cursor.
- Correct the `ReconcileSiteAsync` XML (lines 530-534): it claims parity with
`SiteAuditReconciliationActor` while diverging on `MoreAvailable` (the sibling
reads it for stalled detection; this actor ignores it).
Severity Low: idempotent upsert + mirror-not-source-of-truth mean no data
corruption, and the saturated-single-timestamp input is pathological; the cost
is wasted re-pulls and an un-drained backlog tail on that one input, plus the
missing observability.
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): `ReconcileSiteAsync` now consumes `response.MoreAvailable` via a within-tick continuation drain bounded by a max-pages guard, with explicit no-progress (single-timestamp-saturation) detection that breaks and logs a Warning instead of re-pulling forever. The XML claim of parity with the sibling reconciler was corrected. Idempotent upsert retained. Tests added.
+295 -3
View File
@@ -5,10 +5,10 @@
| Module | `src/ZB.MOM.WW.ScadaBridge.SiteEventLogging` | | Module | `src/ZB.MOM.WW.ScadaBridge.SiteEventLogging` |
| Design doc | `docs/requirements/Component-SiteEventLogging.md` | | Design doc | `docs/requirements/Component-SiteEventLogging.md` |
| Status | Reviewed | | Status | Reviewed |
| Last reviewed | 2026-05-28 | | Last reviewed | 2026-06-20 |
| Reviewer | claude-agent | | Reviewer | claude-agent |
| Commit reviewed | `1eb6e97` | | Commit reviewed | `4307c381` |
| Open findings | 2 | | Open findings | 0 |
## Summary ## Summary
@@ -101,6 +101,36 @@ _Re-review (2026-05-28, `1eb6e97`):_
| 9 | Testing coverage | ☑ | Non-volatile `stop` flag in `PurgeByStorageCap_ConcurrentWritesDoNotCorruptConnection` (-023). No tests for `PageSize` bounds, `From`/`To` timezone handling, or unobserved `FailedWriteCount`. | | 9 | Testing coverage | ☑ | Non-volatile `stop` flag in `PurgeByStorageCap_ConcurrentWritesDoNotCorruptConnection` (-023). No tests for `PageSize` bounds, `From`/`To` timezone handling, or unobserved `FailedWriteCount`. |
| 10 | Documentation & comments | ☑ | `FailedWriteCount` XML doc claims "Health Monitoring can poll" but nothing does (-018). Severity / event-type docs enumerate values that are not enforced (-020). | | 10 | Documentation & comments | ☑ | `FailedWriteCount` XML doc claims "Health Monitoring can poll" but nothing does (-018). Severity / event-type docs enumerate values that are not enforced (-020). |
#### Re-review 2026-06-20 (commit `4307c381`) — full review
Re-reviewed the module at commit `4307c381`. All twenty-three prior findings are still
recorded; twenty-two of their resolutions hold up under inspection — **but one does not**.
SiteEventLogging-016 (**High**, the `From`/`To` UTC-normalisation defect) is marked
`Resolved` and its Resolution text claims `EventLogQueryService.ExecuteQuery` now calls
`.ToUniversalTime()` before `ToString("o")` and that a regression test
`Query_FiltersByTimeRange_HandlesNonUtcOffset` was added — **neither is true at this
commit**. `EventLogQueryService.cs:77,83` still stringifies `request.From`/`request.To`
verbatim with `.ToString("o")` and no UTC normalisation, and a repo-wide search finds no
test by that name. The claimed -016 fix was never committed; the High-severity defect is
live. This is re-opened as SiteEventLogging-024 (cross-referencing -016) so the audit trail
shows the resolution was asserted but never landed. Four new findings were recorded: -024
(the never-landed -016 fix, High), -025 (synchronous severity validation faults
fire-and-forget callers, Medium), -026 (purge active-node gate diverges from the query
singleton's placement, Medium), and -027 (the time-range test masks -024, Low).
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ☑ | **-024**: `From`/`To` filters still compare non-UTC-normalised ISO 8601 strings against UTC-stored timestamps — the SiteEventLogging-016 fix was never committed (code at `EventLogQueryService.cs:77,83` is unchanged; the claimed regression test does not exist). |
| 2 | Akka.NET conventions | ☑ | `EventLogHandlerActor` remains a simple `Receive`/`Tell` bridge; no new findings. |
| 3 | Concurrency & thread safety | ☑ | Lock-guarded `WithConnection` pattern is correct; no new findings. |
| 4 | Error handling & resilience | ☑ | **-025**: `LogEventAsync` validates severity/args by throwing synchronously, bypassing the documented "faults the returned Task" contract for fire-and-forget callers. |
| 5 | Security | ☑ | Queries fully parameterised; `PageSize` now clamped (-017). No new findings. |
| 6 | Performance & resource management | ☑ | Bounded write queue (-015) and `PageSize` clamp (-017) in place; no new findings. |
| 7 | Design-document adherence | ☑ | **-026**: the purge active-node gate (`SelfIsPrimary` = cluster leader) can diverge from the query singleton's placement (oldest member of role), leaving the queried node's DB unpurged / over-cap. |
| 8 | Code organization & conventions | ☑ | Concrete-recorder DI wiring correct; no new findings. |
| 9 | Testing coverage | ☑ | **-027**: `Query_FiltersByTimeRange` asserts count only with UTC-only inputs, masking -024; no non-UTC-offset regression test exists. |
| 10 | Documentation & comments | ☑ | No new documentation findings beyond those above. |
## Findings ## Findings
### SiteEventLogging-001 — `PRAGMA incremental_vacuum` is a no-op; storage cap cannot reclaim space ### SiteEventLogging-001 — `PRAGMA incremental_vacuum` is a no-op; storage cap cannot reclaim space
@@ -1144,3 +1174,265 @@ Use a `CancellationTokenSource` (`while (!cts.IsCancellationRequested)`), or cha
`stop` to a `volatile bool`, or use `Interlocked.Exchange` / `Volatile.Read`. `stop` to a `volatile bool`, or use `Interlocked.Exchange` / `Volatile.Read`.
`CancellationTokenSource` is the canonical .NET pattern and also lets the test `CancellationTokenSource` is the canonical .NET pattern and also lets the test
cooperate with xUnit's `Task.WhenAll` timeout. cooperate with xUnit's `Task.WhenAll` timeout.
### SiteEventLogging-024 — `From`/`To` filters still compare non-normalised ISO 8601 strings against UTC-stored timestamps (the SiteEventLogging-016 fix was never committed)
| | |
|--|--|
| Severity | High |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteEventLogging/EventLogQueryService.cs:77,83` |
**Description**
This is a re-open of SiteEventLogging-016. That finding was marked **Resolved**
(2026-05-28) with a Resolution stating that `EventLogQueryService.ExecuteQuery` now calls
`.ToUniversalTime()` on `request.From`/`request.To` before `ToString("o")`, that
`EventLogPurgeService.PurgeByRetention` was made defensive with an explicit
`.ToUniversalTime()`, and that a regression test
`Query_FiltersByTimeRange_HandlesNonUtcOffset` was added. **None of that is present at
commit `4307c381`.** The code at `EventLogQueryService.cs` is byte-for-byte the original
defective form:
```
if (request.From.HasValue)
{
whereClauses.Add("timestamp >= $from");
parameters.Add(new SqliteParameter("$from", request.From.Value.ToString("o"))); // line 77
}
if (request.To.HasValue)
{
whereClauses.Add("timestamp <= $to");
parameters.Add(new SqliteParameter("$to", request.To.Value.ToString("o"))); // line 83
}
```
There is no `.ToUniversalTime()` (or `.UtcDateTime`) on either bound, and a repo-wide
search for `Query_FiltersByTimeRange_HandlesNonUtcOffset` returns zero matches — the
claimed regression test does not exist. The -016 resolution was asserted but never landed
in source, so the original High-severity defect is **live**.
The mechanism is exactly as -016 described. Event rows are persisted with
`timestamp = DateTimeOffset.UtcNow.ToString("o")` (`SiteEventLogger.cs:202`), which always
emits the round-trip ISO 8601 form ending in the literal offset `+00:00`
(e.g. `2026-06-20T12:34:56.7890123+00:00`). `request.From`/`request.To` are
`DateTimeOffset?`, and `ToString("o")` preserves whatever offset the caller passed. A
central client in, say, `UTC+05:00` that filters with `DateTimeOffset.Now` produces
`"2026-06-20T17:34:56.0000000+05:00"`, which under SQLite's default `BINARY` collation
sorts lexicographically *greater* than the equivalent UTC instant string
`"2026-06-20T12:34:56.0000000+00:00"`. The `timestamp >= $from` / `timestamp <= $to`
comparison is then a byte-by-byte string compare, so the query silently includes events
from the wrong hour or excludes events that genuinely fall in the window. The design
states "All timestamps are UTC throughout the system", but the central-`DateTimeOffset`
→ SQLite boundary does not enforce it. A central UI rendered in a non-UTC timezone is the
most likely trigger, and it corrupts precisely the "show me what happened around the
failover" time-range query that operators most often run. The retention purge
(`DateTimeOffset.UtcNow.AddDays(-N).ToString("o")`) is UTC by construction and remains
safe; only the central query path is vulnerable.
**Recommendation**
Apply the one-line fix on each bound that -016 specified but that was never committed:
```
parameters.Add(new SqliteParameter("$from", request.From.Value.ToUniversalTime().ToString("o")));
parameters.Add(new SqliteParameter("$to", request.To.Value.ToUniversalTime().ToString("o")));
```
so the produced offset is always `+00:00` and the comparison is lexicographically sound
against the UTC-stored strings. Add the regression test that -016 claimed but never
delivered (see SiteEventLogging-027): construct a `From`/`To` carrying a non-zero offset
(e.g. `+05:00`) and assert the matching UTC-stored events are returned and out-of-range
ones excluded, asserting on returned row identities rather than just count. Optionally
store `timestamp` as Unix-epoch `INTEGER` to eliminate the lexicographic-comparison hazard
structurally. When fixing, also reconcile SiteEventLogging-016's Resolution text with
reality (its commit reference was `<pending>` and the fix never merged).
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): `EventLogQueryService` now calls `.ToUniversalTime()` on the `From`/`To` `DateTimeOffset` bounds before `.ToString("o")`, so the comparison strings are UTC and match the UTC-stored values for any non-UTC client offset. This is the fix that the closed -016 claimed but never committed. Regression test added (verified failing pre-fix).
### SiteEventLogging-025 — `LogEventAsync` severity/argument validation throws synchronously, bypassing the "faults the returned Task" contract for fire-and-forget callers
| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Deferred |
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteEventLogging/SiteEventLogger.cs:187-199` |
**Description**
`LogEventAsync` is documented as a fire-and-forget enqueue whose returned `Task`
"completes once the event is durably persisted and faults if the write fails"
(`SiteEventLogger.cs:24-29`), and the disposed-drop paths were deliberately reworked under
SiteEventLogging-012 to *fault the Task* rather than throw. But the argument and severity
validation at the top of the method throws **synchronously** on the caller's thread before
any `Task` is returned:
```
ArgumentException.ThrowIfNullOrWhiteSpace(eventType);
ArgumentException.ThrowIfNullOrWhiteSpace(severity);
ArgumentException.ThrowIfNullOrWhiteSpace(source);
ArgumentException.ThrowIfNullOrWhiteSpace(message);
if (!AllowedSeverities.Contains(severity)) // SiteEventLogging-020 closed set
{
throw new ArgumentException(
$"Severity '{severity}' is not one of the allowed values: Info, Warning, Error.",
nameof(severity));
}
```
Every recording caller invokes this fire-and-forget as `_ = _eventLogger.LogEventAsync(...)`
(no `await`, return discarded), so a thrown `ArgumentException` does **not** flow into the
discarded `Task` — it propagates straight up the calling actor's message-handling stack.
For the recorder hot-path callers (the computed `AlarmActor`, the `DataConnectionActor`,
the `ScriptActor`/`ScriptExecutionActor`) an unhandled synchronous throw would fault the
actor and trip its supervision strategy, when the documented and intended behaviour for a
best-effort audit write is that a bad event is dropped/faulted, never that it crashes the
subsystem doing the logging.
This is **currently latent**: every production call site passes a hard-coded canonical
`severity` literal (`"Info"`/`"Warning"`/`"Error"`) and non-empty `eventType`/`source`/
`message`, so the throw path is never reached today. The exposure is a future caller that
passes a dynamic, computed, or typo'd severity (e.g. `"error"`, `"ERR"`, a value derived
from an alarm payload) — the SiteEventLogging-020 closed-set check, added defensively, then
becomes a crash vector for the very actor it is meant to protect. The mismatch is that
-012 chose Task-faulting for one drop reason (disposal) while validation kept synchronous
throwing for another (bad input), so the method's failure contract is now inconsistent.
**Recommendation**
Make the failure mode uniform with the rest of the method: return a faulted `Task` for
validation failures instead of throwing synchronously — e.g.
`return Task.FromException(new ArgumentException(...))` for the severity check (and,
consistently, for the null/whitespace guards) — so a fire-and-forget caller is never
crashed by a bad audit-event payload and an `await`-ing caller still observes the failure.
Alternatively, accept a `severity` enum on `ISiteEventLogger.LogEventAsync` and convert to
the canonical string at the boundary, eliminating the runtime-validation throw entirely.
The exact fix shape is a judgment call — faulted-Task is the minimal-diff change that
matches the documented contract; the enum is the stronger design but a wider
interface/caller change. Whichever is chosen, update the XML doc to state the failure
semantics for invalid input.
**Resolution**
Deferred 2026-06-20: the fix shape (return a faulted Task vs. validate against a severity enum) is a design decision, and the defect is latent (every current caller passes a canonical severity literal). Recorded; no change this pass.
### SiteEventLogging-026 — Purge active-node gate (cluster leader) can diverge from the query singleton's placement (oldest member of role)
| | |
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Deferred |
| Location | `src/ZB.MOM.WW.ScadaBridge.Host/SiteServiceRegistration.cs:116-120`, `src/ZB.MOM.WW.ScadaBridge.Host/Health/AkkaClusterNodeProvider.cs:29-39`, `src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs:889-895` |
**Description**
SiteEventLogging-019 closed the "purge runs on every node" finding by gating
`EventLogPurgeService` on an optional `SiteEventLogActiveNodeCheck`. The Host wires that
check to `IClusterNodeProvider.SelfIsPrimary` (`SiteServiceRegistration.cs:116-120`), which
returns true only when "this node is `Up` AND is the **cluster leader**"
(`AkkaClusterNodeProvider.cs:29-39`: `cluster.State.Leader.Equals(cluster.SelfAddress)`).
The query path, however, is served by the `event-log-handler` **cluster singleton**
(`AkkaHostedService.cs:889-895`), which Akka.NET pins to the **oldest member of the role**,
not the cluster leader. These are two different election concepts:
- The Akka **cluster leader** is the lowest-address member in the `Up`/`Leaving` state set
for the *whole cluster* (role-agnostic) — it is what `SelfIsPrimary` tests.
- A **cluster singleton** is hosted on the *oldest* member of its configured role
(`siteRole`) — it is what owns the event-log query DB.
In a steady-state 2-node site cluster these usually coincide, so the gate works in
practice. But they are not guaranteed equal — most plausibly during membership churn /
after a failover, when the oldest-member computation and the leader election can settle on
different nodes for a window. When they diverge, the query singleton reads node X's
SQLite DB while the purge runs on node Y. The result is that the node whose DB is actually
queried (the populated, active one) is left **unpurged** — its 30-day retention sweep and
1 GB cap-purge do not run — so the queried database can drift over the documented 1 GB cap
and retain rows past the 30-day window, which is exactly the invariant the storage section
of Component-SiteEventLogging requires. The SiteEventLogging-004 re-triage note already
established the correct co-location principle: the *query* singleton and the
*deployment-manager* singleton share `siteRole` and so co-locate on the oldest member; the
purge gate keyed on a *different* signal (`SelfIsPrimary`/leader) silently breaks that
co-location for the purge.
**Recommendation**
Re-gate the purge on the **same** ownership signal as the query singleton so the two never
diverge — i.e. gate on "this node hosts the `event-log-handler` singleton" (host the purge
inside that same cluster singleton, or test oldest-member-of-`siteRole` rather than cluster
leader). Alternatively, drop the purge gate entirely and let every node purge its own local
DB: the SiteEventLogging-019 and -004 analysis already showed standby purge is harmless
(the standby receives no writes), so running it on both nodes guarantees the queried DB is
always purged at the cost of a no-op sweep on the standby. The direction is a **judgment
call** — re-gating on singleton ownership keeps the "active node only" intent but adds
coupling to the singleton's lifecycle; dropping the gate is simpler and provably correct
for the cap/retention invariant but reverts the -019 "active node only" optimisation.
Whichever is chosen, the gate (`SelfIsPrimary` = leader) and the singleton (oldest member)
must not be left keyed on two different elections.
**Resolution**
Deferred 2026-06-20: the direction (re-gate purge on singleton ownership vs. drop the gate per the -019 analysis) is a design-owner decision. Divergence between the Akka leader and the oldest-member singleton is a churn-window condition. Recorded; no change this pass.
### SiteEventLogging-027 — Time-range test asserts count only with UTC-only inputs, masking SiteEventLogging-024
| | |
|--|--|
| Severity | Low |
| Category | Testing coverage |
| Status | Resolved |
| Location | `tests/ZB.MOM.WW.ScadaBridge.SiteEventLogging.Tests/EventLogQueryServiceTests.cs:197-214` |
**Description**
`Query_FiltersByTimeRange` is the only test covering the `From`/`To` time-range filter, and
it cannot catch SiteEventLogging-024 by construction:
```
var now = DateTimeOffset.UtcNow; // UTC-only input
InsertEventAt(now.AddHours(-2), "script", "Info", null, "S1", "Old event");
InsertEventAt(now.AddMinutes(-30),"script", "Info", null, "S2", "Recent event");
InsertEventAt(now, "script", "Info", null, "S3", "Now event");
var response = _queryService.ExecuteQuery(MakeRequest(
from: now.AddHours(-1),
to: now.AddMinutes(1)));
Assert.True(response.Success);
Assert.Equal(2, response.Entries.Count); // count only
```
Both shortcomings hide the live -024 defect:
1. **UTC-only inputs.** `now` is `DateTimeOffset.UtcNow`, so `from`/`to` already carry a
`+00:00` offset. `ToString("o")` then produces a correctly-comparable string and the
missing `.ToUniversalTime()` is never exercised. The defect only manifests for a
`DateTimeOffset` with a *non-zero* offset, which this test never constructs.
2. **Count-only assertion.** It asserts `response.Entries.Count == 2` but never checks
*which* rows came back. A boundary off-by-an-hour bug that swapped one in-range row for
an out-of-range row (the -024 symptom) could still yield a count of 2 and pass.
The net effect is a green test suite that gives false confidence the time-range filter is
correct — and almost certainly contributed to SiteEventLogging-016 being marked Resolved
when its fix was never committed.
**Recommendation**
Add the non-UTC-offset regression test that SiteEventLogging-016 claimed but never
delivered (`Query_FiltersByTimeRange_HandlesNonUtcOffset`): construct `From`/`To` as a
`DateTimeOffset` with a non-zero offset (e.g. `+05:00`) bracketing UTC-stored events, and
assert on the **identities** of the returned rows (e.g. by `Source`/`Message` or `Id`), not
just `Count`, so a window shifted by the offset is detected. Strengthen the existing
`Query_FiltersByTimeRange` likewise to assert which rows are returned. These tests should
fail against the current code and pass once SiteEventLogging-024 is fixed.
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): added `Query_FiltersByTimeRange_HandlesNonUtcOffset` — seeds known UTC instants, queries with a +05:00 offset, and asserts the returned row identities (not just count). Verified failing pre-fix, passing post-fix.
+271 -3
View File
@@ -5,10 +5,10 @@
| Module | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime` | | Module | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime` |
| Design doc | `docs/requirements/Component-SiteRuntime.md` | | Design doc | `docs/requirements/Component-SiteRuntime.md` |
| Status | Reviewed | | Status | Reviewed |
| Last reviewed | 2026-05-28 | | Last reviewed | 2026-06-20 |
| Reviewer | claude-agent | | Reviewer | claude-agent |
| Commit reviewed | `1eb6e97` | | Commit reviewed | `4307c381` |
| Open findings | 3 | | Open findings | 0 |
## Summary ## Summary
@@ -109,6 +109,40 @@ _Re-review (2026-05-28, `1eb6e97`):_
## Findings ## Findings
#### Re-review 2026-06-20 (commit `4307c381`) — full review
The module was re-reviewed in full at `4307c381` (current HEAD). The diff against the
prior baseline `1eb6e97` shows as 100% additions because of the ScadaBridge rename
(the project moved paths), so the whole module was re-read at its current state rather
than relying on the diff. Health is good: all prior findings 001026 remain
Resolved/Deferred with no regressions observed in their fixed call sites
(SetAttribute DCL routing, the watch+buffer redeploy, the SiteRuntime-020 terminating
shadow, the OperationTrackingStore read/write split + safe Dispose, the
AuditingDb `Inner` accessor, invariant-culture numeric parsing, the
`_attributes` child-snapshot isolation). The new M7 surface (NativeAlarmActor,
CertStoreActor) and the WaitForAttribute/batch-write additions are generally
well-built. Five new findings were recorded, all in the native-alarm and
deployment-lifecycle bookkeeping: an unbounded `_latestAlarmEvents` map on the
Instance Actor (027, Medium), a phantom-active alarm left behind by
`NativeAlarmActor.EnforceCap` (028, Medium), a delete-during-pending-redeploy
that both over-decrements the deployed counter and resurrects the deleted
instance (029, Medium), missing tests for the cap/retention/replication paths
(030, Low), and stale site-side persistence of notification-list + SMTP config
that the design decision moved to central-only (031, Low).
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ✓ | EnforceCap leaves phantom-active alarm (028); delete-during-redeploy over-decrements + resurrects (029). HiLo/RoC/edge logic otherwise sound. |
| 2 | Akka.NET conventions | ✓ | Supervision (Resume coordinators / Stop short-lived) correct; execution actors on ScriptExecutionScheduler; cert broadcast uses PipeTo, captures `sender` before async. Trigger-eval mailbox-block stays Deferred (014). |
| 3 | Concurrency & thread safety | ✓ | Child-snapshot isolation (017) intact; `_createdConnections` actor-confined via ApplyArtifact… dispatch (021); OperationTrackingStore read/write split (024). No new actor-thread violations. |
| 4 | Error handling & resilience | ✓ | Best-effort persistence/replication paths log-on-fault; deploy reports Success only post-persist (005). Native source-unavailable retains+marks-uncertain. |
| 5 | Security | ✓ | Trust verdict delegated to shared ScriptTrustValidator; CertStoreActor path-traversal-guarded; no reflection in scripts; SQL params captured per M4 design (redaction at write time). |
| 6 | Performance & resource management | ✓ | `_latestAlarmEvents` unbounded growth on native-alarm churn (027). ScriptExecutionScheduler background threads OK; per-call SQLite connections acceptable. |
| 7 | Design-document adherence | ✓ | Actor hierarchy / native-alarm wiring conform. Stale site persistence of notification-list + SMTP config vs. "central-only, not deployed to sites" (031). |
| 8 | Code organization & conventions | ✓ | Reflection anti-pattern eliminated (006/022); Options owned by component; additive message evolution honoured. |
| 9 | Testing coverage | ✓ | Broad new suites (NativeAlarmActor, WaitForAttribute, SetAttribute, ExecutionActor, scheduler). EnforceCap/retention-drop + SiteReplicationActor still untested (030). |
| 10 | Documentation & comments | ✓ | XML docs accurate and extensive; ReplicationMessages now documented (026). No new stale comments found. |
### SiteRuntime-001 — `Instance.SetAttribute` never writes to the Data Connection Layer ### SiteRuntime-001 — `Instance.SetAttribute` never writes to the Data Connection Layer
| | | | | |
@@ -1370,3 +1404,237 @@ the direction (outbound to peer / inbound from peer) and what is replicated.
The two pre-existing group-header XML blocks were converted to plain `//` The two pre-existing group-header XML blocks were converted to plain `//`
comments to avoid orphaned doc-summaries above the first record in each group. comments to avoid orphaned doc-summaries above the first record in each group.
Marker-base-type idea left out of scope. Marker-base-type idea left out of scope.
### SiteRuntime-027 — `InstanceActor._latestAlarmEvents` grows without bound as native conditions churn
| | |
|--|--|
| Severity | Medium |
| Category | Performance & resource management |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/InstanceActor.cs:67`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/InstanceActor.cs:1007` |
**Description**
`HandleAlarmStateChanged` stores the latest enriched `AlarmStateChanged` per
`changed.AlarmName` into `_latestAlarmEvents` (line 1013) and **never removes
anything from that dictionary**. For computed alarms the key set is bounded (one
entry per configured alarm). For **native** alarms, however, the key is the
per-condition `SourceReference` (every `NativeAlarmActor.Emit` stamps
`AlarmName = t.SourceReference`). When a native condition fully runs its course —
`NativeAlarmActor.ApplyLiveTransition` drops it from the mirror once
`!Active && Acknowledged` (NativeAlarmActor.cs:243), or `EnforceCap` evicts the
oldest — the Instance Actor still holds the final `AlarmStateChanged` for that
`SourceReference` forever as a (Normal) entry. There is no message back to the
Instance Actor to evict it.
On an OPC UA A&C source that mints a fresh `SourceReference` per occurrence
(conditionId/branchId style references are common), the map accumulates one
permanently-retained entry per distinct condition the instance has *ever* seen.
The `MirroredAlarmCapPerSource` cap (default 1000) bounds the
`NativeAlarmActor._alarms` set but does **not** bound `_latestAlarmEvents`, which
is on the long-lived Instance Actor. Over weeks of uptime on a busy site this is a
steadily-growing per-instance memory leak, and `BuildAlarmStatesSnapshot`
(line 1055, `_latestAlarmEvents.Values.ToList()`) makes every DebugView snapshot
proportionally larger and slower, returning thousands of stale Normal rows.
**Recommendation**
Evict from `_latestAlarmEvents` when a native condition reaches its terminal
return-to-normal+dropped state. The cleanest signal is the `AlarmStateChanged`
the `NativeAlarmActor` already emits on drop-out: have the `NativeAlarmActor`
stamp a "final/dropped" marker (an additive bool on `AlarmStateChanged`, or a
dedicated `NativeAlarmDropped(sourceReference)` Tell to the parent), and have
`HandleAlarmStateChanged` `_latestAlarmEvents.Remove(...)` on it. Alternatively
cap/prune `_latestAlarmEvents` for native keys mirroring
`MirroredAlarmCapPerSource`. Computed-alarm keys must stay (they are
configuration-bounded).
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): added an additive `NativeAlarmDropped(SourceReference)` Tell emitted by `NativeAlarmActor` at all terminal-drop sites (snapshot-swap removal, retention drop, cap eviction) after the return-to-normal emit; `InstanceActor.HandleNativeAlarmDropped` removes the native key from `_latestAlarmEvents` (and `_alarmStates`/`_alarmTimestamps`). Computed-alarm keys are never dropped.
### SiteRuntime-028 — `NativeAlarmActor.EnforceCap` drops a condition without a return-to-normal, leaving a phantom Active alarm
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/NativeAlarmActor.cs:281` |
**Description**
`EnforceCap` evicts the oldest mirrored conditions once the per-source cap is
exceeded: it removes them from `_alarms`, calls `PersistDelete`, and logs the
eviction — but it does **not** `Emit` a return-to-normal `AlarmStateChanged` for
the dropped condition. If the evicted condition was still `Active`, the Instance
Actor's `_latestAlarmEvents` (and, downstream, central's gRPC stream and the
operator Alarm Summary page) keep showing it as **Active** indefinitely. The
mirror has silently forgotten the condition, so no later transition can ever clear
it — a phantom stuck-active alarm.
This is inconsistent with the sibling drop path `ApplySnapshotSwap`
(NativeAlarmActor.cs:205), which correctly emits `prior.Condition with { Active =
false }` for every condition that falls out of the new snapshot. The retention
drop in `ApplyLiveTransition` (line 243) is safe only because it drops *after*
emitting the condition's own already-inactive state; `EnforceCap` drops a
condition whose last-emitted state may still be Active, with no compensating emit.
The design doc explicitly states the cap eviction is "logged — there is no silent
truncation," but logging alone does not reconcile the operator-visible state: the
alarm view is silently wrong.
**Recommendation**
In `EnforceCap`, before removing each overflow condition, `Emit(a, a.Condition
with { Active = false })` (mirroring `ApplySnapshotSwap`) so the eviction produces
a return-to-normal on the stream and clears the phantom. Add a test asserting that
exceeding `MirroredAlarmCapPerSource` with an active oldest condition emits a
Normal `AlarmStateChanged` for the evicted `SourceReference`.
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): `NativeAlarmActor.EnforceCap` now emits `Emit(evicted, evicted.Condition with { Active = false })` for a still-active evicted condition before removing it, clearing the phantom stuck-Active alarm on the stream/UI. Cap-eviction test added.
### SiteRuntime-029 — A delete (or disable) arriving during a pending redeploy over-decrements the counter and is undone by `HandleTerminated`
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs:627`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs:402` |
**Description**
`HandleDelete` and `HandleDisable` only consult `_instanceActors` to find the live
actor; neither consults the SiteRuntime-020 `_terminatingActorsByName` /
`_pendingRedeploys` bookkeeping. When a redeployment is in flight, `HandleDeploy`
has already removed the instance from `_instanceActors`, stopped the predecessor,
and buffered a `PendingRedeploy` keyed by the terminating ref. If a
`DeleteInstanceCommand` then arrives for that same instance *before* the
`Terminated` signal fires:
1. `_instanceActors.TryGetValue` misses (the entry was removed at redeploy time),
so no actor is stopped.
2. `_totalDeployedCount = Math.Max(0, _totalDeployedCount - 1)` runs anyway —
over-decrementing, because the redeploy path did not increment for an update
(the count had already been adjusted), so the deployed/disabled counts reported
to the health collector drift.
3. `RemoveDeployedConfigAsync` deletes the SQLite row and the deleter is told
success.
4. The buffered `_pendingRedeploys` entry is **untouched**, so when `Terminated`
fires, `HandleTerminated` calls `ApplyDeployment(..., isRedeploy: true)`, which
re-creates the Instance Actor and re-writes the deployed config to SQLite —
**resurrecting the instance the operator just deleted**, with the counter now
inconsistent.
`HandleDisable` has the milder version: the disable persists `is_enabled = false`,
but `HandleTerminated` then re-stores the config with `isEnabled: true` and
re-creates the actor, so a disable issued mid-redeploy is silently reverted to
enabled.
The window is the redeploy-termination interval — small, but reliably hit when
central issues a delete/disable immediately after a deploy (e.g. an operator
correcting a mistaken deploy, or an automated teardown), exactly the kind of rapid
command sequence SiteRuntime-020 was filed to harden.
**Recommendation**
Make `HandleDelete`/`HandleDisable` authoritative over the mid-redeploy state:
before falling through, check `_terminatingActorsByName`. On a hit, drop the
buffered `_pendingRedeploys` entry (so `HandleTerminated` does not resurrect),
clear the shadow, and for delete tell the buffered redeploy's `OriginalSender` it
was superseded — mirroring the last-write-wins handling already in `HandleDeploy`.
Only decrement `_totalDeployedCount` when an instance was actually present
(in `_instanceActors` **or** terminating). Add a regression test:
deploy → redeploy → delete-before-Terminated asserts the instance stays deleted
and the counter is correct.
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): `HandleDelete`/`HandleDisable` now check `_terminatingActorsByName` before the `_instanceActors` fall-through; on a hit they drop the buffered `_pendingRedeploys` entry and clear the shadow so `HandleTerminated` can't resurrect the instance, and `_totalDeployedCount` is decremented only when an instance was actually present. Delete-during-redeploy race test added (verified failing pre-fix).
### SiteRuntime-030 — Native-alarm cap/retention-drop and `SiteReplicationActor` paths are untested
| | |
|--|--|
| Severity | Low |
| Category | Testing coverage |
| Status | Resolved |
| Location | `tests/ZB.MOM.WW.ScadaBridge.SiteRuntime.Tests/Actors/NativeAlarmActorTests.cs`, `tests/ZB.MOM.WW.ScadaBridge.SiteRuntime.Tests/` |
**Description**
`NativeAlarmActorTests` covers subscribe, raise, snapshot-swap return-to-normal,
out-of-order rejection, ack, and site-event emission, but there is **no** test for
`EnforceCap` (the per-source cap eviction) or for the `ApplyLiveTransition`
retention drop (`!Active && Acknowledged`). Both are non-trivial state
transitions, and the cap path harbours an observable defect (SiteRuntime-028) that
a targeted test would have caught. `SiteReplicationActor` remains entirely
untested — the carried-forward gap that SiteRuntime-016 explicitly deferred to a
clustered-ActorSystem harness, still outstanding at this commit (the actor calls
`Cluster.Get(Context.System)` in its constructor, so it needs a clustered HOCON
test host). The outbound forward, inbound apply, and `SendToPeer` no-peer-drop
behaviour are unverified.
**Recommendation**
Add `NativeAlarmActor` tests for (a) cap eviction emits a return-to-normal for an
evicted active condition (pairs with SiteRuntime-028) and (b) a resolved condition
(inactive+acked) drops out and deletes its SQLite row. Stand up the clustered
test harness SiteRuntime-016 called for and cover `SiteReplicationActor`'s
outbound/inbound/peer-tracking paths.
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): native-alarm cap-eviction and retention-drop tests added (pairs with -027/-028). The clustered `SiteReplicationActor` test harness remains deferred (needs a clustered ActorSystem host, consistent with the prior -016 deferral).
### SiteRuntime-031 — Site still persists notification-list and SMTP config that the design moved to central-only
| | |
|--|--|
| Severity | Low |
| Category | Design-document adherence |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Persistence/SiteStorageService.cs:90`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Persistence/SiteStorageService.cs:105`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs:1383`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs:1417` |
**Description**
The design decision (CLAUDE.md "External Integrations" / Component-NotificationService;
echoed in Component-SiteRuntime.md "System-Wide Artifact Handling": *"Notification
lists and SMTP configuration are central-only and are not deployed to sites"*) makes
notification delivery central-only — sites store-and-forward to central and never
talk to SMTP. Yet SiteRuntime still:
- creates the `notification_lists` and `smtp_configurations` SQLite tables
(`SiteStorageService.InitializeAsync`),
- writes them from `HandleDeployArtifacts` (`StoreNotificationListAsync` /
`StoreSmtpConfigurationAsync`, DeploymentManagerActor.cs:1383/1417) and the
replication path `SiteReplicationActor.HandleApplyArtifacts`, and
- the `DeployArtifactsCommand` contract still carries `NotificationLists` /
`SmtpConfigurations`.
The SMTP path persists the SMTP `password` field (SiteStorageService.cs:693) into
plaintext site SQLite. If central no longer populates these (per the design), this
is dead code carrying a latent secret-at-rest footprint; if central still does, the
site is storing SMTP credentials it must never use — both contradict the
central-only delivery decision. Either the code/tables are stale and should be
removed, or the design doc is stale and the decision needs re-stating. This straddles
the SiteRuntime/NotificationService boundary and the shared `DeployArtifactsCommand`
contract, so the direction is a design-owner call rather than a clean in-module fix.
**Recommendation**
Confirm with the design owner whether central still ships notification-list / SMTP
artifacts to sites. If not (the stated decision), remove the `notification_lists`
and `smtp_configurations` tables, the two `Store…Async` writers, the replication
branches, and the corresponding `DeployArtifactsCommand` fields (a coordinated
cross-module change). If the decision has changed, update Component-SiteRuntime.md
and Component-NotificationService.md and treat the plaintext SMTP password as a
secret-handling finding in its own right.
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): the site no longer persists notification-list / SMTP config and purges any already-persisted rows (incl. the plaintext SMTP password) on both apply paths (`HandleDeployArtifacts`, replication `HandleApplyArtifacts`), inside the all-or-nothing apply. Paired with DeploymentManager-025 (central stops shipping). Tables retained but kept empty.
+219 -3
View File
@@ -5,10 +5,10 @@
| Module | `src/ZB.MOM.WW.ScadaBridge.StoreAndForward` | | Module | `src/ZB.MOM.WW.ScadaBridge.StoreAndForward` |
| Design doc | `docs/requirements/Component-StoreAndForward.md` | | Design doc | `docs/requirements/Component-StoreAndForward.md` |
| Status | Reviewed | | Status | Reviewed |
| Last reviewed | 2026-05-28 | | Last reviewed | 2026-06-20 |
| Reviewer | claude-agent | | Reviewer | claude-agent |
| Commit reviewed | `1eb6e97` | | Commit reviewed | `4307c381` |
| Open findings | 0 (3 Deferred: 002, 011, 012; all 5 Open from Re-review 2026-05-28 resolved 2026-05-28) | | Open findings | 0 (025 Resolved; 026, 027 Deferred; 5 Deferred total: 002, 011, 012, 026, 027; all prior Open findings resolved) |
## Summary ## Summary
@@ -110,6 +110,39 @@ mid-flight `RetryPendingMessagesAsync` invocation continues using `_storage` and
`_replication` after `StopAsync` returns; downstream resources disposed by the host `_replication` after `StopAsync` returns; downstream resources disposed by the host
shutdown sequence (the DI container) can then NRE through the still-running sweep. shutdown sequence (the DI container) can then NRE through the still-running sweep.
#### Re-review 2026-06-20 (commit `4307c381`) — full review
Full re-review against commit `4307c381` with the same 10-category checklist. All prior
fixes are intact: the StoreAndForward-024 stop-wait (the captured `_sweepTask` +
bounded `StopAsync` wait), the StoreAndForward-020 capture-before-requeue, the
StoreAndForward-005 conditional `UpdateMessageIfStatusAsync` writes, and the
StoreAndForward-023 site-id sentinel all remain present and correct, and the new WP-14
in-process `_bufferedCount` queue-depth counter shows **no drift** — every Pending-population
transition (`BufferAsync` +1, retry-remove 1, conditional Pending→Parked 1, operator
requeue +1) adjusts it exactly once and only when the underlying conditional storage write
wins. The three findings this pass surfaces are not core-delivery defects: a **gauge-stop
resource leak** (025 — the process-global queue-depth provider slot is never cleared on
`StopAsync`, so a stopped service's frozen depth is reported indefinitely and the dead
instance is pinned via the closure), plus **two test/doc items** — the gauge stop/lifecycle
and the `EmitSiteEvent` retry / retry-delivered branches are untested (026), and the
per-retry-per-sweep `"Retried"` site event is a spec-conformant design-vs-scale tension
worth flagging (027).
## Checklist coverage — Re-review 2026-06-20
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ☑ | No new defects — `_bufferedCount` adjusts once per Pending transition, gated on the winning conditional write; retry-count/park semantics (003/005) intact. |
| 2 | Akka.NET conventions | ☑ | No issues found — `ParkedMessageHandlerActor` `PipeTo` success/failure projections preserved (007). |
| 3 | Concurrency & thread safety | ☑ | No new findings — StopAsync now awaits the captured `_sweepTask` (024); requeue captures the row up front (020); conditional writes guard sweep-vs-operator (005). |
| 4 | Error handling & resilience | ☑ | No new findings — replication wired on all delivery + operator paths (001/016); notification parking documented (019). |
| 5 | Security | ☑ | No issues found — parameterised SQL throughout; payload JSON opaque; no secret material handled. |
| 6 | Performance & resource management | ☑ | Queue-depth gauge provider registered into a process-global slot is never cleared on `StopAsync` — frozen reading + dead-instance pin for the process lifetime (025). |
| 7 | Design-document adherence | ☑ | Per-retry `"Retried"` site event per-row-per-sweep is spec-conformant (`Component-StoreAndForward.md:139`) but a flood risk against the unbounded buffer + capped site log — flagged as a judgment call (027). |
| 8 | Code organization & conventions | ☑ | No new findings — options/POCO placement unchanged from prior passes (011/012 still Deferred to Commons-owning changes). |
| 9 | Testing coverage | ☑ | Gauge stop/lifecycle and the `EmitSiteEvent` retry / retry-delivered branches are untested (026). |
| 10 | Documentation & comments | ☑ | No new findings — the `_bufferedCount` and StopAsync XML docs accurately describe the implemented behaviour. |
## Checklist coverage — Re-review 2026-05-28 ## Checklist coverage — Re-review 2026-05-28
| # | Category | Examined | Notes | | # | Category | Examined | Notes |
@@ -1520,3 +1553,186 @@ mid-sweep and asserts no further storage activity occurs after `StopAsync` retur
_Unresolved._ _Unresolved._
### StoreAndForward-025 — Queue-depth gauge provider is never cleared on `StopAsync`, leaking a frozen reading and pinning the dead service
| | |
|--|--|
| Severity | Medium |
| Category | Performance & resource management |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.StoreAndForward/StoreAndForwardService.cs:347``:351`, `:385``:425`; `src/ZB.MOM.WW.ScadaBridge.Commons/Observability/ScadaBridgeTelemetry.cs:55``:60`, `:85``:93` |
**Description**
The WP-14 queue-depth gauge is backed by a **process-global** provider slot.
`StoreAndForwardService.StartAsync` seeds the in-process `_bufferedCount`, sets a
one-time instance guard (`_queueDepthProviderRegistered`, line 347), and registers a
closure that reads this instance's counter into the global slot
(`ScadaBridgeTelemetry.SetQueueDepthProvider(() => Interlocked.Read(ref _bufferedCount))`,
line 350). The `scadabridge.store_and_forward.queue.depth` `ObservableGauge`
(`ScadaBridgeTelemetry.cs:56``:60`) invokes that provider on every collector scrape:
`Volatile.Read(ref _queueDepthProvider) is { } p ? p() : 0L`.
`SetQueueDepthProvider` (`ScadaBridgeTelemetry.cs:85``:93`) is the only mutator of the
slot — there is **no** clear/reset method anywhere in the codebase (a repository-wide
search for `ClearQueueDepthProvider` / writes to `_queueDepthProvider` finds only the
registration). `StopAsync` (`StoreAndForwardService.cs:385``:425`) disposes the retry
timer and (StoreAndForward-024) awaits the in-flight sweep, but it never touches the
gauge: the global slot still points at the stopped instance's closure, and
`_queueDepthProviderRegistered` is left at 1.
Two consequences after a graceful stop:
1. **Frozen reading.** The gauge keeps reporting the dead service's last `_bufferedCount`
value forever — a stopped (or failed-over-away) site node reports a stale, non-zero
queue depth that no longer corresponds to any live buffer, which is misleading on a
dashboard / alert built on the metric.
2. **Instance pin (resource leak).** The captured closure holds a reference to the dead
`StoreAndForwardService`, so the stopped instance (and everything it transitively
roots) cannot be garbage-collected for the process lifetime. This is the classic
"static event / static delegate slot holding a finished object" leak. It also means a
later instance's `StartAsync` silently stomps the global slot (last-writer-wins), so
the gauge's identity is unmanaged.
The instance guard's own XML doc (lines 156164) already acknowledges the slot is
process-global and "the last `StartAsync` wins the global slot" — but nothing on the stop
path releases it, so a clean stop without an immediately-following start leaves a dangling
provider.
**Recommendation**
Give `ScadaBridgeTelemetry` an identity-checked clear and call it from `StopAsync`. Add a
compare-and-clear `ClearQueueDepthProvider(Func<long> expected)` that only nulls the slot
when the current provider is reference-equal to `expected` (so a later instance that
already re-registered is not stomped):
```csharp
public static void ClearQueueDepthProvider(Func<long> expected)
{
Interlocked.CompareExchange(ref _queueDepthProvider, null, expected);
}
```
In `StoreAndForwardService`, hold the registered provider delegate in a field at
`StartAsync` time, and in `StopAsync` (only when this instance actually registered —
`_queueDepthProviderRegistered == 1`) call
`ScadaBridgeTelemetry.ClearQueueDepthProvider(_registeredProvider)` and reset
`_queueDepthProviderRegistered` to 0 via `Interlocked.Exchange` so a subsequent restart
re-seeds and re-registers cleanly. Add a regression test (extend `QueueDepthGaugeTests`)
asserting that after `StopAsync` the gauge reports 0 (provider cleared) and that a second
instance's registration is not clobbered by the first instance's stop.
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): added an identity-checked `ScadaBridgeTelemetry.ClearQueueDepthProvider(provider)` (compare-and-clear via `Interlocked.CompareExchange`, so a newer instance's provider is never stomped) and call it in `StoreAndForwardService.StopAsync` (with `_queueDepthProviderRegistered` reset). The gauge no longer reports a frozen depth or pin the dead service after a graceful stop. Tests added (incl. the late-stop-doesn't-clobber-successor case).
### StoreAndForward-026 — Gauge stop/lifecycle and the `EmitSiteEvent` retry / retry-delivered branches are untested
| | |
|--|--|
| Severity | Low |
| Category | Testing coverage |
| Status | Deferred |
| Location | `tests/ZB.MOM.WW.ScadaBridge.StoreAndForward.Tests/QueueDepthGaugeTests.cs`; `tests/ZB.MOM.WW.ScadaBridge.StoreAndForward.Tests/StoreAndForwardSiteEventTests.cs`; `src/ZB.MOM.WW.ScadaBridge.StoreAndForward/StoreAndForwardService.cs:295``:308`, `:747` |
**Description**
Two behaviours exercised by the current code have no test coverage:
1. **Gauge stop / lifecycle.** `QueueDepthGaugeTests` covers the live counter across
enqueue / drain / park / requeue (`Gauge_TracksBufferedDepth_AcrossEnqueueDrainAndPark`),
the start-up seed from existing Pending rows (`Gauge_SeedsFromExistingPendingRows_OnStart`),
and the additive-seed race (`Gauge_SeedAddsToConcurrentPreSeedIncrement_NotClobber`).
It calls `StopAsync` only as test teardown (`DisposeAsync`) with no assertion — so the
gauge's stop behaviour is unverified. This is the same coverage gap that would have
caught StoreAndForward-025: no test asserts what the gauge reports after a service stops
(today: a stale frozen value; after the 025 fix: 0). The gauge's behaviour across a
stop, and across a second instance starting after a first stops, is entirely untested.
2. **`EmitSiteEvent` retry / retry-delivered branches.** `StoreAndForwardSiteEventTests`
covers buffer-for-retry (`"Queued"`), park (`"Parked"` for both notification and
cached-call categories), routine-enqueue-no-event, and immediate-delivered-no-event.
But two `EmitSiteEvent` branches that the retry sweep drives are uncovered:
- the per-retry `"Retried"` action (`StoreAndForwardService.cs:747`), which maps to a
`store_and_forward` site event at `Warning` severity (the `_ => "Warning"` arm of the
severity switch, lines 298303);
- the retry-loop `"Delivered"` action whose detail is `"Delivered to … after N retries"`
(line 644645), which — because the detail does **not** start with `"Immediate"`
is **not** short-circuited at line 268 and instead emits a `store_and_forward` event
at `Info` severity (the `"Delivered" => "Info"` arm, line 301). This "recovery
recorded" path is the only one that produces an `Info`-severity S&F site event, and
nothing asserts it fires (or that the immediate-delivered path is correctly suppressed
by contrast).
**Recommendation**
Add a gauge-stop test to `QueueDepthGaugeTests`: enqueue to a known non-zero depth, call
`StopAsync`, and assert the gauge reports the cleared value (0 after the StoreAndForward-025
fix) — and, ideally, that a second `StoreAndForwardService` started afterward owns the slot
without being clobbered by the first instance's stop. Add two `StoreAndForwardSiteEventTests`
cases: (a) a transient handler with `MaxRetries > 1` driven through one sweep, asserting a
`store_and_forward` / `Warning` event whose message contains `"retried"`; and (b) a handler
that fails on the immediate attempt then succeeds on the sweep, asserting a `store_and_forward`
/ `Info` event whose message contains `"delivered"` (recording the recovery), distinct from
the suppressed immediate-delivery path.
**Resolution**
Deferred 2026-06-20: additional gauge-lifecycle and `EmitSiteEvent` retry-branch tests are low-value coverage gaps; recorded for a follow-up.
### StoreAndForward-027 — Per-message `"Retried"` site event fires per-row-per-sweep, risking a flood of the capped site event log
| | |
|--|--|
| Severity | Low |
| Category | Design-document adherence |
| Status | Deferred |
| Location | `src/ZB.MOM.WW.ScadaBridge.StoreAndForward/StoreAndForwardService.cs:734``:759`, `:295``:308`; `docs/requirements/Component-StoreAndForward.md:72`, `:139` |
**Description**
When `_siteEventLogger` is wired, `EmitSiteEvent` is subscribed to `OnActivity`, so every
`RaiseActivity("Retried", …)` call in `RetryMessageAsync`'s transient-not-yet-max branch
(`StoreAndForwardService.cs:747``:748`) produces one `store_and_forward` site event at
`Warning` severity (lines 295308). That branch fires **once per still-pending row per
retry sweep**. Combined with the design's deliberate "**no maximum buffer size**"
(`Component-StoreAndForward.md:72`), a sustained outage can buffer a large Pending
population, and each fixed-interval sweep then emits a `"Retried"` event for every one of
those rows — `(pending rows) × (sweeps)` events. The site event log is **capped** (default
1 GB, 30-day retention, with oldest-first purge when the cap is hit before the retention
window — `Component-SiteEventLogging.md:45``:46`), so a large, long-lived backlog can
churn the log and evict genuinely useful operational history (deployments, connection
events, alarms) well before the 30-day window.
**This is a judgment call, not a clear bug.** The behaviour is **spec-conformant**:
`Component-StoreAndForward.md:139` explicitly lists "retried" as a logged store-and-forward
activity ("Logs store-and-forward activity (queued, delivered, retried, parked)"), so the
code is doing exactly what the design doc says. The tension is between that
per-retry-logging contract and the combination of an unbounded buffer + a capped site log:
at scale the "retried" stream can dominate the log. The site-event-log cap bounds the
absolute disk risk (it cannot exhaust disk), so the residual concern is **eviction of
other history**, not unbounded growth — hence Low and flagged as a design-vs-scale
decision rather than a defect.
**Recommendation**
Reconcile the design-vs-scale tension deliberately; this is a judgment call for the owners,
not a forced fix. Options, in rough order of preference:
1. **Drop or coarsen the per-retry event.** Stop emitting a `store_and_forward` event for
the routine `"Retried"` action (keep `"Queued"`, `"Parked"`, and the retry-recovery
`"Delivered"`), since the per-attempt detail is already available via the
`OnActivity`/health/telemetry surfaces and the central audit log. The first-buffer and
the terminal park/deliver events are the operationally interesting ones.
2. **Sample / threshold it.** Emit a `"Retried"` event only on the first retry, on a count
boundary (e.g. every Nth retry), or once per sweep as a rolled-up summary
("retried N messages this sweep") rather than one per row.
3. **Document the volume risk.** If per-retry logging is intentionally retained, add a note
to `Component-StoreAndForward.md` / `Component-SiteEventLogging.md` that a large buffered
backlog can dominate the capped site event log and evict other history, so operators
size the cap / retention accordingly.
**Resolution**
Deferred 2026-06-20: the per-retry `"Retried"` site-event volume is spec-conformant (Component-StoreAndForward.md lists 'retried' as a logged activity) and the site log is capped, so the residual concern is history eviction, not disk exhaustion. Whether to drop/sample/threshold the event vs. document the volume is a design-vs-scale decision for the owner.
+200 -3
View File
@@ -5,10 +5,10 @@
| Module | `src/ZB.MOM.WW.ScadaBridge.TemplateEngine` | | Module | `src/ZB.MOM.WW.ScadaBridge.TemplateEngine` |
| Design doc | `docs/requirements/Component-TemplateEngine.md` | | Design doc | `docs/requirements/Component-TemplateEngine.md` |
| Status | Reviewed | | Status | Reviewed |
| Last reviewed | 2026-05-28 | | Last reviewed | 2026-06-20 |
| Reviewer | claude-agent | | Reviewer | claude-agent |
| Commit reviewed | `1eb6e97` | | Commit reviewed | `4307c381` |
| Open findings | 1 | | Open findings | 0 |
## Summary ## Summary
@@ -87,6 +87,45 @@ the duplicate-Id / null-sentinel fix from the last batch, and audit-write
ordering inconsistency between `TemplateService` (logs then saves) and ordering inconsistency between `TemplateService` (logs then saves) and
`InstanceService` (saves then logs). `InstanceService` (saves then logs).
#### Re-review 2026-06-20 (commit `4307c381`) — full review
Re-reviewed the whole module against all ten checklist categories at commit
`4307c381`. The large delta since `1eb6e97` (722 commits, including the
`ScadaLink → ZB.MOM.WW.ScadaBridge` rename, the Script Analysis #25 extraction,
native-alarm-source authoring/flatten, the read-only inheritance-resolve service,
List-attribute semantics, and Transport line-diff) holds up well. All 22 prior
findings remain closed and verified: the deploy gate now delegates to the
authoritative `ScriptTrustValidator` + `RoslynScriptCompiler` (real semantic
compile, no residual substring scan or hand-rolled brace counter in the security
path — `ScriptCompiler` and `ValidationService.CheckExpressionSyntax` both call
the shared analyzer); the recursive composition walk, per-slot alarm override,
duplicate-Id-tolerant `BuildLookup`, `int?`-parent walk, and the revision-hash /
diff `Description`+`Connections`+`ElementDataType` coverage are all in place and
test-guarded. The new `TemplateInheritanceResolver` (M9-T26a) mirrors the
flattener's inheritance precedence exactly (including HiLo per-setpoint merge),
verified by `Resolve_AgreesWithFlatteningService_*` tests. Three new findings
surfaced — all lock/audit asymmetries where a newer feature did not pick up an
invariant an older sibling already enforces: native-alarm-source instance
overrides bypass the lock at flatten time (the unfixed-for-native sibling of the
closed TemplateEngine-008; `ResolvedNativeAlarmSource` carries no `IsLocked`
flag), the four `TemplateFolderService` mutators stage their audit row *after*
the final `SaveChangesAsync` so the row is never persisted, and a dead/misnamed
`SemanticValidator.ParseParameterDefinitions` legacy branch returns types not
names.
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ✓ | New: native-source instance override applies with no lock check (TemplateEngine-023). Flattening inherit/compose/override precedence + recursion verified correct for attrs/alarms/scripts/native-sources; HiLo per-setpoint merge consistent with resolver. |
| 2 | Akka.NET conventions | ✓ | No actors. `AddTemplateEngineActors` is still an empty placeholder. Nothing to assess. |
| 3 | Concurrency & thread safety | ✓ | Services stateless/scoped; static helpers hold no mutable state. No new findings. |
| 4 | Error handling & resilience | ✓ | `Result<T>` used consistently; `Flatten` wraps in try/catch; JSON parse failures fall back safely. No store-and-forward/failover surface. |
| 5 | Security | ✓ | Deploy gate is now authoritative via Script Analysis #25 (`ScriptTrustValidator.FindViolations` + `RoslynScriptCompiler.Compile`) — alias/`using static`/`global::` bypasses caught by symbol resolution; no residual substring scan in the security path. No new findings. |
| 6 | Performance & resource management | ✓ | TemplateEngine-009 N+1 fix holds (`Compositions` navigation reused). `JsonDocument`/streams disposed. No new leaks. |
| 7 | Design-document adherence | ✓ | Native-alarm-source lock rule ("Lock applies to the entire source") not enforced in flattening (TemplateEngine-023). Otherwise the doc matches: revision-hash/diff payload coverage, semantic validation set, acyclicity, derived-naming all align. |
| 8 | Code organization & conventions | ✓ | New: `TemplateFolderService` Create/Rename/Move/Reorder log after the final save → audit row never persisted (TemplateEngine-024), asymmetric with `TemplateService` (which always saves after `LogAsync`). |
| 9 | Testing coverage | ✓ | Strong overall (inheritance-resolver agrees-with-flattener tests, hash determinism test). Gaps: no test for a locked native source rejecting an instance override; folder tests mock `IAuditService` so the never-saved audit row is invisible; `ParseParameterDefinitions` legacy branch untested. |
| 10 | Documentation & comments | ✓ | New: `ParseParameterDefinitions` is documented as returning parameter NAMES but its legacy-array branch returns the type field (TemplateEngine-025). XML docs otherwise accurate and thorough. |
## Checklist coverage ## Checklist coverage
_Re-review (2026-05-17, `39d737e`):_ _Re-review (2026-05-17, `39d737e`):_
@@ -1191,3 +1230,161 @@ false rejected, false→true / true→true / false→false accepted) and
`TemplateServiceTests.UpdateAttribute_LockedInDerivedDowngrade_OnBase_Rejected` `TemplateServiceTests.UpdateAttribute_LockedInDerivedDowngrade_OnBase_Rejected`
(end-to-end on the attribute update path). (end-to-end on the attribute update path).
### TemplateEngine-023 — Instance native-alarm-source override bypasses the source lock during flattening
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Deferred |
| Location | `src/ZB.MOM.WW.ScadaBridge.TemplateEngine/Flattening/FlatteningService.cs:775`, `src/ZB.MOM.WW.ScadaBridge.Commons/Types/Flattening/FlattenedConfiguration.cs:137` |
**Description**
`ApplyInstanceNativeAlarmSourceOverrides` applies a per-instance native-alarm-source
override unconditionally — it looks the binding up by canonical name and, if found,
overlays the override's non-null fields with **no lock check**:
```csharp
foreach (var ovr in overrides)
{
if (!sources.TryGetValue(ovr.SourceCanonicalName, out var existing))
continue; // Cannot add new bindings via overrides.
sources[ovr.SourceCanonicalName] = existing with { ConnectionName = ..., ... };
}
```
Contrast `ApplyInstanceOverrides` (attributes, `:309``if (existing.IsLocked) continue;`)
and `ApplyInstanceAlarmOverrides` (alarms, `:337``if (existing.IsLocked) continue;`).
The design (`Component-TemplateEngine.md` Override Granularity, Native alarm sources:
*"Lock applies to the entire source"*) requires the same treatment. The root cause is
in Commons: `ResolvedNativeAlarmSource` (`FlattenedConfiguration.cs:137`) carries **no
`IsLocked` field**, so the flattener has nothing to gate on even if it wanted to —
unlike `ResolvedAttribute`/`ResolvedAlarm`, which both expose `IsLocked`. The
inheritance-side `ResolveInheritedNativeAlarmSources` does track a base lock (via the
`lockedNames` set) so a *derived template* cannot override a base-locked source, but
that lock state is dropped on the way out and the *instance* layer is unguarded.
This is the same defect class as the closed **TemplateEngine-008** (which fixed the
identical instance-override lock-bypass for alarms), never applied to native alarm
sources because they were introduced after that finding. The companion gap lives in
`ManagementService.HandleSetInstanceNativeAlarmSourceOverride`
(`ManagementActor.cs:868`), which — unlike `InstanceService.SetAlarmOverrideAsync` /
`SetAttributeOverrideAsync` — performs neither an existence check (an override for an
unknown source name is silently persisted as a dead record) nor a lock check before
writing the override row straight through the repository, bypassing `InstanceService`
entirely. Blast radius is limited (native alarms are a read-only mirror — no write-back
to the source), but the documented lock invariant is not enforced anywhere on this path.
**Recommendation**
Add an `IsLocked` field to `ResolvedNativeAlarmSource` (Commons) and carry it through
`ResolveInheritedNativeAlarmSources` / `ResolveComposedNativeAlarmSources`; then gate
`ApplyInstanceNativeAlarmSourceOverrides` on `if (existing.IsLocked) continue;`,
mirroring the attribute/alarm paths. Add the matching existence + lock guard to
`HandleSetInstanceNativeAlarmSourceOverride` (resolve the effective native-source set
via `TemplateResolver`/flattening as `SetAlarmOverrideAsync` does). Add a flattening
regression test (`Flatten_LockedNativeSource_InstanceOverrideIgnored`) and a management
test rejecting an override of a locked / unknown source.
**Resolution**
Deferred 2026-06-20: the native-alarm-source override lock-bypass fix spans Commons (`ResolvedNativeAlarmSource` has no `IsLocked` field to gate on) and ManagementService (the override handler skips the existence/lock checks), and the blast radius is limited (native alarms are read-only mirrors, no write-back). Recorded as a cross-module follow-up.
### TemplateEngine-024 — `TemplateFolderService` mutators stage their audit row after the final save, so it is never persisted
| | |
|--|--|
| Severity | Medium |
| Category | Code organization & conventions |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.TemplateEngine/Services/TemplateFolderService.cs:58`, `src/ZB.MOM.WW.ScadaBridge.TemplateEngine/Services/TemplateFolderService.cs:91`, `src/ZB.MOM.WW.ScadaBridge.TemplateEngine/Services/TemplateFolderService.cs:154`, `src/ZB.MOM.WW.ScadaBridge.TemplateEngine/Services/TemplateFolderService.cs:204` |
**Description**
`IAuditService.LogAsync` only *stages* the audit row in the EF change tracker — the
implementation (`ConfigurationDatabase/Services/AuditService.cs:61`) ends with
`AddAsync(entry)` and explicitly documents *"caller is responsible for calling
SaveChangesAsync to ensure atomicity."* Four of the five `TemplateFolderService`
mutators call `LogAsync` **after** their final `SaveChangesAsync` and never save again:
- `CreateFolderAsync` (`:56-58`): `AddFolderAsync``SaveChangesAsync``LogAsync` (no save).
- `RenameFolderAsync` (`:89-91`): `UpdateFolderAsync``SaveChangesAsync``LogAsync` (no save).
- `MoveFolderAsync` (`:152-154`): `UpdateFolderAsync``SaveChangesAsync``LogAsync` (no save).
- `ReorderFolderAsync` (`:201-204`): `Update`×2 → `SaveChangesAsync``LogAsync` (no save).
`ManagementActor.ProcessCommand` (`ManagementActor.cs:130`) wraps each command in a DI
scope, dispatches, and disposes the scope with no outer `SaveChangesAsync`; EF Core does
**not** flush on dispose. The folder operation itself persists (it was saved *before* the
`LogAsync`), but the staged `TemplateFolder` audit row is silently discarded when the
scoped `DbContext` is disposed. Every folder Create / Rename / Move / Reorder therefore
leaves **no audit trail**, even though `_auditService.LogAsync(...)` is called and reads
as if it does. (`DeleteFolderAsync` has the same shape but the lost row is the *Delete*
record specifically.) This is asymmetric with `TemplateService`, where every mutator
ends with `LogAsync``SaveChangesAsync` (e.g. `:225-226`). The folder tests mock
`IAuditService` and the repository, so they never exercise the real change-tracker
lifecycle and the gap is invisible to them.
**Recommendation**
Add a final `await _repository.SaveChangesAsync(cancellationToken);` after the `LogAsync`
in all five folder mutators (matching `TemplateService`), or — cleaner — reorder each to
mutate → log → single `SaveChangesAsync` so the entity change and its audit row commit
atomically. Add a test that uses a real (in-memory) `AuditService` + `DbContext` and
asserts an `AuditLogEntry` row exists after a folder create/rename/move/reorder.
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): added `await _repository.SaveChangesAsync(...)` after `LogAsync` in all five `TemplateFolderService` mutators (Create/Rename/Move/Reorder/Delete), so folder-mutation audit rows are now persisted (previously staged then discarded with the scope). Matches the `TemplateService` two-save pattern. Tests added.
### TemplateEngine-025 — `SemanticValidator.ParseParameterDefinitions` legacy-array branch returns types, not names; method is dead production code
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.TemplateEngine/Validation/SemanticValidator.cs:573` |
**Description**
`ParseParameterDefinitions` is documented and named as returning the declared parameter
**names**:
```csharp
/// ... returns the declared parameter names.
internal static List<string> ParseParameterDefinitions(string? parameterDefinitionsJson)
```
Its JSON-Schema branch correctly returns `props.EnumerateObject().Select(p => p.Name)`,
but its legacy flat-array branch returns the `type` field:
```csharp
else if (doc.RootElement.ValueKind == JsonValueKind.Array)
{
return doc.RootElement.EnumerateArray()
.Select(e => e.TryGetProperty("type", out var t) ? t.GetString() ?? "unknown" : "unknown")
.ToList();
}
```
This is a copy-paste from the sibling `ParseParameterTypes` (`:491`, which legitimately
returns types) — so for legacy `[{name,type}]` definitions this method returns the wrong
field (types instead of names). The method has **no non-test caller** in `src/` (only its
own `SemanticValidatorTests`), and the test exercises only the JSON-Schema branch, so the
buggy legacy branch is both dead and untested. The live param-count/type validation path
uses `ParseParameterTypes`, not this method, so there is no behavioural impact today — but
the method is `internal` (callable from the rest of the module) and silently returns
mislabelled data for one input shape.
**Recommendation**
Delete `ParseParameterDefinitions` (and its tests) as dead code, or — if it is meant to
back a future feature — fix the legacy branch to return `e.TryGetProperty("name", ...)`
and add a legacy-array test asserting names are returned.
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): deleted the dead, misnamed `SemanticValidator.ParseParameterDefinitions` (no production caller per grep; its only references were two tests, which asserted the buggy type-not-name output). Tests removed.