From d39089f4eda6a6c1d499355f3d78e64baadac377 Mon Sep 17 00:00:00 2001 From: Joseph Doherty Date: Sat, 20 Jun 2026 18:02:32 -0400 Subject: [PATCH] =?UTF-8?q?docs(code-review):=20full=20review=20at=204307c?= =?UTF-8?q?381=20=E2=80=94=2018=20modules,=2067=20findings=20recorded=20+?= =?UTF-8?q?=20remediation=20tracked?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Full per-module re-review of the 16 stale modules (last seen 1eb6e97 / 2026-05-28) plus first-ever reviews of KpiHistory (#26) and ScriptAnalysis (#25), at HEAD 4307c381. 67 new findings (0 Critical, 6 High, 27 Medium, 34 Low). Remediation in commit fd618cf1 closed 5 of the 6 Highs and ~33 Medium/Low; the rest are Deferred/Won't Fix with rationale. Remaining pending (4) are all InboundAPI's Database-helper findings (IA-026 High .. IA-029), left to the active feat/ipsen-movein effort per owner decision. Highlights: caught a central-only-delivery security drift (SMTP creds broadcast to sites — DM-025/SR-031), a never-committed 'Resolved' fix (SiteEventLogging-016 → -024), an unguarded KPI recorder tick (KH-001), a trust-analyzer fallback weakening (SA-001), and a native-alarm subscribe-path leak (DCL-023). ScriptAnalysis verdict: trust boundary is semantically sound (symbol-based) in the production cluster config. README regenerated; regen-readme.py --check passes (4 pending / 567 total). --- code-reviews/AuditLog/findings.md | 229 +++++++++- .../ClusterInfrastructure/findings.md | 101 ++++- code-reviews/Communication/findings.md | 140 +++++- code-reviews/DataConnectionLayer/findings.md | 243 ++++++++++- code-reviews/DeploymentManager/findings.md | 208 ++++++++- .../ExternalSystemGateway/findings.md | 199 ++++++++- code-reviews/HealthMonitoring/findings.md | 152 ++++++- code-reviews/Host/findings.md | 206 ++++++++- code-reviews/InboundAPI/findings.md | 232 +++++++++- code-reviews/KpiHistory/findings.md | 307 +++++++++++++ code-reviews/NotificationService/findings.md | 152 ++++++- code-reviews/README.md | 61 +-- code-reviews/ScriptAnalysis/findings.md | 407 ++++++++++++++++++ code-reviews/Security/findings.md | 225 +++++++++- code-reviews/SiteCallAudit/findings.md | 241 ++++++++++- code-reviews/SiteEventLogging/findings.md | 298 ++++++++++++- code-reviews/SiteRuntime/findings.md | 274 +++++++++++- code-reviews/StoreAndForward/findings.md | 222 +++++++++- code-reviews/TemplateEngine/findings.md | 203 ++++++++- 19 files changed, 4031 insertions(+), 69 deletions(-) create mode 100644 code-reviews/KpiHistory/findings.md create mode 100644 code-reviews/ScriptAnalysis/findings.md diff --git a/code-reviews/AuditLog/findings.md b/code-reviews/AuditLog/findings.md index a871b435..8a86b5f0 100644 --- a/code-reviews/AuditLog/findings.md +++ b/code-reviews/AuditLog/findings.md @@ -5,9 +5,9 @@ | Module | `src/ZB.MOM.WW.ScadaBridge.AuditLog` | | Design doc | `docs/requirements/Component-AuditLog.md` | | Status | Reviewed | -| Last reviewed | 2026-05-28 | +| Last reviewed | 2026-06-20 | | Reviewer | claude-agent | -| Commit reviewed | `1eb6e97` | +| Commit reviewed | `4307c381` | | Open findings | 0 | ## Summary @@ -59,6 +59,231 @@ chain doesn't reject a central composition root that mistakenly calls the site b ## Findings +#### Re-review 2026-06-20 (commit `4307c381`) — full review + +Since `1eb6e97` the module was renamed (`ScadaLink → ZB.MOM.WW.ScadaBridge`) and substantially +re-architected: the site SQLite store is now a two-table canonical design (`audit_event` +append-only + `audit_forward_state` sidecar with a precomputed `IsCachedKind`); the payload +filter became the `IAuditRedactor` seam (`ScadaBridgeAuditRedactor` + `SafeDefaultAuditRedactor` ++ stateless `AuditRedactionPrimitives`); and milestone work landed (M5.3 T7 inbound response +capture, M5.5 T3 per-channel retention, K8 KPI source, central health snapshot). The module +remains one of the best-engineered in the tree: the AuditLog-001..011 fixes from the last pass +are all present and intact (dual read connection, async scopes, per-EventId reconciliation +retry escape valve, thread-pool-hop `Dispose`, lifecycle CTS), the best-effort "never abort the +action" contract is honoured throughout, and redaction over-redacts on fault. This pass found +**no Critical/High** issues; two design-adherence drifts where documented per-target/purge +config knobs are silently inert, one error-handling asymmetry that lets a transient DI fault +restart the ingest singleton, and two Low items. Five new Open findings (AuditLog-012..016). + +| # | Category | Examined | Notes | +|---|----------|----------|-------| +| 1 | Correctness & logic bugs | Yes | `IsErrorRow` case-sensitive parse is correct (Status is `enum.ToString()` PascalCase). `IsCachedKind` parse falls back to `InboundRequest` (non-cached) on bad JSON — safe. `OnIngestAsync` scope-resolution throw is uncaught (AuditLog-014). | +| 2 | Akka.NET conventions | Yes | Sender/EventStream captured before first await across all three central actors; Tell for hot path, Ask only at the gRPC/ClusterClient boundary; supervisor-strategy comments now correctly credit per-row catch (AuditLog-002 fix intact). No findings. | +| 3 | Concurrency & thread safety | Yes | WAL dual-connection + `_readLock`/`_writeLock` split intact; `Interlocked` counters; `_drainGate` serialises ring drain. `WriteAsync` fast path ignores `ct` when `TryWrite` succeeds (AuditLog-015). | +| 4 | Error handling & resilience | Yes | Best-effort contract honoured everywhere; gRPC pull collapses tolerable faults to empty; reconciliation retry escape valve (AuditLog-004) intact. `OnIngestAsync` lacks the scope-creation try/catch its sibling `OnCachedTelemetryAsync` has (AuditLog-014). | +| 5 | Security | Yes | Header/body/SQL-param redaction + over-redact-on-fault safety net solid; `SafeDefaultAuditRedactor` fallback wired at all three writer sites (AuditLog-008 fix intact); SQL built with bound parameters only. No findings. | +| 6 | Performance & resource management | Yes | Hot path batched + back-pressured; backlog scan off the write lock; partition switch metadata-only; per-channel DELETE batched + clamped; gRPC channel cache race-safe. No findings. | +| 7 | Design-document adherence | Yes | Per-target `CapBytes` override is documented + has a config property but is read NOWHERE (AuditLog-012); purge config section/key drift from the doc's `AuditLogPurge`/`ChannelPurgeBatchSize` (AuditLog-013). Combined-telemetry transport (AuditLog-001) now wired. `SkipBodyCapture` honoured in InboundAPI middleware (out of module). | +| 8 | Code organization & conventions | Yes | Composition root well-segmented with "safe from any root" invariant; `INodeIdentityProvider` standardised on `GetRequiredService` (AuditLog-007 fix intact); idempotency sentinel on the health bridge (AuditLog-011 fix intact). No new findings. | +| 9 | Testing coverage | Yes | Broad (~13k lines). Gaps: no test asserts per-target `CapBytes` is applied (because it isn't — AuditLog-012), and no test binds `AuditLogPurgeOptions` from an `IConfiguration` shaped like the doc (would catch AuditLog-013). | +| 10 | Documentation & comments | Yes | Mostly excellent and accurate. `AuditLogIngestActor` class XML has a duplicated "the central-side" phrase and several stale "Bundle X" milestone tags that mislead (AuditLog-016). | + +### AuditLog-012 — Per-target `CapBytes` override is documented and bindable but never read by the redactor + +| | | +|--|--| +| Severity | Medium | +| Category | Design-document adherence | +| Status | Deferred | +| Location | `src/ZB.MOM.WW.ScadaBridge.AuditLog/Redaction/ScadaBridgeAuditRedactor.cs:185-196`, `src/ZB.MOM.WW.ScadaBridge.AuditLog/Configuration/PerTargetRedactionOverride.cs:13` | + +**Description** + +The design (Component-AuditLog.md §"Payload Capture Policy" / §"Configuration") documents a +per-target payload-cap override and gives it as a worked example: +`"PerTargetOverrides": { "Weather/GetForecast": { "CapBytes": 4096 } }`. The +`PerTargetRedactionOverride.CapBytes` property exists and its XML doc promises "Optional payload +cap override (bytes); null inherits the global cap." But `ScadaBridgeAuditRedactor.SelectCap` +chooses among `InboundMaxBytes` / `ErrorCapBytes` / `DefaultCapBytes` only — it never consults +`opts.PerTargetOverrides[target].CapBytes`. A repo-wide grep confirms `CapBytes` (the per-target +property) is read in zero production code paths. An operator who sets a per-target cap to bound a +chatty target's payloads gets the global `DefaultCapBytes`/`ErrorCapBytes` instead, with no error +and no log — the knob is silently inert. This is a code-vs-spec drift: either the feature was +dropped in the redactor re-architecture (C2/C3) without updating the doc + override XML, or it +was never implemented. The `AdditionalBodyRedactors` and `RedactSqlParamsMatching` per-target +fields ARE honoured, which makes the silent `CapBytes` omission especially easy to miss. + +**Recommendation** + +Either (a) implement it: in `SelectCap`, when `category != ApiInbound` and the resolved +`PerTargetOverrides` entry for `rawEvent.Target` has a non-null `CapBytes`, use +`Math.Min`/`Max(over.CapBytes.Value, …)` per the intended semantics (the doc says it *overrides* +the global cap — clarify whether error rows still get the larger of the two), and add a +`ScadaBridgeAuditRedactorTests` case asserting a per-target cap shortens a payload the global cap +would not; or (b) if the feature is intentionally dropped, delete `PerTargetRedactionOverride.CapBytes`, +remove the `CapBytes` example from the Component-AuditLog.md Configuration block, and note the +removal in the design doc so the override surface stops advertising a no-op. + +**Resolution** + +Deferred 2026-06-20: whether to implement per-target `CapBytes` in `SelectCap` or delete the unused property + doc example is a design-owner decision (was it intentionally dropped in the redactor re-architecture, or lost?). Recorded; no change this pass. + +### AuditLog-013 — Purge config section + key drift: documented `AuditLogPurge`/`ChannelPurgeBatchSize` does not bind + +| | | +|--|--| +| Severity | Medium | +| Category | Design-document adherence | +| Status | Resolved | +| Location | `src/ZB.MOM.WW.ScadaBridge.AuditLog/ServiceCollectionExtensions.cs:56,370-371`, `src/ZB.MOM.WW.ScadaBridge.AuditLog/Central/AuditLogPurgeOptions.cs:40,47` | + +**Description** + +Component-AuditLog.md §"Retention & Purge" / §"Configuration" documents the purge tuning as a +top-level section named `AuditLogPurge` with keys `IntervalHours` and `ChannelPurgeBatchSize`: + +```jsonc +"AuditLogPurge": { "IntervalHours": 24, "ChannelPurgeBatchSize": 5000 } +``` + +The code does not match on either axis. (1) `ServiceCollectionExtensions.PurgeSectionName` is +`"AuditLog:Purge"` — a nested subsection, not the documented top-level `AuditLogPurge`. (2) The +batch-size property is `AuditLogPurgeOptions.ChannelPurgeBatchSizeConfigured`, so its bind key is +`ChannelPurgeBatchSizeConfigured`, not the documented `ChannelPurgeBatchSize` (which is a +computed read-only property the binder ignores). There is no `[ConfigurationKeyName]` attribute +to reconcile the names. The net effect: an operator who follows the design doc and writes +`"AuditLogPurge": { "ChannelPurgeBatchSize": 1000 }` has BOTH values silently ignored — the +wrong section path means `IntervalHours` falls back to the 24 h default, and the property-name +mismatch means the channel batch size falls back to 5000. The actor still functions on defaults, +so nothing crashes, but the documented operator-facing tuning surface is inert. No test binds +`AuditLogPurgeOptions` from an `IConfiguration` shaped like the doc, so the drift is unguarded. +(The §407 doc text says `AuditLogPurge:ChannelPurgeBatchSize`, agreeing with the JSON block and +disagreeing with the code on both section and key.) + +**Recommendation** + +Pick one canonical shape and make code, doc, and a binding test agree. Recommended: keep the +nested `AuditLog:Purge` section (consistent with `AuditLog:SiteWriter` / `AuditLog:SiteTelemetry` +/ `AuditLog:Reconciliation`), but then fix the doc's JSON block + §407 to read +`"AuditLog": { "Purge": { "IntervalHours": …, "ChannelPurgeBatchSize": … } }`, AND add +`[ConfigurationKeyName("ChannelPurgeBatchSize")]` to `ChannelPurgeBatchSizeConfigured` (or rename +it and move the clamp into the `Interval`-style computed getter pattern) so the documented key +binds. Add an `AuditLogOptionsBindingTests`-style test that binds the purge section from an +in-memory config and asserts both `IntervalHours` and the channel batch size take effect. + +**Resolution** + +Resolved 2026-06-20 (commit `fd618cf1`): added `[ConfigurationKeyName("ChannelPurgeBatchSize")]` so the documented key binds, reconciled the design doc to the real nested `AuditLog:Purge` section, and added a binding test proving the purge options bind from the documented keys. + +### AuditLog-014 — `AuditLogIngestActor.OnIngestAsync` does not guard scope/repository resolution, so a transient DI fault restarts the singleton and drops the reply + +| | | +|--|--| +| Severity | Medium | +| Category | Error handling & resilience | +| Status | Resolved | +| Location | `src/ZB.MOM.WW.ScadaBridge.AuditLog/Central/AuditLogIngestActor.cs:135-148` | + +**Description** + +`OnIngestAsync` opens `await using var scope = _serviceProvider!.CreateAsyncScope();` and +`scope.ServiceProvider.GetRequiredService()` with NO surrounding try/catch. +The per-row try/catch lives inside `IngestWithRepositoryAsync`, so it only protects the insert +loop — a throw from scope creation or from `GetRequiredService` (a `DbContext` factory throw on +transient SQL-connection exhaustion, a pooled-context init fault, a DI resolution race during +host churn) propagates straight out of the message handler. When a `ReceiveActor`'s async handler +faults, Akka applies the parent's supervision and restarts the singleton; the captured +`replyTo.Tell(...)` at line 147 never runs, so the site's `Ask` times out +and (correctly) retries — but the singleton has been bounced over a transient fault. The class +XML even claims "that per-row catch is what keeps this actor alive across handler throws" — which +is true for insert throws but false for the scope-resolution throw this handler leaves uncaught. +The sibling `OnCachedTelemetryAsync` (line 216) wraps its entire scope-creation-plus-loop body in +a try/catch and logs + replies-with-partial precisely to avoid this; `OnIngestAsync` is the +asymmetric one. `AuditLogPurgeActor.OnTickAsync` and `SiteAuditReconciliationActor.OnTickAsync` +both guard their `GetRequiredService` with try/catch too — `OnIngestAsync` is the outlier. + +**Recommendation** + +Wrap the scope-creation + repository-resolution in `OnIngestAsync` in a try/catch that mirrors +`OnCachedTelemetryAsync`: on failure, log the resolution fault and `replyTo.Tell(new +IngestAuditEventsReply(accepted))` (accepted will be whatever was processed before the throw, +typically empty) so the site keeps its rows `Pending` and retries on the next drain without +bouncing the singleton. Optionally bump the `ICentralAuditWriteFailureCounter` on that path too, +so a sustained DI/connection fault is visible on the dashboard. + +**Resolution** + +Resolved 2026-06-20 (commit `fd618cf1`): `OnIngestAsync` now wraps the scope-creation + repo resolution + ingest in a try/catch that logs, best-effort bumps the write-failure counter, and still `replyTo.Tell(IngestAuditEventsReply(accepted))` — a transient DI/DbContext fault no longer bounces the central ingest singleton or drops the reply (mirrors `OnCachedTelemetryAsync`). + +### AuditLog-015 — `SqliteAuditWriter.WriteAsync` ignores the cancellation token on the fast (`TryWrite`) path + +| | | +|--|--| +| Severity | Low | +| Category | Concurrency & thread safety | +| Location | `src/ZB.MOM.WW.ScadaBridge.AuditLog/Site/SqliteAuditWriter.cs:254-275` | +| Status | Deferred | + +**Description** + +`WriteAsync(AuditEvent, CancellationToken ct)` accepts a `ct` but, on the common fast path where +`_writeQueue.Writer.TryWrite(pending)` succeeds, it returns `pending.Completion.Task` without ever +observing `ct`. The token is only honoured on the slow path (`WriteSlowPathAsync`, which passes it +to `WriteAsync`). So a caller that passes an already-cancelled (or soon-cancelled) token still +enqueues the row and then awaits an uncancellable completion. In practice the hot-path callers +(`FallbackAuditWriter`, `CachedCallTelemetryForwarder`) thread a real `ct` through, and the audit +contract is best-effort durable-in-microseconds, so the impact is minor — the row simply gets +written instead of cancelled. But the signature advertises cancellation it does not deliver on the +dominant path, which is a latent surprise for any future caller that relies on it (e.g. to bound a +shutdown drain). The completion `Task` returned is not linked to `ct`, so an awaiter cannot abandon +the wait either. + +**Recommendation** + +Either (a) honour the token on the fast path — `ct.ThrowIfCancellationRequested()` before +`TryWrite`, and return a `ct`-linked wait (e.g. `pending.Completion.Task.WaitAsync(ct)`) so an +awaiter can abandon a stuck completion; or (b) if cancellation is intentionally a no-op for this +best-effort hot path, drop the `ct` parameter (or document on the method that it is observed only +on the back-pressured slow path) so the surface stops advertising behaviour it does not provide. + +**Resolution** + +Deferred 2026-06-20: whether cancellation should be honoured on the best-effort `TryWrite` hot path (vs. intentionally a no-op) is a design decision. Recorded; no change this pass. + +### AuditLog-016 — `AuditLogIngestActor` class XML has a duplicated phrase and stale "Bundle"/`IngestedAtUtc`-column wording + +| | | +|--|--| +| Severity | Low | +| Category | Documentation & comments | +| Status | Resolved | +| Location | `src/ZB.MOM.WW.ScadaBridge.AuditLog/Central/AuditLogIngestActor.cs:13-46` | + +**Description** + +The `AuditLogIngestActor` class summary reads "Each row is stamped with the central-side **the +central-side IngestedAtUtc**" — the phrase "the central-side" is duplicated, a copy/paste slip +that survived the rename. The same summary describes `IngestedAtUtc` as if it were a promoted +column, but post-C3 it is a `DetailsJson` field stamped via `AuditRowProjection.WithIngestedAtUtc` +(as the handler bodies and their own inline comments correctly note) — the class-level prose is +now stale relative to the canonical-record shim. The remarks also still narrate the design in +"Bundle A / Bundle D / Bundle E" milestone terms (and the two-constructor rationale references +"Bundle D's tests" / "Bundle E's host wiring"), which no longer map to any current artifact and +read as archaeology to a new maintainer. None of this is wrong behaviourally, but the duplicated +phrase plus the column-vs-DetailsJson mismatch is exactly the kind of doc rot that misleads. + +**Recommendation** + +Fix the duplicated "the central-side the central-side" to a single phrase; reword the +`IngestedAtUtc` sentence to say it is stamped into `DetailsJson` (matching the handler comments); +and either drop the "Bundle X" tags in favour of the feature/milestone names used elsewhere in +the module, or add a one-line legend. A quick sweep of the other central actors' class XML for +the same "Bundle" archaeology would be worthwhile while in there. + +**Resolution** + +Resolved 2026-06-20 (commit `fd618cf1`): class XML doc deduped ('the central-side' duplicate removed), `IngestedAtUtc` reworded to 'stamped into DetailsJson (no promoted column)', and the stale Bundle milestone tags replaced with role-neutral wording. + ### AuditLog-001 — Combined-telemetry transport is plumbed end-to-end but never invoked in production | | | diff --git a/code-reviews/ClusterInfrastructure/findings.md b/code-reviews/ClusterInfrastructure/findings.md index ee22058a..e1f343d0 100644 --- a/code-reviews/ClusterInfrastructure/findings.md +++ b/code-reviews/ClusterInfrastructure/findings.md @@ -5,10 +5,10 @@ | Module | `src/ZB.MOM.WW.ScadaBridge.ClusterInfrastructure` | | Design doc | `docs/requirements/Component-ClusterInfrastructure.md` | | Status | Reviewed | -| Last reviewed | 2026-05-28 | +| Last reviewed | 2026-06-20 | | Reviewer | claude-agent | -| Commit reviewed | `1eb6e97` | -| Open findings | 1 | +| Commit reviewed | `4307c381` | +| Open findings | 0 | ## Summary @@ -82,6 +82,32 @@ either did not surface or that have aged into the file: but the method now offers nothing — keeping it is API-surface noise that an IDE will still suggest via auto-complete. +#### Re-review 2026-06-20 (commit `4307c381`) — full review + +The module remains small and clean — `ClusterOptions`, `ClusterOptionsValidator`, +and `ServiceCollectionExtensions`, all well-documented and well-tested. The +split-brain-resolver, seed-node, and failure-detector configuration contract +matches the design doc, and all fourteen prior findings (CI-001..014) still hold +as resolved: `DownIfAlone` is now consumed by the Host's HOCON builder, the dead +`AddClusterInfrastructureActors` surface is gone, and the validator enforces every +catastrophic value. The only new observation is one Low code-organization item — the +validator hand-rolls every rule via raw `RequireThat` rather than the base's +`MinCount`/`OneOf`/`PositiveTimeSpan` primitives (CI-015), a deliberate trade-off +flagged as a judgment call rather than a clear defect. + +| # | Category | Examined | Notes | +|---|----------|----------|-------| +| 1 | Correctness & logic bugs | ✓ | Validator rules and DI registration are correct. No new defects. | +| 2 | Akka.NET conventions | ✓ | No actors in this module (legitimate, per CI-001 resolution). Nothing actor-shaped to evaluate. | +| 3 | Concurrency & thread safety | ✓ | Validator and DI extension remain stateless. No issues. | +| 4 | Error handling & resilience | ✓ | Validator rejects every catastrophic value the design doc enumerates, now including `DownIfAlone = false` (CI-010) and `SeedNodes.Count < 2` (CI-012). No issues. | +| 5 | Security | ✓ | No authn/authz surface, no secret handling, no remoting transport configured here. No issues. | +| 6 | Performance & resource management | ✓ | No resources held; validator allocates a small failure list per call only. No issues. | +| 7 | Design-document adherence | ✓ | `ClusterOptions` contract complete; SBR/seed/failure-detector config matches the design doc. No new drift. | +| 8 | Code organization & conventions | ✓ | Options/validator placement and Options pattern correct; CI-014 dead surface removed. New — validator hand-rolls every rule via raw `RequireThat` instead of the `OptionsValidatorBase` primitives `MinCount`/`OneOf`/`PositiveTimeSpan` (CI-015, a judgment call). | +| 9 | Testing coverage | ✓ | 19 tests across three classes covering options, validator (incl. single-seed and `DownIfAlone=false`), and DI registration. No gaps. | +| 10 | Documentation & comments | ✓ | XML docs accurate across all source files; CI-013 intent comment present. No issues. | + ## Checklist coverage Original review (2026-05-16, `9c60592`) below; the re-review notes (2026-05-17, @@ -857,3 +883,72 @@ explicitly stating that this project exposes no actor-registration extension (actor wiring lives in `ZB.MOM.WW.ScadaBridge.Host`). If the user prefers to keep the "fail-fast" trap, mark the method `[Obsolete(true, error: true)]` so the compiler — not the runtime — rejects the call. + +### ClusterInfrastructure-015 — Validator hand-rolls every rule via raw `RequireThat` instead of the base primitives + +| | | +|--|--| +| Severity | Low | +| Category | Code organization & conventions | +| Status | Deferred | +| Location | `src/ZB.MOM.WW.ScadaBridge.ClusterInfrastructure/ClusterOptionsValidator.cs:29-57` | + +**Description** + +`ClusterOptionsValidator` derives from `OptionsValidatorBase` +(`ZB.MOM.WW.Configuration`), whose `ValidationBuilder` exposes a set of typed +validation primitives — `MinCount` ("requires a collection with at least N +items"), `OneOf` ("requires the value to be one of `allowed`, case-insensitive"), +`PositiveTimeSpan` ("requires a strictly positive duration"), plus `Required`, +`Port`, and `HostPort` — alongside the lower-level `RequireThat(bool, message)` +escape hatch for "custom or cross-field rules". `ClusterOptionsValidator.Validate` +expresses *every* one of its rules through raw `RequireThat`, even where a +first-class primitive exists for the exact check: + +- the `SeedNodes.Count >= 2` rule (29-33) is what `MinCount(options.SeedNodes, 2, …)` does; +- the `keep-oldest`-only strategy rule (35-39) is what `OneOf(options.SplitBrainResolverStrategy, AllowedStrategies, …)` does; +- the three `> TimeSpan.Zero` timing rules (45-52) are each what `PositiveTimeSpan(…)` does. + +Only the `MinNrOfMembers == 1` equality check, the cross-field +`HeartbeatInterval < FailureDetectionThreshold` comparison, and the `DownIfAlone` +boolean (59-63) genuinely require `RequireThat` (no primitive covers them). Routing +the three primitive-eligible rules through `RequireThat` makes this validator +inconsistent with the *intent* of the shared base (the primitives exist precisely so +common rules read uniformly across the codebase's validators) and risks subtle +wording drift between modules for identical checks. + +This is **explicitly a deliberate trade-off, not a clear win**, and is filed as a +judgment call rather than a defect. Adopting the primitives would **lose** what makes +the current messages valuable: each hand-rolled message cites the governing design-doc +section (`Component-ClusterInfrastructure.md → Node Configuration` / `→ Split-Brain +Resolution`) and spells out the catastrophic consequence of the misconfiguration +("would risk a total cluster shutdown on a partition", "blocks the cluster singleton +after failover and halts all data collection", "the oldest node can run as an isolated +single-node cluster"). The shared primitives emit terse, generic messages +("requires a collection with at least 2 items") that cannot carry that +safety-critical context. The peer `AuditLogOptionsValidator` likewise prefers raw +`RequireThat` for the same reason, so the current style is at least internally +consistent. + +**Recommendation** + +Treat this as a conscious choice between two competing goods rather than assuming a +fix is warranted: + +- **Adopt the primitives** (`MinCount`/`OneOf`/`PositiveTimeSpan` for the three + eligible rules; keep `RequireThat` for the equality, cross-field, and boolean + checks) if codebase-wide *wording consistency* across validators is the priority — + accepting that the design-doc citations and consequence explanations move out of the + failure message (e.g. into XML docs or an appended suffix). +- **Keep as-is** if the *richer, safety-critical failure messages* are the priority + for a module whose misconfiguration is cluster-fatal — accepting the minor + inconsistency with the base's primitive surface. + +Do not assume either direction; decide deliberately. If kept as-is, a one-line code +comment recording the choice (richer messages over primitive uniformity) would stop a +future reader from "tidying" the validator into the primitives and silently dropping +the consequence text. + +**Resolution** + +Deferred 2026-06-20: this is a deliberate trade-off, not a clear win — the hand-rolled `RequireThat` messages carry design-doc citations and cluster-fatal consequence text that the shared base primitives (`MinCount`/`OneOf`/`PositiveTimeSpan`) cannot, and the peer `AuditLogOptionsValidator` follows the same style. Adopting the primitives would prioritize cross-validator wording consistency over those richer safety-critical messages — a deliberate decision left to the owner. No change this pass. diff --git a/code-reviews/Communication/findings.md b/code-reviews/Communication/findings.md index 361cbd1b..3b866bc3 100644 --- a/code-reviews/Communication/findings.md +++ b/code-reviews/Communication/findings.md @@ -5,10 +5,10 @@ | Module | `src/ZB.MOM.WW.ScadaBridge.Communication` | | Design doc | `docs/requirements/Component-Communication.md` | | Status | Reviewed | -| Last reviewed | 2026-05-28 | +| Last reviewed | 2026-06-20 | | Reviewer | claude-agent | -| Commit reviewed | `1eb6e97` | -| Open findings | 2 | +| Commit reviewed | `4307c381` | +| Open findings | 0 | ## Summary @@ -68,6 +68,41 @@ keyed by caller-supplied `CorrelationId` could orphan a subscriber on ID reuse (Communication-022). Seven new findings, all Open: one High, one Medium, five Low. +#### Re-review 2026-06-20 (commit `4307c381`) — full review + +All prior findings (Communication-001..022) remain closed (Resolved, or Resolved-by-Comm-016). +This module was renamed `ScadaLink → ZB.MOM.WW.ScadaBridge` since `1eb6e97`, so the +`git diff 1eb6e97 HEAD` shows the whole module as added; the review re-examined the +full current state across all 10 categories, with focus on the surface added since the +last review — the four new gRPC RPCs on `SiteStreamService` (`IngestAuditEvents`, +`IngestCachedTelemetry`, `PullAuditEvents`, `PullSiteCalls`) and the M7 command/control +forwards (`SearchAddressSpaceCommand`, `WriteTagRequest`, `VerifyEndpointCommand`, +cert-trust commands). The module is healthy: actor state stays on the actor thread, +`PipeTo`/`Forward`/Sender-capture are clean, gRPC channel/stream/CTS lifecycles are +disciplined, the duplicate-stream and Subscribe-throws cleanup paths are correct, and +the gRPC ingest/pull handlers degrade best-effort (empty ack / empty response, rows stay +Pending) on fault. The only new findings are **design-doc drift**: the doc still claims +`SiteStreamService` has "a single RPC `SubscribeInstance`" and that audit/telemetry +reconciliation rides ClusterClient, when in fact four RPCs now exist and the +reconciliation *pull* runs over gRPC (Communication-023); and the AlarmStateUpdate field +table stops at field 21 while the proto/code carry fields 22–23 (Communication-024). Two +new findings, both Open: zero Critical/High, zero Medium-code, one Medium-doc, one Low. + +## Checklist coverage 2026-06-20 + +| # | Category | Examined | Notes | +|---|----------|----------|-------| +| 1 | Correctness & logic bugs | ✓ | `SubscribeInstance` duplicate-removal + Subscribe-throws cleanup (Comm-021) correct; telemetry open/close gauge is balanced (counted only after Subscribe succeeds, decremented in finally). `DebugStreamBridgeActor` stream-first buffer/dedup sound. No new issues. | +| 2 | Akka.NET conventions | ✓ | Coordinator `Resume` strategies present on both actors; `Forward` preserves Sender on every site relay; `SiteAddressCacheLoaded` now read-only-typed (Comm-020). gRPC ingest handlers capture Sender before `PipeTo`. No new issues. | +| 3 | Concurrency & thread safety | ✓ | `_activeStreams`/`_subscriptions` use ownership-preserving `TryRemove(KeyValuePair)`; `GetOrCreate` uses `AddOrUpdate` atomically; bridge state mutated only on the actor thread via `Self.Tell`. No new issues. | +| 4 | Error handling & resilience | ✓ | gRPC `Ingest*`/`Pull*` degrade to empty ack/response on fault (rows stay Pending) — divergent from the ClusterClient path's `Status.Failure` pipe but behaviourally equivalent (both → Pending+retry), and explicitly documented. `LoadSiteAddressesFromDb` lifecycle-CT bounded (Comm-019). No new issues. | +| 5 | Security | ✓ | `correlation_id` validated against `ActorPath.IsValidPathElement` before actor naming (Comm-014). `Pull*` reject `batch_size <= 0` with `InvalidArgument`. Secrets ride encrypted blocks, not this layer. No new issues. | +| 6 | Performance & resource management | ✓ | gRPC channel disposed synchronously (Comm-007); per-stream CTS `CancelAfter(GrpcMaxStreamLifetime)`; bounded `Channel` (1000, DropOldest). `RemoveSiteAsync` still test-only but Comm-013 documents that endpoint change is handled by `GetOrCreate`. No new issues. | +| 7 | Design-document adherence | ✓ | Doc claims a *single* RPC and ClusterClient-only reconciliation; reality is 4 RPCs + gRPC reconciliation pulls (Communication-023). AlarmStateUpdate field table stops at 21; proto/code carry 22–23 (Communication-024). | +| 8 | Code organization & conventions | ✓ | DTO mappers correctly hosted in Communication (owns generated DTOs); proto evolution is additive (field numbers 22–23 appended, never reused); Options pattern intact. No new issues. | +| 9 | Testing coverage | ✓ | `ProtoContractTests` guards oneof variants; `StreamRelayActorTests` + `ConvertToDomainEvent` cover enriched fields 8–23 end-to-end; dedicated proto tests cover the audit/telemetry/pull DTOs. `ProtoRoundtripTests.AlarmStateUpdate_RoundTrip` only asserts fields 1–5, but the gap is covered elsewhere — no separate finding. | +| 10 | Documentation & comments | ✓ | In-code XML docs are thorough and accurate (the prior-finding comments are precise). The drift is in the design doc only (Communication-023/024), not the code comments. | + ## Checklist coverage 2026-05-28 | # | Category | Examined | Notes | @@ -1120,3 +1155,102 @@ The `_debugSubscriptions` dictionary, `TrackMessageForCleanup` helper, and the `HandleConnectionStateChanged` handler that consumed them were all deleted as part of Comm-016's dead-code purge. There is no longer any caller-supplied correlation-id keyed map to overwrite — the orphan-on-reuse hazard is gone. + +--- + +### Communication-023 — Design doc describes `SiteStreamService` as a single RPC and reconciliation as ClusterClient-only; both are now stale + +| | | +|--|--| +| Severity | Medium | +| Category | Design-document adherence | +| Status | Resolved | +| Location | `docs/requirements/Component-Communication.md:85`, `docs/requirements/Component-Communication.md:160-166`, `src/ZB.MOM.WW.ScadaBridge.Communication/Protos/sitestream.proto:8-14` | + +**Description** + +The design doc's "gRPC Proto Definition" section states: *"**Service**: `SiteStreamService` +with **a single RPC** `SubscribeInstance(InstanceStreamRequest) returns (stream +SiteStreamEvent)`."* (line 85). The proto and the generated service now expose **five** +RPCs — `SubscribeInstance`, `IngestAuditEvents`, `IngestCachedTelemetry`, +`PullAuditEvents`, and `PullSiteCalls` (`sitestream.proto:8-14`, +`SiteStreamGrpcServer.cs:211/330/386/436/519`). The four new RPCs are not mentioned +anywhere in the component doc — neither the proto section, the responsibilities, the +topology diagram, nor the Dependencies (the server now also depends on `ISiteAuditQueue` +and `IOperationTrackingStore`). + +This drift also contradicts two transport statements elsewhere in the same doc. The +"Cached Call Telemetry" section (line 166) and the "Notification Submission" section +(line 158) both assert **"Transport: ClusterClient"**, and the reconciliation-pull +paragraph (line 164) describes `CachedCallReconcileRequest`/`CachedCallReconcileResponse`. +In the shipped code the reconciliation **pull** runs over **gRPC**, not ClusterClient: +`GrpcPullAuditEventsClient` / `GrpcPullSiteCallsClient` +(`src/ZB.MOM.WW.ScadaBridge.AuditLog/Central/`) call the `PullAuditEvents` / `PullSiteCalls` +RPCs against `SiteStreamGrpcServer`. So the gRPC channel — documented as carrying +**real-time data only** ("`SiteCommunicationActor` and `CentralCommunicationActor` have no +role in streaming event delivery", "gRPC server-streaming for real-time data") — now also +carries site→central audit/telemetry **ingest** and central→site **reconciliation pulls**. +A reader relying on the doc to understand the two-transport split (the central CLAUDE.md +"Data & Communication" invariant) would be materially misled about what flows over which +transport. + +**Recommendation** + +Update `Component-Communication.md`: (1) replace "a single RPC" with the full RPC list +and document each new RPC's direction, payload, and idempotency; (2) reconcile the +"Transport: ClusterClient" / `CachedCallReconcile*` wording in §10 (and the Site Call +Audit / Audit Log interaction notes) with the gRPC `PullAuditEvents` / `PullSiteCalls` +reality; (3) add `ISiteAuditQueue` / `IOperationTrackingStore` to the SiteStreamGrpcServer +description and Dependencies. If the doc's "gRPC = real-time data only" framing is meant +to be a hard invariant, state explicitly that audit/telemetry ingest and reconciliation +are the documented exceptions. + +**Resolution** + +Resolved 2026-06-20 (commit `fd618cf1`): the design doc (Component-Communication.md) now +enumerates all five `SiteStreamService` RPCs (the one server-streaming `SubscribeInstance` +plus the unary `IngestAuditEvents`/`IngestCachedTelemetry`/`PullAuditEvents`/`PullSiteCalls`), +reconciles the §10 reconciliation wording with the gRPC-pull reality (push telemetry stays +ClusterClient; reconciliation pulls are gRPC request/response), and adds `ISiteAuditQueue`/`IOperationTrackingStore` +to Dependencies. Doc-only. + +--- + +### Communication-024 — AlarmStateUpdate proto carries fields 22–23 but the design-doc field table stops at 21 + +| | | +|--|--| +| Severity | Low | +| Category | Design-document adherence | +| Status | Resolved | +| Location | `docs/requirements/Component-Communication.md:92-114`, `src/ZB.MOM.WW.ScadaBridge.Communication/Protos/sitestream.proto:87-88` | + +**Description** + +The "Enriched AlarmStateUpdate" section states the message "was extended **additively**: +existing fields 1–7 are unchanged, and **fields 8–21** carry the enriched native-alarm +state" (line 92), and the field table enumerates fields 8–21 (lines 94-109). The proto +now also defines **field 22 `native_source_canonical_name`** and **field 23 +`is_configured_placeholder`** (`sitestream.proto:87-88`), both populated on the server +(`StreamRelayActor.HandleAlarmStateChanged`, lines 97-98) and read back on the client +(`SiteStreamGrpcClient.ConvertToDomainEvent`, lines 257-258). The doc's field table and +its "fields 8–21" / "repopulated from fields 8–21" prose (lines 92, 112) are stale — they +omit fields 22–23. `is_configured_placeholder` in particular is load-bearing: it is the +discriminator `StreamRelayActor` uses to drop snapshot-only placeholder rows before +packing the live stream (lines 63-66), a behaviour the doc does not capture. + +The proto evolution itself is correct (additive, field numbers appended and never reused), +so this is purely doc staleness, not a contract regression. + +**Recommendation** + +Extend the AlarmStateUpdate field table to fields 22–23, update the "fields 8–21" prose to +"8–23" in both the intro and the client-mapping bullet, and add a one-line note that +`is_configured_placeholder` rows are dropped at the relay (never relayed to the live gRPC +stream). + +**Resolution** + +Resolved 2026-06-20 (commit `fd618cf1`): the `AlarmStateUpdate` field table was extended to +fields 22–23 (`native_source_canonical_name`, `is_configured_placeholder`) with a note that +placeholder rows are dropped at the relay. Doc-only. diff --git a/code-reviews/DataConnectionLayer/findings.md b/code-reviews/DataConnectionLayer/findings.md index 60785634..9eddae7f 100644 --- a/code-reviews/DataConnectionLayer/findings.md +++ b/code-reviews/DataConnectionLayer/findings.md @@ -5,10 +5,10 @@ | Module | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer` | | Design doc | `docs/requirements/Component-DataConnectionLayer.md` | | Status | Reviewed | -| Last reviewed | 2026-05-28 | +| Last reviewed | 2026-06-20 | | Reviewer | claude-agent | -| Commit reviewed | `1eb6e97` | -| Open findings | 5 | +| Commit reviewed | `4307c381` | +| Open findings | 0 | ## Summary @@ -116,6 +116,37 @@ DCL-007 fixed for `ReadBatchAsync`). New findings are numbered from ## Findings +#### Re-review 2026-06-20 (commit `4307c381`) — full review + +The 2026-06-20 full re-review walked all 10 checklist categories against the current +source, focused on the M7 native-alarm subscribe path, the OPC UA node-browser +(BrowseNext/search/type-info), and the verify-endpoint + site-local cert-trust surface. +The M7 surface is well-built overall — the verify-endpoint probe correctly *captures +but never trusts* an untrusted server certificate, the per-node `CertStoreActor` +broadcast keeps both site nodes' PKI stores consistent across failover, and the +node-browser paging/search/type-info additions are clean. All 22 prior findings remain +`Resolved` and their fixes were verified in place. The review found **4 new findings**, +all clustering on the native-alarm subscribe path, which did **not** inherit the +guards the tag-subscribe path earned across DCL-018 / DCL-021 / DCL-022: the alarm +path leaks its adapter feed on a mid-flight unsubscribe (no DCL-021-style obsolete- +completion guard), has no alarm-resolution retry to match the tag path's +`tag-resolution-retry` timer, leaks `_alarmCts` on `DisposeAsync` (only +`DisconnectAsync` tears it down), and carries a stale "first subscriber wins" comment +on a last-wins filter assignment. + +| # | Category | Examined | Notes | +|---|----------|----------|-------| +| 1 | Correctness & logic bugs | x | No new issues; DCL-016/020/021 counter/response fixes verified. Alarm-path snapshot atomic-swap + per-source prefix routing are correct. | +| 2 | Akka.NET conventions | x | No new issues; DCL-022 `IsTimerActive` gate verified. Alarm subscribe uses `ContinueWith(...).PipeTo(Self)` with `Self`/generation captured (no `Sender`/`this` in closures). | +| 3 | Concurrency & thread safety | x | Finding 023 — native-alarm mid-flight unsubscribe leaks the adapter alarm feed (the tag path's DCL-021 obsolete-completion guard was not mirrored on the alarm path). | +| 4 | Error handling & resilience | x | Finding 024 — no alarm-resolution retry: a failed initial alarm subscribe strands the subscriber with no feed until the next full reconnect (the tag path retries on `tag-resolution-retry`). | +| 5 | Security | x | No new issues. The M7 verify-endpoint probe builds a temporary `RealOpcUaClient`, captures the untrusted server cert, and NEVER trusts it; cert trust is gated through `CertStoreActor` broadcast to both nodes. DCL-012/014 auto-accept warning + Commons default remain out-of-scope follow-ups. | +| 6 | Performance & resource management | x | Finding 025 — `MxGatewayDataConnection.DisposeAsync` abandons `_alarmCts` (alarm task + CTS leak on failover/stop); `DisconnectAsync` already cancels+disposes it under `_alarmLock`. | +| 7 | Design-document adherence | x | No new issues; M7 native-alarm read-only mirror, `AlarmKind` discriminator, and per-source `IAlarmSubscribableConnection` feed match the design. DCL-009 doc action still open at doc level (out of scope). | +| 8 | Code organization & conventions | x | No issues — alarm messages/POCOs in Commons, `IAlarmSubscribableConnection` capability seam in the module, options class owned by component. | +| 9 | Testing coverage | x | DCL001–022 regression tests present. Gaps for findings 023 (alarm unsubscribe mid-flight), 024 (failed alarm subscribe → no retry), 025 (`DisposeAsync` alarm-CTS leak). | +| 10 | Documentation & comments | x | Finding 028 — the `_alarmSourceFilter` "first subscriber wins" XML comment contradicts the last-wins assignment in `HandleSubscribeAlarms`. | + ### DataConnectionLayer-001 — `Task.Run` in `HandleSubscribe` mutates actor state off the actor thread | | | @@ -1267,3 +1298,209 @@ calling `StartPeriodicTimer` — `IsTimerActive` is on `ITimerScheduler`. Apply same gate at both call sites. Add a regression test that fires 5 subscribes with unresolved tags within one retry interval and asserts the retry fires at most one interval after the first failure (not after the fifth subscribe). + +### DataConnectionLayer-023 — Native-alarm mid-flight unsubscribe leaks the adapter alarm feed + +| | | +|--|--| +| Severity | High | +| Category | Concurrency & thread safety | +| Status | Resolved | +| Location | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Actors/DataConnectionActor.cs:1773,1790-1809,1864-1882` | + +**Description** + +The native-alarm subscribe path mirrors the tag-subscribe path's +dispatch-I/O-then-`PipeTo(Self)` shape but never inherited the obsolete-completion +guard the tag path earned in DCL-021. `HandleSubscribeAlarms` adds the source to +`_alarmSubscribesInFlight` (line 1773) and dispatches +`alarmable.SubscribeAlarmsAsync(...)` whose `ContinueWith` pipes an +`AlarmSubscribeCompleted` back to `Self`. If an `UnsubscribeAlarmsRequest` for the +last (or only) subscriber is processed on the actor thread between that dispatch and +the completion, `HandleUnsubscribeAlarms` (lines 1864-1882) removes the subscriber, +empties `_alarmSourceSubscribers`, and removes the filter entries — but it tries to +tear down the adapter feed via `_alarmSubscriptionIds.Remove(...)`, which **fails** +because the subscription id has not been stored yet (it is still in flight). It also +leaves the stale `_alarmSubscribesInFlight` entry in place. The late +`AlarmSubscribeCompleted` then reaches `HandleAlarmSubscribeCompleted` (lines +1790-1809), which **unconditionally** stores `_alarmSubscriptionIds[msg.SourceReference] += msg.SubscriptionId` (line 1796) without re-checking whether any subscriber still +exists in `_alarmSourceSubscribers`. + +The net result is an orphaned adapter alarm feed: the OPC UA monitored-item set / the +MxGateway gateway-wide alarm stream stays alive for a source with **zero** subscribers, +streaming transitions into `HandleAlarmTransitionReceived` that match no subscriber set +and are silently dropped — the feed is never torn down for the lifetime of the adapter +(only a full `ReSubscribeAllAlarms` on reconnect clears `_alarmSubscriptionIds`, and +even that re-subscribes from `_alarmSourceSubscribers`, which no longer contains the +orphaned source, so the stale device-side feed is simply abandoned, not closed). The +tag path closes exactly this race in `HandleSubscribeCompleted` (lines 834-867) by +releasing the just-created adapter handle and clearing `_subscribesInFlight` when the +instance entry has gone; the alarm path does neither. This matters because native-alarm +sources are created/destroyed on every deploy/undeploy and instance stop, so each +mid-flight unsubscribe permanently leaks one device-side alarm subscription and its +publish traffic. + +**Recommendation** + +Mirror the closed DCL-021 fix on the alarm path. In `HandleAlarmSubscribeCompleted`, +before storing the id, guard on the live subscriber set: `if +(!_alarmSourceSubscribers.ContainsKey(msg.SourceReference)) { +if (msg.Success && msg.SubscriptionId != null && _adapter is +IAlarmSubscribableConnection alarmable) _ = +alarmable.UnsubscribeAlarmsAsync(msg.SubscriptionId); return; }` so a feed that +completed after its last subscriber left is released at the adapter rather than stored. +Additionally clear the `_alarmSubscribesInFlight` entry in `HandleUnsubscribeAlarms` +when the last subscriber leaves, so the in-flight marker does not linger. Add a +regression test that subscribes a source, sends `UnsubscribeAlarmsRequest` while the +alarm subscribe I/O is in flight, completes the subscribe, and asserts +`UnsubscribeAlarmsAsync` is called and `_alarmSubscriptionIds` / +`_alarmSubscribesInFlight` are clean. + +**Resolution** + +Resolved 2026-06-20 (commit `fd618cf1`): `HandleAlarmSubscribeCompleted` now guards against an orphaned completion (if no live subscriber remains for the source, it tears down the just-created adapter subscription and returns) and `HandleUnsubscribeAlarms` clears the in-flight marker when the last subscriber leaves — mirroring the DCL-021 tag-path fix. Regression test added (verified failing pre-fix). + +### DataConnectionLayer-024 — No alarm-resolution retry: a failed initial alarm subscribe strands the subscriber until the next full reconnect + +| | | +|--|--| +| Severity | Medium | +| Category | Error handling & resilience | +| Status | Deferred | +| Location | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Actors/DataConnectionActor.cs:1752-1757,1790-1809,1885-1908` | + +**Description** + +When an initial native-alarm subscribe fails, the subscriber is left registered but +with no feed and no path to recovery short of a full connection reconnect. +`HandleSubscribeAlarms` registers the subscriber in `_alarmSourceSubscribers` (lines +1752-1757) **before** issuing the adapter subscribe, so a transition arriving +mid-subscribe is still routed. If `alarmable.SubscribeAlarmsAsync` fails, the piped +`AlarmSubscribeCompleted` reaches `HandleAlarmSubscribeCompleted` (lines 1790-1809), +which on the failure branch only logs a warning (line 1801) and replies a +`SubscribeAlarmsResponse(success: false, ...)`. It does **not** re-arm any retry, does +**not** push `NativeAlarmSourceUnavailable` to the subscriber, and does **not** drop +the source from `_alarmSourceSubscribers`. The subscriber therefore stays in the +routing map with the adapter feed never established. + +Recovery only happens via `ReSubscribeAllAlarms` (lines 1885-1908), which is invoked +solely from `BecomeConnected` after a full connection reconnect (line 583). So a +transient device-side failure on the *initial* alarm subscribe — the alarm server +still booting, a momentary fault — leaves the source dark until the entire connection +happens to cycle through `Reconnecting`. This diverges from the tag path, which arms a +periodic `tag-resolution-retry` timer (lines 989-993, 1678-1682; fired by +`HandleRetryTagResolution` at line 1491) so a tag that fails to resolve at subscribe +time is retried every `TagResolutionRetryInterval` (10 s default) without needing a +reconnect. The native-alarm source has no analogous self-healing. + +This is partly a **design judgment call** rather than a clear-cut defect: the M7 design +models native alarms as a read-only mirror, and "drop the source and signal the +subscriber" is an equally valid contract to "retry the alarm subscribe periodically." +The defect is that the code currently does **neither** — it silently leaves a +registered subscriber with a permanently-dark feed and no signal — which is the worst +of both. + +**Recommendation** + +Pick one of the two valid contracts and implement it (this is a judgment call — do not +assume which is wanted): + +- **Retry:** add an `alarm-resolution-retry` periodic timer (mirroring + `tag-resolution-retry`): track sources whose subscribe failed, re-issue + `SubscribeAlarmsAsync` on the timer, and gate the timer start with + `IsTimerActive` (per the DCL-022 pattern) so a burst of failed subscribes does not + reset it. +- **Fail fast:** on the failure branch of `HandleAlarmSubscribeCompleted`, push + `NativeAlarmSourceUnavailable` to the subscriber(s) for that source and remove the + source from `_alarmSourceSubscribers` / `_alarmSubscribesInFlight`, so the + `NativeAlarmActor` learns the feed is unavailable rather than waiting forever. + +Either way, add a regression test for "initial alarm subscribe fails → subscriber is +not silently left dark." + +**Resolution** + +Deferred 2026-06-20: the fix (add an alarm-resolution retry timer vs. push `NativeAlarmSourceUnavailable` and drop the source on initial-subscribe failure) is a design-owner decision; the current code does neither. Awaiting that decision before implementing. + +### DataConnectionLayer-025 — `MxGatewayDataConnection.DisposeAsync` abandons `_alarmCts`, leaking the alarm task on failover/stop + +| | | +|--|--| +| Severity | Low | +| Category | Performance & resource management | +| Status | Resolved | +| Location | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Adapters/MxGatewayDataConnection.cs:277-283,121-134` | + +**Description** + +`MxGatewayDataConnection.DisposeAsync` (lines 277-283) cancels `_eventLoopCts` and +disposes the underlying `_client`, but never touches `_alarmCts`. The alarm feed is +established in `SubscribeAlarmsAsync` (lines 153-175): under `_alarmLock` it lazily +creates `_alarmCts = new CancellationTokenSource()` and launches a long-running +`Task.Run(() => client.RunAlarmStreamAsync(null, ..., token))`. That background alarm +stream is bound only to `_alarmCts.Token`. `DisconnectAsync` (lines 121-134) already +tears it down correctly — under `_alarmLock` it cancels and disposes `_alarmCts`, +nulls it, and resets `_alarmSubCount` — but `DisposeAsync` does not, so the +`CancellationTokenSource` and the alarm-stream `Task` are both leaked whenever the +adapter is disposed without a prior `DisconnectAsync`. + +The `DataConnectionActor` disposes adapters fire-and-forget on failover (and on +actor/connection stop) via `_adapter.DisposeAsync()` without necessarily calling +`DisconnectAsync` first, so every MxGateway failover or connection teardown that goes +through `DisposeAsync` leaks one running alarm-stream task plus its CTS. Severity is +Low because the leaked task is cancellation-bound to a CTS that will be GC-reclaimed +eventually and the gRPC stream it holds will fault when `_client` is disposed, but a +still-running alarm-stream loop racing a disposed client is precisely the class of +dangling background work the lock-guarded `DisconnectAsync` block was written to avoid. + +**Recommendation** + +In `DisposeAsync`, cancel and dispose `_alarmCts` under `_alarmLock` before disposing +the client — copy the block already present in `DisconnectAsync` (lines 124-130): +`lock (_alarmLock) { _alarmCts?.Cancel(); _alarmCts?.Dispose(); _alarmCts = null; +_alarmSubCount = 0; }`. This guarantees the alarm-stream task is cancelled +deterministically on every teardown path, not just the `DisconnectAsync` one. + +**Resolution** + +Resolved 2026-06-20 (commit `fd618cf1`): `MxGatewayDataConnection.DisposeAsync` now cancels and disposes `_alarmCts` under `_alarmLock`, matching `DisconnectAsync` — no more alarm task/CTS leak on failover/stop. + +### DataConnectionLayer-028 — `_alarmSourceFilter` "first subscriber wins" comment contradicts the last-wins code + +| | | +|--|--| +| Severity | Low | +| Category | Documentation & comments | +| Status | Resolved | +| Location | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Actors/DataConnectionActor.cs:102-103,1758-1761` | + +**Description** + +The XML doc on the `_alarmSourceFilter` field (lines 102-103) reads "sourceReference → +raw condition filter string passed to the adapter (**first subscriber wins**)." The +code does the opposite: `HandleSubscribeAlarms` **unconditionally** overwrites the +filter on every subscribe — `_alarmSourceFilter[request.SourceReference] = +request.ConditionFilter;` (line 1758) and likewise re-parses +`_alarmSourceFilterPredicate[request.SourceReference] = +AlarmConditionFilter.Parse(request.ConditionFilter)` (line 1761) — with no +"already present" guard. So a **second** subscriber to the same source reference +re-filters the shared feed with its own condition filter, i.e. **last subscriber +wins**, not first. The comment will mislead a maintainer reasoning about which filter +governs a shared alarm feed when two instances subscribe to the same source with +different condition filters, and it understates a real behavioural subtlety (the second +subscriber silently changes the gate applied to the first subscriber's transitions in +`HandleAlarmTransitionReceived`). + +**Recommendation** + +Correct the comment to "last subscriber wins" (and note that the shared feed carries a +single filter, so co-subscribers to one source reference must agree on the condition +filter). If per-subscriber filtering is actually intended, make the predicate +per-subscriber rather than per-source so each subscriber's own filter governs only its +own deliveries — but that is a behaviour change, not a comment fix, and should be +decided explicitly. + +**Resolution** + +Resolved 2026-06-20 (commit `fd618cf1`): the `_alarmSourceFilter` comment corrected from 'first subscriber wins' to 'last subscriber wins' (and notes co-subscribers share the single filter). diff --git a/code-reviews/DeploymentManager/findings.md b/code-reviews/DeploymentManager/findings.md index 8b4e5c96..f2ce0738 100644 --- a/code-reviews/DeploymentManager/findings.md +++ b/code-reviews/DeploymentManager/findings.md @@ -5,9 +5,9 @@ | Module | `src/ZB.MOM.WW.ScadaBridge.DeploymentManager` | | Design doc | `docs/requirements/Component-DeploymentManager.md` | | Status | Reviewed | -| Last reviewed | 2026-05-28 | +| Last reviewed | 2026-06-20 | | Reviewer | claude-agent | -| Commit reviewed | `1eb6e97` | +| Commit reviewed | `4307c381` | | Open findings | 0 | ## Summary @@ -83,8 +83,51 @@ The remaining six findings are medium/low: lifecycle-timeout audit gap (DeploymentManager-023), and shared static state across `*ProbeActor` tests (DeploymentManager-024). +#### Re-review 2026-06-20 (commit `4307c381`) — full review + +Re-reviewed the whole current module at HEAD after the rename, the +cert-broadcast / Transport `IStaleInstanceProbe` work, and milestone changes. +DeploymentManager-001..024 all remain `Resolved` and verified against source — +the ref-counted `OperationLockManager`, the broadened/non-cancellable failure +writes, the `ApplyPostSuccessSideEffectsAsync` shared helper with +`forceEnabledState` (Disabled-preservation), the lifecycle-timeout audit helper, +the structured diff with List-value normalization, the hoisted global artifact +fetch, and the instance-state-aware reconciliation are all present and correct. +The two flagged cross-module/architectural seams the prompt called out — the +`TrustServerCert`/`RemoveServerCert` broadcast-to-both-nodes and the +`DeploymentManagerActor` deploy-state query handler — live in **SiteRuntime / +Communication / CentralUI**, not this module, so they are out of scope here. +This review found **3 new findings**. The material one is DeploymentManager-025: +the system-wide artifact path still fetches and broadcasts **notification lists +and SMTP configurations (including SMTP credentials)** to every site, in direct +contradiction of the now-explicit design decision that these are central-only +and "no SMTP credential is ever distributed to sites" (Component-DeploymentManager.md +lines 142-146; CLAUDE.md notification-central-only decision). This supersedes +the earlier accepted-deployable-artifact framing of the closed +DeploymentManager-013. DeploymentManager-026 (deployment records are insert-only +— a new row per deploy accumulates per instance, contradicting "only current +status stored, no history table", and the same-tick `OrderByDescending(DeployedAt)` +read has no tiebreaker) and DeploymentManager-027 (artifact tests *assert* the +forbidden notif/SMTP shipping, cementing the DeploymentManager-025 violation) +are the remaining two. + ## Checklist coverage +#### Re-review 2026-06-20 (commit `4307c381`) + +| # | Category | Examined | Notes | +|---|----------|----------|-------| +| 1 | Correctness & logic bugs | ✓ | New: deployment records are insert-only — `DeployInstanceAsync` Adds a new row per deploy; reconciliation's `GetCurrentDeploymentStatusAsync` orders by `DeployedAt` with no tiebreaker (DeploymentManager-026). | +| 2 | Akka.NET conventions | ✓ | Module remains a plain service layer; no actors. The deploy-state-query/cert-broadcast actors live in SiteRuntime, out of scope. No issues. | +| 3 | Concurrency & thread safety | ✓ | `OperationLockManager` ref-counting + gate re-verified; `DeployToAllSitesAsync` prebuilds per-site commands before the parallel phase (no shared DbContext under `Task.WhenAll`). No issues. | +| 4 | Error handling & resilience | ✓ | Failure-status writes use `CancellationToken.None`; lifecycle timeouts now audit; delete-orphan path surfaced. No new issues. | +| 5 | Security | ✓ | New: SMTP credentials are still serialized into the per-site artifact command and broadcast to every site, which the current design forbids outright (DeploymentManager-025). | +| 6 | Performance & resource management | ✓ | Global artifact queries hoisted (DM-023 resolved). Deployment-record row growth is unbounded per instance (part of DeploymentManager-026). | +| 7 | Design-document adherence | ✓ | New: notification lists + SMTP configs are still treated as deployable artifacts, contradicting the "central-only, never distributed to sites" design (DeploymentManager-025). | +| 8 | Code organization & conventions | ✓ | Options bound via Host; `OptionsSection` constant correct. No new issues. | +| 9 | Testing coverage | ✓ | Broad and current. New: artifact tests assert the forbidden notif/SMTP shipping (DeploymentManager-027). | +| 10 | Documentation & comments | ✓ | `ArtifactDeploymentService` class XML doc still lists notification lists + SMTP as broadcast artifacts (stale vs design — folded into DeploymentManager-025). | + #### Re-review 2026-05-28 (commit `1eb6e97`) | # | Category | Examined | Notes | @@ -1234,3 +1277,164 @@ recipient (an `IActorRef` to a TestKit probe), and assert via `ExpectMsg` in each test. Where the simpler counter shape is preferred, pass a shared-state object into the actor's constructor so each test owns its own instance — never reach for `static` mutable test state. + +### DeploymentManager-025 — Notification lists and SMTP configurations (with credentials) are still broadcast to every site, contradicting the central-only design + +| | | +|--|--| +| Severity | High | +| Category | Design-document adherence | +| Status | Resolved | +| Location | `src/ZB.MOM.WW.ScadaBridge.DeploymentManager/ArtifactDeploymentService.cs:13-22,128,130,150-151,181-188` | + +**Description** + +The design now states explicitly, in two authoritative places, that notification +lists and SMTP configuration are **not** deployable artifacts: + +- `docs/requirements/Component-DeploymentManager.md:142-146`: *"Notification lists + and SMTP configuration are **not** deployable artifacts — they are central-only + definitions managed by the Notification Service ... Notification delivery happens + on the central cluster, so **no notification artifact or SMTP credential is ever + distributed to sites**."* +- CLAUDE.md "External Integrations" decision: *"Notification delivery is + central-only ... Notification lists and SMTP config are no longer deployed to + sites; recipient resolution happens at central, at delivery time."* + +`ArtifactDeploymentService` still does the opposite. `FetchGlobalArtifactsAsync` +queries `_notificationRepo.GetAllNotificationListsAsync` and +`GetAllSmtpConfigurationsAsync` (lines 150-151), maps them into +`NotificationListArtifact`s and `SmtpConfigurationArtifact`s — the SMTP artifact +carrying `smtp.Credentials` verbatim (line 188) — and `BuildDeployArtifactsCommandAsync` +places both into the `DeployArtifactsCommand` sent to **every site** (lines 128, +130). The site side persists them: `SiteReplicationActor` (lines 192-201) and +`DeploymentManagerActor` (lines 1383-1419) loop over `command.NotificationLists` +and `command.SmtpConfigurations` and write them to site SQLite via +`SiteNotificationRepository`. + +This is the precise scenario the design says must never happen: SMTP credentials +travel across the inter-cluster transport and land on every site's SQLite. It +supersedes the framing of the now-closed DeploymentManager-013, which accepted +SMTP-as-deployable-artifact as a documented design decision — the design has since +flipped to forbid distribution entirely, so this is a fresh divergence, not the +same finding. The class-level XML doc (lines 13-22) is correspondingly stale: it +still advertises "notification lists ... and SMTP configurations" as artifacts the +service "broadcasts ... to all sites." + +Secondary defect in the same mapping: `NotificationListArtifact` is built from +`nl.Recipients.Where(r => r.EmailAddress is not null)` (line 182), which silently +drops every SMS-only recipient (`PhoneNumber` set, `EmailAddress` null) of a +`NotificationType.Sms` list. Even if list distribution were intended, the SMS +recipient set would be lost — but since lists must not be distributed at all, this +is subsumed by the primary fix. + +**Recommendation** + +Stop fetching, mapping, and shipping notification lists and SMTP configurations +from the artifact deployment path. Drop the `_notificationRepo` queries from +`FetchGlobalArtifactsAsync`, pass `null` (or empty) for the `NotificationLists` +and `SmtpConfigurations` fields of `DeployArtifactsCommand`, and update the class +XML doc to remove both from the artifact list. The message fields can remain on +`DeployArtifactsCommand` for additive compatibility but must never be populated +from central. Coordinate removal of the consuming code in SiteRuntime +(`SiteReplicationActor`, `DeploymentManagerActor`, `SiteNotificationRepository`) +in the same session per the project's "design + code + tests travel together" +rule. Update the contradicting tests (see DeploymentManager-027). + +**Resolution** + +Resolved 2026-06-20 (commit `fd618cf1`): central `FetchGlobalArtifactsAsync` no longer queries or ships notification lists / SMTP configs (passes null; `DeployArtifactsCommand` fields kept for contract compatibility), and the site purges any already-persisted notification_lists / smtp_configurations rows (clearing the plaintext SMTP password) on both apply paths — enforcing the central-only delivery design. Verified no site runtime/delivery path reads this config. + +### DeploymentManager-026 — `DeployInstanceAsync` inserts a new deployment record every deploy; per-instance rows accumulate and the "current status" read has no tiebreaker + +| | | +|--|--| +| Severity | Medium | +| Category | Design-document adherence | +| Status | Resolved | +| Location | `src/ZB.MOM.WW.ScadaBridge.DeploymentManager/DeploymentService.cs:215-225`, `src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase/Repositories/DeploymentManagerRepository.cs:55-61` | + +**Description** + +`DeployInstanceAsync` always creates a brand-new `DeploymentRecord` and calls +`_repository.AddDeploymentRecordAsync(record, …)` (line 223) — it never reuses or +updates the instance's existing record. There is no update-in-place or +delete-prior path on the deploy flow, so every successful or failed deployment of +an instance leaves its predecessor row behind. Over the life of a central process +— amplified by the bulk "deploy all out-of-date instances" workflow and by repeated +redeploys after timeouts — the `DeploymentRecords` table grows without bound, one +row per deploy attempt per instance. + +This contradicts the design's "Deployment Status Persistence" section +(`Component-DeploymentManager.md:106-109`): *"Only the **current deployment +status** per instance is stored in the configuration database ... No deployment +history table — the audit log (via IAuditService) already captures every +deployment action."* The audit log is the history; the deployment-record table is +supposed to hold only the current status. The implementation instead keeps an +ad-hoc, unindexed history there. + +The accumulation also makes the reconciliation read order-sensitive. +`TryReconcileWithSiteAsync` reads the "prior" record via +`GetCurrentDeploymentStatusAsync`, which is +`OrderByDescending(d => d.DeployedAt).FirstOrDefault()` with **no secondary sort +key** (e.g. `ThenByDescending(d => d.Id)`). `DeployedAt` is a `DateTimeOffset` +stamped with `DateTimeOffset.UtcNow` at record creation; two records inserted +within the same clock tick (rapid redeploy, or a redeploy immediately after a +timed-out attempt) tie on `DeployedAt`, and SQL Server's choice between equal +sort keys is undefined. Reconciliation could then read the *wrong* prior record +(e.g. an older Success instead of the latest stuck InProgress), skipping the +intended site query, or vice-versa. + +**Recommendation** + +Either (a) make the deploy path upsert the instance's single current record +(update-in-place when one exists, insert only on first deploy) so the table holds +exactly one row per instance per the design, or (b) if multiple rows are +deliberately retained, add a deterministic tiebreaker to +`GetCurrentDeploymentStatusAsync` (`OrderByDescending(d => d.DeployedAt) +.ThenByDescending(d => d.Id)`) and document the retention/cleanup story so the +table does not grow unbounded. Option (a) aligns with the design and is preferred. + +**Resolution** + +Resolved 2026-06-20 (commit `fd618cf1`): added a deterministic `.ThenByDescending(d => d.Id)` tiebreaker to `GetCurrentDeploymentStatusAsync` so same-tick deployment records resolve to the newest row. Insert-per-deploy behaviour unchanged (history-vs-upsert remains a separate design question). + +### DeploymentManager-027 — Artifact tests assert that notification lists and SMTP configs are shipped, cementing the DeploymentManager-025 design violation + +| | | +|--|--| +| Severity | Low | +| Category | Testing coverage | +| Status | Resolved | +| Location | `tests/ZB.MOM.WW.ScadaBridge.DeploymentManager.Tests/ArtifactDeploymentServiceTests.cs:173-174,200-201` | + +**Description** + +`ArtifactDeploymentServiceTests` asserts, via NSubstitute, that the artifact +deployment path queries the forbidden artifact sets exactly once per deployment: + +``` +await _notificationRepo.Received(1).GetAllNotificationListsAsync(Arg.Any()); +await _notificationRepo.Received(1).GetAllSmtpConfigurationsAsync(Arg.Any()); +``` + +(both in the `DeployToAllSitesAsync` global-query-hoisting test at 173-174 and the +`RetryForSiteAsync` test at 200-201). These assertions pin the exact behaviour the +current design forbids (DeploymentManager-025): they will keep the service shipping +notification lists and SMTP configs to sites and will actively block the fix — +removing the queries makes these `Received(1)` assertions fail. Tests that lock in +a design violation are worse than no test, because they make the correct change +look like a regression. + +**Recommendation** + +When DeploymentManager-025 is fixed, change these to +`DidNotReceive().GetAllNotificationListsAsync(...)` / +`DidNotReceive().GetAllSmtpConfigurationsAsync(...)` (and assert the +`DeployArtifactsCommand`'s `NotificationLists` / `SmtpConfigurations` fields are +null/empty) so the tests enforce the central-only design instead of contradicting +it. + +**Resolution** + +Resolved 2026-06-20 (commit `fd618cf1`): the artifact tests no longer assert the forbidden notification/SMTP shipping — flipped `Received(1)` → `DidNotReceive()` and added assertions that the shipped command's NotificationLists/SmtpConfigurations are null. diff --git a/code-reviews/ExternalSystemGateway/findings.md b/code-reviews/ExternalSystemGateway/findings.md index 5d2a80af..6bcf263f 100644 --- a/code-reviews/ExternalSystemGateway/findings.md +++ b/code-reviews/ExternalSystemGateway/findings.md @@ -5,9 +5,9 @@ | Module | `src/ZB.MOM.WW.ScadaBridge.ExternalSystemGateway` | | Design doc | `docs/requirements/Component-ExternalSystemGateway.md` | | Status | Reviewed | -| Last reviewed | 2026-05-28 | +| Last reviewed | 2026-06-20 | | Reviewer | claude-agent | -| Commit reviewed | `1eb6e97` | +| Commit reviewed | `4307c381` | | Open findings | 0 | ## Summary @@ -81,6 +81,42 @@ design-doc drift). Theme: every new finding is in a code path that was added or touched by the earlier fix bundle but whose error-propagation contract was not verified end-to-end against the S&F engine or the design doc. +#### Re-review 2026-06-20 (commit `4307c381`) — full review + +All twenty-three prior findings (001–023) remain `Resolved`; spot-checks against the +current source confirm the fixes still hold. Since `1eb6e97` the module gained a new +SQL error-classification layer (`SqlErrorClassifier` + `TransientDatabaseException` / +`PermanentDatabaseException`, commits `d0527064` / `de375ff7`) and `DatabaseGateway.CachedWriteAsync` +was reshaped to attempt the write immediately and classify the outcome (transient → +buffer, permanent → synchronous `Failed`, mirroring `CachedCallAsync`); the `ScadaLink → ZB.MOM.WW.ScadaBridge` +rename also landed. The full 10-category checklist was re-walked and surfaced **three new +findings**, none Critical/High. The most serious (`ExternalSystemGateway-024`, Medium) is +that `SqlErrorClassifier.IsTransient(Exception)` classifies *every* `InvalidOperationException` +(and `TimeoutException`) as transient, so a DB-layer authoring/driver-misuse bug that surfaces +as `InvalidOperationException` is silently buffered and retried instead of propagating — +contradicting the classifier's own stated "authoring bugs must propagate" contract and the +HTTP-path symmetry it claims. `-025` (Low) is a caller-cancellation that the SQL driver +raises as a `SqlException` (mid-flight cancel) being reclassified as a *permanent* DB error +rather than propagating the cancellation (a narrow asymmetry with the `-008` contract; untested). +`-026` (Low) is a mis-ordered numbered-label comment block in `ExecuteWriteAsync`. Theme: +the new SQL-classification seam faithfully mirrors the HTTP path's *shape* but is more +permissive on the non-typed-exception branch than the HTTP path it cites as its model. + +_Re-review (2026-06-20, `4307c381`):_ + +| # | Category | Examined | Notes | +|---|----------|----------|-------| +| 1 | Correctness & logic bugs | ☑ | `BuildUrl` / `JsonElementToParameterValue` precision cascade / verb validation all still correct (006/017/020/022). New: caller-cancel surfacing as `SqlException` is misclassified permanent — finding 025. | +| 2 | Akka.NET conventions | ☑ | Still no actors; `AddExternalSystemGatewayActors` remains a no-op. Cached-call/write lifecycle + audit emission live in SiteRuntime/AuditLog, correct boundary. No issues. | +| 3 | Concurrency & thread safety | ☑ | Services stateless and DI-scoped; new `ExecuteWriteAsync`/`RunSqlAsync`/`SqlErrorClassifier` seams introduce no shared mutable state (static `HashSet` is read-only). No findings. | +| 4 | Error handling & resilience | ☑ | New SQL transient/permanent split is the headline area. `SqlErrorClassifier.IsTransient(Exception)` is over-broad — all `InvalidOperationException`/`TimeoutException` are transient, sweeping authoring bugs into retry — finding 024. Number-based set (024 aside) is sound; unknown→permanent fail-fast is correct. | +| 5 | Security | ☑ | `ExecuteWriteAsync`/`SqlErrorClassifier` use the connection NAME, never the connection string, in error text (verified — credentials not leaked). `ApplyAuth` secrets still never logged. SQL `CommandText` is script-authored by design (parameterised values bound separately). No new findings. | +| 6 | Performance & resource management | ☑ | `RunSqlAsync` uses `await using`/`using` for connection + command. `EmptyParameters` reused. `client.Timeout` set per rented client (correct — `IHttpClientFactory` clients are per-call). No new findings. | +| 7 | Design-document adherence | ☑ | Updated doc (lines 65–66) now describes the immediate-attempt-then-buffer cached-write model the code implements — in sync. PATCH reconciled (023). No new drift. | +| 8 | Code organization & conventions | ☑ | `Transient`/`PermanentDatabaseException` are component-local exception types (parallel to the HTTP ones); acceptable. `SqlErrorClassifier` is a static policy class in the component. No new findings. | +| 9 | Testing coverage | ☑ | New SQL-classification paths well covered (immediate + retry, transient/permanent/non-Sql/cancel/unexpected). Gaps: caller-cancel-as-`SqlException` (025) and the `InvalidOperationException`-authoring-bug case (024) are untested — recorded under those findings. | +| 10 | Documentation & comments | ☑ | XML docs accurate and thorough. The numbered case labels in `ExecuteWriteAsync`'s comment block are mis-ordered vs the catch sequence — finding 026. | + ## Checklist coverage | # | Category | Examined | Notes | @@ -1312,3 +1348,162 @@ code travel together" rule: - If PATCH is not in scope, remove `method.HttpMethod.Equals("PATCH", ...)` from the body branch in `InvokeHttpAsync` and let finding-022's verb validation reject it. The design-doc list then remains the single source of truth. + +### ExternalSystemGateway-024 — `SqlErrorClassifier.IsTransient(Exception)` classifies all `InvalidOperationException`/`TimeoutException` as transient, so DB authoring/driver-misuse bugs are buffered and retried instead of propagating + +| | | +|--|--| +| Severity | Medium | +| Category | Error handling & resilience | +| Status | Deferred | +| Location | `src/ZB.MOM.WW.ScadaBridge.ExternalSystemGateway/SqlErrorClassifier.cs:123-142`, `src/ZB.MOM.WW.ScadaBridge.ExternalSystemGateway/DatabaseGateway.cs:346-355` | + +**Description** + +`SqlErrorClassifier.IsTransient(Exception)` is a pure exception-type test: + +```csharp +return exception is InvalidOperationException + or IOException + or SocketException + or TimeoutException + or TaskCanceledException + or DbException; +``` + +`DatabaseGateway.ExecuteWriteAsync` uses it as the third ordered catch +(`catch (Exception ex) when (SqlErrorClassifier.IsTransient(ex))`) to decide whether a +write that did *not* surface as a `SqlException` should be buffered and retried. The +type's own XML doc justifies the set as "a live DB outage does not always surface as a +`SqlException`" and explicitly contrasts it with authoring bugs: *"Authoring bugs +(`ArgumentException`, `NullReferenceException`, etc.) are loud, fixable failures — +silently buffering and retrying them forever would hide the bug."* + +The problem is that `InvalidOperationException` and `TimeoutException` are exactly the +two most common shapes of an authoring/driver-misuse bug on the ADO.NET write path, and +both are now swept into the transient set: + +1. `Microsoft.Data.SqlClient` throws `InvalidOperationException` for a long list of + *programming* errors that are NOT outages and will fail identically on every retry — + e.g. a `SqlParameter` already belongs to another `SqlParameterCollection`, the + command's `Connection` property is not set, an open `SqlDataReader` already exists on + the connection, or `CommandText` is invalid for the call. The doc treats + `InvalidOperationException` as "the connection is not open / pooled connection + broken", but the type carries no such discrimination — any of those authoring + conditions is classified transient and buffered. +2. A bare `TimeoutException` (or any of these) thrown by a custom/wrapped provider for a + non-outage reason is likewise classified transient. + +Consequence: a `CachedWrite` whose payload triggers one of these deterministic +programming errors is returned to the script as `WasBuffered: true` (a lie — it will +never succeed), then re-attempted on every S&F sweep, failing identically each time, +until `RetryCount >= DefaultMaxRetries` and it is finally parked — a noisy retry loop +hiding a fixable bug, which is the exact outcome the doc says the design avoids. It is +also asymmetric with the HTTP path it cites as its model: `ErrorClassifier.IsTransient(Exception)` +(ExternalSystemClient) deliberately does NOT catch `InvalidOperationException`, so the +same class of bug propagates on the API side. + +**Recommendation** + +Narrow the non-`SqlException` transient set to genuine transport/outage shapes. Drop +the broad `InvalidOperationException` arm — or, if connection-state errors must stay +transient, narrow it (e.g. only when the message indicates a broken/closed connection, +or only the `DbException`/`SocketException`/`IOException`/timeout shapes) so that +ADO.NET programming `InvalidOperationException`s propagate as the design intends. +Mirror the HTTP path's narrower `ErrorClassifier.IsTransient(Exception)` set unless +there is a documented reason the SQL path must be broader. Add a regression test that +throws a programming-shaped `InvalidOperationException` (e.g. "The SqlParameter is +already contained by another SqlParameterCollection.") from `RunSqlAsync` and asserts +it propagates (not buffered), companion to the existing +`CachedWrite_UnexpectedException_Propagates_NotClassifiedTransient`. + +**Resolution** + +Deferred 2026-06-20: narrowing the over-broad SQL transient set (currently all `InvalidOperationException`/`TimeoutException` are treated transient) to match the HTTP path is a classification-policy decision for the owner; recommended narrowing recorded. No data loss today (bounded noisy retry returning a false WasBuffered), so deferred rather than forced. + +### ExternalSystemGateway-025 — Caller-token cancellation surfacing from the SQL driver as a `SqlException` is reclassified as a permanent DB error instead of propagating + +| | | +|--|--| +| Severity | Low | +| Category | Error handling & resilience | +| Status | Resolved | +| Location | `src/ZB.MOM.WW.ScadaBridge.ExternalSystemGateway/DatabaseGateway.cs:329-345`, `src/ZB.MOM.WW.ScadaBridge.ExternalSystemGateway/SqlErrorClassifier.cs:158-174` | + +**Description** + +`ExecuteWriteAsync`'s ordered catches handle caller cancellation only via +`catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested)`. +That covers the common case where the driver observes the token before/around the +network call and raises `OperationCanceledException`/`TaskCanceledException`. However, +when `Microsoft.Data.SqlClient` is cancelled *mid-statement* by the caller's token it +can instead raise a **`SqlException`** (historically error number `0` / message +"Operation cancelled by user", and on some paths `3980` "the request failed to run +because the batch is aborted"). Such a `SqlException` does not derive from +`OperationCanceledException`, so the first filter is skipped; it falls into +`catch (SqlException ex) { throw SqlErrorClassifier.Throw(connectionName, ex); }`, and +because `0`/`3980` are not in the transient set, `SqlErrorClassifier.Throw` raises a +`PermanentDatabaseException`. + +The caller-requested cancellation is therefore reported to the script as a *permanent +database failure* (`ExternalCallResult.Success == false`, "Permanent database error: …") +rather than propagating as an `OperationCanceledException`. This contradicts the +`ExternalSystemGateway-008` contract ("the caller asked to abandon the work — do not +reclassify") that the comment block at `DatabaseGateway.cs:333-337` explicitly cites, +and it is the SQL-path analogue of the cancellation/timeout conflation that `-008` +fixed for HTTP. The existing `CachedWrite_CancellationRequested_PropagatesOperationCanceled_NotReclassified` +test only exercises the `OperationCanceledException` shape, so the `SqlException` +cancellation path is uncovered. + +**Recommendation** + +Before classifying a `SqlException`, re-check `cancellationToken.IsCancellationRequested` +and, if set, rethrow as `OperationCanceledException(cancellationToken)` (or call +`cancellationToken.ThrowIfCancellationRequested()`), so a caller-initiated cancel +propagates regardless of whether the driver surfaced it as `OperationCanceledException` +or as a cancellation-shaped `SqlException`. Add a regression test that cancels the +caller token and throws a cancellation-shaped `SqlException` (or, since `SqlException` +has no public constructor, drive it through `RunSqlAsync` with a token-cancelled +`OperationCanceledException` wrapped to assert the guard) and asserts an +`OperationCanceledException` propagates and nothing is buffered. + +**Resolution** + +Resolved 2026-06-20 (commit `fd618cf1`): `ExecuteWriteAsync` now calls `cancellationToken.ThrowIfCancellationRequested()` at the top of the `catch (SqlException)` block (before `SqlErrorClassifier.Throw`), so a caller-token cancel that surfaces as a SqlException propagates as `OperationCanceledException` regardless of the driver's exception shape (version-independent). Test added (verified failing pre-fix). + +### ExternalSystemGateway-026 — `ExecuteWriteAsync` comment-block case labels are mis-ordered relative to the catch sequence + +| | | +|--|--| +| Severity | Low | +| Category | Documentation & comments | +| Status | Resolved | +| Location | `src/ZB.MOM.WW.ScadaBridge.ExternalSystemGateway/DatabaseGateway.cs:316-355` | + +**Description** + +The header comment for `ExecuteWriteAsync` (lines 316-328) numbers the classification +cases 1–4 in catch order: `1.` caller-requested cancellation propagates unchanged; `2.` +a `SqlException` is classified by number; `3.` a non-`SqlException` transport/connection +failure is transient; `4.` unexpected exceptions propagate. The inline labels on the +catch bodies, however, are inverted: the cancellation catch (line 333) is annotated +`// [2]` and the non-`SqlException` transient catch (line 348) is annotated `// [1]`. +The labels also do not match the parallel labelling in the test file, where `[1]` is +used for the non-Sql-outage case and `[2]` for cancellation — so the source comment and +the tests disagree with the source's own header list. + +This is cosmetic but actively misleading: a reader cross-referencing the numbered +rationale in the header against the `[n]`-tagged catch bodies will map each case to the +wrong rationale, and the discrepancy undermines confidence in an otherwise carefully +documented classification seam. + +**Recommendation** + +Renumber the inline `// [n]` labels to match the header list (cancellation = `[1]`, +non-`SqlException` transient = `[3]`), or drop the bracketed numbers entirely and keep +the prose, so the comment is self-consistent. Align the test-file annotations to +whatever scheme is chosen. + +**Resolution** + +Resolved 2026-06-20 (commit `fd618cf1`): renumbered the inline `// [n]` case labels in `ExecuteWriteAsync` to match the method header's rationale list (cancellation=[1], SqlException=[2], non-Sql=[3]), and realigned the three corresponding labels in the test file. Comment-only. diff --git a/code-reviews/HealthMonitoring/findings.md b/code-reviews/HealthMonitoring/findings.md index 7a492167..076f1173 100644 --- a/code-reviews/HealthMonitoring/findings.md +++ b/code-reviews/HealthMonitoring/findings.md @@ -5,10 +5,10 @@ | Module | `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring` | | Design doc | `docs/requirements/Component-HealthMonitoring.md` | | Status | Reviewed | -| Last reviewed | 2026-05-28 | +| Last reviewed | 2026-06-20 | | Reviewer | claude-agent | -| Commit reviewed | `1eb6e97` | -| Open findings | 2 | +| Commit reviewed | `4307c381` | +| Open findings | 0 | ## Summary @@ -80,6 +80,45 @@ covers (HealthMonitoring-023). All sequence-number and offline-detection arithmetic uses `_timeProvider.GetUtcNow()` consistently — no wall-clock vs monotonic mismatch was observed. +#### Re-review 2026-06-20 (commit `4307c381`) — full review + +All twenty-three prior findings (HealthMonitoring-001..023) remain `Resolved` +and were spot-verified against the current source: the HealthMonitoring-017/018 +"snapshot-then-restore-on-failure" counter logic is in place +(`HealthReportSender.cs:158-176`, `CentralHealthReportLoop.cs:118-134`, +`SiteHealthCollector.AddIntervalCounters`), the HealthMonitoring-020 offline→online +`max(receivedAt, now)` anchor is correct (`CentralHealthAggregator.cs:138-149`), +and the HealthMonitoring-021 `$central` sentinel is honoured by the aggregator and +the Central UI. The full 10-category checklist produced **2 new findings, both Low, +none crash-class** — both confined to the new M6 `SiteHealthKpiSampleSource` +(KPI History sampling) added since the last baseline: the synthetic `$central` +self-report is sampled as a real `KpiScopes.Site` series with meaningless +zero connection/instance/S&F values (HealthMonitoring-024), and 8 of the 12 +emitted Site Health metrics are persisted every minute per site but never read by +any chart while the design doc claims all 12 are "rendered in the dashboard's +per-site KpiTrendChart panel" (HealthMonitoring-025). The KPI-recorder +scope/captive-dependency reasoning is sound (recorder opens a per-pass scope and +the scoped source over a singleton aggregator is the normal lifetime relationship), +the per-source fault isolation is correct, and the `AddIntervalCounters` / +`SetSiteEventLogWriteFailures` default-interface no-ops are an appropriate +test-fake-compatibility seam. No concurrency, security, or sequence-arithmetic +regressions were found. + +_Re-review (2026-06-20, `4307c381`):_ + +| # | Category | Examined | Notes | +|---|----------|----------|-------| +| 1 | Correctness & logic bugs | x | Re-verified HealthMonitoring-017/018/020 fixes in place and correct. `SiteHealthKpiSampleSource.CollectAsync` arithmetic (connectionsUp/Down split, summed `sfBufferDepth`, null-`SiteAuditBacklog`→0) is correct. No new logic bug. | +| 2 | Akka.NET conventions | x | Module still contains no actors. `IHealthReportTransport.Send` is fire-and-forget (Tell semantics). KPI source is a plain `IKpiSampleSource` consumed by the KpiHistory recorder singleton (out of scope). No issues found. | +| 3 | Concurrency & thread safety | x | `SiteHealthCollector` counters use `Interlocked`; `AddIntervalCounters` restore is per-field `Interlocked.Add` and correctly sums with concurrent increments against the zero left by `CollectReport`'s `Exchange`. Aggregator CAS pattern unchanged. `SiteEventLogFailureCountReporter` uses a linked CTS + `Task.Run` loop with catch-all probe isolation — sound. No issues found. | +| 4 | Error handling & resilience | x | Counter-restore-on-failure (HM-017/018) confirmed. KPI source is exception-isolated by the recorder per-source. `SiteEventLogFailureCountReporter.SafeProbe` logs and continues. No issues found. | +| 5 | Security | x | No issues found. Numeric/string operational metrics only; no secrets, auth surface, or untrusted-input parsing. | +| 6 | Performance & resource management | x | `PeriodicTimer`/CTS disposed correctly. New: `SiteHealthKpiSampleSource` writes 12 Site-scoped rows per site per minute, of which 8 are never charted (HealthMonitoring-025) — bounded but wasteful; also emits a full row-set for the synthetic `$central` (HealthMonitoring-024). | +| 7 | Design-document adherence | x | Design doc line 119 claims all 12 emitted Site Health metrics are "rendered in the dashboard's per-site KpiTrendChart panel" — only 4 (`connectionsDown`/`scriptErrors`/`sfBufferDepth`/`deadLetters`) are actually charted (HealthMonitoring-025). Synthetic `$central` is sampled as a real Site scope (HealthMonitoring-024). | +| 8 | Code organization & conventions | x | KPI source's metric-name catalog is split: 4 charted metrics share `KpiMetrics.SiteHealth` (Commons), 8 are private literals — noted under HealthMonitoring-025. Options/validator ownership and idempotent registration correct. `AddCentralHealthAggregation` registers the scoped KPI source via `TryAddEnumerable` — correct. | +| 9 | Testing coverage | x | `SiteHealthKpiSampleSourceTests` covers the populated/heartbeat-only/null-backlog/empty paths and exact (metric,value) tuples well. No test asserts `$central` is excluded (it currently is not — HealthMonitoring-024). Otherwise coverage is strong (73 tests). | +| 10 | Documentation & comments | x | XML docs accurate. The only doc drift is design-doc line 119's "all 12 rendered" claim (HealthMonitoring-025). | + ## Checklist coverage | # | Category | Examined | Notes | @@ -1153,3 +1192,110 @@ a non-bug. Rename to `StoreAndForwardBufferDepths_DefaultsToEmpty_WhenSetterNotCalled` (or similar) and update the test body's intent — purely a documentation / maintainability fix; no behaviour change. + +### HealthMonitoring-024 — `SiteHealthKpiSampleSource` samples the synthetic `$central` self-report as a real `KpiScopes.Site` series + +| | | +|--|--| +| Severity | Low | +| Category | Design-document adherence | +| Status | Resolved | +| Location | `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/Kpi/SiteHealthKpiSampleSource.cs:69-86`, `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/CentralHealthReportLoop.cs:115-120` | + +**Description** + +`SiteHealthKpiSampleSource.CollectAsync` iterates +`ICentralHealthAggregator.GetAllSiteStates()` and emits the 12-metric catalog +for **every** entry whose `LatestReport` is non-null. The aggregator keyspace +includes the synthetic central self-report under `CentralHealthReportLoop.CentralSiteId` +(`"$central"`), which `CentralHealthReportLoop` feeds via `ProcessReport` every +30s — so `$central` always has a non-null `LatestReport` once the loop has run. +The source therefore writes 12 `KpiSample` rows per minute with +`Scope = KpiScopes.Site`, `ScopeKey = "$central"` for the central cluster. + +The class XML doc, the design doc (`Component-HealthMonitoring.md:119`, +"emits `IKpiSampleSource` (`SiteHealthKpiSampleSource`, per-Site)"), and the +`KpiScopes.Site` scope all describe this as a **per-real-site** series. +`$central` is not a site — its `DataConnectionStatuses`, `StoreAndForwardBufferDepths`, +`ParkedMessageCount`, and instance counts are structurally empty/zero +(the central node runs no DCL, no S&F engine, no instances), so the persisted +`connectionsUp`/`connectionsDown`/`sfBufferDepth`/`parkedMessages`/`deployedInstances` +trend rows for `$central` are meaningless constant-zero noise. The Central UI +trend selector (`Health.razor:570-572`) deliberately pins `$central` first and +queries `KpiScopes.Site`/`scopeKey="$central"`, so the dashboard will plot these +flat-zero "Central Cluster" trends. + +This is consistent (the UI renders central as a card by design) so it is not a +behaviour bug — but it conflates a synthetic non-site with real-site KPI history, +permanently stores meaningless zero series, and contradicts the "per-Site" +contract. None of the other M6 sample sources sample a synthetic scope key. + +**Recommendation** + +Decide whether central-cluster trends belong in the per-Site KPI store. If not, +skip the `CentralHealthReportLoop.CentralSiteId` entry in `CollectAsync` (a single +`if (siteId == CentralHealthReportLoop.CentralSiteId) continue;` guard) and remove +the `$central` option from the trend selector. If central trends are intended, +give them a distinct scope (e.g. a `KpiScopes.Central` scope or a central-specific +metric subset) so the data is not labelled as a real site, and update the design +doc + class XML doc to say so. Either way add a test asserting the chosen +behaviour for `$central`. + +**Resolution** + +Resolved 2026-06-20 (commit `fd618cf1`): the design doc now documents that the synthetic `$central` self-report is intentionally sampled into the per-site KPI store (the Central UI deliberately pins `$central` first in the trend selector as 'Central Cluster'), and that its zero connection/instance/S&F values reflect the self-report carrying no site-runtime data — not a defect. Doc-only. + +### HealthMonitoring-025 — 8 of the 12 emitted Site Health KPI metrics are never charted; design doc claims all 12 are rendered + +| | | +|--|--| +| Severity | Low | +| Category | Design-document adherence | +| Status | Resolved | +| Location | `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/Kpi/SiteHealthKpiSampleSource.cs:40-51`, `docs/requirements/Component-HealthMonitoring.md:119` | + +**Description** + +`SiteHealthKpiSampleSource` emits 12 metrics per site per sampling pass: +`connectionsUp`, `connectionsDown`, `scriptErrors`, `alarmEvalErrors`, +`sfBufferDepth`, `deadLetters`, `parkedMessages`, `deployedInstances`, +`enabledInstances`, `disabledInstances`, `auditBacklogPending`, +`eventLogWriteFailures`. Only **four** are ever read back: the Central UI +per-site trend panel calls `LoadTrendSeriesAsync` for exactly +`KpiMetrics.SiteHealth.ConnectionsDown`, `.ScriptErrors`, `.SfBufferDepth`, and +`.DeadLetters` (`Health.razor:596-603`) — and only those four are in the public +`KpiMetrics.SiteHealth` catalog (`Commons/Types/Kpi/KpiMetrics.cs:84-97`). The +other 8 metric names are private string literals in the source and have no +reader anywhere in the codebase. + +Two consequences: + +1. **Stale design doc.** `Component-HealthMonitoring.md:119` lists all 12 metric + names and states they are "rendered in the dashboard's per-site + `KpiTrendChart` panel." Eight of them are never rendered. A reader / future + maintainer is misled about what the trend panel shows. +2. **Persisted-but-dead samples.** The `KpiHistoryRecorderActor` writes all 12 + rows to the central `KpiSample` table every minute for every reporting site; + 8 are pure write-amplification that nothing queries and that the 90-day + purge eventually drops. Bounded, but wasteful — roughly two-thirds of the + Site Health KPI rows are never read. + +A related maintainability smell: the split catalog (4 metrics shared via +`KpiMetrics.SiteHealth`, 8 private literals) means a future chart added for, say, +`alarmEvalErrors` would have to re-type the literal and risk a silent typo +mismatch against the source. + +**Recommendation** + +Reconcile intent in one of two directions: (a) if only the four charted metrics +are wanted, stop emitting the other eight (and correct the design doc's list to +the four actually rendered); or (b) if the full 12-metric history is intended for +future charts / ad-hoc query, promote all 12 names into the public +`KpiMetrics.SiteHealth` catalog so source and any future UI binding key off one +symbol, and either add the missing charts or soften the doc's "rendered … panel" +wording to "recorded for trend history." Add the design-doc fix in the same +session per the repo's doc-and-code-travel-together rule. + +**Resolution** + +Resolved 2026-06-20 (commit `fd618cf1`): corrected the design doc — of the 12 sampled Site Health KPI metrics only four are charted (connectionsDown, scriptErrors, sfBufferDepth, deadLetters); the other eight are now documented as sampled-but-not-charted (retained for future surfaces / ad-hoc query). Doc-only. diff --git a/code-reviews/Host/findings.md b/code-reviews/Host/findings.md index d0ed36ac..fccbb7f1 100644 --- a/code-reviews/Host/findings.md +++ b/code-reviews/Host/findings.md @@ -5,9 +5,9 @@ | Module | `src/ZB.MOM.WW.ScadaBridge.Host` | | Design doc | `docs/requirements/Component-Host.md` | | Status | Reviewed | -| Last reviewed | 2026-05-28 | +| Last reviewed | 2026-06-20 | | Reviewer | claude-agent | -| Commit reviewed | `1eb6e97` | +| Commit reviewed | `4307c381` | | Open findings | 0 | ## Summary @@ -80,6 +80,33 @@ Serilog is the only logger provider and the section is dead config (Host-021); and `ParseLevel` silently swallows an unrecognised `MinimumLevel` value (e.g. a typo) and falls back to `Information` with no warning (Host-022). +#### Re-review 2026-06-20 (commit `4307c381`) — full review + +All twenty-two prior findings (Host-001..022) remain `Resolved` in the current tree +(Host-008 was deliberately *reverted* in M2.9 — see Host-024). This is the first +review since the `ScadaBridge` rename, so a `git diff 1eb6e97 HEAD` shows the whole +directory as additions and is not a useful delta; the review walked the full current +state plus the per-file commit history. Major changes since `1eb6e97`: REQ-HOST-4a was +re-architected (the leader-only `active-node` check left `/health/ready`; a new +leadership-agnostic `RequiredSingletonsHealthCheck` probes the five required central +singletons via `Identify` — clean), shared `ZB.MOM.WW.Health`/`Telemetry` adoption +(`/healthz`, `/metrics`, OTel), a site HTTP/1.1 `MetricsPort` (8084) listener, the +KpiHistory + Transport central registrations, the disable-login dev flag, and the +library-backed inbound API-key auth. Four new findings, none crash/data-loss class. +Host-023 (Medium) is a regression-of-resolution: the shipped `appsettings.Site.json` +second seed-node now targets `localhost:8084` — the new HTTP/1.1 MetricsPort, not an +Akka remoting endpoint — the same defect class as the resolved Host-004 (which chose +8084 as a "remoting port" before MetricsPort claimed it); `StartupValidator` rejects a +seed on the GrpcPort but not on the MetricsPort. Host-024 (Medium) flags the M2.9 +reversal of Host-008: `MachineDataDb` is again *required* for Central yet is consumed +by no DbContext/repository AND is absent from the shipped `appsettings.Central.json`, +so a developer running the shipped Central binary fails startup for a value nothing +uses. Host-025 (Low): the Host directly consumes KpiHistory (`AddKpiHistory`, +`KpiHistoryRecorderActor`, `KpiHistoryOptions`) but has no direct `ProjectReference` — +it relies on a transitive pull through CentralUI. Host-026 (Low): the Component +Registration Matrix in `Component-Host.md` is stale — KpiHistory and Transport are +registered central-only in `Program.cs` but absent from the matrix. + ## Checklist coverage | # | Category | Examined | Notes | @@ -110,6 +137,21 @@ _Re-review (2026-05-28, `1eb6e97`):_ | 9 | Testing coverage | ☑ | Strong existing suite. No coverage for the Site `CentralContactPoints` second-entry rule (Host-016), the site-shutdown ordering (Host-017), the `NodeName`-absent shipped config (Host-018), the unused `CancellationToken` parameter (Host-019), the `MinimumLevel.Is` override semantics (Host-020) or the `ParseLevel` silent fallback (Host-022). | | 10 | Documentation & comments | ☑ | Re-review: layered `MinimumLevel.Is` / `ReadFrom.Configuration` semantics are not surfaced — an operator-set `Serilog:MinimumLevel` is silently overridden by `ScadaBridge:Logging:MinimumLevel` (Host-020); `ParseLevel` silently coerces a misspelled level to `Information` with no warning (Host-022). | +_Re-review (2026-06-20, `4307c381`):_ + +| # | Category | Examined | Notes | +|---|----------|----------|-------| +| 1 | Correctness & logic bugs | ☑ | Re-review: shipped `appsettings.Site.json` second seed-node targets MetricsPort 8084 (HTTP/1.1 Kestrel listener), not an Akka remoting endpoint — regression of the resolved Host-004 fix (Host-023). | +| 2 | Akka.NET conventions | ☑ | All five central singletons + the four site singletons reviewed: ClusterSingletonManager/Proxy pairing, role scoping (site singletons `WithRole(site-*)`, central on base "Central"), `PhaseClusterLeave` graceful-stop drain tasks, receptionist registrations, `RequiredSingletonsHealthCheck` `Identify` probe — all sound. No new issues. | +| 3 | Concurrency & thread safety | ☑ | `GetOrCreateActorSystem` double-checked lock (HOST-021) is correct; `_trackedDisposables` snapshot-under-lock; `RequiredSingletonsHealthCheck` probes concurrently and never throws. No new issues. | +| 4 | Error handling & resilience | ☑ | `StartupRetry` now threads `ApplicationStopping` + transient-only classifier (Host-019/015 resolved); migration retry sound. No new issues. | +| 5 | Security | ☑ | Shipped configs keep `${...}` secret placeholders; pepper fail-fast validated (≥16 chars). Note: docker per-node `appsettings.Central.json` still carry plaintext SQL passwords / JWT key — out of this module's edit scope (Host-003 follow-up), not re-filed. | +| 6 | Performance & resource management | ☑ | ActorSystem singleton-not-transient avoids per-probe disposal; gRPC stream cancel-on-shutdown wired. No new issues. | +| 7 | Design-document adherence | ☑ | Re-review: Component Registration Matrix omits KpiHistory + Transport (both central-only in Program.cs) (Host-026); `MachineDataDb` required-but-unconsumed and absent from shipped Central config (Host-024). | +| 8 | Code organization & conventions | ☑ | Re-review: Host directly consumes KpiHistory types but has no direct `ProjectReference` (transitive via CentralUI) (Host-025). | +| 9 | Testing coverage | ☑ | Strong suite (RequiredSingletons, Hocon, Serilog, StartupRetry covered). Gaps: no test asserts a seed on the MetricsPort is rejected (Host-023 — the existing 8084-seed test only avoids the grpc rule), and no shipped-config test catches the missing `MachineDataDb` key (Host-024). | +| 10 | Documentation & comments | ☑ | Re-review: `Component-Host.md` matrix stale vs `Program.cs` (Host-026). Inline XML/comments otherwise accurate and unusually thorough. | + ## Findings ### Host-001 — `/health/ready` includes the leader-only `active-node` check @@ -1118,3 +1160,163 @@ behaviour you choose. **Resolution** _Open._ + +### Host-023 — Shipped Site `SeedNodes` second entry targets the MetricsPort, not a remoting port + +| | | +|--|--| +| Severity | Medium | +| Category | Correctness & logic bugs | +| Status | Deferred | +| Location | `src/ZB.MOM.WW.ScadaBridge.Host/appsettings.Site.json:13-17`, `src/ZB.MOM.WW.ScadaBridge.Host/StartupValidator.cs:117-127` | + +**Description** + +The shipped site config sets `Node:RemotingPort = 8082`, `Node:GrpcPort = 8083`, +`Node:MetricsPort = 8084`, but `Cluster:SeedNodes` is +`["akka.tcp://scadabridge@localhost:8082", "akka.tcp://scadabridge@localhost:8084"]`. +The second seed targets `8084` — which `Program.cs:419-422` binds as the **HTTP/1.1 +Kestrel `/metrics` listener** (`HttpProtocols.Http1AndHttp2`), not an Akka.Remote +endpoint. A node joining via that seed attempts an Akka.Remote TCP association against +the Prometheus scrape listener and fails. This is the exact defect class as the +**resolved Host-004** (where the second seed targeted the gRPC port 8083): Host-004's +fix corrected the seed to `8084` and called it a "remoting port", but the later +introduction of `MetricsPort` (commits `bbc9f092`/`c41cb41c`) claimed `8084`, silently +re-breaking the example. `StartupValidator` (`StartupValidator.cs:121-127`) was +extended by Host-004 to reject a seed whose port equals `GrpcPort`, but it does **not** +reject a seed whose port equals `MetricsPort`, so this misconfiguration passes +validation silently. For the single-node dev loopback the first seed (`8082`) succeeds +and the bug is masked, but it is an incorrect example that copies into multi-node +configs (the docker site configs correctly use distinct hostnames both on `8082`). + +**Recommendation** + +The loopback dev site cannot host two distinct remoting endpoints, so drop the +second seed (mirror Host-016's single-entry `CentralContactPoints` template) or use a +distinct hostname placeholder both on `8082`, with a comment that multi-node sites +list each node's *remoting* port. Extend the Site block of `StartupValidator` to also +reject any seed whose port equals the resolved `MetricsPort` (and ideally any port +the node binds for non-remoting use). Add a `StartupValidatorTests` case +`Site_SeedNodeOnMetricsPort_FailsValidation` mirroring +`Site_SeedNodeOnGrpcPort_FailsValidation`. + +**Resolution** + +Deferred 2026-06-20: the shipped `appsettings.Site.json` second seed-node points at the MetricsPort and the validator lacks a `seedPort != metricsPort` rule; fixing touches shipped dev config and a single-node-loopback assumption. Recorded for a follow-up (no production node ships this dev config). + +### Host-024 — `MachineDataDb` re-required for Central but consumed by nothing and absent from the shipped config + +| | | +|--|--| +| Severity | Medium | +| Category | Design-document adherence | +| Status | Deferred | +| Location | `src/ZB.MOM.WW.ScadaBridge.Host/StartupValidator.cs:63-65`, `src/ZB.MOM.WW.ScadaBridge.Host/appsettings.Central.json:22-24` | + +**Description** + +Commit `76198b36` ("reverts Host-008") re-added a fail-fast `.Require` for +`ScadaBridge:Database:MachineDataDb` to the Central `.When` block, citing +REQ-HOST-3/REQ-HOST-4 and that the docker per-node configs carry the key. Two problems +remain after the reversal: + +1. **Still consumed by nothing.** A repo-wide search shows `MachineDataDb` is read only + in `StartupValidator.cs` and declared on `DatabaseOptions.cs:11` — no DbContext, no + repository, no component reads it. Only `ConfigurationDb` is wired into + `AddConfigurationDatabase` (`Program.cs:200-202`). The original Host-008 rationale + ("dead configuration that fails startup for a value nothing uses") is therefore + still entirely true; the reversal re-instated the fail-fast without adding a + consumer. +2. **Absent from the shipped Central config.** `appsettings.Central.json` (the template + a developer running the binary directly uses) contains `Database:ConfigurationDb` + (a `${...}` placeholder) but **no `MachineDataDb` key at all**. So a Central node + started from the shipped config fails `StartupValidator` immediately with + "ScadaBridge:Database:MachineDataDb connection string required for Central" — for a + value nothing consumes. The docker configs happen to set it; the shipped default + does not. (This is *not* a re-file of Host-008, which was Resolved-then-reverted; + the new, distinct defect is the validation breaking the shipped dev config.) + +This is also a code-vs-doc inconsistency: the design intent (validate a value the +system needs) is unmet because the value is needed by nothing. + +**Recommendation** + +Pick one and make code, shipped config, and doc agree: either (a) wire a machine-data +store so the value is actually consumed and add a `MachineDataDb` placeholder to the +shipped `appsettings.Central.json` (matching the `_secrets` env-var pattern), or (b) +remove the `.Require`, the `DatabaseOptions.MachineDataDb` property, and the docker +keys again. Add a config-shape regression test asserting the shipped +`appsettings.Central.json` satisfies `StartupValidator` for the Central role (the +absence of such a test is why this slipped through). + +**Resolution** + +Deferred 2026-06-20: `MachineDataDb` is re-required for Central but consumed by nothing and absent from shipped Central config (the shipped Central binary would fail StartupValidator). This is a reversal of the reverted Host-008 and needs a product decision — whether a machine-data store is actually coming (add a config placeholder) or the requirement should be removed again. Recorded for that decision. + +### Host-025 — Host directly consumes KpiHistory but has no direct project reference + +| | | +|--|--| +| Severity | Low | +| Category | Code organization & conventions | +| Status | Resolved | +| Location | `src/ZB.MOM.WW.ScadaBridge.Host/ZB.MOM.WW.ScadaBridge.Host.csproj:42-63`, `src/ZB.MOM.WW.ScadaBridge.Host/Program.cs:118`, `src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs:717-757` | + +**Description** + +`Program.cs` calls `builder.Services.AddKpiHistory(builder.Configuration)` and +`AkkaHostedService.RegisterCentralActors` constructs the `KpiHistoryRecorderActor` +cluster singleton from `KpiHistoryOptions` / `KpiHistoryRecorderActor` +(`ZB.MOM.WW.ScadaBridge.KpiHistory`), yet the Host `.csproj` carries **no +`ProjectReference` to `ZB.MOM.WW.ScadaBridge.KpiHistory`**. It compiles only because +CentralUI (`ZB.MOM.WW.ScadaBridge.CentralUI.csproj`) transitively re-exports it. The +Host is the composition root for every component and the project deliberately declares +direct dependencies elsewhere — the csproj even comments that `ZB.MOM.WW.Theme` is +declared directly "rather than leaning on CentralUI re-exporting it transitively" +(lines 37-39). KpiHistory is the same situation but was not given the same treatment. +If CentralUI ever drops or restructures its KpiHistory reference, the Host fails to +compile for a dependency it directly uses — a latent build fragility. + +**Recommendation** + +Add `` +to the Host csproj's `ProjectReference` group, mirroring the explicit Theme reference +rationale already documented there. + +**Resolution** + +Resolved 2026-06-20 (commit `fd618cf1`): added an explicit `` to the KpiHistory project in the Host csproj, so Host's direct use of KpiHistory types no longer relies on the transitive pull through CentralUI. + +### Host-026 — Component Registration Matrix omits KpiHistory and Transport + +| | | +|--|--| +| Severity | Low | +| Category | Documentation & comments | +| Status | Resolved | +| Location | `docs/requirements/Component-Host.md` (Component Registration Matrix, lines 177-198), `src/ZB.MOM.WW.ScadaBridge.Host/Program.cs:92,118` | + +**Description** + +`Program.cs` registers two central-only components that have shipped since the matrix +was last updated: **Transport** (`builder.Services.AddTransport()`, line 92, #24) and +**KpiHistory** (`builder.Services.AddKpiHistory(...)`, line 118, plus the +`kpi-history-recorder` cluster singleton in `AkkaHostedService`, #26). Neither appears +in the Component Registration Matrix in `Component-Host.md` (lines 177-198), which is +the documented source of which components register on which role. The matrix is +described in CLAUDE.md as something that "must stay in sync with actual component +documents"; it is now stale against the actual composition root. A maintainer auditing +role-based registration against the doc would miss both components (and miss that +KpiHistory's recorder is deliberately *not* readiness-gated — worth a matrix note). + +**Recommendation** + +Add `Transport` (Central: Yes / Site: No / DI: Yes / Actors: No / Endpoints: No) and +`KpiHistory` (Central: Yes / Site: No / DI: Yes / Actors: Yes / Endpoints: No) rows to +the matrix, and note KpiHistory's recorder singleton is intentionally absent from +`RequiredSingletonsHealthCheck` (not readiness-gated). Re-check the Dependencies list +("All 19 component libraries") against the actual csproj reference count while editing. + +**Resolution** + +Resolved 2026-06-20 (commit `fd618cf1`): added `KpiHistory` and `Transport` rows to the Component Registration Matrix in Component-Host.md (both Central=Yes/Site=No/DI=Yes), reflecting their actual central-only registration. diff --git a/code-reviews/InboundAPI/findings.md b/code-reviews/InboundAPI/findings.md index 291b60a5..d7109c81 100644 --- a/code-reviews/InboundAPI/findings.md +++ b/code-reviews/InboundAPI/findings.md @@ -5,10 +5,10 @@ | Module | `src/ZB.MOM.WW.ScadaBridge.InboundAPI` | | Design doc | `docs/requirements/Component-InboundAPI.md` | | Status | Reviewed | -| Last reviewed | 2026-05-28 | +| Last reviewed | 2026-06-20 | | Reviewer | claude-agent | -| Commit reviewed | `1eb6e97` | -| Open findings | 0 | +| Commit reviewed | `4307c381` | +| Open findings | 4 | ## Summary @@ -124,6 +124,36 @@ configuration database, but the invariant is undocumented.) | 9 | Testing coverage | ☑ | `EndpointExtensions.HandleInboundApiRequest` composition wiring has no test (InboundAPI-023); middleware/filter/validator/executor/route are individually covered. | | 10 | Documentation & comments | ☑ | No new issues. | +#### Re-review 2026-06-20 (commit `4307c381`) — full review + +All 25 prior findings remain `Resolved` (InboundAPI-022's `IActiveNodeGate` Host +registration is now present at `Program.cs:260`; InboundAPI-025's POST-only audit +`UseWhen` predicate is live at `Program.cs:337-344`). The module changed materially: +the whole project was renamed (ScadaLink → ZB.MOM.WW.ScadaBridge), inbound auth was +re-architected to the shared `ZB.MOM.WW.Auth.ApiKeys` Bearer/`sbk_` verifier (the SQL +`ApiKey` store, `ApiKeyValidator`, and peppered `IApiKeyHasher` were retired), +`ForbiddenApiChecker` became a thin shim over Script Analysis #25's `ScriptTrustValidator`, +and JSON-Schema `$ref` runtime resolution (`SchemaRefResolver`) plus recursive +Object/List validation+materialization landed. The dominant new finding is that a +**read-only `InboundDatabaseHelper`** was added to `InboundScriptContext.Database` — +reintroducing the very raw-DB-access capability the design doc still explicitly +forbids (regressing the InboundAPI-007 closure), with no read-only enforcement and a +sync-over-async, token-ignoring data path. **4 new findings**: 1 High, 1 Medium, +2 Low — no Critical. + +| # | Category | Examined | Notes | +|---|----------|----------|-------| +| 1 | Correctness & logic bugs | ☑ | Recursive Materialize/Validate look sound; `WaitForAttribute` deadline-vs-wait-timeout tension (InboundAPI-029). | +| 2 | Akka.NET conventions | ☑ | ASP.NET-hosted; routes via `IInstanceRouter` → `CommunicationService`. Correlation/parent ids stamped on all four `Route.To()` verbs. No issues. | +| 3 | Concurrency & thread safety | ☑ | `ConcurrentDictionary` caches; per-request DI scope created from root in singleton executor (valid). New DB helper blocks a pool thread (InboundAPI-027). | +| 4 | Error handling & resilience | ☑ | DB helper's `ExecuteScalar`/`ExecuteReader` are synchronous and ignore `_ct` — a slow query is unbounded by the method timeout (InboundAPI-027). | +| 5 | Security | ☑ | `InboundDatabaseHelper` runs arbitrary script-supplied SQL (not read-only despite the name) on central with host privileges (InboundAPI-026). Bearer verifier delegates constant-time compare to shared lib. Forbidden-API check delegates to #25. | +| 6 | Performance & resource management | ☑ | Capped capture/known-bad caches; `ArrayPool` rent. Sync-over-async DB calls (InboundAPI-027). | +| 7 | Design-document adherence | ☑ | `Database` helper contradicts the doc's "No direct database access" decision (InboundAPI-026, regresses InboundAPI-007); `WaitForAttribute` timeout binding diverges from spec §6 line 192 (InboundAPI-029). | +| 8 | Code organization & conventions | ☑ | `InboundDatabaseHelper` is a component-local helper (fine); schema types in Commons. No issues. | +| 9 | Testing coverage | ☑ | `InboundDatabaseHelper` tests cover SELECT happy-path only — no write/DDL-rejection test (because none is enforced); `WaitForAttribute`/`Database` not end-to-end via the endpoint (InboundAPI-028). | +| 10 | Documentation & comments | ☑ | Helper XML doc claims "read-only" which is unenforced (folded into InboundAPI-026). | + ## Findings ### InboundAPI-001 — Singleton script handler cache mutated without synchronization @@ -1299,3 +1329,199 @@ route, not on the `/api/` prefix. Options: The endpoint-filter form is the recommended fix — it co-locates the audit-emission scope with the route definition and matches how InboundAPI-006/008 gating is already wired. + +### InboundAPI-026 — `InboundDatabaseHelper` gives inbound scripts arbitrary raw SQL (not read-only); contradicts the design doc's "No direct database access" decision and regresses InboundAPI-007 + +| | | +|--|--| +| Severity | High | +| Category | Security | +| Status | Open | +| Location | `src/ZB.MOM.WW.ScadaBridge.InboundAPI/InboundDatabaseHelper.cs:13-70`, `src/ZB.MOM.WW.ScadaBridge.InboundAPI/InboundScriptExecutor.cs:267-281`, `:387`; design doc `docs/requirements/Component-InboundAPI.md:202-215` | + +**Description** + +A new `InboundDatabaseHelper` is exposed to every inbound API script as +`InboundScriptContext.Database` (`InboundScriptExecutor.cs:387`), backed by the +production `IDatabaseGateway` (`AddExternalSystemGateway()` in the central-role branch +of `Program.cs:81`). Its `QuerySingle`/`Query` methods take a script-supplied `sql` +string and run it verbatim — `cmd.CommandText = sql; cmd.ExecuteScalar()/ExecuteReader()` +(`InboundDatabaseHelper.cs:27-29`, `:41-44`) — against a raw ADO.NET `DbConnection` +obtained from `IDatabaseGateway.GetConnectionAsync` (`IDatabaseGateway.cs:20`, the same +gateway the External System Gateway uses for SCADA machine-data DBs). Two problems: + +1. **It directly contradicts the design doc.** `Component-InboundAPI.md:202-215` — which + is the *resolution* of finding InboundAPI-007 — states verbatim: *"**No direct + database access.** Inbound API scripts are not given a raw database client. Handing a + script a raw `SqlConnection` is in direct tension with the ScadaBridge script trust + model … Scripts interact with the system only through the curated `Route` and + `Parameters` surfaces … If a method needs data from the configuration or machine-data + databases, that access belongs behind a dedicated, scoped helper — not a + general-purpose connection — and would be added here as an explicit design change."* + The code now ships exactly the capability the doc forbids, and the doc was **not** + updated. This is a regression of the closed InboundAPI-007 (whose chosen resolution + was to remove the DB API and document its absence). Inbound API scripts are authored + by the less-trusted-than-Admin Design role and execute on the central node with the + host process's privileges, so this materially widens the trust boundary the design + deliberately drew. + +2. **"Read-only" is unenforced — it is a comment, not a constraint.** The class summary + says *"Read-only database access"* but nothing restricts the SQL: a script may run + `Database.Query("conn", "DELETE FROM Machine")`, `UPDATE`, `INSERT`, `DROP`, or any + DDL/DML, and `Query`/`QuerySingle` will execute it (`ExecuteReader`/`ExecuteScalar` + happily run non-SELECT statements). The gateway opens a normal read-write connection + (no `ApplicationIntent=ReadOnly`, no statement allow-listing). The "read-only" claim + in the XML doc and the call-site comment (`InboundScriptExecutor.cs:240`, + "read-only Database helper") is therefore false and misleading. + +Parameter *values* are correctly bound (`AddParameters`, `:59-68`) so value-injection is +not the issue — the issue is that arbitrary statement text is script-controlled and the +"read-only" containment does not exist. + +**Recommendation** + +Reconcile code and design, and make the safety claim real. Either (a) if scoped DB +access is now intended, update `Component-InboundAPI.md:202-215` to authorize it as the +explicit design change the prior text demanded, AND enforce read-only — restrict to +`SELECT` (reject any statement whose first significant token is not `SELECT`/`WITH`, or +better, route through a read-intent connection / a curated query API that cannot mutate), +fix the XML summary to describe what is actually enforced, and gate which connection +names a script may reach; or (b) if direct DB access is still out of scope, remove +`InboundDatabaseHelper` and the `Database` member (reverting to the InboundAPI-007 +posture). Add a regression test asserting a non-SELECT statement is rejected once +read-only is enforced (see InboundAPI-028). + +**Resolution** + +_Unresolved._ + +### InboundAPI-027 — `InboundDatabaseHelper` is sync-over-async and ignores the method-deadline token — thread-pool starvation and an unbounded slow query + +| | | +|--|--| +| Severity | Medium | +| Category | Performance & resource management | +| Status | Open | +| Location | `src/ZB.MOM.WW.ScadaBridge.InboundAPI/InboundDatabaseHelper.cs:25-33`, `:40-44` | + +**Description** + +`QuerySingle`/`Query` execute on the ASP.NET request thread (the script runs inside +`CSharpScript.RunAsync` on a thread-pool thread). Both methods do +`_gateway.GetConnectionAsync(connectionName, _ct).GetAwaiter().GetResult()` +(`:25`, `:40`) — a synchronous block on an async DB-open — and then call the **synchronous** +`cmd.ExecuteScalar()` (`:29`) / `cmd.ExecuteReader()` (`:44`). Two consequences: + +1. **Thread-pool starvation.** The inbound API has no rate limiting (a deliberate design + choice). Each DB-backed method invocation blocks a pool thread for the full duration + of the DB round-trip via `.GetAwaiter().GetResult()`. Under concurrent load this + consumes pool threads that cannot be released until the DB responds — exactly the + pattern ASP.NET guidance warns against (it won't deadlock without a sync context, but + it does starve the pool and degrade the whole central node). + +2. **The slow query is not bounded by the method timeout.** The cancellation token + `_ct` (the method-deadline `cts.Token` passed at `InboundScriptExecutor.cs:268`) is + forwarded to `GetConnectionAsync` but **not** to the synchronous `ExecuteScalar()` / + `ExecuteReader()` — those overloads ignore cancellation entirely. So once a connection + is open, a hung or slow `SELECT` runs to completion (or the provider's own command + timeout) regardless of the method timeout, holding the blocked thread the whole time. + This contradicts the InboundAPI-016 guarantee that the method timeout bounds the + method's work. + +**Recommendation** + +Use the async ADO.NET path end-to-end and honour the token: `await +_gateway.GetConnectionAsync(...)`, then `await cmd.ExecuteReaderAsync(_ct)` / +`await cmd.ExecuteScalarAsync(_ct)`, and make `QuerySingle`/`Query` async (`Task` / +`Task>`) so the script `await`s them rather than blocking a pool +thread. Set a `CommandTimeout` derived from the remaining method deadline as a backstop. +If a synchronous script-facing signature must be kept, at minimum register the command +with the token and a command timeout so a slow query cannot run unbounded. + +**Resolution** + +_Unresolved._ + +### InboundAPI-028 — `InboundDatabaseHelper` has no negative-path tests; `Database`/`WaitForAttribute` are not exercised end-to-end through the endpoint + +| | | +|--|--| +| Severity | Low | +| Category | Testing coverage | +| Status | Open | +| Location | `tests/ZB.MOM.WW.ScadaBridge.InboundAPI.Tests/InboundDatabaseHelperTests.cs:34-67`, `tests/ZB.MOM.WW.ScadaBridge.InboundAPI.Tests/EndpointExtensionsTests.cs` | + +**Description** + +`InboundDatabaseHelperTests` covers only the SELECT happy path (single value, no-rows +default, multi-column row, empty result). There is no test that: + +- a non-SELECT statement (`DELETE`/`UPDATE`/`DROP`) is rejected — because nothing rejects + it today (InboundAPI-026); a test would currently *prove* the helper is not read-only; +- the `InvalidOperationException("Database is not available …")` path fires when the + gateway is null (`InboundDatabaseHelper.cs:24`, `:39`); +- the method-deadline token cancels a slow query (InboundAPI-027). + +Separately, `InboundScriptContext.Database` and `RouteTarget.WaitForAttribute` — both +new script-facing capabilities — are unit-tested at the helper/route level but never +driven end-to-end through `HandleInboundApiRequest` (the `EndpointExtensionsTests` +TestServer suite does not invoke a script that touches `Database` or `WaitForAttribute`), +so a wiring regression that fails to construct the `Database` helper or thread the +deadline into a wait would be silent. + +**Recommendation** + +Add negative-path tests to `InboundDatabaseHelperTests` (null-gateway throw; once +InboundAPI-026 enforces read-only, a write/DDL-rejection test; an InboundAPI-027 +cancellation test). Add an `EndpointExtensionsTests` case driving a method script that +calls `Database.Query(...)` and one that calls `Route.To(...).WaitForAttribute(...)` so +the executor→context→helper wiring is covered through the real endpoint. + +**Resolution** + +_Unresolved._ + +### InboundAPI-029 — Routed `WaitForAttribute` is cancelled by the method-level deadline, contradicting spec §6 (wait bounded by the wait timeout, not the method timeout) + +| | | +|--|--| +| Severity | Low | +| Category | Design-document adherence | +| Status | Open | +| Location | `src/ZB.MOM.WW.ScadaBridge.InboundAPI/RouteHelper.cs:220-247`, `:302-303`; design doc `docs/requirements/Component-InboundAPI.md:192` | + +**Description** + +`Component-InboundAPI.md:192` says of `WaitForAttribute`: *"The cluster call is bounded +by the **wait timeout** rather than the generic integration timeout."* But +`RouteTarget.WaitForAttribute` routes with `Effective(cancellationToken)` (`:226`), +which — when the script passes no explicit token, the normal case — resolves to +`_deadlineToken` (the method-level timeout CTS bound by `WithDeadline` at +`InboundScriptExecutor.cs:279`). So a method whose timeout (default 30 s) is shorter than +the requested wait timeout (e.g. `WaitForAttribute("Flag", true, TimeSpan.FromMinutes(5))`) +has its wait cancelled at 30 s by the method deadline, never reaching the 5-minute wait +timeout the spec says governs it. The routed wait is thus bounded by the *generic method +timeout* — exactly what spec line 192 says it should not be. + +This is in genuine tension with the InboundAPI-016 decision ("the method timeout covers +routed calls, including waits"). The two design statements conflict for the +`WaitForAttribute` case: §6 line 192 wants the wait timeout to win; InboundAPI-016 wants +the method deadline to be the hard ceiling. The current code silently resolves it in +favour of the method deadline, which makes the documented `WaitForAttribute` timeout +semantics (and its `false`-on-timeout return contract) unreliable whenever the method +timeout is the smaller of the two. + +**Recommendation** + +Reconcile the two design statements explicitly. Options: (a) treat the method deadline as +the hard ceiling and update `Component-InboundAPI.md:192` to say the wait is bounded by +`min(waitTimeout, remaining method deadline)` and that an undersized method timeout will +cut a wait short; or (b) honour spec §6 by deriving the wait's effective token from the +wait timeout (a per-wait CTS) rather than the method deadline, accepting that a long wait +can outlive the generic method timeout. Either way, add a `RouteHelperTests` case pinning +the chosen semantics when method-timeout < wait-timeout (the current suite only tests +deadline-token *inheritance*, not the timeout *ordering* conflict). + +**Resolution** + +_Unresolved._ diff --git a/code-reviews/KpiHistory/findings.md b/code-reviews/KpiHistory/findings.md new file mode 100644 index 00000000..8582a58c --- /dev/null +++ b/code-reviews/KpiHistory/findings.md @@ -0,0 +1,307 @@ +# Code Review — KpiHistory + +| Field | Value | +|-------|-------| +| Module | `src/ZB.MOM.WW.ScadaBridge.KpiHistory` | +| Design doc | `docs/requirements/Component-KpiHistory.md` | +| Status | Reviewed | +| Last reviewed | 2026-06-20 | +| Reviewer | claude-agent | +| Commit reviewed | `4307c381` | +| Open findings | 0 | + +## Summary + +KpiHistory is a small, well-built observability module: a single ~305-line recorder +singleton (`KpiHistoryRecorderActor`), a strongly-typed options class with a fail-fast +validator, and a thin DI composition root. The owned code is clean — the actor is +textbook best-effort: every sample pass and purge sweep runs off the actor thread via +`PipeTo`, per-source faults are isolated so one throwing `IKpiSampleSource` never aborts +a tick or the other sources, the repository write/purge is guarded, no exception escapes +either tick handler, and a lifecycle `CancellationTokenSource` is cancelled in `PostStop` +so an in-flight pass observes shutdown promptly. The singleton is wired correctly in the +Host (ClusterSingletonManager + proxy + `PhaseClusterLeave` drain) and is deliberately +absent from the readiness barrier, exactly as the design requires. The options validator, +the EAV table mapping, both named indexes, and all four `IKpiSampleSource` implementations +exist and are registered on the central host as designed. + +The dominant theme is **unbounded work under load**, in two places. First, the recorder +has **no in-flight guard** on its sample timer — directly contradicting its own XML-doc +claim to "mirror the NotificationOutboxActor timer + scope-per-tick + PipeTo pattern," +because the NotificationOutbox dispatcher *does* hold an in-flight guard and the recorder +does not. When a sample pass runs longer than `SampleInterval` (slow/recovering DB), Akka +periodic timers enqueue, not coalesce, so overlapping `RunSamplePass` tasks pile up, +multiplying DB load at exactly the moment the store is struggling and double-writing +samples for overlapping windows. Second, `GetRawSeriesAsync` has **no server-side row +cap**: the design's `DefaultMaxSeriesPoints` ceiling is applied by `KpiSeriesBucketer` +only *after* the full raw window is materialised into memory — a 7-day window at the +default 60 s cadence is ~10 080 rows per series pulled to the Central UI before +downsampling, defeating the stated intent of the cap. A secondary theme is **bucketer +contract drift** (the "largest-timestamp-wins for unsorted input" doc claim is not what +the code does, and the short-series early-return emits raw capture timestamps where the +downsample path emits bucket-boundary timestamps) — both live in Commons but are core to +this module's query reducer. No Critical findings; one High, three Medium, two Low. + +## Checklist coverage + +| # | Category | Examined | Notes | +|---|----------|----------|-------| +| 1 | Correctness & logic bugs | Yes | `capturedAt`/cut-off captured on the actor thread (correct). `KpiSeriesBucketer` short-series early-return emits raw capture timestamps vs bucket-boundary timestamps on the downsample path (KpiHistory-005). Bucketer "largest-timestamp-wins for unsorted input" doc claim is false — it is last-in-iteration (KpiHistory-006). | +| 2 | Akka.NET conventions | Yes | `PipeTo` + scope-per-tick + off-thread I/O + `IWithTimers` all correct; sender not captured across awaits; messages immutable singletons. But no in-flight guard despite the XML claiming to mirror NotificationOutbox (KpiHistory-001). | +| 3 | Concurrency & thread safety | Yes | Actor state is effectively stateless; `_shutdownCts` lifecycle is correct. The missing in-flight guard allows overlapping sample passes under DB latency (KpiHistory-001). | +| 4 | Error handling & resilience | Yes | Per-source isolation, write/purge guards, and `OperationCanceledException` shutdown handling are all correct and tested. No issues beyond KpiHistory-001's load amplification. | +| 5 | Security | Yes | No injection surface — all queries are parameterised LINQ; `Source`/`Metric`/`Scope`/`ScopeKey` are equality predicates, never interpolated SQL. Scope isolation via `ScopeKey == scopeKey` (incl. `IS NULL` for Global) is correct. No issues found. | +| 6 | Performance & resource management | Yes | `GetRawSeriesAsync` returns the entire window with no server-side cap before bucketing (KpiHistory-002). `RecordSamplesAsync` short-circuits empty batches (good). Purge is set-based `ExecuteDeleteAsync` but is a single unbatched statement (KpiHistory-004). | +| 7 | Design-document adherence | Yes | The recorder XML claims to mirror NotificationOutbox's pattern but omits its in-flight guard (KpiHistory-001). `DefaultMaxSeriesPoints`-as-a-true-cap intent is undermined by KpiHistory-002. Otherwise faithful: singleton, not-readiness-gated, daily purge, EAV schema, indexes. | +| 8 | Code organization & conventions | Yes | Options class owned by the component (correct); validator co-located; `IKpiHistoryRepository`/`IKpiSampleSource`/`KpiSample`/bucketer in Commons, impl in ConfigurationDatabase (correct); singleton Props built in Host. No issues found. | +| 9 | Testing coverage | Yes | Recorder isolation, faulted-tick recovery, and purge cut-off are tested; validator bounds fully covered; bucketer has strong sorted-input coverage. Gaps: no overlapping-tick test, no unsorted-input bucketer test, no short-series timestamp-semantics assertion (KpiHistory-003). | +| 10 | Documentation & comments | Yes | XML is generally excellent. Two drift points: the "mirror NotificationOutbox" claim (KpiHistory-001) and the bucketer unsorted-input claim (KpiHistory-006). `SampleComplete`/`PurgeComplete` no-op handlers are documented. | + +## Findings + +### KpiHistory-001 — Recorder has no in-flight guard; overlapping sample passes pile up under DB latency + +| | | +|--|--| +| Severity | High | +| Category | Akka.NET conventions | +| Status | Resolved | +| Location | `src/ZB.MOM.WW.ScadaBridge.KpiHistory/KpiHistoryRecorderActor.cs:89-92`, `:103-107`, `:143-159` | + +**Description** + +The sample timer is a plain Akka periodic timer +(`Timers.StartPeriodicTimer(SampleTimerKey, SampleTick.Instance, …, interval: _options.SampleInterval)`). +`HandleSampleTick` launches `RunSamplePass(...)` off-thread via `PipeTo` and returns +immediately; the piped-back `SampleComplete` is a deliberate no-op +(`Receive(_ => { })`). There is **no in-flight guard** — nothing prevents +a second `SampleTick` from starting a second `RunSamplePass` while the first is still +awaiting its DB round-trip. + +Akka periodic timers do not coalesce missed ticks — they enqueue. So when a sample pass +takes longer than `SampleInterval` (a slow, contended, or recovering central MS SQL — +exactly the regime where observability matters most), each subsequent tick spawns *another* +concurrent pass. Each pass opens its own DI scope + `DbContext` and issues its own +`AddRange` + `SaveChangesAsync`. The result is a self-amplifying load spiral against the +struggling store, plus duplicate sample rows whose `CapturedAtUtc` values straddle the same +real-time window (the design's "one shared tick timestamp" invariant assumes one pass per +tick). + +The actor's own XML-doc states it "mirrors the `NotificationOutboxActor` timer + +scope-per-tick + PipeTo pattern" (lines 37-39) — but `NotificationOutboxActor` holds an +explicit in-flight boolean cleared on `DispatchComplete` precisely to serialise sweeps. +This recorder dropped that half of the pattern. The piped-back `SampleComplete` message is +the natural place the guard would be lowered, and its current empty body is the tell that +the guard was intended but omitted. + +**Recommendation** + +Add an in-flight guard: set a `_sampleInFlight` flag at the top of `HandleSampleTick`, skip +(and log at debug) the tick if already set, and clear it in the `SampleComplete` handler +(both success and failure projections already route there). This matches the +NotificationOutbox pattern the doc claims to follow and bounds the recorder to one pass per +tick. Add a regression test that drives two `SampleTick`s before the first pass completes +(e.g. a repository whose `RecordSamplesAsync` blocks on a gate) and asserts only one pass +ran. The purge tick is daily + idempotent so a guard there is optional, but consider the +same treatment for symmetry. + +**Resolution** + +Resolved 2026-06-20 (commit `fd618cf1`): added a `_sampleInFlight` guard to the recorder — `HandleSampleTick` skips (debug-logs) if a pass is in flight; the flag is cleared via a `SampleComplete` message piped on BOTH success and failure, so a faulted source can't wedge the guard. Overlapping-tick regression test added. + +### KpiHistory-002 — `GetRawSeriesAsync` materialises the entire window with no server-side cap; `DefaultMaxSeriesPoints` is applied only after the fact + +| | | +|--|--| +| Severity | Medium | +| Category | Performance & resource management | +| Status | Deferred | +| Location | `src/ZB.MOM.WW.ScadaBridge.Commons/Interfaces/Repositories/IKpiHistoryRepository.cs:39-41` (contract); `src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase/Repositories/KpiHistoryRepository.cs:45-62` (impl); consumed by `src/ZB.MOM.WW.ScadaBridge.CentralUI/Services/KpiHistoryQueryService.cs:77-93` | + +**Description** + +`GetRawSeriesAsync` has no `limit`/`maxPoints`/`Take` parameter. The implementation issues a +`Where(...).OrderBy(CapturedAtUtc).Select(...).ToListAsync()` that pulls **every** sample in +`[fromUtc, toUtc]` for the series into memory. The `DefaultMaxSeriesPoints` ceiling +(default 200, design-described as the cap that prevents "a single trend query [from +streaming] an arbitrarily large series to the Central UI" — `KpiHistoryOptionsValidator` +XML) is enforced by `KpiSeriesBucketer.Bucket` **after** the full raw set has already been +fetched across the wire and allocated as a `List` in the query service. + +At the default 60 s sample cadence, a 24 h window is ~1 440 rows/series, a 7 d window is +~10 080 rows/series, and a 30 d window ~43 200 rows/series — per chart, with up to four +trend panels on a page each issuing its own query. The cap that the design relies on for +back-pressure provides none at the data tier; it only trims the in-memory result the UI +binds to. The `IX_KpiSample_Series` index makes the *range scan* efficient, but the row +count returned is still unbounded by anything except the window the parent page chooses. + +**Recommendation** + +Push the downsampling toward the store, or at least bound the fetch. Cheapest correct fix: +add an optional `int? maxRows` to `GetRawSeriesAsync` and have the query service pass a +generous multiple of `effectiveMax` (e.g. `effectiveMax * k`) so the bucketer still has +enough density to pick representative last-values while the DB-side `Take` caps the +transfer. A more thorough fix is a server-side bucketed aggregation (GROUP BY a computed +bucket index, MAX(CapturedAtUtc) per bucket), but that is a larger change the design +explicitly deferred ("v1 ships exactly one aggregation"). At minimum, document the +unbounded-fetch behaviour and the practical window ceiling so an operator does not point a +30 d chart at a busy multi-site deployment. + +**Resolution** + +Deferred 2026-06-20: bounding the raw window fetch (cheap `Take`-cap vs. server-side bucketed aggregation) is a design decision and the code lives in ConfigurationDatabase/Commons, not this project. Recorded as a query-path enhancement; the practical window ceiling should be documented when addressed. + +### KpiHistory-003 — Missing tests: overlapping ticks, unsorted-input bucketing, short-series timestamp semantics + +| | | +|--|--| +| Severity | Medium | +| Category | Testing coverage | +| Status | Resolved | +| Location | `tests/ZB.MOM.WW.ScadaBridge.KpiHistory.Tests/KpiHistoryRecorderActorTests.cs`; `tests/ZB.MOM.WW.ScadaBridge.Commons.Tests/Kpi/KpiSeriesBucketerTests.cs` | + +**Description** + +The recorder and bucketer tests are otherwise strong (per-source isolation, faulted-tick +recovery, purge cut-off, validator bounds, sorted-input bucketing, right-edge handling, +empty-bucket omission, out-of-window filtering). Three behaviour gaps remain, each tied to +a finding here: + +1. **No overlapping-tick test** (KpiHistory-001). Every recorder test sends a single tick + and awaits its effect; nothing exercises a second `SampleTick` arriving while the first + pass is still in flight, so the missing in-flight guard is invisible to the suite. A + gated-repository test (block `RecordSamplesAsync`, send two ticks, count passes) would + pin the intended one-pass-per-tick behaviour. +2. **No unsorted-input bucketer test** (KpiHistory-006). All bucketer tests pass ascending + input, so the doc's "largest-timestamp-wins for unsorted input" claim is never checked — + and it is in fact wrong. +3. **No short-series timestamp-semantics assertion** (KpiHistory-005). + `Bucket_BucketStartUtc_IsSetToBucketStart…` covers only the downsample path; no test + asserts what `BucketStartUtc` the early-return (`raw.Count <= maxPoints`) path emits, so + the inconsistency between the two paths is untested. + +**Recommendation** + +Add the three tests above. The overlapping-tick test belongs in +`KpiHistoryRecorderActorTests` (it is a recorder behaviour); the two bucketer tests belong +in `KpiSeriesBucketerTests`. + +**Resolution** + +Resolved 2026-06-20 (commit `fd618cf1`): added the overlapping-tick gated-repository test plus unsorted-input and short-series bucketer tests (the latter pin the documented short-series behaviour). + +### KpiHistory-004 — Retention purge is a single unbatched `ExecuteDeleteAsync`; a large backlog deletes in one transaction + +| | | +|--|--| +| Severity | Low | +| Category | Performance & resource management | +| Status | Deferred | +| Location | `src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase/Repositories/KpiHistoryRepository.cs:65-71` | + +**Description** + +`PurgeOlderThanAsync` runs one set-based `Where(s => s.CapturedAtUtc < before).ExecuteDeleteAsync(...)`. +For the steady-state daily cadence this deletes one day of expired rows and is fine. But if +the purge has not run for an extended period (singleton down across a long failover window, +`PurgeInterval` mis-set and later corrected, or `RetentionDays` shortened), a single +unbounded `DELETE` can touch a very large row count in one transaction — lock escalation on +`KpiSample`, a long-running transaction, and transaction-log growth, which on the shared +central MS SQL can affect operational tables. The Audit Log purge path, by contrast, uses a +bounded batched delete for exactly this reason. + +This is observability data on a non-partitioned `[PRIMARY]` table, so the blast radius is +bounded and the severity is Low — but the unbatched delete is a latent operational hazard +on the shared store. + +**Recommendation** + +Loop a bounded delete (`DELETE TOP (N) … WHERE CapturedAtUtc < @before`, or EF Core's +batching) until zero rows are affected, mirroring the Audit Log purge shape. Keep the +returned total for the existing log line. + +**Resolution** + +Deferred 2026-06-20: batching the retention purge is a shared-MS-SQL-store tradeoff (vs. the Audit Log's batched-delete precedent); low severity for non-partitioned observability data. Recorded as a future enhancement. + +### KpiHistory-005 — `KpiSeriesBucketer` short-series early-return emits raw capture timestamps where the downsample path emits bucket-boundary timestamps + +| | | +|--|--| +| Severity | Low | +| Category | Correctness & logic bugs | +| Status | Deferred | +| Location | `src/ZB.MOM.WW.ScadaBridge.Commons/Types/Kpi/KpiSeriesBucketer.cs:57-58` (early return) vs `:93-94` (downsample) | + +**Description** + +`KpiSeriesBucketer.Bucket` has two return paths that disagree on what `BucketStartUtc` +means. When `raw.Count <= maxPoints` it returns `raw` unchanged — those points carry the +raw `CapturedAtUtc` as their `BucketStartUtc` (the repository builds +`new KpiSeriesPoint(s.CapturedAtUtc, s.Value)`). When `raw.Count > maxPoints` it returns +points whose `BucketStartUtc` is the **bucket boundary** +(`fromUtc + bucketIndex * bucketWidthTicks`), as the dedicated test +`Bucket_BucketStartUtc_IsSetToBucketStartNotRawPointTimestamp` asserts. + +So the same series, charted at a density below vs above `maxPoints`, plots its points at +different x-positions: actual capture instants in the sparse case, evenly-spaced bucket +starts in the dense case. The downstream `KpiTrendChart` normalises X across +`[min(BucketStartUtc), max(BucketStartUtc)]`, so the visual impact is minor (the time range +is essentially the same), but the contract is inconsistent and the x-axis "tick spacing" +subtly changes as a window crosses the cap. This is the bucketer that the KpiHistory design +defines as the module's query reducer, so the inconsistency is in-scope even though the file +lives in Commons. + +**Recommendation** + +Make the two paths agree. Either document the difference explicitly on `Bucket` (the +short-series path returns raw capture instants; the downsample path returns bucket starts), +or — cleaner — have the short-series path also project onto a consistent timestamp basis if +exact bucket-start semantics are part of the contract. Add the short-series timestamp test +from KpiHistory-003. + +**Resolution** + +Deferred 2026-06-20: whether the bucketer's short-series early-return and the downsample path must agree on `BucketStartUtc` semantics (vs. documenting the difference) is a contract decision for the design owner. The doc was corrected (KpiHistory-006) to describe current behaviour; the contract change is deferred. + +### KpiHistory-006 — `KpiSeriesBucketer` doc claims "largest timestamp within each bucket is selected" for unsorted input; the code selects last-in-iteration + +| | | +|--|--| +| Severity | Low | +| Category | Documentation & comments | +| Status | Resolved | +| Location | `src/ZB.MOM.WW.ScadaBridge.Commons/Types/Kpi/KpiSeriesBucketer.cs:20-21` (doc), `:88-97` (code) | + +**Description** + +The `raw` param doc states: *"If not sorted, the point with the largest timestamp within +each bucket is selected."* The code does not do this. When a point is stored into a bucket, +`best[bucketIndex].BucketStartUtc` is set to the **bucket-start** timestamp +(`fromUtc + bucketIndex * bucketWidthTicks`), not the raw point's timestamp. The +last-value comparison for a subsequent point in the same bucket is then +`point.BucketStartUtc > best[bucketIndex].BucketStartUtc` — i.e. it compares the new raw +point's capture time against the *bucket start*, which (for any in-bucket point) is almost +always true. The effect is that each later-in-iteration point overwrites the previous one +regardless of their relative timestamps, so the **last point in iteration order** wins, not +the point with the largest timestamp. + +For the production caller this is harmless: `KpiHistoryRepository.GetRawSeriesAsync` always +`OrderBy(CapturedAtUtc)`, so iteration order equals time order and last-in-iteration is the +largest timestamp. But the documented contract for unsorted input is simply false, and the +"if ties, keep first encountered — stable" comment (line 89) is also inaccurate — the +overwrite triggers on equal-as-well-as-greater for any in-bucket point. A future caller that +trusts the unsorted-input guarantee will get wrong results silently. + +**Recommendation** + +Either (a) fix the comparison to track the selected raw point's actual timestamp (store the +raw `point.BucketStartUtc` alongside the emitted value and compare against that), making the +"largest timestamp wins" claim true for unsorted input; or (b) tighten the doc to state the +method requires ascending-sorted input and selects last-in-iteration otherwise, and drop the +inaccurate "largest timestamp" / "stable ties" language. Pair with the unsorted-input test +from KpiHistory-003. + +**Resolution** + +Resolved 2026-06-20 (commit `fd618cf1`): corrected the `KpiSeriesBucketer` param XML doc — dropped the false 'largest-timestamp-wins / stable ties' claim; now states it requires ascending-sorted input and selects last-in-iteration otherwise. Doc-only. diff --git a/code-reviews/NotificationService/findings.md b/code-reviews/NotificationService/findings.md index 401b94a8..fddec49e 100644 --- a/code-reviews/NotificationService/findings.md +++ b/code-reviews/NotificationService/findings.md @@ -5,9 +5,9 @@ | Module | `src/ZB.MOM.WW.ScadaBridge.NotificationService` | | Design doc | `docs/requirements/Component-NotificationService.md` | | Status | Reviewed | -| Last reviewed | 2026-05-28 | +| Last reviewed | 2026-06-20 | | Reviewer | claude-agent | -| Commit reviewed | `1eb6e97` | +| Commit reviewed | `4307c381` | | Open findings | 0 | ## Summary @@ -100,6 +100,39 @@ and `CredentialRedactor` masks any component of the credential string that is ≥ 4 characters long — a 4-character user name like `root` or a 4-char tenant prefix could be aggressively scrubbed out of unrelated log text (NS-025). +#### Re-review 2026-06-20 (commit `4307c381`) — full review + +Re-reviewed the whole current state. Since `1eb6e97` the module was renamed +(`ScadaLink → ZB.MOM.WW.ScadaBridge`, so the diff shows every file as "added" at +the new path), gained complete XML-doc coverage, and — most importantly — **NS-021 +is now RESOLVED**: `ISmtpClientWrapper.AuthenticateAsync` threads an `oauth2UserName` +and the OAuth2 branch builds `new SaslMechanismOAuth2(oauth2UserName, token)`, +throwing `SmtpPermanentException` if the identity is empty; the real production +caller `EmailNotificationDeliveryAdapter.SendAsync` passes `config.FromAddress` +(`src/ZB.MOM.WW.ScadaBridge.NotificationOutbox/Delivery/EmailNotificationDeliveryAdapter.cs:196`), +and `MailKitSmtpClientWrapperTests` pins both the throw and the XOAUTH2 framing. +All ten prior NS-019..NS-025 findings hold; the module is now cleanly scoped to the +shared central SMTP primitives (no orphaned site-delivery code). The re-review +surfaced **three new findings**, none above Medium: the OAuth2 token endpoint and +scope are hardcoded to Microsoft 365 so no other OAuth2 provider can authenticate +despite the design doc promising "other modern SMTP providers" (NS-026), one test +file is misplaced in this test project (its system-under-test lives in +NotificationOutbox — NS-027), and one test's XML doc still names the +NS-019-deleted `NotificationDeliveryService` (NS-028). + +| # | Category | Examined | Notes | +|---|----------|----------|-------| +| 1 | Correctness & logic bugs | ☑ | NS-021 now resolved (oauth2 user identity threaded + tested). Classification/redaction correct. No new correctness defect. | +| 2 | Akka.NET conventions | ☑ | No actors. `AddNotificationServiceActors` is a documented no-op. | +| 3 | Concurrency & thread safety | ☑ | `OAuth2TokenService` per-credential `ConcurrentDictionary` + per-entry `SemaphoreSlim` double-check is correct; concurrent-fetch test passes. No issues. | +| 4 | Error handling & resilience | ☑ | Delivery error path now lives entirely in NotificationOutbox; the shared `SmtpErrorClassifier` is sound (typed exceptions + numeric SMTP code, cancellation excluded). No issues here. | +| 5 | Security | ☑ | OAuth2 secret/token never logged (tenant only); `CredentialRedactor` tightened (NS-025). At-rest encryption of `SmtpConfiguration.Credentials` still NS-013 Deferred (Commons-scoped). OAuth2 endpoint hardcoded to M365 (NS-026). | +| 6 | Performance & resource management | ☑ | Single `SmtpClient` per wrapper, disposed by caller's `using`; per-delivery factory documented (NS-022). No leak. | +| 7 | Design-document adherence | ☑ | Central-only redesign clean. New gap: OAuth2 hardcoded to M365 token URL + outlook.office365.com scope contradicts "other modern SMTP providers" in the doc (NS-026). | +| 8 | Code organization & conventions | ☑ | One-type-per-file holds. `NotificationOutboxKpiSampleSourceTests.cs` is in this test project but its SUT is in NotificationOutbox (NS-027). | +| 9 | Testing coverage | ☑ | Primitives well covered (classifier, TLS parser, redactor, OAuth2 token cache/refresh/concurrency, auth-wrapper negative paths incl. NS-021). No false-coverage; orphaned tests gone (NS-024). | +| 10 | Documentation & comments | ☑ | XML docs accurate after NS-019/023, EXCEPT `SmtpErrorClassifierTests.cs:11` still references the deleted `NotificationDeliveryService` (NS-028). | + ## Checklist coverage | # | Category | Examined | Notes | @@ -860,3 +893,118 @@ The threshold is a defence-in-depth choice; the existing tests assert that `Hunt **Recommendation** Tighten the redaction policy: mask only the obviously-secret components — the password (Basic), the client secret (OAuth2), and the whole packed string — not the user name / tenant / client id. The simplest implementation is to redact only the **last** colon-separated component (the secret) plus the full packed string. Bump the per-component minimum length to something high enough that a typical short user name does not match (≥ 12 chars is the usual heuristic for a password). Add a test asserting `Scrub("/root/.config", "root:hunter2")` does not mask `/root/.config`'s `root`. + +### NotificationService-026 — OAuth2 token endpoint and scope are hardcoded to Microsoft 365; no other OAuth2 SMTP provider can authenticate + +| | | +|--|--| +| Severity | Medium | +| Category | Design-document adherence | +| Status | Deferred | +| Location | `src/ZB.MOM.WW.ScadaBridge.NotificationService/OAuth2TokenService.cs:73`, `src/ZB.MOM.WW.ScadaBridge.NotificationService/OAuth2TokenService.cs:80` | + +**Description** + +`OAuth2TokenService.GetTokenAsync` constructs the token URL as +`$"https://login.microsoftonline.com/{tenantId}/oauth2/v2.0/token"` (`:73`) and +requests the fixed scope `"https://outlook.office365.com/.default"` (`:80`). Both +are Microsoft-365-specific: the authority host is Azure AD's, and the scope is the +Outlook/Office-365 SMTP-send resource. The design doc (`Component-NotificationService.md`, +"Email Server Configuration") states the OAuth2 Client Credentials mode is +"For Microsoft 365 **and other modern SMTP providers** that require OAuth2", and +CLAUDE.md's External Integrations decision likewise says "OAuth2 Client Credentials +(Microsoft 365) **or** Basic Auth". As implemented, only Microsoft 365 can ever +work: a Google Workspace, on-prem Keycloak/ADFS, or any non-Azure OAuth2 SMTP relay +would need a different authority host and a different scope, neither of which is +configurable. The credential triple carries only `tenantId:clientId:clientSecret` +— there is no authority-host or scope field to override either constant. An operator +who configures an `oauth2` SMTP config pointing at a non-M365 provider gets a token +request sent to `login.microsoftonline.com` with `{tenantId}` interpolated as a path +segment, which fails (404/400) and surfaces as a permanent delivery failure with no +hint that the provider is simply unsupported. + +This is latent rather than actively broken — the only documented/tested target is +M365 — but it contradicts the design doc's stated breadth and is the kind of gap that +silently blocks a future deployment against a non-Microsoft tenant. + +**Recommendation** + +Either (a) make the authority host and scope configurable — add optional +`OAuth2Authority`/`OAuth2Scope` fields to `SmtpConfiguration` (defaulting to the M365 +values) and thread them into `GetTokenAsync` — and update the cache key to include +them; or (b) narrow the design doc and CLAUDE.md to say OAuth2 Client Credentials is +**Microsoft 365 only** and document the M365-only constraint on `OAuth2TokenService`, +so an operator is not misled into configuring an unsupported provider. Option (b) is +the smaller change if M365-only is acceptable for v1. + +**Resolution** + +Deferred 2026-06-20: the OAuth2 token endpoint + scope are hardcoded to Microsoft 365 while the doc implies multi-provider support. Whether to make authority/scope configurable or narrow the doc to 'Microsoft 365 only' for v1 is a product/scope decision. Latent today (only M365 is deployed/tested). Recorded for that decision. + +### NotificationService-027 — `NotificationOutboxKpiSampleSourceTests` is in the wrong test project (its system-under-test lives in NotificationOutbox) + +| | | +|--|--| +| Severity | Low | +| Category | Code organization & conventions | +| Status | Resolved | +| Location | `tests/ZB.MOM.WW.ScadaBridge.NotificationService.Tests/Kpi/NotificationOutboxKpiSampleSourceTests.cs:1`, `src/ZB.MOM.WW.ScadaBridge.NotificationOutbox/Kpi/NotificationOutboxKpiSampleSource.cs` | + +**Description** + +`tests/ZB.MOM.WW.ScadaBridge.NotificationService.Tests/Kpi/NotificationOutboxKpiSampleSourceTests.cs` +was added (commit `0d6c026d`, "K6 — NotificationOutbox sample source") inside the +**NotificationService** test project, but the class it exercises — +`NotificationOutboxKpiSampleSource` — lives in the **NotificationOutbox** project +(`src/ZB.MOM.WW.ScadaBridge.NotificationOutbox/Kpi/NotificationOutboxKpiSampleSource.cs`), +confirmed by a repo-wide grep showing the type defined only there. A module's test +project should test that module: this test belongs in +`tests/ZB.MOM.WW.ScadaBridge.NotificationOutbox.Tests`. As placed, it forces the +NotificationService test project to reference the NotificationOutbox project, blurs +the module boundary the review process treats as the unit of review, and means a +reviewer scoping "the NotificationService tests" picks up coverage for a different +component. It also inflates this module's apparent test count with tests that assert +nothing about NotificationService. + +**Recommendation** + +Move `Kpi/NotificationOutboxKpiSampleSourceTests.cs` to +`tests/ZB.MOM.WW.ScadaBridge.NotificationOutbox.Tests/Kpi/`, and drop the +NotificationOutbox project reference from `ZB.MOM.WW.ScadaBridge.NotificationService.Tests.csproj` +if nothing else in that test project needs it. + +**Resolution** + +Resolved 2026-06-20 (commit `fd618cf1`): moved `NotificationOutboxKpiSampleSourceTests` from NotificationService.Tests into NotificationOutbox.Tests (where its SUT lives), with the namespace adjusted to that project's convention. Passes in its new home. + +### NotificationService-028 — Stale XML cross-reference: `SmtpErrorClassifierTests` still names the NS-019-deleted `NotificationDeliveryService` + +| | | +|--|--| +| Severity | Low | +| Category | Documentation & comments | +| Status | Resolved | +| Location | `tests/ZB.MOM.WW.ScadaBridge.NotificationService.Tests/SmtpErrorClassifierTests.cs:11` | + +**Description** + +The class-summary XML doc on `SmtpErrorClassifierTests` says the classification +policy "is shared between `` and the central +outbox's `EmailNotificationDeliveryAdapter`". `NotificationDeliveryService` was +deleted as part of the NS-019 resolution (the orphaned site-side sender), so this +cross-reference points at a type that no longer exists. The `` will not +resolve and the sentence describes a sharing relationship that ended with the +central-only redesign — the classifier is now consumed only by +`EmailNotificationDeliveryAdapter`. This is the same stale-doc class NS-023 cleaned up +in the production source, missed in this one test file. Minor, but misleading to a +maintainer and a dangling `cref`. + +**Recommendation** + +Drop the `NotificationDeliveryService` reference from the summary; state that the +classifier is shared by the central `EmailNotificationDeliveryAdapter` (and is a +reusable primitive of this module), consistent with the NS-019/NS-023 cleanup. + +**Resolution** + +Resolved 2026-06-20 (commit `fd618cf1`): removed the dangling `` (type deleted by NS-019) from the `SmtpErrorClassifierTests` XML doc, repointing the reference to the actual production caller. Doc-only. diff --git a/code-reviews/README.md b/code-reviews/README.md index fb22104b..7869ab63 100644 --- a/code-reviews/README.md +++ b/code-reviews/README.md @@ -40,37 +40,39 @@ module file and counted in **Total**. | Severity | Open findings | |----------|---------------| | Critical | 0 | -| High | 0 | -| Medium | 0 | -| Low | 0 | -| **Total** | **0** | +| High | 1 | +| Medium | 1 | +| Low | 2 | +| **Total** | **4** | ## Module Status | Module | Last reviewed | Commit | Open (C/H/M/L) | Open | Total | |--------|---------------|--------|----------------|------|-------| -| [AuditLog](AuditLog/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 11 | +| [AuditLog](AuditLog/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 16 | | [CLI](CLI/findings.md) | 2026-06-19 | `d6ead8ae` | 0/0/0/0 | 0 | 24 | | [CentralUI](CentralUI/findings.md) | 2026-06-19 | `d6ead8ae` | 0/0/0/0 | 0 | 36 | -| [ClusterInfrastructure](ClusterInfrastructure/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 14 | +| [ClusterInfrastructure](ClusterInfrastructure/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 15 | | [Commons](Commons/findings.md) | 2026-06-19 | `d6ead8ae` | 0/0/0/0 | 0 | 26 | -| [Communication](Communication/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 22 | +| [Communication](Communication/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 24 | | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | 2026-06-19 | `d6ead8ae` | 0/0/0/0 | 0 | 27 | -| [DataConnectionLayer](DataConnectionLayer/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 22 | -| [DeploymentManager](DeploymentManager/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 24 | -| [ExternalSystemGateway](ExternalSystemGateway/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 23 | -| [HealthMonitoring](HealthMonitoring/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 23 | -| [Host](Host/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 22 | -| [InboundAPI](InboundAPI/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 25 | +| [DataConnectionLayer](DataConnectionLayer/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 26 | +| [DeploymentManager](DeploymentManager/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 27 | +| [ExternalSystemGateway](ExternalSystemGateway/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 26 | +| [HealthMonitoring](HealthMonitoring/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 25 | +| [Host](Host/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 26 | +| [InboundAPI](InboundAPI/findings.md) | 2026-06-20 | `4307c381` | 0/1/1/2 | 4 | 29 | +| [KpiHistory](KpiHistory/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 6 | | [ManagementService](ManagementService/findings.md) | 2026-06-19 | `d6ead8ae` | 0/0/0/0 | 0 | 26 | | [NotificationOutbox](NotificationOutbox/findings.md) | 2026-06-19 | `d6ead8ae` | 0/0/0/0 | 0 | 13 | -| [NotificationService](NotificationService/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 25 | -| [Security](Security/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 21 | -| [SiteCallAudit](SiteCallAudit/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 6 | -| [SiteEventLogging](SiteEventLogging/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 23 | -| [SiteRuntime](SiteRuntime/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 26 | -| [StoreAndForward](StoreAndForward/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 24 | -| [TemplateEngine](TemplateEngine/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 22 | +| [NotificationService](NotificationService/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 28 | +| [ScriptAnalysis](ScriptAnalysis/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 8 | +| [Security](Security/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 25 | +| [SiteCallAudit](SiteCallAudit/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 9 | +| [SiteEventLogging](SiteEventLogging/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 27 | +| [SiteRuntime](SiteRuntime/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 31 | +| [StoreAndForward](StoreAndForward/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 27 | +| [TemplateEngine](TemplateEngine/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 25 | | [Transport](Transport/findings.md) | 2026-06-19 | `d6ead8ae` | 0/0/0/0 | 0 | 15 | ## Pending Findings @@ -84,14 +86,21 @@ description, location, recommendation — lives in the module's `findings.md`. _None open._ -### High (0) +### High (1) -_None open._ +| ID | Module | Title | +|----|--------|-------| +| InboundAPI-026 | [InboundAPI](InboundAPI/findings.md) | `InboundDatabaseHelper` gives inbound scripts arbitrary raw SQL (not read-only); contradicts the design doc's "No direct database access" decision and regresses InboundAPI-007 | -### Medium (0) +### Medium (1) -_None open._ +| ID | Module | Title | +|----|--------|-------| +| InboundAPI-027 | [InboundAPI](InboundAPI/findings.md) | `InboundDatabaseHelper` is sync-over-async and ignores the method-deadline token — thread-pool starvation and an unbounded slow query | -### Low (0) +### Low (2) -_None open._ +| ID | Module | Title | +|----|--------|-------| +| InboundAPI-028 | [InboundAPI](InboundAPI/findings.md) | `InboundDatabaseHelper` has no negative-path tests; `Database`/`WaitForAttribute` are not exercised end-to-end through the endpoint | +| InboundAPI-029 | [InboundAPI](InboundAPI/findings.md) | Routed `WaitForAttribute` is cancelled by the method-level deadline, contradicting spec §6 (wait bounded by the wait timeout, not the method timeout) | diff --git a/code-reviews/ScriptAnalysis/findings.md b/code-reviews/ScriptAnalysis/findings.md new file mode 100644 index 00000000..8a6a94ea --- /dev/null +++ b/code-reviews/ScriptAnalysis/findings.md @@ -0,0 +1,407 @@ +# Code Review — ScriptAnalysis + +| Field | Value | +|-------|-------| +| Module | `src/ZB.MOM.WW.ScadaBridge.ScriptAnalysis` | +| Design doc | `docs/requirements/Component-ScriptAnalysis.md` | +| Status | Reviewed | +| Last reviewed | 2026-06-20 | +| Reviewer | claude-agent | +| Commit reviewed | `4307c381` | +| Open findings | 0 | + +## Summary + +This is the first formal review of the ScriptAnalysis component (#25), the single +authoritative source of truth for the ScadaBridge script trust boundary. The module +is small (5 source files, ~700 lines) and well-documented, and its core design is +**sound**: the trust verdict is genuinely **semantic** (Roslyn symbol resolution in +Pass 1), not a string/regex deny-list, so the classic spelling-based evasions — +aliases, `using static`, `global::`, partial-namespace qualification — are correctly +defeated by resolving the symbol rather than the text. The reflection-gateway +hardening (Pass 2) closes the `typeof(x).Assembly.GetType("System.IO.File")` class of +bypass that a namespace-only deny-list cannot see. All four documented consumers +(Template Engine, Site Runtime, Inbound API, Central UI) genuinely delegate to +`ScriptTrustValidator.FindViolations` and **hard-reject** on a non-empty result; none +re-implements or relaxes the policy. A reflection-based parity guard for the +compile-only surfaces exists in `SiteRuntime.Tests`. **No working trust-boundary +bypass was found at HEAD against the production (cluster-hosted) configuration.** + +That said, the review surfaced several real issues. The highest-impact one is a +**silent degradation of the semantic pass when `TRUSTED_PLATFORM_ASSEMBLIES` is +unavailable** (single-file/AOT/trimmed host): `AnalysisReferences` falls back to the +5-assembly minimal set, after which a *bare* forbidden type inside an *allowed* +namespace (the documented `Process` via `using System.Diagnostics;` case) no longer +resolves and slips Pass 1 entirely — and Pass 2 never flags bare identifiers, only +dotted spellings (ScriptAnalysis-001). This is the only finding that can weaken the +verdict, and it is narrow (the production hosts carry a TPA list, and the downstream +real `Compile` gate catches it for two of the four consumers), so it is rated High +rather than Critical. The remaining findings are robustness/correctness/doc issues: +the semantic pass swallows nothing but the unresolved-symbol fallback is namespace- +spelling-dependent (Medium); `nameof(...)` over-flags legitimate non-forbidden uses +inconsistently (Low, documented-intentional but worth pinning); `ParseDiagnostics`/ +`Compile` drop warnings despite the doc promising "errors and warnings"; there is no +caching/throttling of the Roslyn compilations (performance); the `extraReferences` +XML-doc on `FindViolations` is contradicted by the Central UI consumer; and the +adversarial test corpus, though good, has gaps (no extension-method, no +TPA-fallback, no `file`-scoped/`unsafe`, no verbatim/Unicode-escape identifier +cases). The trust boundary is **semantically sound in production**; the open items +harden the edges and align the spec. + +## Checklist coverage + +| # | Category | Examined | Notes | +|---|----------|----------|-------| +| 1 | Correctness & logic bugs | ☑ | `nameof` over-flags inconsistently (ScriptAnalysis-005); `GetSyntacticScopes` only returns text for dotted names so a bare unresolved forbidden identifier yields no scope (root of ScriptAnalysis-001/002). | +| 2 | Akka.NET conventions | ☑ | No issues found — module has no actors, no Akka dependency (by design; csproj references only Roslyn + Commons). Stateless static APIs, thread-safe. | +| 3 | Concurrency & thread safety | ☑ | No issues found — all public surfaces are static/stateless; the shared `static readonly` policy collections are read-only after init; `HardeningWalker` is per-call. `AnalysisReferences` eager-built once at type init. | +| 4 | Error handling & resilience | ☑ | `RoslynScriptCompiler.Compile` converts exceptions into a fail-the-gate error entry (fail-safe). `BuildAnalysisReferences` swallows per-assembly read errors and silently falls back to the minimal set — the silent fallback is the resilience gap (ScriptAnalysis-001). | +| 5 | Security | ☑ | Semantic (symbol-based) verdict — no spelling bypass found in production. One conditional degradation (TPA-absent host) weakens Pass 1 to spelling-dependent (ScriptAnalysis-001); `extraReferences` can never cause a false allow (verified). Defence-in-depth only, as documented — not a runtime sandbox (ScriptAnalysis-008). | +| 6 | Performance & resource management | ☑ | No compilation caching/throttling; every gate call re-parses, rebuilds a compilation, and re-runs Roslyn — `FindViolations` builds a fresh `CSharpCompilation` against the *full framework* reference set on every call (ScriptAnalysis-006). | +| 7 | Design-document adherence | ☑ | `ParseDiagnostics`/`Compile` drop warnings though doc says "errors and warnings" (ScriptAnalysis-004); `AnalysisReferences` (full-framework set) is not described in REQ-SA-2, which says DefaultReferences (ScriptAnalysis-007). | +| 8 | Code organization & conventions | ☑ | No issues found — POCO/static layout is clean, namespace correct, depends only on Commons + Roslyn. `ScriptCompileSurface` lives in ScriptAnalysis (not Commons) — acceptable as it mirrors a runtime surface and is consumed here. | +| 9 | Testing coverage | ☑ | Good adversarial reject/allow corpus, but missing: TPA-fallback degradation, extension methods, verbatim/Unicode-escape identifiers, `unsafe`/pointer, comment/trivia, and `AnalysisReferences`/exception paths (ScriptAnalysis-003). | +| 10 | Documentation & comments | ☑ | `FindViolations` XML-doc claim "Forbidden references are NOT added here" is contradicted by the Central UI consumer forwarding `compilation.References` (ScriptAnalysis-007, doc half). Otherwise thorough and accurate. | + +## Findings + +### ScriptAnalysis-001 — Semantic trust pass silently degrades to spelling-dependent when `TRUSTED_PLATFORM_ASSEMBLIES` is unavailable + +| | | +|--|--| +| Severity | High | +| Category | Security | +| Status | Resolved | +| Location | `src/ZB.MOM.WW.ScadaBridge.ScriptAnalysis/ScriptTrustPolicy.cs:146-184`; interacts with `ScriptTrustValidator.cs:111-136`, `:195-213` | + +**Description** + +`AnalysisReferences` is the reference set that lets Pass 1 (the semantic symbol pass — +the *authoritative* guard) resolve every type a script names to its true namespace. +It is built from `AppContext.GetData("TRUSTED_PLATFORM_ASSEMBLIES")`. When that value +is absent — a single-file deployment, an AOT/trimmed host, or any runtime that does +not publish the TPA list — `BuildAnalysisReferences` falls back to the 5-assembly +`DefaultReferences` set (`ScriptTrustPolicy.cs:183`) **silently** (the per-assembly +`catch { }` at `:165`/`:178` and the `Count > 0` fallback at `:183` emit nothing). + +After that fallback, a *bare* forbidden type that lives inside an *allowed* namespace +no longer resolves. The policy's own documented example is exactly this case: +`using System.Diagnostics; var p = Process.Start("x");`. `System.Diagnostics` is an +allowed namespace (so `VisitUsingDirective` does not flag the import), and bare +`Process` is an `IdentifierName`. With `Process.dll` absent from the minimal +reference set, `model.GetSymbolInfo` returns no symbol, so Pass 1 falls to +`GetSyntacticScopes`, which returns an empty scope list for a bare (dotless) +identifier (`ScriptTrustValidator.cs:210-212`). Pass 2's `HardeningWalker` only flags +forbidden *dotted* names and forbidden `using` imports — never a bare identifier +token (`VisitIdentifierName`, `:373-395`, only checks `dynamic`/`Activator`). The +forbidden `Process.Start` therefore slips the trust validator entirely. The same +degradation reopens the broader "forbidden type in an allowed namespace reached as a +bare identifier" class that the full-framework reference set was specifically added +(commit `06975720`, M3.6) to close. + +This is rated High not Critical because: (a) the production hosts (Akka.NET cluster, +docker topology) run on a normal framework-dependent runtime that publishes the TPA +list, so the full set is loaded; and (b) for two of the four consumers (Template +Engine, Site Runtime) the downstream real `Compile` gate against a curated reference +set independently rejects an undefined `Process` symbol. But Inbound API and the +Central UI run gate rely on `FindViolations` for the reflection/namespace verdict, +and a deployment-mode change that drops the TPA list would silently weaken the +boundary with no diagnostic. + +**Recommendation** + +Make the degradation loud and fail-safe: when the TPA list is unavailable, log a +warning (or expose a diagnostic flag the consumers can surface), and — more +importantly — add the BCL assemblies that host the forbidden-type-in-allowed-namespace +cases (at minimum `System.Diagnostics.Process`) to `DefaultAssemblies` so they resolve +even in the minimal fallback. Alternatively, extend Pass 2 to flag bare identifiers +whose name matches the simple name of a known forbidden type (e.g. `Process`, +`WebClient`), independent of symbol resolution, so the verdict never depends solely on +the reference set being complete. + +**Resolution** + +Resolved 2026-06-20 (commit `fd618cf1`): the TPA-absent fallback now folds the +assemblies hosting the forbidden anchor types (Process, Socket, File, Thread, Assembly, +Marshal) into the minimal validation reference set so a bare forbidden type still +resolves and is flagged; an `AnalysisReferencesDegraded` flag + a `Trace.TraceWarning` +make the degraded mode observable. CRITICAL: anchors are added ONLY to the validation +`AnalysisReferences`, never to `DefaultReferences` — keeping the compile gate from being +able to bind `Process`. Verified by a TPA-fallback test that a bare `Process.Start` is +still rejected. + +### ScriptAnalysis-002 — Pass 2 (syntactic hardening) is spelling-dependent and cannot catch a bare forbidden identifier on its own + +| | | +|--|--| +| Severity | Medium | +| Category | Security | +| Status | Deferred | +| Location | `src/ZB.MOM.WW.ScadaBridge.ScriptAnalysis/ScriptTrustValidator.cs:289-395`; `:195-213` | + +**Description** + +Pass 2 (`HardeningWalker`) is purely textual: `VisitUsingDirective` flags forbidden +`using` imports (`:289-295`), `VisitQualifiedName`/`VisitMemberAccessExpression` flag +forbidden *dotted* names (`:298-369`), and `VisitIdentifierName` flags only the bare +identifiers `dynamic`/`Activator` (`:373-395`). It cannot independently flag any other +bare forbidden identifier — it relies entirely on Pass 1's semantic resolution for +that. The two passes are therefore not the "two independent guards" the design implies +for the bare-identifier case: for an unresolved bare identifier (see +ScriptAnalysis-001) **neither** pass fires. `GetSyntacticScopes` deliberately returns +nothing for dotless names (`:210-212`, "bare local identifiers cannot reach a forbidden +namespace") — which is true only while Pass 1 resolves them; once resolution fails the +assumption breaks. + +This is the structural root of ScriptAnalysis-001 and is recorded separately because +it is a design property (the syntactic pass is not self-sufficient against bare names) +that should be acknowledged even after the TPA-fallback hole is closed: the defence is +single-layered, not double-layered, for the bare-identifier vector. + +**Recommendation** + +Document explicitly that Pass 1 is the sole guard for bare (single-identifier) +forbidden references and that Pass 2 only backstops *spelled* forbidden namespaces. If +genuine defence-in-depth is wanted for bare names, add a simple-name deny-set to Pass 2 +(the simple names of the forbidden root types — `Process`, `WebClient`, `Marshal`, +`Registry`, `Thread`, `File`, `Directory`, etc.), accepting the small false-positive +risk on a script's own identically-named local, which a security gate should tolerate. + +**Resolution** + +Deferred 2026-06-20: strengthening Pass-2 (syntactic) so a bare forbidden identifier is +caught without Pass-1 symbol resolution is a defence-in-depth enhancement; the +production path (TPA present, two-layer compile gate on most call sites) is sound. +Recorded for a follow-up. + +### ScriptAnalysis-003 — Adversarial test corpus has bypass-relevant gaps + +| | | +|--|--| +| Severity | Medium | +| Category | Testing coverage | +| Status | Resolved | +| Location | `tests/ZB.MOM.WW.ScadaBridge.ScriptAnalysis.Tests/ScriptTrustValidatorTests.cs`; `RoslynScriptCompilerTests.cs` | + +**Description** + +The reject/allow corpus is solid for the spelled-namespace, alias, `using static`, +`global::`, reflection-gateway, generic-argument, and nested-lambda vectors. For a +security-critical trust boundary, however, several bypass-relevant cases are untested: + +- **TPA-fallback degradation** (ScriptAnalysis-001) — no test forces + `AnalysisReferences` onto the minimal set and asserts that bare `Process`/`WebClient` + is still rejected. This is the highest-value missing test. +- **Extension methods** — no test that a forbidden-namespace extension method invoked + in receiver-position (`x.SomeForbiddenExt()`) is resolved and flagged. +- **Verbatim / Unicode-escaped identifiers** — `@dynamic`, `dynamic`, + `Activator`, or `GetType` are not exercised; `VisitIdentifierName` compares + `Identifier.ValueText`, which decodes Unicode escapes (so it likely holds), but the + case is not pinned, and `dynamic` is matched via `text == "dynamic"` (`:380`) which is + a separate path from the `ForbiddenIdentifiers` set. +- **`unsafe`/pointer/`stackalloc`** — relies implicitly on `ScriptOptions` defaulting + `AllowUnsafe=false`; not asserted. +- **Comment/trivia and string-literal payloads** — no test that a forbidden namespace + inside a comment or string is *not* flagged (false-positive guard) and that a + forbidden token reconstructed only at runtime from strings is correctly *not* the + validator's concern (documents the sandbox caveat). + +**Recommendation** + +Add the cases above, prioritising the TPA-fallback test (inject a minimal-reference +build of `AnalysisReferences` or factor the reference set so a test can force the +fallback) and the extension-method test. Add a verbatim/Unicode-escape identifier +reject test for `dynamic`/`Activator`/`GetType`. + +**Resolution** + +Resolved 2026-06-20 (commit `fd618cf1`): added 18 adversarial tests — TPA-fallback +bare-Process/Socket, extension-method invocation, verbatim/Unicode-escaped identifiers, +`unsafe` blocks (benign vs forbidden-API-inside), and comment/string-literal +false-positive guards. + +### ScriptAnalysis-004 — `ParseDiagnostics`/`Compile` drop warnings though the design doc promises "errors and warnings" + +| | | +|--|--| +| Severity | Low | +| Category | Design-document adherence | +| Status | Resolved | +| Location | `src/ZB.MOM.WW.ScadaBridge.ScriptAnalysis/RoslynScriptCompiler.cs:30-33`, `:73-76` | + +**Description** + +REQ-SA-3 states `Compile` returns the diagnostics "filtered to errors and warnings" +and `ParseDiagnostics` returns "syntax-level diagnostics (errors and warnings)". The +code filters `d.Severity == DiagnosticSeverity.Error` only in both methods +(`RoslynScriptCompiler.cs:31` and `:74`), so warnings are silently dropped. Filtering +to errors-only is arguably the safer choice for a compile *gate* (a warning should not +block a deploy), but the code and the spec disagree, and a future maintainer reading +the doc will expect warnings in the returned list. + +**Recommendation** + +Reconcile: either change the doc (REQ-SA-3) to say "errors only", or include warnings +and let callers decide. Given the gate semantics, updating the doc to "error-severity +diagnostics" is the cleaner fix. + +**Resolution** + +Resolved 2026-06-20 (commit `fd618cf1`): corrected the design doc (REQ-SA-3) to +'error-severity only', matching the compile wrapper's existing behaviour (a warning must +not block a deploy gate). Code unchanged. + +### ScriptAnalysis-005 — `nameof(forbiddenType)` flagging is inconsistent between passes and across resolution states + +| | | +|--|--| +| Severity | Low | +| Category | Correctness & logic bugs | +| Status | Deferred | +| Location | `src/ZB.MOM.WW.ScadaBridge.ScriptAnalysis/ScriptTrustValidator.cs:42-48`, `:99-136`, `:298-325` | + +**Description** + +The class comment (`:42-48`) states that a forbidden type inside `nameof(...)` is +"deliberately flagged" as a conservative fail-safe, and `Rejects_NameOf_ForbiddenType` +pins `nameof(System.IO.File)`. However the flagging is incidental, not deliberate, and +inconsistent: `nameof(System.IO.File)` is caught because the inner `System.IO.File` is +a `QualifiedNameSyntax`/member access that both passes inspect directly — the +`nameof` wrapper is irrelevant. A bare `nameof(Process)` (after `using +System.Diagnostics;`, with full TPA loaded) resolves `Process` to a real symbol and +*is* flagged by Pass 1; but the same `nameof(Process)` under the minimal-reference +fallback resolves to nothing and is *not* flagged. So the "deliberate fail-safe" +behaviour actually varies with the reference set, and `nameof` — which produces only a +compile-time string and reaches no API — is a genuine false positive that will reject a +legitimate script doing `nameof(System.Net.IPAddress)` for a log message. + +**Recommendation** + +Decide the intended semantics explicitly. If `nameof` should be allowed (it reaches no +runtime API), special-case `NameOfExpression`/`InvocationExpression` named `nameof` to +suppress the inner-reference report. If it should be rejected, make that uniform +regardless of resolution state (tie to a simple-name check), and keep the test. Either +way, align the comment with the actual, deterministic behaviour. + +**Resolution** + +Deferred 2026-06-20: the `nameof(forbiddenType)` over-flagging is a documented +deliberate fail-safe; whether `nameof` (which reaches no API) should be allowed vs +uniformly rejected is a design-owner decision. Recorded; no behaviour change this pass. + +### ScriptAnalysis-006 — No caching or throttling of Roslyn compilations; full-framework compilation rebuilt per call + +| | | +|--|--| +| Severity | Low | +| Category | Performance & resource management | +| Status | Deferred | +| Location | `src/ZB.MOM.WW.ScadaBridge.ScriptAnalysis/ScriptTrustValidator.cs:73-97`; `RoslynScriptCompiler.cs:50-71` | + +**Description** + +`FindViolations` parses the script and builds a fresh `CSharpCompilation` against the +**entire framework reference set** (`AnalysisReferences`, potentially hundreds of +`MetadataReference`s) on every call (`:87-95`). `RoslynScriptCompiler.Compile` +similarly creates a new `CSharpScript` and calls `.Compile()` per call. Roslyn +compilation is the expensive part of these calls. The metadata references are cached +(built once in `AnalysisReferences`), which is the main cost, so this is Low rather +than Medium — but for the Central UI editor path (which can call the validator on every +keystroke-debounced analysis) and for batch template validation (many scripts per +deploy), the absence of any per-source memoisation or a bounded compilation cache means +repeated work. There is no back-pressure or concurrency cap either; the Inbound API +review (InboundAPI-009) already noted uncapped recompilation downstream. + +**Recommendation** + +Consider a small bounded cache keyed on a hash of `(code, globalsType)` for `Compile` +results and on `code` for `FindViolations` verdicts (the verdict is a pure function of +the source and the static policy). At minimum, document that callers on hot paths +(editor live-analysis) should debounce and cache, and confirm the Central UI debounces +before calling. + +**Resolution** + +Deferred 2026-06-20: compilation caching/throttling (the full-framework +`CSharpCompilation` is rebuilt per `FindViolations` call) is a performance enhancement, +not a correctness defect. Recorded for a follow-up. + +### ScriptAnalysis-007 — `FindViolations` `extraReferences` XML-doc contradicts the Central UI consumer; `AnalysisReferences` undocumented in REQ-SA-2 + +| | | +|--|--| +| Severity | Low | +| Category | Documentation & comments | +| Status | Resolved | +| Location | `src/ZB.MOM.WW.ScadaBridge.ScriptAnalysis/ScriptTrustValidator.cs:57-64`; `ScriptTrustPolicy.cs:125-144`; `docs/requirements/Component-ScriptAnalysis.md:74-80` | + +**Description** + +Two doc/code mismatches, neither a vulnerability: + +1. The `extraReferences` XML-doc on `FindViolations` (`:57-64`) says "Forbidden + references are NOT added here". But the Central UI run gate calls + `FindViolations(tree.ToString(), compilation.References)` + (`CentralUI/ScriptAnalysis/ScriptAnalysisService.cs:1319-1320`), forwarding the full + BCL surface as `extraReferences`. This is in fact *safe* (the deny-list is by + namespace/type, so adding references can only improve resolution, never produce a + false allow — and the policy comment at `ScriptTrustPolicy.cs:138-142` says exactly + that), but the `FindViolations` doc as written implies the caller must not pass + forbidden assemblies, which is both unnecessary and contradicted in practice. + +2. REQ-SA-2 in the design doc describes Pass 1 as using "the full trusted-platform + reference set from `ScriptTrustPolicy.DefaultReferences`". The code uses a separate + `ScriptTrustPolicy.AnalysisReferences` (the full framework), and `DefaultReferences` + is explicitly the *minimal* set used by `RoslynScriptCompiler`. The doc conflates the + two and never mentions `AnalysisReferences` or the TPA mechanism — material because + that mechanism is the heart of the security verdict (and of ScriptAnalysis-001). + +**Recommendation** + +Rewrite the `extraReferences` doc to state that extra references only widen symbol +resolution and never whitelist a forbidden API. Update REQ-SA-2 to describe +`AnalysisReferences` (full-framework, TPA-sourced) versus `DefaultReferences` (minimal, +runtime-fidelity), and note the TPA-fallback behaviour. + +**Resolution** + +Resolved 2026-06-20 (commit `fd618cf1`): corrected the `FindViolations` +`extraReferences` XML doc (extra references only widen resolution, never whitelist) and +documented `AnalysisReferences` vs `DefaultReferences` in REQ-SA-2. + +### ScriptAnalysis-008 — Static trust gate is defence-in-depth, not a sandbox; correctly documented but worth recording as a standing risk + +| | | +|--|--| +| Severity | Low | +| Category | Security | +| Status | Won't Fix | +| Location | `src/ZB.MOM.WW.ScadaBridge.ScriptAnalysis/ScriptTrustValidator.cs:34-40`; `docs/requirements/Component-ScriptAnalysis.md:151` | + +**Description** + +The component is explicit (both in code comments and REQ-SA-5) that it is a +compile-time deny-list, not a runtime sandbox: scripts execute in-process, and a +sufficiently determined script could reconstruct a forbidden API at runtime via paths +the static walker cannot see (e.g. building type names from runtime strings, or any +gap in the deny-list). The reflection-gateway hardening narrows this, and the +`dynamic`/`Activator` bans close the obvious late-binding routes, but the residual risk +is real and inherent to in-process execution of partially-trusted C#. This is recorded +(not as a defect of this module, which behaves as designed) so the standing risk is +visible in the audit trail and is revisited if/when the deferred runtime boundary +(restricted `AssemblyLoadContext` / curated reference set / out-of-process sandbox) is +scoped. + +**Recommendation** + +Keep the caveat prominent. Track the runtime-boundary work as a roadmap item; in the +meantime ensure the curated *execution* reference sets (SiteRuntime `ScriptAssemblies`, +`RoslynScriptCompiler` `DefaultReferences`) stay minimal so that even a deny-list miss +cannot bind a forbidden type at execution time (this is the real second layer of +defence for the compile-running consumers). + +**Resolution** + +Won't Fix 2026-06-20: the static gate is intentionally defence-in-depth, not a runtime +sandbox — this is the explicitly-documented design posture, not a defect. Recorded as a +standing risk for visibility. diff --git a/code-reviews/Security/findings.md b/code-reviews/Security/findings.md index 4e84ef39..65998a0a 100644 --- a/code-reviews/Security/findings.md +++ b/code-reviews/Security/findings.md @@ -5,10 +5,10 @@ | Module | `src/ZB.MOM.WW.ScadaBridge.Security` | | Design doc | `docs/requirements/Component-Security.md` | | Status | Reviewed | -| Last reviewed | 2026-05-28 | +| Last reviewed | 2026-06-20 | | Reviewer | claude-agent | -| Commit reviewed | `1eb6e97` | -| Open findings | 0 (1 deferred — Security-008) | +| Commit reviewed | `4307c381` | +| Open findings | 0 (2 deferred — Security-008, Security-022) | ## Summary @@ -78,6 +78,50 @@ values silently surface only on first login (Security-020); and the `RequireHttpsCookie=false` dev opt-out emits no warning, so an HTTP production deployment silently transmits the JWT bearer credential in cleartext (Security-021). +#### Re-review 2026-06-20 (commit `4307c381`) — full review + +Major architectural change since `1eb6e97`: the project was renamed (ScadaLink → +ZB.MOM.WW.ScadaBridge) and the bespoke `LdapAuthService`/`LdapTransport`/`LdapAuthResult` +were **cut over to the external `ZB.MOM.WW.Auth.Ldap` library** (commit `ac34dac4`); the +prior `SiteScopeAuthorizationHandler` was deleted (Security-017 resolution), and the role +model was canonicalised + collapsed (`Admin→Administrator`, `Design→Designer`, +`Deployment→Deployer`, `Audit→Administrator`, `AuditReadOnly→Viewer`; commit `b104760b`) +with M7 `Operator`/`Verifier` added (`a0ce8b6c`). The interactive cookie session now signs +in with **bare claims** via the shared `SessionClaimBuilder` and is policed by +`CookieSessionValidator` (idle-timeout + LDAP-free DB role-refresh, fail-closed, well +tested). All prior fixes that survived the cutover remain in place (key-length guard, +issuer/audience binding, `RoleMapper` union semantics, `Roles` constants, cookie hardening +via `ZbCookieDefaults`, the `RequireHttpsCookie` warning). Security-008 (N+1 scope-rule +query) is still correctly **Deferred**. This pass surfaced **4 new findings** +(Security-022..025): three Medium and one Low. The most consequential is **Security-022** — +the documented "cookie + JWT hybrid" no longer holds at HEAD: `/auth/token` still *mints* a +bearer JWT but **no code anywhere validates it** (CLI uses HTTP Basic Auth, Inbound API uses +the `sbk_` ApiKeys verifier, the cookie path uses bare claims), so `JwtTokenService`'s entire +`ValidateToken`/`RefreshToken`/`RecordActivity`/`ShouldRefresh`/`IsIdleTimedOut` surface is +dead code minting an orphaned credential. **Security-023** — `LdapGroupMapping.Role` accepts +arbitrary free strings via CLI/API with no canonical-set validation, and the Central UI +`RequireClaim` policies are case-sensitive while the ManagementActor authz check is +case-insensitive, so a mis-cased/typo'd mapping silently grants inconsistent access across +the two surfaces with no operator feedback. **Security-024** — the M7 `RequireOperator`/ +`RequireVerifier` SoD policies have no functional `AuthorizeAsync` grant/deny test (only +constant-name assertions), unlike every other policy. **Security-025** (Low) — a unit test +makes a live network connect to `nonexistent.invalid:9999`. + +_Re-review (2026-06-20, `4307c381`):_ + +| # | Category | Examined | Notes | +|---|----------|----------|-------| +| 1 | Correctness & logic bugs | ☑ | `RoleMapper` union semantics (Security-016 fix) hold and read correctly. `LdapGroupMapping.Role` is an unvalidated free string + case-sensitive UI policy vs case-insensitive ManagementActor authz (Security-023). | +| 2 | Akka.NET conventions | ☑ | No actors. `AddSecurityActors` is still a placeholder. No issues. | +| 3 | Concurrency & thread safety | ☑ | Services stateless/DI-scoped; `CookieSessionValidator` reads injected `TimeProvider`. LDAP now in the external library. No shared mutable state. No issues. | +| 4 | Error handling & resilience | ☑ | `CookieSessionValidator` fails closed on idle (missing anchor = timed out) and swallows refresh faults keeping current roles (mirrors "LDAP failure: active sessions continue") — correct and well tested. `LdapAuthFailureMessages` maps the library's structured failure enum to stable user text. No issues. | +| 5 | Security | ☑ | Cookie hardening (HttpOnly/SameSite=Strict/Secure-when-required/sliding) comes from `ZbCookieDefaults` — verified correct. JWT mint binds issuer/audience, pins `MapInbound/OutboundClaims=false`, fails fast on short key. BUT the minted bearer JWT is validated by no consumer (Security-022). DisableLogin auto-login is gated at registration + loud warning + doc'd (acceptable). | +| 6 | Performance & resource management | ☑ | Security-008 N+1 remains correctly Deferred (gated on `ISecurityRepository`). `CookieSessionValidator` deliberately avoids re-minting on the no-refresh path. No new perf issues. | +| 7 | Design-document adherence | ☑ | The cookie+JWT-hybrid design is now cookie-with-bare-claims; the JWT half is orphaned/unvalidated (Security-022) — the M2.19 doc note acknowledges bare-claim cookies but the doc still presents JWT-in-cookie as the model and does not flag that `/auth/token`'s JWT is unconsumed. Operator/Verifier policies match the design. | +| 8 | Code organization & conventions | ☑ | `Roles`/`SecurityOptions` owned by the component; repository interface in Commons; additive `IGroupRoleMapper` adapter. Dead JWT methods (Security-022) are the main org smell. | +| 9 | Testing coverage | ☑ | `CookieSessionValidator`, `SessionClaimBuilder`, `RoleMapper` union, cookie wiring, `SecurityOptionsValidator`, DisableLogin are well covered. Gaps: no functional authz test for `RequireOperator`/`RequireVerifier` (Security-024); a unit test does live network I/O (Security-025); no test for a non-canonical/mis-cased mapping role (Security-023). | +| 10 | Documentation & comments | ☑ | XML docs are thorough and current. `SecurityOptions` and `JwtTokenService` docs still describe an active JWT-refresh lifecycle that no longer has a caller (Security-022). | + ## Checklist coverage | # | Category | Examined | Notes | @@ -1008,3 +1052,178 @@ SecurityOptions:RequireHttpsCookie=true in production.")`. Optionally, also fail startup when `RequireHttpsCookie=false` AND `ASPNETCORE_ENVIRONMENT=Production`. Add a regression test that asserts the warning is emitted when the flag is disabled and not when it is enabled. + +### Security-022 — `/auth/token` mints a bearer JWT that no consumer validates; the entire `JwtTokenService` validate/refresh surface is dead code + +| | | +|--|--| +| Severity | Medium | +| Category | Design-document adherence | +| Status | Deferred | +| Location | `src/ZB.MOM.WW.ScadaBridge.Security/JwtTokenService.cs:149`, `:194`, `:209`, `:239`, `:272`; `src/ZB.MOM.WW.ScadaBridge.CentralUI/Auth/AuthEndpoints.cs:99-148` | + +**Description** + +The design (Component-Security.md "Cookie + JWT Hybrid", CLAUDE.md "cookie+JWT hybrid … +HMAC-SHA256, 15-minute expiry w/ sliding refresh") presents a cookie-embedded JWT as the +session model. At HEAD this is no longer how the interactive session works (the M2.19 note +acknowledges the cookie now carries **bare claims** policed by `CookieSessionValidator`), +and — more importantly — the JWT half is now an **orphaned, unconsumed credential**. A grep +across `src/` shows `JwtTokenService.ValidateToken`, `RefreshToken`, `RecordActivity`, and +`ShouldRefresh` have **zero production callers**; only `GenerateToken` is called, solely by +`POST /auth/token` (`AuthEndpoints.cs:134`). That endpoint returns an `access_token` to the +caller, but **no endpoint in the codebase accepts or validates that JWT** as a credential: +the CLI authenticates to the Management API with HTTP Basic Auth +(`CLI/ManagementHttpClient.cs:36`), the Inbound API authenticates with the external +`ZB.MOM.WW.Auth.ApiKeys` `sbk__` verifier, and the cookie pipeline uses +`SessionClaimBuilder` bare claims (never `ValidateToken`). The result: `/auth/token` issues +HMAC-signed bearer tokens that grant access to nothing, and ~120 lines of carefully-hardened +issuer/audience/clock-skew/idle/refresh logic (the subject of resolved findings +Security-006/007/014) are dead code. The hazard is twofold: (a) the design doc materially +overstates the live session mechanism, misleading any future reader/auditor; and (b) an +orphaned token mint is a latent footgun — a future endpoint added under the assumption that +`/auth/token` produces a *usable* credential would wire a second, parallel auth path with +no test coverage proving it actually validates. + +**Recommendation** + +Decide the intent explicitly. Either (a) **wire the bearer path**: add +`AddAuthentication().AddJwtBearer(...)` (or a custom handler delegating to +`JwtTokenService.ValidateToken`) on whatever surface is meant to accept the token, and add +an end-to-end test that a `/auth/token`-issued JWT authorizes a protected request — then the +hybrid model the doc describes is real; or (b) **remove the dead code**: delete +`/auth/token` and `JwtTokenService.ValidateToken`/`RefreshToken`/`RecordActivity`/ +`ShouldRefresh`/`IsIdleTimedOut` (keep only what the bare-claim cookie path needs, which is +none of `JwtTokenService`), and rewrite Component-Security.md's "Cookie + JWT Hybrid" / +"Token Lifecycle" sections to describe the actual bare-claim cookie + `CookieSessionValidator` +mechanism as the primary model rather than a footnote. Whichever is chosen, the design doc +and the code must agree on whether a usable JWT exists. + +**Resolution** + +Deferred 2026-06-20: this needs a product decision — either wire a JWT-bearer auth handler so the documented cookie+JWT hybrid is real, or delete `/auth/token` + the orphaned `JwtTokenService` validate/refresh surface and rewrite the design doc to describe the actual bare-claim cookie model. Recorded for that decision. + +### Security-023 — `LdapGroupMapping.Role` is an unvalidated free string; case-sensitive UI policy vs case-insensitive server authz grants inconsistent access + +| | | +|--|--| +| Severity | Medium | +| Category | Correctness & logic bugs | +| Status | Resolved | +| Location | `src/ZB.MOM.WW.ScadaBridge.Security/AuthorizationPolicies.cs:144-157`; `src/ZB.MOM.WW.ScadaBridge.Security/RoleMapper.cs:40-42`; `src/ZB.MOM.WW.ScadaBridge.Commons/Entities/Security/LdapGroupMapping.cs:10`; `src/ZB.MOM.WW.ScadaBridge.ManagementService/ManagementActor.cs:106`, `:1893` | + +**Description** + +The role string an operator maps an LDAP group to (`LdapGroupMapping.Role`) is a free +`string` with no validation against the canonical `Roles.*` set anywhere on the write path: +`ManagementActor` (`:1893`) does `new LdapGroupMapping(cmd.LdapGroupName, cmd.Role)` verbatim, +and the CLI `security role-mapping create --role ` +(`CLI/Commands/SecurityCommands.cs:157`) accepts any string. The Central UI dropdown +(`LdapMappingForm.razor`) constrains the *UI* path to the canonical constants, but the CLI/API +path does not. Two enforcement surfaces then disagree on casing: + +- **Central UI authorization** uses `policy.RequireClaim(JwtTokenService.RoleClaimType, Roles.Deployer)` + (`AuthorizationPolicies.cs:150`). ASP.NET Core's `RequireClaim` compares claim *values* + with ordinal (case-**sensitive**) equality. +- **ManagementActor authorization** uses `user.Roles.Contains(requiredRole, StringComparer.OrdinalIgnoreCase)` + (`ManagementActor.cs:106`) — case-**insensitive**. +- **`RoleMapper`** matches the Deployer scope branch with `StringComparison.OrdinalIgnoreCase` + (`RoleMapper.cs:42`) but stamps the **DB row's verbatim casing** into the role claim. + +So a mapping created via CLI/API as `--role deployer` (lowercase) or a typo like `Deploy` +produces a principal whose role claim is `"deployer"`: the **ManagementActor/CLI surface +authorizes it** (case-insensitive), the **Central UI `RequireDeployment` policy denies it** +(case-sensitive), and for the Deployer case `RoleMapper` still resolves site scope (matching +case-insensitively) — a user who can deploy via CLI but is locked out of the equivalent UI +pages, or vice versa, with no error explaining why. A pure typo (`"Deploer"`) silently grants +a role claim that *no* policy and no ManagementActor check matches — the user appears to hold +a role yet is denied everywhere, with no validation feedback at mapping-creation time. + +**Recommendation** + +Validate `cmd.Role` against the canonical `Roles.All` set (case-insensitively, normalising to +the canonical casing) when creating/updating a mapping — in `ManagementActor` (the single +server-side write path) and reject non-canonical values with a clear error; optionally also +guard in the CLI for a fast local failure. Separately, make the two authorization surfaces +agree on casing: either normalise the role claim to canonical casing in `SessionClaimBuilder`/ +`RoleMapper` before it is stamped, or make the ManagementActor check case-sensitive to match +`RequireClaim` (the safer direction once values are validated). Add a regression test for a +mis-cased mapping role. + +**Resolution** + +Resolved 2026-06-20 (commit `fd618cf1`): added membership validation rejecting any `cmd.Role` not in `Roles.All` (canonical set, case-insensitive) at LDAP-group-mapping create AND update, before any DB write — closing the unvalidated-free-string gap. The casing/comparison-asymmetry half is a larger change and remains a recorded follow-up (membership check is safe: a non-canonical role never worked). No existing test regressed. + +### Security-024 — `RequireOperator` / `RequireVerifier` policies have no functional authorization test + +| | | +|--|--| +| Severity | Medium | +| Category | Testing coverage | +| Status | Resolved | +| Location | `src/ZB.MOM.WW.ScadaBridge.Security/AuthorizationPolicies.cs:153-157`; `tests/ZB.MOM.WW.ScadaBridge.Security.Tests/SecurityTests.cs:1045-1186` (AuthorizationPolicyTests); `tests/ZB.MOM.WW.ScadaBridge.Security.Tests/RolesTests.cs:44-45` | + +**Description** + +The M7 two-person Secured Write feature added the `RequireOperator` and `RequireVerifier` +policies (`AuthorizationPolicies.cs:153-157`), each a single-role `RequireClaim`. These gate a +SCADA control-surface write workflow whose entire safety argument is separation of duties, so +correct grant/deny behaviour is security-critical. But `AuthorizationPolicyTests` — which +exercises every *other* policy with real `IAuthorizationService.AuthorizeAsync` evaluations +(Admin/Design/Deployment/OperationalAudit/AuditExport, including the load-bearing +"Viewer reads but cannot export" SoD case) — has **no** Operator/Verifier test. The only +coverage is `RolesTests.cs:44-45`, which asserts the *constant string values* +(`"RequireOperator"`, `"RequireVerifier"`) — not that an `Operator` principal satisfies +`RequireOperator`, that a `Verifier` does not, or that the two policies are mutually distinct. +A regression that, say, mapped `RequireOperator` to `Roles.Verifier` (or to +`OperationalAuditRoles`) would compile and pass the existing tests while silently collapsing +the separation of duties. + +**Recommendation** + +Add `AuthorizeAsync`-based `[Theory]` cases to `AuthorizationPolicyTests` mirroring the +existing policy tests: `RequireOperator` succeeds for `Roles.Operator` and fails for +`Roles.Verifier`/`Administrator`/empty; `RequireVerifier` succeeds for `Roles.Verifier` and +fails for `Roles.Operator`; and a combined case asserting an Operator-only principal cannot +satisfy `RequireVerifier` (the SoD invariant at the policy layer). + +**Resolution** + +Resolved 2026-06-20 (commit `fd618cf1`): added functional `AuthorizeAsync` grant/deny tests for `RequireOperator`/`RequireVerifier` (and a combined SoD-distinctness test), guarding the separation-of-duties policy wiring against regression. + +### Security-025 — Unit test performs a live network connection to `nonexistent.invalid:9999` + +| | | +|--|--| +| Severity | Low | +| Category | Testing coverage | +| Status | Resolved | +| Location | `tests/ZB.MOM.WW.ScadaBridge.Security.Tests/SecurityTests.cs:71-89` (`AuthenticateAsync_ConnectionFailure_FailsClosed_NeverThrows`) | + +**Description** + +`LdapAuthServiceTests.AuthenticateAsync_ConnectionFailure_FailsClosed_NeverThrows` constructs +the external library's `LdapAuthService` pointed at `Server = "nonexistent.invalid"`, +`Port = 9999` with a 2-second `ConnectionTimeoutMs`, then asserts the result fails closed. +Although the test relies on the connection *failing*, it still performs a real DNS resolution +and TCP connect attempt from the unit-test process. In an offline or network-sandboxed CI +environment the resolution/connect can behave differently (immediate failure vs. the full +2-second timeout vs. a captive-portal redirect), making the test both **slow** (up to ~2s of +wall-clock dead time) and **environment-dependent**. Unit tests in this suite are otherwise +network-free (the sibling cases are explicitly named `_NoNetwork`). The behaviour under test +(fail-closed mapping of an unreachable directory) is really the external library's +responsibility and is, per the file's own header comment, already covered by the library's own +test suite through its internal connection seam. + +**Recommendation** + +Drop this case from the ScadaBridge unit suite (the library owns the fail-closed contract via +its internal `ILdapConnection` seam, as the region comment notes), or move it to an explicitly +opt-in integration-test category so it does not run in the default network-free unit pass. If +kept as a smoke test, point it at a guaranteed-unroutable loopback address (e.g. +`127.0.0.1` on a closed port) to bound the failure deterministically rather than relying on +DNS resolution of `.invalid`. + +**Resolution** + +Resolved 2026-06-20 (commit `fd618cf1`): the fail-closed connection test now targets `127.0.0.1` on a closed port instead of `nonexistent.invalid:9999` — deterministic loopback connection-refused, no external DNS/timeout dependency. Same assertion preserved. diff --git a/code-reviews/SiteCallAudit/findings.md b/code-reviews/SiteCallAudit/findings.md index f656afb3..83059753 100644 --- a/code-reviews/SiteCallAudit/findings.md +++ b/code-reviews/SiteCallAudit/findings.md @@ -5,10 +5,10 @@ | Module | `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit` | | Design doc | `docs/requirements/Component-SiteCallAudit.md` | | Status | Reviewed | -| Last reviewed | 2026-05-28 | +| Last reviewed | 2026-06-20 | | Reviewer | claude-agent | -| Commit reviewed | `1eb6e97` | -| Open findings | 2 | +| Commit reviewed | `4307c381` | +| Open findings | 0 | ## Summary @@ -42,6 +42,23 @@ tests using a shared `MsSqlMigrationFixture`. | 9 | Testing coverage | Yes | Relay path well covered (6 unit tests). Ingest/KPI well covered by MSSQL fixture. Stuck-only paging boundary edge not directly exercised (Finding 006). | | 10 | Documentation & comments | Yes | XML docstring claims `SupervisorStrategy` uses Resume — incorrect (Finding 001). `AckErrorMessage` switch arm for `SiteUnreachable` falls through instead of throwing (Finding 005). | +#### Re-review 2026-06-20 (commit `4307c381`) — full review + +Since the `1eb6e97` baseline the module grew the two pieces that were "still deferred" at the prior pass: the periodic per-site reconciliation puller (Piece A — `OnReconciliationTickAsync` → `ReconcileSiteAsync`, in-memory per-site cursor + idempotent monotonic upsert + per-site failure isolation) and the daily terminal-row purge scheduler (Piece B — `OnPurgeTickAsync` → `PurgeTerminalAsync`, continue-on-error). All of the new code is clean, idiomatic, and well-tested (dedicated `SiteCallAuditReconciliationTests`, `SiteCallAuditPurgeTests`, `SiteCallRelayTests`, KPI-sample and options suites); the M6 `SiteCallAuditKpiSampleSource` correctly anchors both cutoffs on a single capture instant and emits Global + per-Site + per-Node snapshots, and the per-node KPI handler is a clean additive peer of the per-site one. The three new findings are all forward-compat / cosmetic: a latent purge-gating gap (the purge timer is armed only when the reconciliation collaborators are present — currently masked because `Program.cs` co-registers them) plus two doc/edge items (the design doc over-claims six charted trend series when only three are charted, and the reconciliation cursor ignores `MoreAvailable` so a single-timestamp batch saturation re-pulls without progress). No correctness, concurrency, security, or resilience defects in the new code; the inclusive-cursor + idempotent-upsert pairing keeps every edge case safe-by-construction (wasted work, never corruption). + +| # | Category | Examined | Notes | +|---|----------|----------|-------| +| 1 | Correctness & logic bugs | Yes | Reconciliation cursor ignores `MoreAvailable`; a single-`UpdatedAtUtc` batch saturation re-pulls without progress (Finding 009 — wasted work only, idempotent upsert prevents corruption). Finding 003 (`IngestedAtUtc`) resolved — now stamped in `OnUpsertAsync` and `ReconcileSiteAsync`. | +| 2 | Akka.NET conventions | Yes | `Sender` captured before first await on every async handler; `PipeTo` used for all read/relay replies; self-ticks scheduled via `ScheduleTellRepeatedlyCancelable` with `Self` sender, cancelled in `PostStop`. No issues found. | +| 3 | Concurrency & thread safety | Yes | `_reconciliationCursors`, `_centralCommunication`, timers all mutated only on the actor thread. Per-tick/per-message DI scope (`CreateAsyncScope` for the async tick paths, `CreateScope` for sync read paths) disposed in `finally` / `await using`. No issues found. | +| 4 | Error handling & resilience | Yes | Reconciliation has per-site try/catch isolation; purge is continue-on-error; ingest catches all and replies `Accepted=false`. Deliberate coarse-retry divergence from `SiteAuditReconciliationActor` (no per-row abandon) is documented and safe given idempotent upsert. No issues found. | +| 5 | Security | Yes | All SQL parameterised at the repository. Relay carries no user-controlled strings beyond `SourceSite` (a site id). No issues found. | +| 6 | Performance & resource management | Yes | One DI scope per tick reused across all sites; `MaxPageSize=200` clamp; async DbContext disposal off the dispatcher. `MoreAvailable`-ignoring re-pull is bounded wasted work (Finding 009). | +| 7 | Design-document adherence | Yes | Reconciliation puller + purge scheduler now implemented (prior Finding 004 resolved). Design doc over-claims six charted KPI trend series; only three are charted (Finding 008). | +| 8 | Code organization & conventions | Yes | Purge timer gated on reconciliation collaborators rather than just the repository — a host registering SiteCallAudit without the reconciliation client silently never purges (Finding 007, latent). `KpiSampleSource` registered via `TryAddEnumerable`; options owned by the component. | +| 9 | Testing coverage | Yes | Reconciliation/purge/relay/KPI/options all have dedicated suites. No test exercises a saturated reconciliation batch (`MoreAvailable: true`) or the single-timestamp no-progress edge — every reconciliation test uses `MoreAvailable: false` (relates to Finding 009). | +| 10 | Documentation & comments | Yes | Prior Findings 001/005 doc fixes still in place. Reconciliation `ReconcileSiteAsync` XML claims parity with `SiteAuditReconciliationActor` while diverging on `MoreAvailable` (the sibling reads it for stalled detection; this actor ignores it) — Finding 009 recommendation includes a doc correction. | + ## Findings ### SiteCallAudit-001 — SupervisorStrategy override is dead code; XML claims Resume that is not enforced @@ -323,3 +340,221 @@ Add a test that (a) inserts 6 rows in interleaved order: stuck, not-stuck, stuck, not-stuck, stuck, not-stuck (oldest first); (b) issues a `StuckOnly` page-size-1 query; (c) asserts each page returns exactly the stuck row, with no overlap and all 3 stuck rows visited. + +### SiteCallAudit-007 — Daily purge timer is armed only when the reconciliation collaborators are present; a host without the reconciliation client silently never purges + +| | | +|--|--| +| Severity | Medium | +| Category | Code organization & conventions | +| Status | Resolved | +| Location | `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:307-321`, `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:202-227` | + +**Description** + +`StartPurgeTimer` (lines 307-321) gates the daily terminal-row purge tick on +the **reconciliation** collaborators being present: + +```csharp +private void StartPurgeTimer() +{ + if (_pullClient is null || _siteEnumerator is null) + { + return; + } + // ... schedule PurgeTick ... +} +``` + +But the purge pass (`OnPurgeTickAsync` → `PurgeWithRepositoryAsync` → +`ISiteCallAuditRepository.PurgeTerminalAsync`) needs only the repository — it +has no dependency on `IPullSiteCallsClient` / `ISiteEnumerator`. The two +collaborators are resolved by the production constructor (lines 202-227) via +`serviceProvider.GetService()` / +`GetService()` — both registered by +`AddAuditLogCentralReconciliationClient` +(`AuditLog/ServiceCollectionExtensions.cs:473,523`, registered as `ISiteEnumerator` +and `IPullSiteCallsClient`). `GetService` (not `GetRequiredService`) returns +`null` if that helper was never called, so a host that registers +`AddSiteCallAudit()` **without** also calling +`AddAuditLogCentralReconciliationClient(...)` constructs the actor with both +collaborators null. In `PreStart` both `StartReconciliationTimer` and +`StartPurgeTimer` early-return, so the actor runs forever with **no purge timer +at all** → unbounded growth of the central `SiteCalls` table, with no log line +to say the purge was skipped. + +This is **currently latent, not live**: `Program.cs` co-registers +`AddAuditLogCentralReconciliationClient(builder.Configuration)` (line 107) +immediately before `AddSiteCallAudit()` (line 113), so on the real central node +both collaborators resolve and both timers run today. The risk is a future host +(or a refactor that splits the reconciliation client out of the central +composition root, or a test/embedded host that wants only ingest + purge) +silently losing the purge with no diagnostic. The gate exists for a legitimate +reason — keeping the repo-only test ctor free of *both* background timers so the +MSSQL read/upsert tests see no scheduled side effects — but it conflates "no +reconciliation route" with "no purge", and the actor's own XML +(lines 36-38) documents the coupling as deliberate rather than flagging it as a +hazard. + +**Recommendation** + +Decouple `StartPurgeTimer` from the reconciliation collaborators — purge needs +only a repository, which both the production and the repo-only test ctors +always have. Two viable shapes: + +- Preferred: gate the purge timer on its own real precondition (a repository is + always available, so arm it unconditionally in the production + reconciliation + ctors; keep it off only in the repo-only test ctor via an explicit + "background timers off" flag rather than by proxy of the reconciliation + collaborators). This keeps the MSSQL test isolation while removing the + accidental coupling. +- At minimum: log a `Warning` in `StartPurgeTimer` (and `StartReconciliationTimer`) + when the timer is **not** armed on the production path, so a misconfigured host + surfaces "SiteCallAudit purge timer not started — `PurgeTerminalAsync` will + never run" instead of growing the table silently. + +Severity is a judgment call: Medium because the consequence (unbounded central +table growth) is real and silent, but it is latent today (the only production +composition root co-registers the reconciliation client), so an argument for Low +is defensible. Flagged Medium to weight the silent-failure + data-growth aspect. + +**Resolution** + +Resolved 2026-06-20 (commit `fd618cf1`): decoupled the daily purge timer from the reconciliation collaborators (a new `_backgroundTimersEnabled` flag) so a central node that omits the reconciliation client still purges — no more silent unbounded `SiteCalls` growth; a Warning is logged if the reconciliation collaborators are absent. Test added (purge arms without the reconciliation client). + +### SiteCallAudit-008 — Design doc claims six charted KPI trend series; only three are actually charted + +| | | +|--|--| +| Severity | Low | +| Category | Design-document adherence | +| Status | Resolved | +| Location | `docs/requirements/Component-SiteCallAudit.md:149-154`, `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/Kpi/SiteCallAuditKpiSampleSource.cs:42-47`, `src/ZB.MOM.WW.ScadaBridge.Commons/Types/Kpi/KpiMetrics.cs:52-62` | + +**Description** + +The design doc (`Component-SiteCallAudit.md` lines 149-154, the KPI History +interaction) states: + +> "… the resulting `buffered` / `parked` / `failedLastInterval` / +> `deliveredLastInterval` / `stuck` / `oldestPendingAgeSeconds` series render as +> trends on the Site Calls page via `KpiTrendChart`." + +That lists **six** series as charted. In the code only **three** are charted: + +- The public charted catalog `KpiMetrics.SiteCallAudit` + (`KpiMetrics.cs:52-62`) exposes exactly `Buffered`, `Parked`, and + `FailedLastInterval` — and its own XML says "Charted Site Call Audit (#22) + metrics … Rendered by the Central UI Site Calls report trend panel." +- `SiteCallAuditKpiSampleSource` (lines 42-47) keys those three off the public + Commons catalog (`KpiMetrics.SiteCallAudit.*`) and keeps the other three — + `deliveredLastInterval`, `stuck`, `oldestPendingAgeSeconds` — as **private** + string literals, with the comment "Charted metrics share the public Commons + catalog … the uncharted internal metrics stay private here (#178)." +- The UI confirms it: `SiteCalls/SiteCallsReport.razor.cs` calls + `LoadSeriesAsync` for exactly `KpiMetrics.SiteCallAudit.Buffered`, `.Parked`, + and `.FailedLastInterval` — three series, no more. + +So all six metrics are *sampled* into the `KpiSample` history store, but only +three are *charted*. The doc reads as if all six are rendered, which is a small +design-doc drift (over-claiming the UI surface). The code is internally +consistent and correctly commented; the doc is the stale party. + +**Recommendation** + +One-line doc edit on `Component-SiteCallAudit.md` lines 149-154: state that the +three series `buffered` / `parked` / `failedLastInterval` render as trends on +the Site Calls page via `KpiTrendChart`, and that `deliveredLastInterval` / +`stuck` / `oldestPendingAgeSeconds` are **sampled into the KPI-history store but +not charted** (available for future trend panels / ad-hoc query). No code change +— the code is already the intended state. + +**Resolution** + +Resolved 2026-06-20 (commit `fd618cf1`): the design doc now lists the three actually-charted KPI metrics (buffered/parked/failedLastInterval) and marks the other three as sampled-but-not-yet-charted, matching the `KpiMetrics` catalog. Doc-only. + +### SiteCallAudit-009 — Reconciliation cursor advances inclusively and ignores `MoreAvailable`; a single-timestamp batch saturation re-pulls without progress + +| | | +|--|--| +| Severity | Low | +| Category | Correctness & logic bugs | +| Status | Resolved | +| Location | `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:505-535` | + +**Description** + +`ReconcileSiteAsync` (lines 505-535) pulls rows since the per-site cursor, +upserts each, and advances the cursor to the maximum `UpdatedAtUtc` observed: + +```csharp +var since = _reconciliationCursors.TryGetValue(site.SiteId, out var c) ? c : DateTime.MinValue; +var response = await client.PullAsync(site.SiteId, since, _options.ReconciliationBatchSize, ...); + +var maxUpdated = since; +foreach (var row in response.SiteCalls) +{ + await repository.UpsertAsync(row with { IngestedAtUtc = nowUtc }); + if (row.UpdatedAtUtc > maxUpdated) maxUpdated = row.UpdatedAtUtc; +} +_reconciliationCursors[site.SiteId] = maxUpdated; +``` + +Two interacting properties: + +1. The pull is **inclusive** — `PullAsync(site, since, …)` asks for rows with + `UpdatedAtUtc >= since` (documented in the method's XML, lines 497-503), and + the cursor advances to the **max** `UpdatedAtUtc` seen, which is itself one + of the rows just pulled. So the boundary row is re-pulled every tick and + deduped by the idempotent monotonic upsert — intended, harmless. +2. `response.MoreAvailable` is **never read** at all. The `PullSiteCallsResponse` + carries `MoreAvailable` (`PullSiteCallsResponse.cs:14-17`: "True when the + site saturated the requested batch size — the caller should advance the + cursor and pull again"), but `ReconcileSiteAsync` ignores it and relies on + the natural tick cadence to drain the backlog over successive ticks. + +The edge case: if a site has **more rows than the batch size all sharing one +exact `UpdatedAtUtc`** (e.g. a burst written in the same tick / same clock +value), the saturated batch returns rows whose max `UpdatedAtUtc` equals +`since`. `maxUpdated` therefore stays at `since`, the cursor does **not** +advance, and because the pull is inclusive the next tick re-pulls the identical +window — and again, and again — making no forward progress on that site's +backlog. Because the upsert is idempotent and the `SiteCalls` table is an +eventually-consistent mirror (not the source of truth), this is **wasted work, +never corruption** — but it is an unbounded re-pull loop on a pathological +input, and any rows in that backlog beyond the batch ceiling never get +reconciled. + +The sibling `SiteAuditReconciliationActor` shares the inclusive-cursor / +max-timestamp shape, so the same single-timestamp-saturation no-progress edge +applies to it — but that sibling **does read `MoreAvailable`** (it feeds its +stalled-detection state machine, `SiteAuditReconciliationActor.cs:324-325`, +publishing `SiteAuditTelemetryStalledChanged` so a non-draining site surfaces a +health signal). `ReconcileSiteAsync`'s XML (lines 530-534) claims the +no-immediate-re-pull behaviour "match[es] `SiteAuditReconciliationActor`", which +is only half true: it matches the cursor cadence but diverges by dropping +`MoreAvailable` entirely, so this actor has neither a continuation pull nor a +stalled signal — a saturated site lags silently with no observability. + +**Recommendation** + +- Consume `MoreAvailable`: either continue draining within the same tick while + `MoreAvailable` is true (bounded by a max-iterations guard), or — matching the + sibling — surface a stalled/non-draining signal when a batch comes back + saturated so a stuck site is observable rather than silent. +- Defend the single-timestamp no-progress edge with a tiebreaker beyond the + timestamp (e.g. advance on `(UpdatedAtUtc, TrackedOperationId)` as a composite + keyset cursor, and ask the pull for rows strictly after that composite), so a + burst sharing one `UpdatedAtUtc` cannot pin the cursor. +- Correct the `ReconcileSiteAsync` XML (lines 530-534): it claims parity with + `SiteAuditReconciliationActor` while diverging on `MoreAvailable` (the sibling + reads it for stalled detection; this actor ignores it). + +Severity Low: idempotent upsert + mirror-not-source-of-truth mean no data +corruption, and the saturated-single-timestamp input is pathological; the cost +is wasted re-pulls and an un-drained backlog tail on that one input, plus the +missing observability. + +**Resolution** + +Resolved 2026-06-20 (commit `fd618cf1`): `ReconcileSiteAsync` now consumes `response.MoreAvailable` via a within-tick continuation drain bounded by a max-pages guard, with explicit no-progress (single-timestamp-saturation) detection that breaks and logs a Warning instead of re-pulling forever. The XML claim of parity with the sibling reconciler was corrected. Idempotent upsert retained. Tests added. diff --git a/code-reviews/SiteEventLogging/findings.md b/code-reviews/SiteEventLogging/findings.md index 972a1efd..b663b7cb 100644 --- a/code-reviews/SiteEventLogging/findings.md +++ b/code-reviews/SiteEventLogging/findings.md @@ -5,10 +5,10 @@ | Module | `src/ZB.MOM.WW.ScadaBridge.SiteEventLogging` | | Design doc | `docs/requirements/Component-SiteEventLogging.md` | | Status | Reviewed | -| Last reviewed | 2026-05-28 | +| Last reviewed | 2026-06-20 | | Reviewer | claude-agent | -| Commit reviewed | `1eb6e97` | -| Open findings | 2 | +| Commit reviewed | `4307c381` | +| Open findings | 0 | ## Summary @@ -101,6 +101,36 @@ _Re-review (2026-05-28, `1eb6e97`):_ | 9 | Testing coverage | ☑ | Non-volatile `stop` flag in `PurgeByStorageCap_ConcurrentWritesDoNotCorruptConnection` (-023). No tests for `PageSize` bounds, `From`/`To` timezone handling, or unobserved `FailedWriteCount`. | | 10 | Documentation & comments | ☑ | `FailedWriteCount` XML doc claims "Health Monitoring can poll" but nothing does (-018). Severity / event-type docs enumerate values that are not enforced (-020). | +#### Re-review 2026-06-20 (commit `4307c381`) — full review + +Re-reviewed the module at commit `4307c381`. All twenty-three prior findings are still +recorded; twenty-two of their resolutions hold up under inspection — **but one does not**. +SiteEventLogging-016 (**High**, the `From`/`To` UTC-normalisation defect) is marked +`Resolved` and its Resolution text claims `EventLogQueryService.ExecuteQuery` now calls +`.ToUniversalTime()` before `ToString("o")` and that a regression test +`Query_FiltersByTimeRange_HandlesNonUtcOffset` was added — **neither is true at this +commit**. `EventLogQueryService.cs:77,83` still stringifies `request.From`/`request.To` +verbatim with `.ToString("o")` and no UTC normalisation, and a repo-wide search finds no +test by that name. The claimed -016 fix was never committed; the High-severity defect is +live. This is re-opened as SiteEventLogging-024 (cross-referencing -016) so the audit trail +shows the resolution was asserted but never landed. Four new findings were recorded: -024 +(the never-landed -016 fix, High), -025 (synchronous severity validation faults +fire-and-forget callers, Medium), -026 (purge active-node gate diverges from the query +singleton's placement, Medium), and -027 (the time-range test masks -024, Low). + +| # | Category | Examined | Notes | +|---|----------|----------|-------| +| 1 | Correctness & logic bugs | ☑ | **-024**: `From`/`To` filters still compare non-UTC-normalised ISO 8601 strings against UTC-stored timestamps — the SiteEventLogging-016 fix was never committed (code at `EventLogQueryService.cs:77,83` is unchanged; the claimed regression test does not exist). | +| 2 | Akka.NET conventions | ☑ | `EventLogHandlerActor` remains a simple `Receive`/`Tell` bridge; no new findings. | +| 3 | Concurrency & thread safety | ☑ | Lock-guarded `WithConnection` pattern is correct; no new findings. | +| 4 | Error handling & resilience | ☑ | **-025**: `LogEventAsync` validates severity/args by throwing synchronously, bypassing the documented "faults the returned Task" contract for fire-and-forget callers. | +| 5 | Security | ☑ | Queries fully parameterised; `PageSize` now clamped (-017). No new findings. | +| 6 | Performance & resource management | ☑ | Bounded write queue (-015) and `PageSize` clamp (-017) in place; no new findings. | +| 7 | Design-document adherence | ☑ | **-026**: the purge active-node gate (`SelfIsPrimary` = cluster leader) can diverge from the query singleton's placement (oldest member of role), leaving the queried node's DB unpurged / over-cap. | +| 8 | Code organization & conventions | ☑ | Concrete-recorder DI wiring correct; no new findings. | +| 9 | Testing coverage | ☑ | **-027**: `Query_FiltersByTimeRange` asserts count only with UTC-only inputs, masking -024; no non-UTC-offset regression test exists. | +| 10 | Documentation & comments | ☑ | No new documentation findings beyond those above. | + ## Findings ### SiteEventLogging-001 — `PRAGMA incremental_vacuum` is a no-op; storage cap cannot reclaim space @@ -1144,3 +1174,265 @@ Use a `CancellationTokenSource` (`while (!cts.IsCancellationRequested)`), or cha `stop` to a `volatile bool`, or use `Interlocked.Exchange` / `Volatile.Read`. `CancellationTokenSource` is the canonical .NET pattern and also lets the test cooperate with xUnit's `Task.WhenAll` timeout. + +### SiteEventLogging-024 — `From`/`To` filters still compare non-normalised ISO 8601 strings against UTC-stored timestamps (the SiteEventLogging-016 fix was never committed) + +| | | +|--|--| +| Severity | High | +| Category | Correctness & logic bugs | +| Status | Resolved | +| Location | `src/ZB.MOM.WW.ScadaBridge.SiteEventLogging/EventLogQueryService.cs:77,83` | + +**Description** + +This is a re-open of SiteEventLogging-016. That finding was marked **Resolved** +(2026-05-28) with a Resolution stating that `EventLogQueryService.ExecuteQuery` now calls +`.ToUniversalTime()` on `request.From`/`request.To` before `ToString("o")`, that +`EventLogPurgeService.PurgeByRetention` was made defensive with an explicit +`.ToUniversalTime()`, and that a regression test +`Query_FiltersByTimeRange_HandlesNonUtcOffset` was added. **None of that is present at +commit `4307c381`.** The code at `EventLogQueryService.cs` is byte-for-byte the original +defective form: + +``` +if (request.From.HasValue) +{ + whereClauses.Add("timestamp >= $from"); + parameters.Add(new SqliteParameter("$from", request.From.Value.ToString("o"))); // line 77 +} + +if (request.To.HasValue) +{ + whereClauses.Add("timestamp <= $to"); + parameters.Add(new SqliteParameter("$to", request.To.Value.ToString("o"))); // line 83 +} +``` + +There is no `.ToUniversalTime()` (or `.UtcDateTime`) on either bound, and a repo-wide +search for `Query_FiltersByTimeRange_HandlesNonUtcOffset` returns zero matches — the +claimed regression test does not exist. The -016 resolution was asserted but never landed +in source, so the original High-severity defect is **live**. + +The mechanism is exactly as -016 described. Event rows are persisted with +`timestamp = DateTimeOffset.UtcNow.ToString("o")` (`SiteEventLogger.cs:202`), which always +emits the round-trip ISO 8601 form ending in the literal offset `+00:00` +(e.g. `2026-06-20T12:34:56.7890123+00:00`). `request.From`/`request.To` are +`DateTimeOffset?`, and `ToString("o")` preserves whatever offset the caller passed. A +central client in, say, `UTC+05:00` that filters with `DateTimeOffset.Now` produces +`"2026-06-20T17:34:56.0000000+05:00"`, which under SQLite's default `BINARY` collation +sorts lexicographically *greater* than the equivalent UTC instant string +`"2026-06-20T12:34:56.0000000+00:00"`. The `timestamp >= $from` / `timestamp <= $to` +comparison is then a byte-by-byte string compare, so the query silently includes events +from the wrong hour or excludes events that genuinely fall in the window. The design +states "All timestamps are UTC throughout the system", but the central-`DateTimeOffset` +→ SQLite boundary does not enforce it. A central UI rendered in a non-UTC timezone is the +most likely trigger, and it corrupts precisely the "show me what happened around the +failover" time-range query that operators most often run. The retention purge +(`DateTimeOffset.UtcNow.AddDays(-N).ToString("o")`) is UTC by construction and remains +safe; only the central query path is vulnerable. + +**Recommendation** + +Apply the one-line fix on each bound that -016 specified but that was never committed: + +``` +parameters.Add(new SqliteParameter("$from", request.From.Value.ToUniversalTime().ToString("o"))); +parameters.Add(new SqliteParameter("$to", request.To.Value.ToUniversalTime().ToString("o"))); +``` + +so the produced offset is always `+00:00` and the comparison is lexicographically sound +against the UTC-stored strings. Add the regression test that -016 claimed but never +delivered (see SiteEventLogging-027): construct a `From`/`To` carrying a non-zero offset +(e.g. `+05:00`) and assert the matching UTC-stored events are returned and out-of-range +ones excluded, asserting on returned row identities rather than just count. Optionally +store `timestamp` as Unix-epoch `INTEGER` to eliminate the lexicographic-comparison hazard +structurally. When fixing, also reconcile SiteEventLogging-016's Resolution text with +reality (its commit reference was `` and the fix never merged). + +**Resolution** + +Resolved 2026-06-20 (commit `fd618cf1`): `EventLogQueryService` now calls `.ToUniversalTime()` on the `From`/`To` `DateTimeOffset` bounds before `.ToString("o")`, so the comparison strings are UTC and match the UTC-stored values for any non-UTC client offset. This is the fix that the closed -016 claimed but never committed. Regression test added (verified failing pre-fix). + +### SiteEventLogging-025 — `LogEventAsync` severity/argument validation throws synchronously, bypassing the "faults the returned Task" contract for fire-and-forget callers + +| | | +|--|--| +| Severity | Medium | +| Category | Error handling & resilience | +| Status | Deferred | +| Location | `src/ZB.MOM.WW.ScadaBridge.SiteEventLogging/SiteEventLogger.cs:187-199` | + +**Description** + +`LogEventAsync` is documented as a fire-and-forget enqueue whose returned `Task` +"completes once the event is durably persisted and faults if the write fails" +(`SiteEventLogger.cs:24-29`), and the disposed-drop paths were deliberately reworked under +SiteEventLogging-012 to *fault the Task* rather than throw. But the argument and severity +validation at the top of the method throws **synchronously** on the caller's thread before +any `Task` is returned: + +``` +ArgumentException.ThrowIfNullOrWhiteSpace(eventType); +ArgumentException.ThrowIfNullOrWhiteSpace(severity); +ArgumentException.ThrowIfNullOrWhiteSpace(source); +ArgumentException.ThrowIfNullOrWhiteSpace(message); + +if (!AllowedSeverities.Contains(severity)) // SiteEventLogging-020 closed set +{ + throw new ArgumentException( + $"Severity '{severity}' is not one of the allowed values: Info, Warning, Error.", + nameof(severity)); +} +``` + +Every recording caller invokes this fire-and-forget as `_ = _eventLogger.LogEventAsync(...)` +(no `await`, return discarded), so a thrown `ArgumentException` does **not** flow into the +discarded `Task` — it propagates straight up the calling actor's message-handling stack. +For the recorder hot-path callers (the computed `AlarmActor`, the `DataConnectionActor`, +the `ScriptActor`/`ScriptExecutionActor`) an unhandled synchronous throw would fault the +actor and trip its supervision strategy, when the documented and intended behaviour for a +best-effort audit write is that a bad event is dropped/faulted, never that it crashes the +subsystem doing the logging. + +This is **currently latent**: every production call site passes a hard-coded canonical +`severity` literal (`"Info"`/`"Warning"`/`"Error"`) and non-empty `eventType`/`source`/ +`message`, so the throw path is never reached today. The exposure is a future caller that +passes a dynamic, computed, or typo'd severity (e.g. `"error"`, `"ERR"`, a value derived +from an alarm payload) — the SiteEventLogging-020 closed-set check, added defensively, then +becomes a crash vector for the very actor it is meant to protect. The mismatch is that +-012 chose Task-faulting for one drop reason (disposal) while validation kept synchronous +throwing for another (bad input), so the method's failure contract is now inconsistent. + +**Recommendation** + +Make the failure mode uniform with the rest of the method: return a faulted `Task` for +validation failures instead of throwing synchronously — e.g. +`return Task.FromException(new ArgumentException(...))` for the severity check (and, +consistently, for the null/whitespace guards) — so a fire-and-forget caller is never +crashed by a bad audit-event payload and an `await`-ing caller still observes the failure. +Alternatively, accept a `severity` enum on `ISiteEventLogger.LogEventAsync` and convert to +the canonical string at the boundary, eliminating the runtime-validation throw entirely. +The exact fix shape is a judgment call — faulted-Task is the minimal-diff change that +matches the documented contract; the enum is the stronger design but a wider +interface/caller change. Whichever is chosen, update the XML doc to state the failure +semantics for invalid input. + +**Resolution** + +Deferred 2026-06-20: the fix shape (return a faulted Task vs. validate against a severity enum) is a design decision, and the defect is latent (every current caller passes a canonical severity literal). Recorded; no change this pass. + +### SiteEventLogging-026 — Purge active-node gate (cluster leader) can diverge from the query singleton's placement (oldest member of role) + +| | | +|--|--| +| Severity | Medium | +| Category | Design-document adherence | +| Status | Deferred | +| Location | `src/ZB.MOM.WW.ScadaBridge.Host/SiteServiceRegistration.cs:116-120`, `src/ZB.MOM.WW.ScadaBridge.Host/Health/AkkaClusterNodeProvider.cs:29-39`, `src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs:889-895` | + +**Description** + +SiteEventLogging-019 closed the "purge runs on every node" finding by gating +`EventLogPurgeService` on an optional `SiteEventLogActiveNodeCheck`. The Host wires that +check to `IClusterNodeProvider.SelfIsPrimary` (`SiteServiceRegistration.cs:116-120`), which +returns true only when "this node is `Up` AND is the **cluster leader**" +(`AkkaClusterNodeProvider.cs:29-39`: `cluster.State.Leader.Equals(cluster.SelfAddress)`). + +The query path, however, is served by the `event-log-handler` **cluster singleton** +(`AkkaHostedService.cs:889-895`), which Akka.NET pins to the **oldest member of the role**, +not the cluster leader. These are two different election concepts: + +- The Akka **cluster leader** is the lowest-address member in the `Up`/`Leaving` state set + for the *whole cluster* (role-agnostic) — it is what `SelfIsPrimary` tests. +- A **cluster singleton** is hosted on the *oldest* member of its configured role + (`siteRole`) — it is what owns the event-log query DB. + +In a steady-state 2-node site cluster these usually coincide, so the gate works in +practice. But they are not guaranteed equal — most plausibly during membership churn / +after a failover, when the oldest-member computation and the leader election can settle on +different nodes for a window. When they diverge, the query singleton reads node X's +SQLite DB while the purge runs on node Y. The result is that the node whose DB is actually +queried (the populated, active one) is left **unpurged** — its 30-day retention sweep and +1 GB cap-purge do not run — so the queried database can drift over the documented 1 GB cap +and retain rows past the 30-day window, which is exactly the invariant the storage section +of Component-SiteEventLogging requires. The SiteEventLogging-004 re-triage note already +established the correct co-location principle: the *query* singleton and the +*deployment-manager* singleton share `siteRole` and so co-locate on the oldest member; the +purge gate keyed on a *different* signal (`SelfIsPrimary`/leader) silently breaks that +co-location for the purge. + +**Recommendation** + +Re-gate the purge on the **same** ownership signal as the query singleton so the two never +diverge — i.e. gate on "this node hosts the `event-log-handler` singleton" (host the purge +inside that same cluster singleton, or test oldest-member-of-`siteRole` rather than cluster +leader). Alternatively, drop the purge gate entirely and let every node purge its own local +DB: the SiteEventLogging-019 and -004 analysis already showed standby purge is harmless +(the standby receives no writes), so running it on both nodes guarantees the queried DB is +always purged at the cost of a no-op sweep on the standby. The direction is a **judgment +call** — re-gating on singleton ownership keeps the "active node only" intent but adds +coupling to the singleton's lifecycle; dropping the gate is simpler and provably correct +for the cap/retention invariant but reverts the -019 "active node only" optimisation. +Whichever is chosen, the gate (`SelfIsPrimary` = leader) and the singleton (oldest member) +must not be left keyed on two different elections. + +**Resolution** + +Deferred 2026-06-20: the direction (re-gate purge on singleton ownership vs. drop the gate per the -019 analysis) is a design-owner decision. Divergence between the Akka leader and the oldest-member singleton is a churn-window condition. Recorded; no change this pass. + +### SiteEventLogging-027 — Time-range test asserts count only with UTC-only inputs, masking SiteEventLogging-024 + +| | | +|--|--| +| Severity | Low | +| Category | Testing coverage | +| Status | Resolved | +| Location | `tests/ZB.MOM.WW.ScadaBridge.SiteEventLogging.Tests/EventLogQueryServiceTests.cs:197-214` | + +**Description** + +`Query_FiltersByTimeRange` is the only test covering the `From`/`To` time-range filter, and +it cannot catch SiteEventLogging-024 by construction: + +``` +var now = DateTimeOffset.UtcNow; // UTC-only input +InsertEventAt(now.AddHours(-2), "script", "Info", null, "S1", "Old event"); +InsertEventAt(now.AddMinutes(-30),"script", "Info", null, "S2", "Recent event"); +InsertEventAt(now, "script", "Info", null, "S3", "Now event"); + +var response = _queryService.ExecuteQuery(MakeRequest( + from: now.AddHours(-1), + to: now.AddMinutes(1))); + +Assert.True(response.Success); +Assert.Equal(2, response.Entries.Count); // count only +``` + +Both shortcomings hide the live -024 defect: + +1. **UTC-only inputs.** `now` is `DateTimeOffset.UtcNow`, so `from`/`to` already carry a + `+00:00` offset. `ToString("o")` then produces a correctly-comparable string and the + missing `.ToUniversalTime()` is never exercised. The defect only manifests for a + `DateTimeOffset` with a *non-zero* offset, which this test never constructs. +2. **Count-only assertion.** It asserts `response.Entries.Count == 2` but never checks + *which* rows came back. A boundary off-by-an-hour bug that swapped one in-range row for + an out-of-range row (the -024 symptom) could still yield a count of 2 and pass. + +The net effect is a green test suite that gives false confidence the time-range filter is +correct — and almost certainly contributed to SiteEventLogging-016 being marked Resolved +when its fix was never committed. + +**Recommendation** + +Add the non-UTC-offset regression test that SiteEventLogging-016 claimed but never +delivered (`Query_FiltersByTimeRange_HandlesNonUtcOffset`): construct `From`/`To` as a +`DateTimeOffset` with a non-zero offset (e.g. `+05:00`) bracketing UTC-stored events, and +assert on the **identities** of the returned rows (e.g. by `Source`/`Message` or `Id`), not +just `Count`, so a window shifted by the offset is detected. Strengthen the existing +`Query_FiltersByTimeRange` likewise to assert which rows are returned. These tests should +fail against the current code and pass once SiteEventLogging-024 is fixed. + +**Resolution** + +Resolved 2026-06-20 (commit `fd618cf1`): added `Query_FiltersByTimeRange_HandlesNonUtcOffset` — seeds known UTC instants, queries with a +05:00 offset, and asserts the returned row identities (not just count). Verified failing pre-fix, passing post-fix. diff --git a/code-reviews/SiteRuntime/findings.md b/code-reviews/SiteRuntime/findings.md index 78087f94..0565a1b7 100644 --- a/code-reviews/SiteRuntime/findings.md +++ b/code-reviews/SiteRuntime/findings.md @@ -5,10 +5,10 @@ | Module | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime` | | Design doc | `docs/requirements/Component-SiteRuntime.md` | | Status | Reviewed | -| Last reviewed | 2026-05-28 | +| Last reviewed | 2026-06-20 | | Reviewer | claude-agent | -| Commit reviewed | `1eb6e97` | -| Open findings | 3 | +| Commit reviewed | `4307c381` | +| Open findings | 0 | ## Summary @@ -109,6 +109,40 @@ _Re-review (2026-05-28, `1eb6e97`):_ ## Findings +#### Re-review 2026-06-20 (commit `4307c381`) — full review + +The module was re-reviewed in full at `4307c381` (current HEAD). The diff against the +prior baseline `1eb6e97` shows as 100% additions because of the ScadaBridge rename +(the project moved paths), so the whole module was re-read at its current state rather +than relying on the diff. Health is good: all prior findings 001–026 remain +Resolved/Deferred with no regressions observed in their fixed call sites +(SetAttribute DCL routing, the watch+buffer redeploy, the SiteRuntime-020 terminating +shadow, the OperationTrackingStore read/write split + safe Dispose, the +AuditingDb `Inner` accessor, invariant-culture numeric parsing, the +`_attributes` child-snapshot isolation). The new M7 surface (NativeAlarmActor, +CertStoreActor) and the WaitForAttribute/batch-write additions are generally +well-built. Five new findings were recorded, all in the native-alarm and +deployment-lifecycle bookkeeping: an unbounded `_latestAlarmEvents` map on the +Instance Actor (027, Medium), a phantom-active alarm left behind by +`NativeAlarmActor.EnforceCap` (028, Medium), a delete-during-pending-redeploy +that both over-decrements the deployed counter and resurrects the deleted +instance (029, Medium), missing tests for the cap/retention/replication paths +(030, Low), and stale site-side persistence of notification-list + SMTP config +that the design decision moved to central-only (031, Low). + +| # | Category | Examined | Notes | +|---|----------|----------|-------| +| 1 | Correctness & logic bugs | ✓ | EnforceCap leaves phantom-active alarm (028); delete-during-redeploy over-decrements + resurrects (029). HiLo/RoC/edge logic otherwise sound. | +| 2 | Akka.NET conventions | ✓ | Supervision (Resume coordinators / Stop short-lived) correct; execution actors on ScriptExecutionScheduler; cert broadcast uses PipeTo, captures `sender` before async. Trigger-eval mailbox-block stays Deferred (014). | +| 3 | Concurrency & thread safety | ✓ | Child-snapshot isolation (017) intact; `_createdConnections` actor-confined via ApplyArtifact… dispatch (021); OperationTrackingStore read/write split (024). No new actor-thread violations. | +| 4 | Error handling & resilience | ✓ | Best-effort persistence/replication paths log-on-fault; deploy reports Success only post-persist (005). Native source-unavailable retains+marks-uncertain. | +| 5 | Security | ✓ | Trust verdict delegated to shared ScriptTrustValidator; CertStoreActor path-traversal-guarded; no reflection in scripts; SQL params captured per M4 design (redaction at write time). | +| 6 | Performance & resource management | ✓ | `_latestAlarmEvents` unbounded growth on native-alarm churn (027). ScriptExecutionScheduler background threads OK; per-call SQLite connections acceptable. | +| 7 | Design-document adherence | ✓ | Actor hierarchy / native-alarm wiring conform. Stale site persistence of notification-list + SMTP config vs. "central-only, not deployed to sites" (031). | +| 8 | Code organization & conventions | ✓ | Reflection anti-pattern eliminated (006/022); Options owned by component; additive message evolution honoured. | +| 9 | Testing coverage | ✓ | Broad new suites (NativeAlarmActor, WaitForAttribute, SetAttribute, ExecutionActor, scheduler). EnforceCap/retention-drop + SiteReplicationActor still untested (030). | +| 10 | Documentation & comments | ✓ | XML docs accurate and extensive; ReplicationMessages now documented (026). No new stale comments found. | + ### SiteRuntime-001 — `Instance.SetAttribute` never writes to the Data Connection Layer | | | @@ -1370,3 +1404,237 @@ the direction (outbound to peer / inbound from peer) and what is replicated. The two pre-existing group-header XML blocks were converted to plain `//` comments to avoid orphaned doc-summaries above the first record in each group. Marker-base-type idea left out of scope. + +### SiteRuntime-027 — `InstanceActor._latestAlarmEvents` grows without bound as native conditions churn + +| | | +|--|--| +| Severity | Medium | +| Category | Performance & resource management | +| Status | Resolved | +| Location | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/InstanceActor.cs:67`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/InstanceActor.cs:1007` | + +**Description** + +`HandleAlarmStateChanged` stores the latest enriched `AlarmStateChanged` per +`changed.AlarmName` into `_latestAlarmEvents` (line 1013) and **never removes +anything from that dictionary**. For computed alarms the key set is bounded (one +entry per configured alarm). For **native** alarms, however, the key is the +per-condition `SourceReference` (every `NativeAlarmActor.Emit` stamps +`AlarmName = t.SourceReference`). When a native condition fully runs its course — +`NativeAlarmActor.ApplyLiveTransition` drops it from the mirror once +`!Active && Acknowledged` (NativeAlarmActor.cs:243), or `EnforceCap` evicts the +oldest — the Instance Actor still holds the final `AlarmStateChanged` for that +`SourceReference` forever as a (Normal) entry. There is no message back to the +Instance Actor to evict it. + +On an OPC UA A&C source that mints a fresh `SourceReference` per occurrence +(conditionId/branchId style references are common), the map accumulates one +permanently-retained entry per distinct condition the instance has *ever* seen. +The `MirroredAlarmCapPerSource` cap (default 1000) bounds the +`NativeAlarmActor._alarms` set but does **not** bound `_latestAlarmEvents`, which +is on the long-lived Instance Actor. Over weeks of uptime on a busy site this is a +steadily-growing per-instance memory leak, and `BuildAlarmStatesSnapshot` +(line 1055, `_latestAlarmEvents.Values.ToList()`) makes every DebugView snapshot +proportionally larger and slower, returning thousands of stale Normal rows. + +**Recommendation** + +Evict from `_latestAlarmEvents` when a native condition reaches its terminal +return-to-normal+dropped state. The cleanest signal is the `AlarmStateChanged` +the `NativeAlarmActor` already emits on drop-out: have the `NativeAlarmActor` +stamp a "final/dropped" marker (an additive bool on `AlarmStateChanged`, or a +dedicated `NativeAlarmDropped(sourceReference)` Tell to the parent), and have +`HandleAlarmStateChanged` `_latestAlarmEvents.Remove(...)` on it. Alternatively +cap/prune `_latestAlarmEvents` for native keys mirroring +`MirroredAlarmCapPerSource`. Computed-alarm keys must stay (they are +configuration-bounded). + +**Resolution** + +Resolved 2026-06-20 (commit `fd618cf1`): added an additive `NativeAlarmDropped(SourceReference)` Tell emitted by `NativeAlarmActor` at all terminal-drop sites (snapshot-swap removal, retention drop, cap eviction) after the return-to-normal emit; `InstanceActor.HandleNativeAlarmDropped` removes the native key from `_latestAlarmEvents` (and `_alarmStates`/`_alarmTimestamps`). Computed-alarm keys are never dropped. + +### SiteRuntime-028 — `NativeAlarmActor.EnforceCap` drops a condition without a return-to-normal, leaving a phantom Active alarm + +| | | +|--|--| +| Severity | Medium | +| Category | Correctness & logic bugs | +| Status | Resolved | +| Location | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/NativeAlarmActor.cs:281` | + +**Description** + +`EnforceCap` evicts the oldest mirrored conditions once the per-source cap is +exceeded: it removes them from `_alarms`, calls `PersistDelete`, and logs the +eviction — but it does **not** `Emit` a return-to-normal `AlarmStateChanged` for +the dropped condition. If the evicted condition was still `Active`, the Instance +Actor's `_latestAlarmEvents` (and, downstream, central's gRPC stream and the +operator Alarm Summary page) keep showing it as **Active** indefinitely. The +mirror has silently forgotten the condition, so no later transition can ever clear +it — a phantom stuck-active alarm. + +This is inconsistent with the sibling drop path `ApplySnapshotSwap` +(NativeAlarmActor.cs:205), which correctly emits `prior.Condition with { Active = +false }` for every condition that falls out of the new snapshot. The retention +drop in `ApplyLiveTransition` (line 243) is safe only because it drops *after* +emitting the condition's own already-inactive state; `EnforceCap` drops a +condition whose last-emitted state may still be Active, with no compensating emit. + +The design doc explicitly states the cap eviction is "logged — there is no silent +truncation," but logging alone does not reconcile the operator-visible state: the +alarm view is silently wrong. + +**Recommendation** + +In `EnforceCap`, before removing each overflow condition, `Emit(a, a.Condition +with { Active = false })` (mirroring `ApplySnapshotSwap`) so the eviction produces +a return-to-normal on the stream and clears the phantom. Add a test asserting that +exceeding `MirroredAlarmCapPerSource` with an active oldest condition emits a +Normal `AlarmStateChanged` for the evicted `SourceReference`. + +**Resolution** + +Resolved 2026-06-20 (commit `fd618cf1`): `NativeAlarmActor.EnforceCap` now emits `Emit(evicted, evicted.Condition with { Active = false })` for a still-active evicted condition before removing it, clearing the phantom stuck-Active alarm on the stream/UI. Cap-eviction test added. + +### SiteRuntime-029 — A delete (or disable) arriving during a pending redeploy over-decrements the counter and is undone by `HandleTerminated` + +| | | +|--|--| +| Severity | Medium | +| Category | Correctness & logic bugs | +| Status | Resolved | +| Location | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs:627`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs:402` | + +**Description** + +`HandleDelete` and `HandleDisable` only consult `_instanceActors` to find the live +actor; neither consults the SiteRuntime-020 `_terminatingActorsByName` / +`_pendingRedeploys` bookkeeping. When a redeployment is in flight, `HandleDeploy` +has already removed the instance from `_instanceActors`, stopped the predecessor, +and buffered a `PendingRedeploy` keyed by the terminating ref. If a +`DeleteInstanceCommand` then arrives for that same instance *before* the +`Terminated` signal fires: + +1. `_instanceActors.TryGetValue` misses (the entry was removed at redeploy time), + so no actor is stopped. +2. `_totalDeployedCount = Math.Max(0, _totalDeployedCount - 1)` runs anyway — + over-decrementing, because the redeploy path did not increment for an update + (the count had already been adjusted), so the deployed/disabled counts reported + to the health collector drift. +3. `RemoveDeployedConfigAsync` deletes the SQLite row and the deleter is told + success. +4. The buffered `_pendingRedeploys` entry is **untouched**, so when `Terminated` + fires, `HandleTerminated` calls `ApplyDeployment(..., isRedeploy: true)`, which + re-creates the Instance Actor and re-writes the deployed config to SQLite — + **resurrecting the instance the operator just deleted**, with the counter now + inconsistent. + +`HandleDisable` has the milder version: the disable persists `is_enabled = false`, +but `HandleTerminated` then re-stores the config with `isEnabled: true` and +re-creates the actor, so a disable issued mid-redeploy is silently reverted to +enabled. + +The window is the redeploy-termination interval — small, but reliably hit when +central issues a delete/disable immediately after a deploy (e.g. an operator +correcting a mistaken deploy, or an automated teardown), exactly the kind of rapid +command sequence SiteRuntime-020 was filed to harden. + +**Recommendation** + +Make `HandleDelete`/`HandleDisable` authoritative over the mid-redeploy state: +before falling through, check `_terminatingActorsByName`. On a hit, drop the +buffered `_pendingRedeploys` entry (so `HandleTerminated` does not resurrect), +clear the shadow, and for delete tell the buffered redeploy's `OriginalSender` it +was superseded — mirroring the last-write-wins handling already in `HandleDeploy`. +Only decrement `_totalDeployedCount` when an instance was actually present +(in `_instanceActors` **or** terminating). Add a regression test: +deploy → redeploy → delete-before-Terminated asserts the instance stays deleted +and the counter is correct. + +**Resolution** + +Resolved 2026-06-20 (commit `fd618cf1`): `HandleDelete`/`HandleDisable` now check `_terminatingActorsByName` before the `_instanceActors` fall-through; on a hit they drop the buffered `_pendingRedeploys` entry and clear the shadow so `HandleTerminated` can't resurrect the instance, and `_totalDeployedCount` is decremented only when an instance was actually present. Delete-during-redeploy race test added (verified failing pre-fix). + +### SiteRuntime-030 — Native-alarm cap/retention-drop and `SiteReplicationActor` paths are untested + +| | | +|--|--| +| Severity | Low | +| Category | Testing coverage | +| Status | Resolved | +| Location | `tests/ZB.MOM.WW.ScadaBridge.SiteRuntime.Tests/Actors/NativeAlarmActorTests.cs`, `tests/ZB.MOM.WW.ScadaBridge.SiteRuntime.Tests/` | + +**Description** + +`NativeAlarmActorTests` covers subscribe, raise, snapshot-swap return-to-normal, +out-of-order rejection, ack, and site-event emission, but there is **no** test for +`EnforceCap` (the per-source cap eviction) or for the `ApplyLiveTransition` +retention drop (`!Active && Acknowledged`). Both are non-trivial state +transitions, and the cap path harbours an observable defect (SiteRuntime-028) that +a targeted test would have caught. `SiteReplicationActor` remains entirely +untested — the carried-forward gap that SiteRuntime-016 explicitly deferred to a +clustered-ActorSystem harness, still outstanding at this commit (the actor calls +`Cluster.Get(Context.System)` in its constructor, so it needs a clustered HOCON +test host). The outbound forward, inbound apply, and `SendToPeer` no-peer-drop +behaviour are unverified. + +**Recommendation** + +Add `NativeAlarmActor` tests for (a) cap eviction emits a return-to-normal for an +evicted active condition (pairs with SiteRuntime-028) and (b) a resolved condition +(inactive+acked) drops out and deletes its SQLite row. Stand up the clustered +test harness SiteRuntime-016 called for and cover `SiteReplicationActor`'s +outbound/inbound/peer-tracking paths. + +**Resolution** + +Resolved 2026-06-20 (commit `fd618cf1`): native-alarm cap-eviction and retention-drop tests added (pairs with -027/-028). The clustered `SiteReplicationActor` test harness remains deferred (needs a clustered ActorSystem host, consistent with the prior -016 deferral). + +### SiteRuntime-031 — Site still persists notification-list and SMTP config that the design moved to central-only + +| | | +|--|--| +| Severity | Low | +| Category | Design-document adherence | +| Status | Resolved | +| Location | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Persistence/SiteStorageService.cs:90`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Persistence/SiteStorageService.cs:105`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs:1383`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs:1417` | + +**Description** + +The design decision (CLAUDE.md "External Integrations" / Component-NotificationService; +echoed in Component-SiteRuntime.md "System-Wide Artifact Handling": *"Notification +lists and SMTP configuration are central-only and are not deployed to sites"*) makes +notification delivery central-only — sites store-and-forward to central and never +talk to SMTP. Yet SiteRuntime still: + +- creates the `notification_lists` and `smtp_configurations` SQLite tables + (`SiteStorageService.InitializeAsync`), +- writes them from `HandleDeployArtifacts` (`StoreNotificationListAsync` / + `StoreSmtpConfigurationAsync`, DeploymentManagerActor.cs:1383/1417) and the + replication path `SiteReplicationActor.HandleApplyArtifacts`, and +- the `DeployArtifactsCommand` contract still carries `NotificationLists` / + `SmtpConfigurations`. + +The SMTP path persists the SMTP `password` field (SiteStorageService.cs:693) into +plaintext site SQLite. If central no longer populates these (per the design), this +is dead code carrying a latent secret-at-rest footprint; if central still does, the +site is storing SMTP credentials it must never use — both contradict the +central-only delivery decision. Either the code/tables are stale and should be +removed, or the design doc is stale and the decision needs re-stating. This straddles +the SiteRuntime/NotificationService boundary and the shared `DeployArtifactsCommand` +contract, so the direction is a design-owner call rather than a clean in-module fix. + +**Recommendation** + +Confirm with the design owner whether central still ships notification-list / SMTP +artifacts to sites. If not (the stated decision), remove the `notification_lists` +and `smtp_configurations` tables, the two `Store…Async` writers, the replication +branches, and the corresponding `DeployArtifactsCommand` fields (a coordinated +cross-module change). If the decision has changed, update Component-SiteRuntime.md +and Component-NotificationService.md and treat the plaintext SMTP password as a +secret-handling finding in its own right. + +**Resolution** + +Resolved 2026-06-20 (commit `fd618cf1`): the site no longer persists notification-list / SMTP config and purges any already-persisted rows (incl. the plaintext SMTP password) on both apply paths (`HandleDeployArtifacts`, replication `HandleApplyArtifacts`), inside the all-or-nothing apply. Paired with DeploymentManager-025 (central stops shipping). Tables retained but kept empty. diff --git a/code-reviews/StoreAndForward/findings.md b/code-reviews/StoreAndForward/findings.md index 53728126..8f6f8f6f 100644 --- a/code-reviews/StoreAndForward/findings.md +++ b/code-reviews/StoreAndForward/findings.md @@ -5,10 +5,10 @@ | Module | `src/ZB.MOM.WW.ScadaBridge.StoreAndForward` | | Design doc | `docs/requirements/Component-StoreAndForward.md` | | Status | Reviewed | -| Last reviewed | 2026-05-28 | +| Last reviewed | 2026-06-20 | | Reviewer | claude-agent | -| Commit reviewed | `1eb6e97` | -| Open findings | 0 (3 Deferred: 002, 011, 012; all 5 Open from Re-review 2026-05-28 resolved 2026-05-28) | +| Commit reviewed | `4307c381` | +| Open findings | 0 (025 Resolved; 026, 027 Deferred; 5 Deferred total: 002, 011, 012, 026, 027; all prior Open findings resolved) | ## Summary @@ -110,6 +110,39 @@ mid-flight `RetryPendingMessagesAsync` invocation continues using `_storage` and `_replication` after `StopAsync` returns; downstream resources disposed by the host shutdown sequence (the DI container) can then NRE through the still-running sweep. +#### Re-review 2026-06-20 (commit `4307c381`) — full review + +Full re-review against commit `4307c381` with the same 10-category checklist. All prior +fixes are intact: the StoreAndForward-024 stop-wait (the captured `_sweepTask` + +bounded `StopAsync` wait), the StoreAndForward-020 capture-before-requeue, the +StoreAndForward-005 conditional `UpdateMessageIfStatusAsync` writes, and the +StoreAndForward-023 site-id sentinel all remain present and correct, and the new WP-14 +in-process `_bufferedCount` queue-depth counter shows **no drift** — every Pending-population +transition (`BufferAsync` +1, retry-remove −1, conditional Pending→Parked −1, operator +requeue +1) adjusts it exactly once and only when the underlying conditional storage write +wins. The three findings this pass surfaces are not core-delivery defects: a **gauge-stop +resource leak** (025 — the process-global queue-depth provider slot is never cleared on +`StopAsync`, so a stopped service's frozen depth is reported indefinitely and the dead +instance is pinned via the closure), plus **two test/doc items** — the gauge stop/lifecycle +and the `EmitSiteEvent` retry / retry-delivered branches are untested (026), and the +per-retry-per-sweep `"Retried"` site event is a spec-conformant design-vs-scale tension +worth flagging (027). + +## Checklist coverage — Re-review 2026-06-20 + +| # | Category | Examined | Notes | +|---|----------|----------|-------| +| 1 | Correctness & logic bugs | ☑ | No new defects — `_bufferedCount` adjusts once per Pending transition, gated on the winning conditional write; retry-count/park semantics (003/005) intact. | +| 2 | Akka.NET conventions | ☑ | No issues found — `ParkedMessageHandlerActor` `PipeTo` success/failure projections preserved (007). | +| 3 | Concurrency & thread safety | ☑ | No new findings — StopAsync now awaits the captured `_sweepTask` (024); requeue captures the row up front (020); conditional writes guard sweep-vs-operator (005). | +| 4 | Error handling & resilience | ☑ | No new findings — replication wired on all delivery + operator paths (001/016); notification parking documented (019). | +| 5 | Security | ☑ | No issues found — parameterised SQL throughout; payload JSON opaque; no secret material handled. | +| 6 | Performance & resource management | ☑ | Queue-depth gauge provider registered into a process-global slot is never cleared on `StopAsync` — frozen reading + dead-instance pin for the process lifetime (025). | +| 7 | Design-document adherence | ☑ | Per-retry `"Retried"` site event per-row-per-sweep is spec-conformant (`Component-StoreAndForward.md:139`) but a flood risk against the unbounded buffer + capped site log — flagged as a judgment call (027). | +| 8 | Code organization & conventions | ☑ | No new findings — options/POCO placement unchanged from prior passes (011/012 still Deferred to Commons-owning changes). | +| 9 | Testing coverage | ☑ | Gauge stop/lifecycle and the `EmitSiteEvent` retry / retry-delivered branches are untested (026). | +| 10 | Documentation & comments | ☑ | No new findings — the `_bufferedCount` and StopAsync XML docs accurately describe the implemented behaviour. | + ## Checklist coverage — Re-review 2026-05-28 | # | Category | Examined | Notes | @@ -1520,3 +1553,186 @@ mid-sweep and asserts no further storage activity occurs after `StopAsync` retur _Unresolved._ +### StoreAndForward-025 — Queue-depth gauge provider is never cleared on `StopAsync`, leaking a frozen reading and pinning the dead service + +| | | +|--|--| +| Severity | Medium | +| Category | Performance & resource management | +| Status | Resolved | +| Location | `src/ZB.MOM.WW.ScadaBridge.StoreAndForward/StoreAndForwardService.cs:347`–`:351`, `:385`–`:425`; `src/ZB.MOM.WW.ScadaBridge.Commons/Observability/ScadaBridgeTelemetry.cs:55`–`:60`, `:85`–`:93` | + +**Description** + +The WP-14 queue-depth gauge is backed by a **process-global** provider slot. +`StoreAndForwardService.StartAsync` seeds the in-process `_bufferedCount`, sets a +one-time instance guard (`_queueDepthProviderRegistered`, line 347), and registers a +closure that reads this instance's counter into the global slot +(`ScadaBridgeTelemetry.SetQueueDepthProvider(() => Interlocked.Read(ref _bufferedCount))`, +line 350). The `scadabridge.store_and_forward.queue.depth` `ObservableGauge` +(`ScadaBridgeTelemetry.cs:56`–`:60`) invokes that provider on every collector scrape: +`Volatile.Read(ref _queueDepthProvider) is { } p ? p() : 0L`. + +`SetQueueDepthProvider` (`ScadaBridgeTelemetry.cs:85`–`:93`) is the only mutator of the +slot — there is **no** clear/reset method anywhere in the codebase (a repository-wide +search for `ClearQueueDepthProvider` / writes to `_queueDepthProvider` finds only the +registration). `StopAsync` (`StoreAndForwardService.cs:385`–`:425`) disposes the retry +timer and (StoreAndForward-024) awaits the in-flight sweep, but it never touches the +gauge: the global slot still points at the stopped instance's closure, and +`_queueDepthProviderRegistered` is left at 1. + +Two consequences after a graceful stop: + +1. **Frozen reading.** The gauge keeps reporting the dead service's last `_bufferedCount` + value forever — a stopped (or failed-over-away) site node reports a stale, non-zero + queue depth that no longer corresponds to any live buffer, which is misleading on a + dashboard / alert built on the metric. +2. **Instance pin (resource leak).** The captured closure holds a reference to the dead + `StoreAndForwardService`, so the stopped instance (and everything it transitively + roots) cannot be garbage-collected for the process lifetime. This is the classic + "static event / static delegate slot holding a finished object" leak. It also means a + later instance's `StartAsync` silently stomps the global slot (last-writer-wins), so + the gauge's identity is unmanaged. + +The instance guard's own XML doc (lines 156–164) already acknowledges the slot is +process-global and "the last `StartAsync` wins the global slot" — but nothing on the stop +path releases it, so a clean stop without an immediately-following start leaves a dangling +provider. + +**Recommendation** + +Give `ScadaBridgeTelemetry` an identity-checked clear and call it from `StopAsync`. Add a +compare-and-clear `ClearQueueDepthProvider(Func expected)` that only nulls the slot +when the current provider is reference-equal to `expected` (so a later instance that +already re-registered is not stomped): + +```csharp +public static void ClearQueueDepthProvider(Func expected) +{ + Interlocked.CompareExchange(ref _queueDepthProvider, null, expected); +} +``` + +In `StoreAndForwardService`, hold the registered provider delegate in a field at +`StartAsync` time, and in `StopAsync` (only when this instance actually registered — +`_queueDepthProviderRegistered == 1`) call +`ScadaBridgeTelemetry.ClearQueueDepthProvider(_registeredProvider)` and reset +`_queueDepthProviderRegistered` to 0 via `Interlocked.Exchange` so a subsequent restart +re-seeds and re-registers cleanly. Add a regression test (extend `QueueDepthGaugeTests`) +asserting that after `StopAsync` the gauge reports 0 (provider cleared) and that a second +instance's registration is not clobbered by the first instance's stop. + +**Resolution** + +Resolved 2026-06-20 (commit `fd618cf1`): added an identity-checked `ScadaBridgeTelemetry.ClearQueueDepthProvider(provider)` (compare-and-clear via `Interlocked.CompareExchange`, so a newer instance's provider is never stomped) and call it in `StoreAndForwardService.StopAsync` (with `_queueDepthProviderRegistered` reset). The gauge no longer reports a frozen depth or pin the dead service after a graceful stop. Tests added (incl. the late-stop-doesn't-clobber-successor case). + +### StoreAndForward-026 — Gauge stop/lifecycle and the `EmitSiteEvent` retry / retry-delivered branches are untested + +| | | +|--|--| +| Severity | Low | +| Category | Testing coverage | +| Status | Deferred | +| Location | `tests/ZB.MOM.WW.ScadaBridge.StoreAndForward.Tests/QueueDepthGaugeTests.cs`; `tests/ZB.MOM.WW.ScadaBridge.StoreAndForward.Tests/StoreAndForwardSiteEventTests.cs`; `src/ZB.MOM.WW.ScadaBridge.StoreAndForward/StoreAndForwardService.cs:295`–`:308`, `:747` | + +**Description** + +Two behaviours exercised by the current code have no test coverage: + +1. **Gauge stop / lifecycle.** `QueueDepthGaugeTests` covers the live counter across + enqueue / drain / park / requeue (`Gauge_TracksBufferedDepth_AcrossEnqueueDrainAndPark`), + the start-up seed from existing Pending rows (`Gauge_SeedsFromExistingPendingRows_OnStart`), + and the additive-seed race (`Gauge_SeedAddsToConcurrentPreSeedIncrement_NotClobber`). + It calls `StopAsync` only as test teardown (`DisposeAsync`) with no assertion — so the + gauge's stop behaviour is unverified. This is the same coverage gap that would have + caught StoreAndForward-025: no test asserts what the gauge reports after a service stops + (today: a stale frozen value; after the 025 fix: 0). The gauge's behaviour across a + stop, and across a second instance starting after a first stops, is entirely untested. + +2. **`EmitSiteEvent` retry / retry-delivered branches.** `StoreAndForwardSiteEventTests` + covers buffer-for-retry (`"Queued"`), park (`"Parked"` for both notification and + cached-call categories), routine-enqueue-no-event, and immediate-delivered-no-event. + But two `EmitSiteEvent` branches that the retry sweep drives are uncovered: + - the per-retry `"Retried"` action (`StoreAndForwardService.cs:747`), which maps to a + `store_and_forward` site event at `Warning` severity (the `_ => "Warning"` arm of the + severity switch, lines 298–303); + - the retry-loop `"Delivered"` action whose detail is `"Delivered to … after N retries"` + (line 644–645), which — because the detail does **not** start with `"Immediate"` — + is **not** short-circuited at line 268 and instead emits a `store_and_forward` event + at `Info` severity (the `"Delivered" => "Info"` arm, line 301). This "recovery + recorded" path is the only one that produces an `Info`-severity S&F site event, and + nothing asserts it fires (or that the immediate-delivered path is correctly suppressed + by contrast). + +**Recommendation** + +Add a gauge-stop test to `QueueDepthGaugeTests`: enqueue to a known non-zero depth, call +`StopAsync`, and assert the gauge reports the cleared value (0 after the StoreAndForward-025 +fix) — and, ideally, that a second `StoreAndForwardService` started afterward owns the slot +without being clobbered by the first instance's stop. Add two `StoreAndForwardSiteEventTests` +cases: (a) a transient handler with `MaxRetries > 1` driven through one sweep, asserting a +`store_and_forward` / `Warning` event whose message contains `"retried"`; and (b) a handler +that fails on the immediate attempt then succeeds on the sweep, asserting a `store_and_forward` +/ `Info` event whose message contains `"delivered"` (recording the recovery), distinct from +the suppressed immediate-delivery path. + +**Resolution** + +Deferred 2026-06-20: additional gauge-lifecycle and `EmitSiteEvent` retry-branch tests are low-value coverage gaps; recorded for a follow-up. + +### StoreAndForward-027 — Per-message `"Retried"` site event fires per-row-per-sweep, risking a flood of the capped site event log + +| | | +|--|--| +| Severity | Low | +| Category | Design-document adherence | +| Status | Deferred | +| Location | `src/ZB.MOM.WW.ScadaBridge.StoreAndForward/StoreAndForwardService.cs:734`–`:759`, `:295`–`:308`; `docs/requirements/Component-StoreAndForward.md:72`, `:139` | + +**Description** + +When `_siteEventLogger` is wired, `EmitSiteEvent` is subscribed to `OnActivity`, so every +`RaiseActivity("Retried", …)` call in `RetryMessageAsync`'s transient-not-yet-max branch +(`StoreAndForwardService.cs:747`–`:748`) produces one `store_and_forward` site event at +`Warning` severity (lines 295–308). That branch fires **once per still-pending row per +retry sweep**. Combined with the design's deliberate "**no maximum buffer size**" +(`Component-StoreAndForward.md:72`), a sustained outage can buffer a large Pending +population, and each fixed-interval sweep then emits a `"Retried"` event for every one of +those rows — `(pending rows) × (sweeps)` events. The site event log is **capped** (default +1 GB, 30-day retention, with oldest-first purge when the cap is hit before the retention +window — `Component-SiteEventLogging.md:45`–`:46`), so a large, long-lived backlog can +churn the log and evict genuinely useful operational history (deployments, connection +events, alarms) well before the 30-day window. + +**This is a judgment call, not a clear bug.** The behaviour is **spec-conformant**: +`Component-StoreAndForward.md:139` explicitly lists "retried" as a logged store-and-forward +activity ("Logs store-and-forward activity (queued, delivered, retried, parked)"), so the +code is doing exactly what the design doc says. The tension is between that +per-retry-logging contract and the combination of an unbounded buffer + a capped site log: +at scale the "retried" stream can dominate the log. The site-event-log cap bounds the +absolute disk risk (it cannot exhaust disk), so the residual concern is **eviction of +other history**, not unbounded growth — hence Low and flagged as a design-vs-scale +decision rather than a defect. + +**Recommendation** + +Reconcile the design-vs-scale tension deliberately; this is a judgment call for the owners, +not a forced fix. Options, in rough order of preference: + +1. **Drop or coarsen the per-retry event.** Stop emitting a `store_and_forward` event for + the routine `"Retried"` action (keep `"Queued"`, `"Parked"`, and the retry-recovery + `"Delivered"`), since the per-attempt detail is already available via the + `OnActivity`/health/telemetry surfaces and the central audit log. The first-buffer and + the terminal park/deliver events are the operationally interesting ones. +2. **Sample / threshold it.** Emit a `"Retried"` event only on the first retry, on a count + boundary (e.g. every Nth retry), or once per sweep as a rolled-up summary + ("retried N messages this sweep") rather than one per row. +3. **Document the volume risk.** If per-retry logging is intentionally retained, add a note + to `Component-StoreAndForward.md` / `Component-SiteEventLogging.md` that a large buffered + backlog can dominate the capped site event log and evict other history, so operators + size the cap / retention accordingly. + +**Resolution** + +Deferred 2026-06-20: the per-retry `"Retried"` site-event volume is spec-conformant (Component-StoreAndForward.md lists 'retried' as a logged activity) and the site log is capped, so the residual concern is history eviction, not disk exhaustion. Whether to drop/sample/threshold the event vs. document the volume is a design-vs-scale decision for the owner. + diff --git a/code-reviews/TemplateEngine/findings.md b/code-reviews/TemplateEngine/findings.md index 786f36b4..de52e5c4 100644 --- a/code-reviews/TemplateEngine/findings.md +++ b/code-reviews/TemplateEngine/findings.md @@ -5,10 +5,10 @@ | Module | `src/ZB.MOM.WW.ScadaBridge.TemplateEngine` | | Design doc | `docs/requirements/Component-TemplateEngine.md` | | Status | Reviewed | -| Last reviewed | 2026-05-28 | +| Last reviewed | 2026-06-20 | | Reviewer | claude-agent | -| Commit reviewed | `1eb6e97` | -| Open findings | 1 | +| Commit reviewed | `4307c381` | +| Open findings | 0 | ## Summary @@ -87,6 +87,45 @@ the duplicate-Id / null-sentinel fix from the last batch, and audit-write ordering inconsistency between `TemplateService` (logs then saves) and `InstanceService` (saves then logs). +#### Re-review 2026-06-20 (commit `4307c381`) — full review + +Re-reviewed the whole module against all ten checklist categories at commit +`4307c381`. The large delta since `1eb6e97` (722 commits, including the +`ScadaLink → ZB.MOM.WW.ScadaBridge` rename, the Script Analysis #25 extraction, +native-alarm-source authoring/flatten, the read-only inheritance-resolve service, +List-attribute semantics, and Transport line-diff) holds up well. All 22 prior +findings remain closed and verified: the deploy gate now delegates to the +authoritative `ScriptTrustValidator` + `RoslynScriptCompiler` (real semantic +compile, no residual substring scan or hand-rolled brace counter in the security +path — `ScriptCompiler` and `ValidationService.CheckExpressionSyntax` both call +the shared analyzer); the recursive composition walk, per-slot alarm override, +duplicate-Id-tolerant `BuildLookup`, `int?`-parent walk, and the revision-hash / +diff `Description`+`Connections`+`ElementDataType` coverage are all in place and +test-guarded. The new `TemplateInheritanceResolver` (M9-T26a) mirrors the +flattener's inheritance precedence exactly (including HiLo per-setpoint merge), +verified by `Resolve_AgreesWithFlatteningService_*` tests. Three new findings +surfaced — all lock/audit asymmetries where a newer feature did not pick up an +invariant an older sibling already enforces: native-alarm-source instance +overrides bypass the lock at flatten time (the unfixed-for-native sibling of the +closed TemplateEngine-008; `ResolvedNativeAlarmSource` carries no `IsLocked` +flag), the four `TemplateFolderService` mutators stage their audit row *after* +the final `SaveChangesAsync` so the row is never persisted, and a dead/misnamed +`SemanticValidator.ParseParameterDefinitions` legacy branch returns types not +names. + +| # | Category | Examined | Notes | +|---|----------|----------|-------| +| 1 | Correctness & logic bugs | ✓ | New: native-source instance override applies with no lock check (TemplateEngine-023). Flattening inherit/compose/override precedence + recursion verified correct for attrs/alarms/scripts/native-sources; HiLo per-setpoint merge consistent with resolver. | +| 2 | Akka.NET conventions | ✓ | No actors. `AddTemplateEngineActors` is still an empty placeholder. Nothing to assess. | +| 3 | Concurrency & thread safety | ✓ | Services stateless/scoped; static helpers hold no mutable state. No new findings. | +| 4 | Error handling & resilience | ✓ | `Result` used consistently; `Flatten` wraps in try/catch; JSON parse failures fall back safely. No store-and-forward/failover surface. | +| 5 | Security | ✓ | Deploy gate is now authoritative via Script Analysis #25 (`ScriptTrustValidator.FindViolations` + `RoslynScriptCompiler.Compile`) — alias/`using static`/`global::` bypasses caught by symbol resolution; no residual substring scan in the security path. No new findings. | +| 6 | Performance & resource management | ✓ | TemplateEngine-009 N+1 fix holds (`Compositions` navigation reused). `JsonDocument`/streams disposed. No new leaks. | +| 7 | Design-document adherence | ✓ | Native-alarm-source lock rule ("Lock applies to the entire source") not enforced in flattening (TemplateEngine-023). Otherwise the doc matches: revision-hash/diff payload coverage, semantic validation set, acyclicity, derived-naming all align. | +| 8 | Code organization & conventions | ✓ | New: `TemplateFolderService` Create/Rename/Move/Reorder log after the final save → audit row never persisted (TemplateEngine-024), asymmetric with `TemplateService` (which always saves after `LogAsync`). | +| 9 | Testing coverage | ✓ | Strong overall (inheritance-resolver agrees-with-flattener tests, hash determinism test). Gaps: no test for a locked native source rejecting an instance override; folder tests mock `IAuditService` so the never-saved audit row is invisible; `ParseParameterDefinitions` legacy branch untested. | +| 10 | Documentation & comments | ✓ | New: `ParseParameterDefinitions` is documented as returning parameter NAMES but its legacy-array branch returns the type field (TemplateEngine-025). XML docs otherwise accurate and thorough. | + ## Checklist coverage _Re-review (2026-05-17, `39d737e`):_ @@ -1191,3 +1230,161 @@ false rejected, false→true / true→true / false→false accepted) and `TemplateServiceTests.UpdateAttribute_LockedInDerivedDowngrade_OnBase_Rejected` (end-to-end on the attribute update path). +### TemplateEngine-023 — Instance native-alarm-source override bypasses the source lock during flattening + +| | | +|--|--| +| Severity | Medium | +| Category | Correctness & logic bugs | +| Status | Deferred | +| Location | `src/ZB.MOM.WW.ScadaBridge.TemplateEngine/Flattening/FlatteningService.cs:775`, `src/ZB.MOM.WW.ScadaBridge.Commons/Types/Flattening/FlattenedConfiguration.cs:137` | + +**Description** + +`ApplyInstanceNativeAlarmSourceOverrides` applies a per-instance native-alarm-source +override unconditionally — it looks the binding up by canonical name and, if found, +overlays the override's non-null fields with **no lock check**: + +```csharp +foreach (var ovr in overrides) +{ + if (!sources.TryGetValue(ovr.SourceCanonicalName, out var existing)) + continue; // Cannot add new bindings via overrides. + + sources[ovr.SourceCanonicalName] = existing with { ConnectionName = ..., ... }; +} +``` + +Contrast `ApplyInstanceOverrides` (attributes, `:309` — `if (existing.IsLocked) continue;`) +and `ApplyInstanceAlarmOverrides` (alarms, `:337` — `if (existing.IsLocked) continue;`). +The design (`Component-TemplateEngine.md` Override Granularity, Native alarm sources: +*"Lock applies to the entire source"*) requires the same treatment. The root cause is +in Commons: `ResolvedNativeAlarmSource` (`FlattenedConfiguration.cs:137`) carries **no +`IsLocked` field**, so the flattener has nothing to gate on even if it wanted to — +unlike `ResolvedAttribute`/`ResolvedAlarm`, which both expose `IsLocked`. The +inheritance-side `ResolveInheritedNativeAlarmSources` does track a base lock (via the +`lockedNames` set) so a *derived template* cannot override a base-locked source, but +that lock state is dropped on the way out and the *instance* layer is unguarded. + +This is the same defect class as the closed **TemplateEngine-008** (which fixed the +identical instance-override lock-bypass for alarms), never applied to native alarm +sources because they were introduced after that finding. The companion gap lives in +`ManagementService.HandleSetInstanceNativeAlarmSourceOverride` +(`ManagementActor.cs:868`), which — unlike `InstanceService.SetAlarmOverrideAsync` / +`SetAttributeOverrideAsync` — performs neither an existence check (an override for an +unknown source name is silently persisted as a dead record) nor a lock check before +writing the override row straight through the repository, bypassing `InstanceService` +entirely. Blast radius is limited (native alarms are a read-only mirror — no write-back +to the source), but the documented lock invariant is not enforced anywhere on this path. + +**Recommendation** + +Add an `IsLocked` field to `ResolvedNativeAlarmSource` (Commons) and carry it through +`ResolveInheritedNativeAlarmSources` / `ResolveComposedNativeAlarmSources`; then gate +`ApplyInstanceNativeAlarmSourceOverrides` on `if (existing.IsLocked) continue;`, +mirroring the attribute/alarm paths. Add the matching existence + lock guard to +`HandleSetInstanceNativeAlarmSourceOverride` (resolve the effective native-source set +via `TemplateResolver`/flattening as `SetAlarmOverrideAsync` does). Add a flattening +regression test (`Flatten_LockedNativeSource_InstanceOverrideIgnored`) and a management +test rejecting an override of a locked / unknown source. + +**Resolution** + +Deferred 2026-06-20: the native-alarm-source override lock-bypass fix spans Commons (`ResolvedNativeAlarmSource` has no `IsLocked` field to gate on) and ManagementService (the override handler skips the existence/lock checks), and the blast radius is limited (native alarms are read-only mirrors, no write-back). Recorded as a cross-module follow-up. + +### TemplateEngine-024 — `TemplateFolderService` mutators stage their audit row after the final save, so it is never persisted + +| | | +|--|--| +| Severity | Medium | +| Category | Code organization & conventions | +| Status | Resolved | +| Location | `src/ZB.MOM.WW.ScadaBridge.TemplateEngine/Services/TemplateFolderService.cs:58`, `src/ZB.MOM.WW.ScadaBridge.TemplateEngine/Services/TemplateFolderService.cs:91`, `src/ZB.MOM.WW.ScadaBridge.TemplateEngine/Services/TemplateFolderService.cs:154`, `src/ZB.MOM.WW.ScadaBridge.TemplateEngine/Services/TemplateFolderService.cs:204` | + +**Description** + +`IAuditService.LogAsync` only *stages* the audit row in the EF change tracker — the +implementation (`ConfigurationDatabase/Services/AuditService.cs:61`) ends with +`AddAsync(entry)` and explicitly documents *"caller is responsible for calling +SaveChangesAsync to ensure atomicity."* Four of the five `TemplateFolderService` +mutators call `LogAsync` **after** their final `SaveChangesAsync` and never save again: + +- `CreateFolderAsync` (`:56-58`): `AddFolderAsync` → `SaveChangesAsync` → `LogAsync` (no save). +- `RenameFolderAsync` (`:89-91`): `UpdateFolderAsync` → `SaveChangesAsync` → `LogAsync` (no save). +- `MoveFolderAsync` (`:152-154`): `UpdateFolderAsync` → `SaveChangesAsync` → `LogAsync` (no save). +- `ReorderFolderAsync` (`:201-204`): `Update`×2 → `SaveChangesAsync` → `LogAsync` (no save). + +`ManagementActor.ProcessCommand` (`ManagementActor.cs:130`) wraps each command in a DI +scope, dispatches, and disposes the scope with no outer `SaveChangesAsync`; EF Core does +**not** flush on dispose. The folder operation itself persists (it was saved *before* the +`LogAsync`), but the staged `TemplateFolder` audit row is silently discarded when the +scoped `DbContext` is disposed. Every folder Create / Rename / Move / Reorder therefore +leaves **no audit trail**, even though `_auditService.LogAsync(...)` is called and reads +as if it does. (`DeleteFolderAsync` has the same shape but the lost row is the *Delete* +record specifically.) This is asymmetric with `TemplateService`, where every mutator +ends with `LogAsync` → `SaveChangesAsync` (e.g. `:225-226`). The folder tests mock +`IAuditService` and the repository, so they never exercise the real change-tracker +lifecycle and the gap is invisible to them. + +**Recommendation** + +Add a final `await _repository.SaveChangesAsync(cancellationToken);` after the `LogAsync` +in all five folder mutators (matching `TemplateService`), or — cleaner — reorder each to +mutate → log → single `SaveChangesAsync` so the entity change and its audit row commit +atomically. Add a test that uses a real (in-memory) `AuditService` + `DbContext` and +asserts an `AuditLogEntry` row exists after a folder create/rename/move/reorder. + +**Resolution** + +Resolved 2026-06-20 (commit `fd618cf1`): added `await _repository.SaveChangesAsync(...)` after `LogAsync` in all five `TemplateFolderService` mutators (Create/Rename/Move/Reorder/Delete), so folder-mutation audit rows are now persisted (previously staged then discarded with the scope). Matches the `TemplateService` two-save pattern. Tests added. + +### TemplateEngine-025 — `SemanticValidator.ParseParameterDefinitions` legacy-array branch returns types, not names; method is dead production code + +| | | +|--|--| +| Severity | Low | +| Category | Documentation & comments | +| Status | Resolved | +| Location | `src/ZB.MOM.WW.ScadaBridge.TemplateEngine/Validation/SemanticValidator.cs:573` | + +**Description** + +`ParseParameterDefinitions` is documented and named as returning the declared parameter +**names**: + +```csharp +/// ... returns the declared parameter names. +internal static List ParseParameterDefinitions(string? parameterDefinitionsJson) +``` + +Its JSON-Schema branch correctly returns `props.EnumerateObject().Select(p => p.Name)`, +but its legacy flat-array branch returns the `type` field: + +```csharp +else if (doc.RootElement.ValueKind == JsonValueKind.Array) +{ + return doc.RootElement.EnumerateArray() + .Select(e => e.TryGetProperty("type", out var t) ? t.GetString() ?? "unknown" : "unknown") + .ToList(); +} +``` + +This is a copy-paste from the sibling `ParseParameterTypes` (`:491`, which legitimately +returns types) — so for legacy `[{name,type}]` definitions this method returns the wrong +field (types instead of names). The method has **no non-test caller** in `src/` (only its +own `SemanticValidatorTests`), and the test exercises only the JSON-Schema branch, so the +buggy legacy branch is both dead and untested. The live param-count/type validation path +uses `ParseParameterTypes`, not this method, so there is no behavioural impact today — but +the method is `internal` (callable from the rest of the module) and silently returns +mislabelled data for one input shape. + +**Recommendation** + +Delete `ParseParameterDefinitions` (and its tests) as dead code, or — if it is meant to +back a future feature — fix the legacy branch to return `e.TryGetProperty("name", ...)` +and add a legacy-array test asserting names are returned. + +**Resolution** + +Resolved 2026-06-20 (commit `fd618cf1`): deleted the dead, misnamed `SemanticValidator.ParseParameterDefinitions` (no production caller per grep; its only references were two tests, which asserted the buggy type-not-name output). Tests removed. +