From c42bb48585a42a77846bca0b1f88a1dd8a9191aa Mon Sep 17 00:00:00 2001 From: Joseph Doherty Date: Wed, 24 Jun 2026 09:20:03 -0400 Subject: [PATCH] =?UTF-8?q?docs(code-review):=20re-review=2017=20changed?= =?UTF-8?q?=20modules=20at=201f9de8a2=20=E2=80=94=208=20new=20findings?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Re-reviewed the modules whose source changed since the last review baseline (full-review remediation fd618cf1 + InboundAPI Database-helper fixes b3c90143), focused on whether the fixes are sound and regression-free. 9 of 17 modules clean; 8 new findings (0 Critical, 0 High, 4 Medium, 4 Low), all code-verified by the orchestrator before recording: - DataConnectionLayer-029 (Med): DCL-023's unsubscribe-clears-in-flight reopens a double-subscribe window that leaks an orphaned alarm feed; the alarm completion handler overwrites the subscription id without the tag-path guard at line 908. - InboundAPI-031 (Med): WaitForAttribute's 5s grace backstop is tighter than the CommunicationService Ask's timeout+IntegrationTimeout (30s) round-trip slack, so a slow-but-valid timed-out 'false' arriving in the 5-30s window is cancelled into an unhandled OperationCanceledException/500 (contradicts spec 6 + its own comment). - SiteRuntime-032 (Med): SiteRuntime-029's wasPresent guard skips the deployed-count decrement when deleting a DISABLED instance (absent from both maps), drifting the health-dashboard tally; self-heals on singleton restart (observational, hence Med). - StoreAndForward-028 (Med): StoreAndForward-025 resets the register-guard but not _bufferedCount, so a same-instance Stop->Start re-seeds the depth gauge to ~2N. - AuditLog-017, CentralUI-037, ScriptAnalysis-009, SiteRuntime-033 (Low): a test-coverage gap plus stale doc-comments/spec following the remediation. Header commit/date bumped to 1f9de8a2 / 2026-06-24 on all 17 modules; README regenerated (8 pending / 576 total). --- code-reviews/AuditLog/findings.md | 50 ++++++++++++- code-reviews/CentralUI/findings.md | 50 ++++++++++++- code-reviews/Commons/findings.md | 29 +++++++- .../ConfigurationDatabase/findings.md | 27 ++++++- code-reviews/DataConnectionLayer/findings.md | 50 ++++++++++++- code-reviews/DeploymentManager/findings.md | 27 ++++++- .../ExternalSystemGateway/findings.md | 27 ++++++- code-reviews/Host/findings.md | 27 ++++++- code-reviews/InboundAPI/findings.md | 50 ++++++++++++- code-reviews/KpiHistory/findings.md | 27 ++++++- code-reviews/ManagementService/findings.md | 27 ++++++- code-reviews/README.md | 58 +++++++++------ code-reviews/ScriptAnalysis/findings.md | 50 ++++++++++++- code-reviews/SiteCallAudit/findings.md | 27 ++++++- code-reviews/SiteEventLogging/findings.md | 27 ++++++- code-reviews/SiteRuntime/findings.md | 73 ++++++++++++++++++- code-reviews/StoreAndForward/findings.md | 49 ++++++++++++- code-reviews/TemplateEngine/findings.md | 26 ++++++- 18 files changed, 635 insertions(+), 66 deletions(-) diff --git a/code-reviews/AuditLog/findings.md b/code-reviews/AuditLog/findings.md index 8a86b5f0..e3c94b13 100644 --- a/code-reviews/AuditLog/findings.md +++ b/code-reviews/AuditLog/findings.md @@ -5,10 +5,10 @@ | Module | `src/ZB.MOM.WW.ScadaBridge.AuditLog` | | Design doc | `docs/requirements/Component-AuditLog.md` | | Status | Reviewed | -| Last reviewed | 2026-06-20 | +| Last reviewed | 2026-06-24 | | Reviewer | claude-agent | -| Commit reviewed | `4307c381` | -| Open findings | 0 | +| Commit reviewed | `1f9de8a2` | +| Open findings | 1 | ## Summary @@ -875,3 +875,47 @@ test in `AddAuditLogTests` calls the helper twice and asserts a single `AddAuditLogCentralMaintenance` helper is left for a follow-up — it is only ever called from the central composition root and the unit/integration fixtures use disposable IServiceCollections, so the foot-gun is narrower. + +## Re-review — 2026-06-24 (commit `1f9de8a2`) + +Focused re-review of the changes since the prior review — verifying the code-review remediation + feature fixes are sound and regression-free. Reviewed by a per-module workflow agent; findings code-verified by the orchestrator. + +**Changes reviewed:** Two files changed since 4307c381. AuditLogPurgeOptions.cs (AuditLog-013) adds [ConfigurationKeyName("ChannelPurgeBatchSize")] so the documented operator config key binds onto the ChannelPurgeBatchSizeConfigured backing property instead of being silently ignored. AuditLogIngestActor.cs (AuditLog-014) wraps OnIngestAsync's scope-creation + repository resolution in an outer try/catch (mirroring the pre-existing OnCachedTelemetryAsync guard) so a transient DI/DbContext fault cannot escape the handler and restart the central singleton; plus doc-comment cleanup removing stale Bundle A/C/D/E jargon. + +**Verdict:** The two changes are sound, targeted remediations and are regression-free. AuditLog-013 correctly aligns the binder with the documented AuditLog:Purge:ChannelPurgeBatchSize key (verified against Component-AuditLog.md lines 407/508); the project builds clean and the new binding test pins the exact documented section/key shape. AuditLog-014 correctly hardens OnIngestAsync against scope/resolution faults restarting the singleton, preserving the captured-Sender-before-await pattern and the unconditional reply so the site's Ask never times out; per-row failures stay contained in the inner loop and the failure counter is defensively guarded. All 28 ingest/purge/binding unit tests pass. The only gap is that the new AuditLog-014 guard path itself is not exercised by a dedicated test, but it is correct by inspection and matches an already-covered sibling path. + +| # | Category | Examined | Notes | +|---|----------|----------|-------| +| 1 | Correctness & logic bugs | ☑ | AuditLog-014 outer catch only fires on scope-create/resolve/dispose faults (per-row Exception catch contains row failures); replyTo.Tell runs unconditionally after if-else. AuditLog-013 ConfigurationKeyName maps documented key onto backing prop. No logic defects. | +| 2 | Akka.NET conventions | ☑ | Sender captured before first await (line 116); singleton no longer restarts on transient DI fault (the fix's intent); SupervisorStrategy override unchanged. Ask-at-boundary reply preserved. No issues found. | +| 3 | Concurrency & thread safety | ☑ | All mutation (accepted list, nowUtc) is actor-thread local; fresh DI scope per message; no captured this/Sender in post-await closures. No shared mutable state introduced. No issues found. | +| 4 | Error handling & resilience | ☑ | New outer try/catch makes best-effort audit truly non-fatal; counter.Increment wrapped in defensive try/catch; logs with accepted count. Brings OnIngestAsync into parity with OnCachedTelemetryAsync. Sound. | +| 5 | Security | ☑ | No secret/auth exposure in new log messages; redaction path (SafeDefault fallback) unchanged. Config key change is non-sensitive. No issues found. | +| 6 | Performance & resource management | ☑ | await using on the async scope unchanged; at most one extra counter increment on a scope-dispose fault (best-effort gauge). No new allocation or leak. No issues found. | +| 7 | Design-document adherence | ☑ | ChannelPurgeBatchSize key now matches Component-AuditLog.md (lines 407/508) exactly; audit-write-failure-never-aborts invariant reinforced. No drift in either direction. | +| 8 | Code organization & conventions | ☑ | Stale Bundle A/C/D/E jargon removed from doc comments; all targets (AuditRowProjection.WithIngestedAtUtc, ConfigurationKeyNameAttribute, SafeDefaultAuditRedactor) resolve. ConfigurationKeyName is the only repo usage but compiles fine. Clean. | +| 9 | Testing coverage | ☑ | New PurgeOptions_Bind_FromDocumentedSectionAndKeys pins AuditLog-013; 28 ingest/purge/binding tests pass. AuditLog-014's scope-throw guard path has no dedicated test (Low finding). | +| 10 | Documentation & comments | ☑ | Class/remarks doc comments updated accurately; AuditLog-014 inline comment precisely explains the guard's rationale (singleton restart / dropped reply / site retry). No stale references remain. | + +**New findings from this re-review (1):** + +### AuditLog-017 — AuditLog-014 guard path not exercised by a test + +| | | +|--|--| +| Severity | Low | +| Category | Testing coverage | +| Status | Open | +| Location | `src/ZB.MOM.WW.ScadaBridge.AuditLog/Central/AuditLogIngestActor.cs:149` | + +**Description** + +The new outer try/catch in OnIngestAsync (the AuditLog-014 fix) is the regression guard that keeps the central singleton alive and still replies (with whatever was accepted) when CreateAsyncScope() or GetRequiredService() throws. No unit test injects a throwing IServiceProvider/scope to assert that the actor survives, increments the failure counter, and returns an IngestAuditEventsReply. The fix is correct by inspection and mirrors the already-present OnCachedTelemetryAsync guard, but the specific behavior the fix exists to protect is not pinned, so a future refactor could silently reintroduce the restart-on-transient-fault regression. + +**Recommendation** + +Add a TestKit test using a service provider whose IAuditLogRepository resolution throws (or whose scope-create throws), assert the actor does not restart and replies with an empty accepted list, and assert the ICentralAuditWriteFailureCounter is incremented once. + +**Resolution** + +_Unresolved._ diff --git a/code-reviews/CentralUI/findings.md b/code-reviews/CentralUI/findings.md index 76abd643..afe96771 100644 --- a/code-reviews/CentralUI/findings.md +++ b/code-reviews/CentralUI/findings.md @@ -5,10 +5,10 @@ | Module | `src/ZB.MOM.WW.ScadaBridge.CentralUI` | | Design doc | `docs/requirements/Component-CentralUI.md` | | Status | Reviewed | -| Last reviewed | 2026-06-19 | +| Last reviewed | 2026-06-24 | | Reviewer | claude-agent | -| Commit reviewed | `d6ead8ae` | -| Open findings | 0 | +| Commit reviewed | `1f9de8a2` | +| Open findings | 1 | ## Summary @@ -1685,3 +1685,47 @@ Add a test that saving succeeds with a Messaging Service SID and no From Number. Resolved 2026-06-19 (commit `33e1802e`): added `SavingNewConfig_MessagingServiceSidOnly_NoFromNumber_Saves`, which asserts the config persists with a null FromNumber. Complemented by ManagementActor + CLI either-or tests in the same commit. + +## Re-review — 2026-06-24 (commit `1f9de8a2`) + +Focused re-review of the changes since the prior review — verifying the code-review remediation + feature fixes are sound and regression-free. Reviewed by a per-module workflow agent; findings code-verified by the orchestrator. + +**Changes reviewed:** Added a `Database` accessor to both inbound script-analysis globals types. `InboundScriptHost.DatabaseAccessor` (editor/diagnostics mirror) exposes `QuerySingleAsync`/`QueryAsync`/`ExecuteAsync` as never-invoked signature stubs that return default/empty; `SandboxInboundScriptHost.DatabaseAccessor` (Test Run mirror) exposes the same three methods, each throwing `ScriptSandboxException` because a central Test Run has no configured DB connection — symmetric with the pre-existing throwing `RouteAccessor`. + +**Verdict:** The change is clean, minimal, and faithful to the runtime it mirrors. Both new `DatabaseAccessor` types replicate the exact signatures of the shipped runtime `InboundAPI.InboundDatabaseHelper` (verified method-by-method), so inbound scripts using `Database.*` now type-check in the editor and recompile identically in a Test Run while remaining safely unreachable (sandbox throws). It aligns the editor surface with the Component-InboundAPI.md spec (lines 202-222), closing a prior drift where `Database.*` would have falsely flagged as a compile error. The CentralUI project builds with 0 warnings/0 errors and the additions are well-XML-documented. No correctness, security, concurrency, or contract regressions found. One Low-severity doc-comment staleness exists just outside the changed files in ScriptAnalysisService.cs. + +| # | Category | Examined | Notes | +|---|----------|----------|-------| +| 1 | Correctness & logic bugs | ☑ | Editor-mirror stubs are never invoked (Roslyn reads signatures only); default/empty return bodies are appropriate. Sandbox-mirror throws on every call, symmetric with RouteAccessor. No logic defects. | +| 2 | Akka.NET conventions | ☑ | No issues found — these are plain POCO globals types for Roslyn analysis; no actors, messages, Tell/Ask, or supervision involved. | +| 3 | Concurrency & thread safety | ☑ | No issues found — accessors are stateless (sandbox RouteTarget holds only an immutable readonly instanceCode); no shared mutable state, no captured sender/this. | +| 4 | Error handling & resilience | ☑ | Sandbox DatabaseAccessor throws ScriptSandboxException with a clear operation-named message; consistent with the established Route Unavailable() pattern. No swallowed exceptions. | +| 5 | Security | ☑ | Mirror methods never execute SQL — no injection surface here; runtime InboundDatabaseHelper enforces parameter binding (InboundAPI-026). No secrets logged; connection names embedded in sandbox messages are non-sensitive. | +| 6 | Performance & resource management | ☑ | No issues found — no connections, streams, or disposables opened in either mirror; the sandbox path throws before touching any resource. | +| 7 | Design-document adherence | ☑ | Mirror signatures exactly match Component-InboundAPI.md (lines 215-217) and runtime InboundDatabaseHelper. Change correctly closes prior editor drift; no spec violation. | +| 8 | Code organization & conventions | ☑ | Nested accessor classes match the existing RouteHelper/RouteAccessor structure; consistent naming, full XML docs. Builds clean under ImplicitUsings + Nullable. | +| 9 | Testing coverage | ☑ | New InboundScript_Database_DiagnosesClean covers the editor mirror via Diagnose, matching the established inbound-Route test pattern. No dedicated RunInSandbox test asserts the inbound Database throws, but this mirrors the existing (also-untested) inbound-Route sandbox-throw behavior — consistent, not a regression. | +| 10 | Documentation & comments | ☑ | Both new files are thoroughly XML-documented. ScriptAnalysisService.cs doc comment (lines 163-164, outside the changed files) now under-describes the inbound sandbox surface — reported as Low. | + +**New findings from this re-review (1):** + +### CentralUI-037 — RunInSandboxAsync doc comment omits new Database-throws behavior + +| | | +|--|--| +| Severity | Low | +| Category | Documentation & comments | +| Status | Open | +| Location | `src/ZB.MOM.WW.ScadaBridge.CentralUI/ScriptAnalysis/ScriptAnalysisService.cs:163` | + +**Description** + +The XML summary on RunInSandboxAsync states only "For the SandboxInboundScriptHost surface, every Route call throws because cross-site routing needs a deployed site." The reviewed change added a Database accessor to SandboxInboundScriptHost whose three methods also throw ScriptSandboxException in a Test Run. The comment is now incomplete relative to the code it describes — a reader would not learn that Database.* also fails in an inbound Test Run. This is documentation-only; the throwing behavior itself is correct and intended. + +**Recommendation** + +Extend the line-163 sentence to note that the inbound sandbox surface also throws on every Database call (no configured central DB connection in a Test Run), mirroring the existing Route wording. + +**Resolution** + +_Unresolved._ diff --git a/code-reviews/Commons/findings.md b/code-reviews/Commons/findings.md index 4eb36196..cbe03880 100644 --- a/code-reviews/Commons/findings.md +++ b/code-reviews/Commons/findings.md @@ -5,10 +5,10 @@ | Module | `src/ZB.MOM.WW.ScadaBridge.Commons` | | Design doc | `docs/requirements/Component-Commons.md` | | Status | Reviewed | -| Last reviewed | 2026-06-19 | +| Last reviewed | 2026-06-24 | | Reviewer | claude-agent | -| Commit reviewed | `d6ead8ae` | -| Open findings | 2 | +| Commit reviewed | `1f9de8a2` | +| Open findings | 0 | ## Summary @@ -1239,3 +1239,26 @@ Optionally default `RecipientEmails` to an empty list, or normalize null→empty Won't Fix (2026-06-19): not a correctness defect — the handler already ignores `RecipientEmails` for SMS lists. Changing a required positional parameter to defaulted/nullable is a breaking contract change for existing callers for a pure-ergonomics gain; deferred indefinitely. + +## Re-review — 2026-06-24 (commit `1f9de8a2`) + +Focused re-review of the changes since the prior review — verifying the code-review remediation + feature fixes are sound and regression-free. Reviewed by a per-module workflow agent; findings code-verified by the orchestrator. + +**Changes reviewed:** Two changes since d6ead8ae: (1) ScadaBridgeTelemetry gained a new ClearQueueDepthProvider(Func) method that does an identity-checked Interlocked.CompareExchange compare-and-clear of the process-global queue-depth provider slot, so a StoreAndForwardService can deregister its own provider on StopAsync without clobbering a newer instance that already re-registered. (2) KpiSeriesBucketer received a documentation-only change: the XML doc and an inline comment now state that input MUST be sorted ascending by BucketStartUtc and clarify that for unsorted input the method selects the last point in iteration order (not the largest-timestamp point). No executable logic in the bucketer changed. + +**Verdict:** The delta is small, well-targeted, and clean. ClearQueueDepthProvider is a correct fix for a real concern (a stopped service otherwise pins a dead closure and reports a frozen gauge depth forever); the compare-and-clear is sound under the actual memory model, and the StoreAndForwardService caller correctly retains the exact delegate reference (_queueDepthProvider) and passes it back, so the identity check works. The convergence behaviour between a new instance's Volatile.Write register and an old instance's CAS clear is correct in both interleavings (newer provider always wins). The KpiSeriesBucketer change is documentation only and accurately tightens an already-satisfied precondition: the sole production caller (KpiHistoryRepository.GetRawSeriesAsync) returns rows OrderBy(CapturedAtUtc) ascending with BucketStartUtc populated from CapturedAtUtc, so the new sorted-input requirement holds in production. Both changes are pinned by new regression tests (QueueDepthGaugeTests: StopAsync clears + does-not-clobber-newer; KpiSeriesBucketerTests: unsorted-last-in-iteration + short-series-raw-timestamps). No new issues found. + +| # | Category | Examined | Notes | +|---|----------|----------|-------| +| 1 | Correctness & logic bugs | ☑ | CompareExchange compare-and-clear is correct; only nulls slot when it still holds the caller's delegate. Bucketer doc matches actual last-in-iteration behaviour; for the sorted production input last-in-iteration equals largest-timestamp. No issues found. | +| 2 | Akka.NET conventions | ☑ | Not actor code — static telemetry helper and a pure downsampling function. No supervision/Tell-vs-Ask/correlation-id surface touched. Not applicable to the delta. | +| 3 | Concurrency & thread safety | ☑ | Interlocked.CompareExchange (full fence) interoperates correctly with Set's Volatile.Write and the gauge's Volatile.Read on the same atomic reference field; reference assignment is atomic, no torn reads. New-vs-old register/clear interleavings both converge to the newer provider. No issues found. | +| 4 | Error handling & resilience | ☑ | ClearQueueDepthProvider null-guards and returns early; gauge falls back to 0 after clear (matches existing null-handling in Set). No exception paths introduced. No issues found. | +| 5 | Security | ☑ | No secrets, no SQL/LDAP, no external input. Queue depth is a numeric count; no sensitive data in telemetry. No issues found. | +| 6 | Performance & resource management | ☑ | Fixes a real resource concern: clearing the provider lets the dead StoreAndForwardService instance's closure be collected instead of being pinned for process lifetime. Bucketer unchanged at runtime. No issues found. | +| 7 | Design-document adherence | ☑ | Bucketer last-value-per-bucket semantics still match Component-KpiHistory.md; the tightened sort precondition is met by GetRawSeriesAsync's OrderBy. Telemetry gauge is an internal observability detail not specified by design docs — no drift. No issues found. | +| 8 | Code organization & conventions | ☑ | ClearQueueDepthProvider mirrors SetQueueDepthProvider's signature and access style and lives in the right file. Minor pre-existing asymmetry (Set uses Volatile.Write, Clear uses Interlocked) is intentional and harmless. No issues found. | +| 9 | Testing coverage | ☑ | Both changes have dedicated regression tests: QueueDepthGaugeTests covers clear-on-stop and the no-clobber-newer-instance race; KpiSeriesBucketerTests pins unsorted last-in-iteration and short-series raw-timestamp paths. Coverage is adequate. No issues found. | +| 10 | Documentation & comments | ☑ | XML docs are thorough and accurate; the bucketer comment's 'essentially any in-bucket point' hedge correctly accounts for the exact-bucket-start equality edge. No misleading or stale comments. No issues found. | + +_No new findings — the changes in this module are clean._ diff --git a/code-reviews/ConfigurationDatabase/findings.md b/code-reviews/ConfigurationDatabase/findings.md index 837941b2..041391ca 100644 --- a/code-reviews/ConfigurationDatabase/findings.md +++ b/code-reviews/ConfigurationDatabase/findings.md @@ -5,9 +5,9 @@ | Module | `src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase` | | Design doc | `docs/requirements/Component-ConfigurationDatabase.md` | | Status | Reviewed | -| Last reviewed | 2026-06-19 | +| Last reviewed | 2026-06-24 | | Reviewer | claude-agent | -| Commit reviewed | `d6ead8ae` | +| Commit reviewed | `1f9de8a2` | | Open findings | 0 | ## Summary @@ -1461,3 +1461,26 @@ Deferred (2026-06-19): the migrations are standard idempotent `ALTER`/guarded-`C verified by the model snapshot and the build, and `AddSmsNotifications` was applied to the real docker MS SQL during the feature's integration pass. A live-SQL migration test is tracked for a future ConfigurationDatabase test-stability pass. + +## Re-review — 2026-06-24 (commit `1f9de8a2`) + +Focused re-review of the changes since the prior review — verifying the code-review remediation + feature fixes are sound and regression-free. Reviewed by a per-module workflow agent; findings code-verified by the orchestrator. + +**Changes reviewed:** A single change to GetCurrentDeploymentStatusAsync in DeploymentManagerRepository.cs: added .ThenByDescending(d => d.Id) as a secondary sort key after .OrderByDescending(d => d.DeployedAt), plus an explanatory DeploymentManager-026 comment. This makes the "current" deployment-record read deterministic when two insert-only records for the same instance tie on DeployedAt within one clock tick. + +**Verdict:** The change is a small, correct, well-targeted fix that removes a real non-determinism in the read path. Because deployments are insert-only and DeploymentRecord.Id is the EF auto-increment identity PK (HasKey(d => d.Id), no ValueGeneratedNever), a later insert always carries a strictly higher Id, so ThenByDescending(d => d.Id) correctly selects the most recently inserted record on a DeployedAt tie. The fix matters for correctness: the sole caller TryReconcileWithSiteAsync branches on prior.Status (InProgress / timed-out Failed) in ShouldQuerySiteBeforeRedeploy, so reading the wrong tied record could alter reconciliation behavior. The change is regression-free (a read-only ordering refinement that is a no-op when DeployedAt differs), the InstanceId filter is index-backed, and both a repository-level test and a SiteRuntime regression test cover the exact same-tick scenario. No new issues introduced. + +| # | Category | Examined | Notes | +|---|----------|----------|-------| +| 1 | Correctness & logic bugs | ☑ | ThenByDescending(d => d.Id) is a valid deterministic tiebreaker; Id is auto-increment identity PK so highest Id == most recent insert. Resolves the same-DeployedAt ambiguity correctly. No issues found. | +| 2 | Akka.NET conventions | ☑ | Pure EF repository read; no actor/messaging code involved. N/A to this change. No issues found. | +| 3 | Concurrency & thread safety | ☑ | Stateless repository read over scoped DbContext; the change addresses a concurrency-induced tie (rapid redeploy / timed-out attempt) deterministically. No shared mutable state. No issues found. | +| 4 | Error handling & resilience | ☑ | No new error paths; FirstOrDefaultAsync nullable result preserved and handled by caller (prior == null guard). No issues found. | +| 5 | Security | ☑ | No secrets, no user input, fully parameterized LINQ-to-SQL. No injection or redaction surface. No issues found. | +| 6 | Performance & resource management | ☑ | InstanceId is indexed (HasIndex line 41); secondary integer sort on an already-filtered set is negligible. No extra round-trips or allocations. No issues found. | +| 7 | Design-document adherence | ☑ | Supports optimistic-concurrency/insert-only deployment-record model; DeploymentManager-026 is an inline decision tag consistent with existing CD-017/WP-4/WP-8 convention. No spec drift. No issues found. | +| 8 | Code organization & conventions | ☑ | Matches surrounding style; sibling list methods (GetAll/GetByInstance) lack the same tiebreaker but their consumers do not depend on single-row determinism — not a defect. No issues found. | +| 9 | Testing coverage | ☑ | RepositoryCoverageTests adds GetCurrentDeploymentStatus_SameTickRecords_DeterministicTiebreakerPicksHighestId; SiteRuntime adds SR029 regression test. Exact scenario covered. No issues found. | +| 10 | Documentation & comments | ☑ | Inline comment accurately explains the tie scenario, the SQL Server undefined-ordering risk, and the chosen resolution. Accurate and proportionate. No issues found. | + +_No new findings — the changes in this module are clean._ diff --git a/code-reviews/DataConnectionLayer/findings.md b/code-reviews/DataConnectionLayer/findings.md index 9eddae7f..7c7647bc 100644 --- a/code-reviews/DataConnectionLayer/findings.md +++ b/code-reviews/DataConnectionLayer/findings.md @@ -5,10 +5,10 @@ | Module | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer` | | Design doc | `docs/requirements/Component-DataConnectionLayer.md` | | Status | Reviewed | -| Last reviewed | 2026-06-20 | +| Last reviewed | 2026-06-24 | | Reviewer | claude-agent | -| Commit reviewed | `4307c381` | -| Open findings | 0 | +| Commit reviewed | `1f9de8a2` | +| Open findings | 1 | ## Summary @@ -1504,3 +1504,47 @@ decided explicitly. **Resolution** Resolved 2026-06-20 (commit `fd618cf1`): the `_alarmSourceFilter` comment corrected from 'first subscriber wins' to 'last subscriber wins' (and notes co-subscribers share the single filter). + +## Re-review — 2026-06-24 (commit `1f9de8a2`) + +Focused re-review of the changes since the prior review — verifying the code-review remediation + feature fixes are sound and regression-free. Reviewed by a per-module workflow agent; findings code-verified by the orchestrator. + +**Changes reviewed:** Two remediation fixes. (1) DCL-023 in DataConnectionActor.cs: the native-alarm subscribe path now inherits the DCL-021 obsolete-completion guard — HandleAlarmSubscribeCompleted releases the just-created adapter feed when no subscriber remains (last subscriber unsubscribed mid-flight), and HandleUnsubscribeAlarms now clears the _alarmSubscribesInFlight marker so a late completion is recognised as orphaned. (2) DCL-025 in MxGatewayDataConnection.cs: DisposeAsync now tears down the alarm stream (cancel/dispose _alarmCts, reset _alarmSubCount) under _alarmLock, mirroring DisconnectAsync, because the actor disposes adapters fire-and-forget without first calling DisconnectAsync. A focused regression test (DCL023_...) and a comment-only doc tweak on _alarmSourceFilter were added. + +**Verdict:** Both fixes target genuine resource-leak bugs and are largely well-constructed: DCL-025 correctly closes a real MxGateway alarm-stream/CTS leak on the fire-and-forget DisposeAsync path (lock-guarded, idempotent, and OPC UA correctly does not share the leak since its client owns the alarm subscription), and DCL-023 correctly handles the documented "last subscriber unsubscribed while subscribe in flight" case with good test coverage. However, the DCL-023 change introduces a new, narrower race: clearing _alarmSubscribesInFlight on unsubscribe re-opens the in-flight gate, so a re-subscribe to the same source arriving before the late completion fires causes a second concurrent adapter subscribe whose first subscription id is then silently overwritten and leaked. The tag-path completion handler guards against exactly this overwrite (line 908); the alarm completion handler does not. Project builds clean with no warnings. + +| # | Category | Examined | Notes | +|---|----------|----------|-------| +| 1 | Correctness & logic bugs | ☑ | DCL-023 guard logic is correct for the documented case, but newly enables a double-subscribe leak on resubscribe-during-orphan because HandleAlarmSubscribeCompleted overwrites _alarmSubscriptionIds with no already-present guard. See finding. | +| 2 | Akka.NET conventions | ☑ | State mutated only on the actor thread; adapter subscribe uses ContinueWith+PipeTo (no captured Self/Sender misuse); subscriber/Self/generation captured into locals before the async continuation. Compliant. | +| 3 | Concurrency & thread safety | ☑ | DCL-025 teardown is correctly serialized via _alarmLock against concurrent Subscribe/UnsubscribeAlarmsAsync; cancel-after-dispose-and-null is safe. The DCL-023 ordering hazard is logical (single-threaded actor), not a data race; covered under Correctness. | +| 4 | Error handling & resilience | ☑ | Orphaned-feed release is fire-and-forget (acceptable, best-effort cleanup); failed/late completions handled; warning logged. No issues. | +| 5 | Security | ☑ | No secrets, SQL/LDAP, or untrusted input in the delta; no script-trust-forbidden APIs introduced. No issues found. | +| 6 | Performance & resource management | ☑ | DCL-025 fixes a real CTS/Task leak on MxGateway failover/stop. DCL-023 fixes the original orphan leak but leaves a residual leak path (the resubscribe-overwrite case in the finding). | +| 7 | Design-document adherence | ☑ | Component-DataConnectionLayer.md ref-counted 'open once, tear down on last unsubscribe' invariant is honoured; fixes reinforce it. No doc drift introduced. | +| 8 | Code organization & conventions | ☑ | Fixes are localized, follow the established DCL-0xx tagged-comment convention, and mirror the existing DCL-021 / DisconnectAsync patterns. Consistent. | +| 9 | Testing coverage | ☑ | New DCL023 test exercises the fixed path well. Gaps: no test for the resubscribe-during-orphan double-subscribe race (the new finding), and no test for DCL-025 DisposeAsync alarm teardown. | +| 10 | Documentation & comments | ☑ | Comments on both fixes are accurate and explain intent/cross-references. The _alarmSourceFilter doc update (first→last subscriber wins) correctly matches the unconditional overwrite in HandleSubscribeAlarms. | + +**New findings from this re-review (1):** + +### DataConnectionLayer-029 — Resubscribe during orphaned in-flight subscribe leaks an alarm feed + +| | | +|--|--| +| Severity | Medium | +| Category | Correctness & logic bugs | +| Status | Open | +| Location | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Actors/DataConnectionActor.cs:1823` | + +**Description** + +The DCL-023 fix clears _alarmSubscribesInFlight in HandleUnsubscribeAlarms (line 1910). This re-opens the in-flight gate, so the following interleaving is now reachable: (1) Subscribe A for source S starts adapter subscribe #1, in-flight={S}; (2) the last subscriber unsubscribes — HandleUnsubscribeAlarms removes the subscriber set and clears in-flight (no subscription id stored yet, so no teardown); (3) a new Subscribe B for the same S arrives before completion #1 fires — HandleSubscribeAlarms sees neither _alarmSubscriptionIds.ContainsKey(S) nor _alarmSubscribesInFlight.Contains(S) (both cleared), so it issues a SECOND concurrent adapter subscribe #2; (4) completion #1 fires: _alarmSourceSubscribers now contains S again (B re-added it), so the orphan guard at line 1806 does NOT trigger and line 1823-1825 stores _alarmSubscriptionIds[S]=subId#1; (5) completion #2 fires and overwrites _alarmSubscriptionIds[S]=subId#2 — subId#1 is now leaked (never unsubscribed) and, since the adapter generation is unchanged, feed #1 keeps streaming routable transitions (duplicate delivery + resource leak). Pre-DCL-023, step 3 short-circuited because the in-flight marker survived, so the second subscribe was never issued — this leak path is newly introduced by the delta. Note the analogous tag-path completion handler guards against exactly this overwrite at line 908 (`if (result.AlreadySubscribed || _subscriptionIds.ContainsKey(result.TagPath)) continue;`); the alarm completion handler has no equivalent guard. + +**Recommendation** + +In HandleAlarmSubscribeCompleted, before storing the subscription id, guard against an already-present feed: if _alarmSubscriptionIds.ContainsKey(msg.SourceReference) (or a concurrent in-flight subscribe re-added the marker), release the just-created subId via UnsubscribeAlarmsAsync instead of overwriting — mirroring the tag-path re-check at line 908. Add a regression test that fires unsubscribe then a fresh subscribe for the same source while the first adapter subscribe is still parked, then releases both, asserting exactly one feed survives and the extra one is released. + +**Resolution** + +_Unresolved._ diff --git a/code-reviews/DeploymentManager/findings.md b/code-reviews/DeploymentManager/findings.md index f2ce0738..35bc523b 100644 --- a/code-reviews/DeploymentManager/findings.md +++ b/code-reviews/DeploymentManager/findings.md @@ -5,9 +5,9 @@ | Module | `src/ZB.MOM.WW.ScadaBridge.DeploymentManager` | | Design doc | `docs/requirements/Component-DeploymentManager.md` | | Status | Reviewed | -| Last reviewed | 2026-06-20 | +| Last reviewed | 2026-06-24 | | Reviewer | claude-agent | -| Commit reviewed | `4307c381` | +| Commit reviewed | `1f9de8a2` | | Open findings | 0 | ## Summary @@ -1438,3 +1438,26 @@ it. **Resolution** Resolved 2026-06-20 (commit `fd618cf1`): the artifact tests no longer assert the forbidden notification/SMTP shipping — flipped `Received(1)` → `DidNotReceive()` and added assertions that the shipped command's NotificationLists/SmtpConfigurations are null. + +## Re-review — 2026-06-24 (commit `1f9de8a2`) + +Focused re-review of the changes since the prior review — verifying the code-review remediation + feature fixes are sound and regression-free. Reviewed by a per-module workflow agent; findings code-verified by the orchestrator. + +**Changes reviewed:** The diff removes the notification-list/SMTP artifact path from ArtifactDeploymentService (DeploymentManager-025): the INotificationRepository is no longer stored or queried (discarded via `_ = notificationRepo`), FetchGlobalArtifactsAsync no longer fetches notification lists or SMTP configs (and drops their mapping plus the two GlobalArtifactSnapshot record fields), and BuildDeployArtifactsCommandAsync now always sends `NotificationLists: null`/`SmtpConfigurations: null` on the per-site DeployArtifactsCommand. XML docs and inline comments were rewritten to document the central-only invariant. + +**Verdict:** The change is sound, regression-free, and brings the code into alignment with the design spec rather than away from it. Component-DeploymentManager.md (lines 142-145) explicitly states notification lists and SMTP configuration are NOT deployable artifacts and that no SMTP credential is ever distributed to a site — the prior code violated this; the fix corrects it. The end-to-end invariant holds: the DeployArtifactsCommand fields are retained for additive message-contract compatibility, and the SiteRuntime apply path (SiteReplicationActor / DeploymentManagerActor) ignores the now-null fields and actively purges any pre-fix plaintext-SMTP rows. The DeploymentManager project builds with 0 warnings/errors and all 8 ArtifactDeploymentService tests pass, including new assertions that the notification repo is never called and that every per-site command carries null notification/SMTP fields. No new issues introduced. + +| # | Category | Examined | Notes | +|---|----------|----------|-------| +| 1 | Correctness & logic bugs | ☑ | Mixed positional/named constructor args bind correctly (build verified); fields correctly forced null. No issues found. | +| 2 | Akka.NET conventions | ☑ | No actor code touched; this is a plain async service. Site-side apply path confirmed to ignore null fields. No issues found. | +| 3 | Concurrency & thread safety | ☑ | Per-site commands still built sequentially before parallel dispatch (DbContext not thread-safe comment intact); change does not alter concurrency model. No issues found. | +| 4 | Error handling & resilience | ☑ | Removed code carried no error handling; surrounding try/catch + per-site timeout unchanged. No issues found. | +| 5 | Security | ☑ | Positive security improvement: plaintext SMTP credentials are no longer mapped into artifacts or distributed to sites; site side purges pre-fix rows. No issues found. | +| 6 | Performance & resource management | ☑ | Drops two repository round-trips from the global fetch; comment math (≈ N + M·N) updated accordingly. No leaks. No issues found. | +| 7 | Design-document adherence | ☑ | Now matches Component-DeploymentManager.md:142-145 and CLAUDE.md 'notification delivery is central-only'; corrects a prior spec violation. No drift. | +| 8 | Code organization & conventions | ☑ | Unused ctor param retained for DI/source compat and explicitly discarded with a documenting comment; GlobalArtifactSnapshot trimmed cleanly. No issues found. | +| 9 | Testing coverage | ☑ | Tests updated to DidNotReceive() on both notification queries plus Assert.All null-field checks on recorded commands; 8/8 pass. Good coverage of the delta. | +| 10 | Documentation & comments | ☑ | XML/inline comments thoroughly updated to explain the central-only invariant and additive-compat field retention; design doc consistent. No stale references. | + +_No new findings — the changes in this module are clean._ diff --git a/code-reviews/ExternalSystemGateway/findings.md b/code-reviews/ExternalSystemGateway/findings.md index 6bcf263f..f80ed25f 100644 --- a/code-reviews/ExternalSystemGateway/findings.md +++ b/code-reviews/ExternalSystemGateway/findings.md @@ -5,9 +5,9 @@ | Module | `src/ZB.MOM.WW.ScadaBridge.ExternalSystemGateway` | | Design doc | `docs/requirements/Component-ExternalSystemGateway.md` | | Status | Reviewed | -| Last reviewed | 2026-06-20 | +| Last reviewed | 2026-06-24 | | Reviewer | claude-agent | -| Commit reviewed | `4307c381` | +| Commit reviewed | `1f9de8a2` | | Open findings | 0 | ## Summary @@ -1507,3 +1507,26 @@ whatever scheme is chosen. **Resolution** Resolved 2026-06-20 (commit `fd618cf1`): renumbered the inline `// [n]` case labels in `ExecuteWriteAsync` to match the method header's rationale list (cancellation=[1], SqlException=[2], non-Sql=[3]), and realigned the three corresponding labels in the test file. Comment-only. + +## Re-review — 2026-06-24 (commit `1f9de8a2`) + +Focused re-review of the changes since the prior review — verifying the code-review remediation + feature fixes are sound and regression-free. Reviewed by a per-module workflow agent; findings code-verified by the orchestrator. + +**Changes reviewed:** In DatabaseGateway.ExecuteWriteAsync, the catch (SqlException) block now calls cancellationToken.ThrowIfCancellationRequested() before SqlErrorClassifier.Throw(...), so a caller-token cancel that surfaces from the SQL driver as a SqlException (mid-flight cancel, e.g. error number 0/3980) propagates as OperationCanceledException instead of being reclassified as a permanent DB error. The three inline case labels were also renumbered ([2]->[1] cancellation, [1]->[3] non-Sql transient, new [2] for the SqlException branch) to match the method header list and the test labels. A new regression test (CachedWrite_CancellationSurfacingAsSqlException_PropagatesCanceled_NotReclassifiedPermanent) fabricates a cancel-shaped SqlException via reflection and asserts OperationCanceledException propagates with nothing buffered. + +**Verdict:** The change is small, correct, and regression-free. It implements exactly the fix prescribed for ExternalSystemGateway-025 (and its sibling comment-ordering finding -026), matching the recommendation verbatim and mirroring the HTTP path's cancel-not-reclassified contract. Catch ordering is sound: the new ThrowIfCancellationRequested sits in the dedicated SqlException branch ahead of SqlErrorClassifier.Throw, so a cancel-shaped SqlException now propagates as OperationCanceledException while the non-cancel path is a no-op and classifies exactly as before. The project builds clean (0 warnings) and the targeted cancellation/outage tests pass (7/7). No new issues found. + +| # | Category | Examined | Notes | +|---|----------|----------|-------| +| 1 | Correctness & logic bugs | ☑ | ThrowIfCancellationRequested correctly placed before SqlErrorClassifier.Throw; no-op when token not cancelled, so the classify path is unchanged. No issues found. | +| 2 | Akka.NET conventions | ☑ | DatabaseGateway is a plain service, not an actor; no actor-state/Tell/Ask/closure-capture surface touched by the change. No issues found. | +| 3 | Concurrency & thread safety | ☑ | No shared mutable state introduced; the change reads the caller's CancellationToken only. No issues found. | +| 4 | Error handling & resilience | ☑ | This IS the change: caller cancel now wins over permanent-error reclassification regardless of driver exception shape; transient-vs-permanent classification for S&F is otherwise preserved. No issues found. | +| 5 | Security | ☑ | Error context still uses the connection NAME, never the connection string; no credential leakage. SQL still parameterized in RunSqlAsync. No issues found. | +| 6 | Performance & resource management | ☑ | RunSqlAsync still disposes connection/command via await using/using; the added token check adds no allocation or resource cost. No issues found. | +| 7 | Design-document adherence | ☑ | Component-ExternalSystemGateway.md transient/permanent contract intact; doc need not enumerate driver exception shapes, so no doc drift. No issues found. | +| 8 | Code organization & conventions | ☑ | Single classification seam (ExecuteWriteAsync) preserved; catch order mirrors ExternalSystemClient.InvokeHttpAsync. No issues found. | +| 9 | Testing coverage | ☑ | New regression test drives a fabricated cancel-shaped SqlException through the production classification via the RawExecuteStubGateway seam and asserts OperationCanceledException with zero buffered rows; 7/7 targeted tests pass. No issues found. | +| 10 | Documentation & comments | ☑ | Inline [1]/[2]/[3] labels now align with the method header list and the test file labels, resolving the prior mis-ordering. No issues found. | + +_No new findings — the changes in this module are clean._ diff --git a/code-reviews/Host/findings.md b/code-reviews/Host/findings.md index fccbb7f1..1e52bba7 100644 --- a/code-reviews/Host/findings.md +++ b/code-reviews/Host/findings.md @@ -5,9 +5,9 @@ | Module | `src/ZB.MOM.WW.ScadaBridge.Host` | | Design doc | `docs/requirements/Component-Host.md` | | Status | Reviewed | -| Last reviewed | 2026-06-20 | +| Last reviewed | 2026-06-24 | | Reviewer | claude-agent | -| Commit reviewed | `4307c381` | +| Commit reviewed | `1f9de8a2` | | Open findings | 0 | ## Summary @@ -1320,3 +1320,26 @@ the matrix, and note KpiHistory's recorder singleton is intentionally absent fro **Resolution** Resolved 2026-06-20 (commit `fd618cf1`): added `KpiHistory` and `Transport` rows to the Component Registration Matrix in Component-Host.md (both Central=Yes/Site=No/DI=Yes), reflecting their actual central-only registration. + +## Re-review — 2026-06-24 (commit `1f9de8a2`) + +Focused re-review of the changes since the prior review — verifying the code-review remediation + feature fixes are sound and regression-free. Reviewed by a per-module workflow agent; findings code-verified by the orchestrator. + +**Changes reviewed:** A single new `ProjectReference` to `ZB.MOM.WW.ScadaBridge.KpiHistory.csproj` was added to the Host's `.csproj` (commit fd618cf1, the review-remediation pass). This makes the Host's pre-existing compile-time dependency on KpiHistory (used in `Program.cs` via `AddKpiHistory` and in `AkkaHostedService.cs` via `KpiHistoryOptions` + the `KpiHistoryRecorderActor` cluster singleton) explicit rather than relying on a transitive reference through CentralUI. + +**Verdict:** The change is a one-line build-graph fix and it is correct, minimal, and regression-free. The Host directly consumes KpiHistory public types in Program.cs and AkkaHostedService.cs, so the explicit direct reference is necessary (not redundant) and removes the latent fragility of depending on a transitive project reference via CentralUI. There is no duplicate reference, the KpiHistory project is a solution member, and the Host project builds cleanly with 0 warnings under TreatWarningsAsErrors. The actor wiring it enables (kpi-history-recorder singleton, not readiness-gated, with a CoordinatedShutdown drain) matches Component-KpiHistory.md and CLAUDE.md design intent. No issues found in the delta. + +| # | Category | Examined | Notes | +|---|----------|----------|-------| +| 1 | Correctness & logic bugs | ☑ | Reference resolves and is required by direct type usage in Program.cs/AkkaHostedService.cs; build succeeds. No issues found. | +| 2 | Akka.NET conventions | ☑ | Enabled wiring (KpiHistoryRecorderActor singleton + proxy + PhaseClusterLeave GracefulStop drain) follows the established singleton pattern; not readiness-gated per design. No issues found. | +| 3 | Concurrency & thread safety | ☑ | csproj change introduces no concurrency surface; actor opens its own per-tick DI scope as documented. No issues found. | +| 4 | Error handling & resilience | ☑ | Graceful-stop drain wraps GracefulStop in try/catch with PoisonPill fallback; unchanged by this diff. No issues found. | +| 5 | Security | ☑ | No secrets, network, or trust-boundary surface in a project reference. No issues found. | +| 6 | Performance & resource management | ☑ | Build-graph-only change; KpiHistory is a lightweight project (Akka + DI/Options). No runtime cost added. No issues found. | +| 7 | Design-document adherence | ☑ | KpiHistory is component #26, central-only, not readiness-gated; the reference and the singleton wiring it enables match Component-KpiHistory.md and CLAUDE.md. No issues found. | +| 8 | Code organization & conventions | ☑ | Reference placed alongside other ScadaBridge ProjectReferences in logical order; no duplicate; KpiHistory is a slnx member. No issues found. | +| 9 | Testing coverage | ☑ | A build-graph fix is exercised by solution compilation and the KpiHistory.Tests project; no Host-specific test is warranted for a csproj reference. No issues found. | +| 10 | Documentation & comments | ☑ | Existing Program.cs/AkkaHostedService.cs comments accurately describe the KPI History wiring this reference supports. No issues found. | + +_No new findings — the changes in this module are clean._ diff --git a/code-reviews/InboundAPI/findings.md b/code-reviews/InboundAPI/findings.md index e623e5b8..1750f284 100644 --- a/code-reviews/InboundAPI/findings.md +++ b/code-reviews/InboundAPI/findings.md @@ -5,10 +5,10 @@ | Module | `src/ZB.MOM.WW.ScadaBridge.InboundAPI` | | Design doc | `docs/requirements/Component-InboundAPI.md` | | Status | Reviewed | -| Last reviewed | 2026-06-20 | +| Last reviewed | 2026-06-24 | | Reviewer | claude-agent | -| Commit reviewed | `4307c381` | -| Open findings | 0 | +| Commit reviewed | `1f9de8a2` | +| Open findings | 1 | ## Summary @@ -1633,3 +1633,47 @@ databases."), mirroring how `RouteAccessor` throws on cross-site routing. Regres `InboundScript_Database_DiagnosesClean` (CentralUI.Tests) drives a script using all three `Database` methods through `ScriptAnalysisService.Diagnose(... ScriptKind.InboundApi)` and asserts no `CS`/`SCADA` markers — alongside the existing `InboundScript_WaitForAttribute_DiagnosesClean`. + +## Re-review — 2026-06-24 (commit `1f9de8a2`) + +Focused re-review of the changes since the prior review — verifying the code-review remediation + feature fixes are sound and regression-free. Reviewed by a per-module workflow agent; findings code-verified by the orchestrator. + +**Changes reviewed:** InboundDatabaseHelper was migrated from blocking ADO.NET (.GetAwaiter().GetResult() + sync ExecuteScalar/ExecuteReader) to a fully async path (await using conn/cmd/reader; token forwarded to ExecuteScalarAsync/ExecuteReaderAsync/ReadAsync), gained an ExecuteAsync write method (InboundAPI-026) and a CommandTimeout backstop derived from the method timeout (InboundAPI-027). RouteHelper/RouteTarget gained a separate request-abort token (WithRequestAborted), and WaitForAttribute was re-bounded by its own wait timeout (+ a 5s WaitResponseGrace backstop) plus the abort/explicit tokens, deliberately excluding the method deadline (InboundAPI-029, spec §6). InboundScriptExecutor wires the method timeout into the DB helper and threads the raw request-abort token into the route helper. + +**Verdict:** The remediation is largely sound and well-tested. The DB-helper async migration is correct end-to-end: GetConnectionAsync returns an already-opened connection, await using disposes connection/command/reader properly, the deadline token reaches every async DB call, the CommandTimeout backstop is derived safely (Math.Ceiling, only set when positive), writes are added per spec, and SQL-injection protection via parameter binding is preserved and regression-tested. RouteHelper builder composition is intact and spec §6 (wait bounded by its own timeout, not the method deadline, still cancellable by client disconnect) is honoured and well covered. The one substantive concern is in WaitForAttribute: the new local 5-second grace backstop is tighter than the 30s IntegrationTimeout round-trip slack the CommunicationService layer already budgets for the same Ask, so a slow-but-legitimate site response can be cancelled locally and surface to the script as a thrown exception instead of the spec-mandated clean false. The InboundAPI project builds clean and all 40 changed-area tests pass. + +| # | Category | Examined | Notes | +|---|----------|----------|-------| +| 1 | Correctness & logic bugs | ☑ | WaitForAttribute local 5s grace backstop is shorter than the 30s IntegrationTimeout round-trip slack the CommunicationService Ask budgets; a slow site response gets cancelled locally and thrown instead of returning false. See finding. | +| 2 | Akka.NET conventions | ☑ | No actor code changed; RouteTarget delegates to IInstanceRouter/CommunicationService which Asks at the cluster boundary. No captured sender/this, no shared mutable actor state. No issues. | +| 3 | Concurrency & thread safety | ☑ | RouteHelper/RouteTarget are immutable (readonly fields, builder returns new instances); InboundDatabaseHelper is per-execution and not shared. Per-wait CTS created and disposed locally. No issues. | +| 4 | Error handling & resilience | ☑ | Null-gateway throws InvalidOperationException on first use (tested); GetConnectionAsync disposes on open failure. The grace-backstop converting a clean false into a thrown OperationCanceledException is the resilience gap noted in the finding. | +| 5 | Security | ☑ | SQL injection closed via parameter binding (regression-tested with an injection payload); named connections only, no arbitrary connection string; scripts never touch System.Data. No secret logging introduced. No issues. | +| 6 | Performance & resource management | ☑ | Blocking .GetAwaiter().GetResult() removed (no pool-thread blocking); await using disposes connection/command/reader; per-wait CTS and linked CTS both disposed via using. No issues. | +| 7 | Design-document adherence | ☑ | Component-InboundAPI.md Database Access (async, reads+writes, named connections, parameter binding, CommandTimeout backstop) and Routing Behavior §6 (wait bounded by its own timeout, abort still cancels) match the code. Doc is in sync. | +| 8 | Code organization & conventions | ☑ | CreateCommand factory extracted to dedupe; builder pattern consistent across With* methods; UTC timestamps (DateTimeOffset.UtcNow) preserved on routed requests. No issues. | +| 9 | Testing coverage | ☑ | Strong: reads/writes/injection/null-gateway/execute-path cancellation for the DB helper; wait-not-bounded-by-deadline, explicit-token, client-disconnect for the route. Gap: no test asserts behaviour when the local grace backstop fires before a slow site response (the finding scenario). | +| 10 | Documentation & comments | ☑ | XML docs and inline rationale (InboundAPI-026/027/029) are thorough and accurate. The grace-backstop comment overstates safety relative to the 30s IntegrationTimeout slack it sits inside, but that is documentation of the finding, not a separate issue. | + +**New findings from this re-review (1):** + +### InboundAPI-031 — Wait grace backstop tighter than the layer's round-trip slack + +| | | +|--|--| +| Severity | Medium | +| Category | Correctness & logic bugs | +| Status | Open | +| Location | `src/ZB.MOM.WW.ScadaBridge.InboundAPI/RouteHelper.cs:266` | + +**Description** + +RouteTarget.WaitForAttribute builds a local backstop CTS of `timeout + WaitResponseGrace` (5s) and links it into the token passed to RouteToWaitForAttributeAsync (RouteHelper.cs:112, 266-269). The layer beneath — CommunicationService.RouteToWaitForAttributeAsync (CommunicationService.cs:601) — already bounds the same cluster Ask by `request.Timeout + IntegrationTimeout`, and IntegrationTimeout defaults to 30s (CommunicationOptions.cs:22). So the local backstop fires 25s earlier than the round-trip budget the lower layer deliberately allows. The site enforces the wait timeout and returns Matched=false; if that response is delivered between `timeout+5s` and `timeout+30s` (e.g. under cluster load, GC pause, or network latency — well within the slack the lower layer budgets), the local linked CTS cancels the in-flight Ask, which throws OperationCanceledException/TaskCanceledException straight through CommunicationServiceInstanceRouter (no try/catch) to the script — converting a clean, spec-mandated `false` (spec §6) into an unhandled exception/500. The code comment at RouteHelper.cs:107-111 claims the grace 'must sit slightly LATER than the wait timeout … must not pre-empt the site's own timed-out response', but 5s is shorter than the 30s the layer itself permits for that very response, so the backstop does exactly what the comment says it must not. + +**Recommendation** + +Align the local backstop with the round-trip slack the CommunicationService Ask uses (i.e. derive the grace from CommunicationOptions.IntegrationTimeout rather than a hard-coded 5s), or drop the local backstop entirely and rely on the CommunicationService Ask's own `timeout + IntegrationTimeout` bound (the request-abort and explicit caller tokens can still be linked in). Either way, ensure the local cancellation cannot fire before the lower layer would have delivered the site's timed-out response. Add a test that delays the substitute router's response past the grace but within the round-trip budget and asserts the call still returns false rather than throwing. + +**Resolution** + +_Unresolved._ diff --git a/code-reviews/KpiHistory/findings.md b/code-reviews/KpiHistory/findings.md index 8582a58c..dc3cf5b9 100644 --- a/code-reviews/KpiHistory/findings.md +++ b/code-reviews/KpiHistory/findings.md @@ -5,9 +5,9 @@ | Module | `src/ZB.MOM.WW.ScadaBridge.KpiHistory` | | Design doc | `docs/requirements/Component-KpiHistory.md` | | Status | Reviewed | -| Last reviewed | 2026-06-20 | +| Last reviewed | 2026-06-24 | | Reviewer | claude-agent | -| Commit reviewed | `4307c381` | +| Commit reviewed | `1f9de8a2` | | Open findings | 0 | ## Summary @@ -305,3 +305,26 @@ from KpiHistory-003. **Resolution** Resolved 2026-06-20 (commit `fd618cf1`): corrected the `KpiSeriesBucketer` param XML doc — dropped the false 'largest-timestamp-wins / stable ties' claim; now states it requires ascending-sorted input and selects last-in-iteration otherwise. Doc-only. + +## Re-review — 2026-06-24 (commit `1f9de8a2`) + +Focused re-review of the changes since the prior review — verifying the code-review remediation + feature fixes are sound and regression-free. Reviewed by a per-module workflow agent; findings code-verified by the orchestrator. + +**Changes reviewed:** The diff adds an in-flight guard (`_sampleInFlight` bool field) to `KpiHistoryRecorderActor` so overlapping sample passes can never run concurrently. `HandleSampleTick` now skips (coalesces) a `SampleTick` and logs at debug if a pass is already in flight; otherwise it raises the guard before launching the off-thread `RunSamplePass`. The `SampleComplete` receive handler lowers the guard (it fires on both success and fault paths via the PipeTo projection). XML doc comments were updated accordingly. Tests: a new deterministic regression test (`OverlappingTick_WhileFirstPassInFlight_DoesNotStartSecondPass`) uses a gated repository to prove the second tick is skipped and the guard correctly resets; the pre-existing recovery test was hardened to re-send the tick per poll to avoid racing the asynchronous guard reset. + +**Verdict:** The change is sound, minimal, and regression-free. It faithfully mirrors the already-shipped `NotificationOutboxActor` dispatch in-flight-guard pattern (skip-if-busy, raise-before-launch, lower-on-PipeTo-completion). The guard field is read and written only on the actor thread, so there is no thread-safety hazard; it cannot wedge permanently because `RunSamplePass` never throws and both PipeTo success and failure projections emit `SampleComplete`. The skip-on-overlap behaviour is consistent with the design doc, which describes best-effort per-tick sampling with no strict "a sample must land every interval" guarantee. The new behaviour is covered by a deterministic regression test, and all 4 actor tests pass. No new issues found. + +| # | Category | Examined | Notes | +|---|----------|----------|-------| +| 1 | Correctness & logic bugs | ☑ | Guard raised before launch, lowered on both success+fault PipeTo paths; cannot wedge (RunSamplePass never throws). 1:1 raise/lower pairing — no double-lower. No issues found. | +| 2 | Akka.NET conventions | ☑ | Off-thread work via PipeTo to Self; guard mutated only on the actor thread in the SampleComplete handler. Matches the documented NotificationOutboxActor pattern. No issues found. | +| 3 | Concurrency & thread safety | ☑ | _sampleInFlight is touched only on the actor thread (HandleSampleTick + SampleComplete receive); no shared mutable state or captured this/sender in the off-thread pass. No issues found. | +| 4 | Error handling & resilience | ☑ | Faulted pass still produces SampleComplete (belt-and-braces failure projection), so the guard always clears; best-effort observability contract preserved. No issues found. | +| 5 | Security | ☑ | No new I/O, no secrets, no user input, no injection surface introduced. Debug-level skip log carries no sensitive data. No issues found. | +| 6 | Performance & resource management | ☑ | The guard's purpose is precisely to avoid load amplification on a slow/recovering DB by preventing overlapping passes. No new allocations or leaks; field is reset on restart. No issues found. | +| 7 | Design-document adherence | ☑ | Component-KpiHistory.md describes best-effort per-tick sampling with no strict per-interval landing guarantee; coalescing overlaps is consistent and the doc already says the recorder mirrors NotificationOutboxActor. No drift. | +| 8 | Code organization & conventions | ☑ | Single new private bool field with thorough XML doc; inline comments accurate. Consistent with the sibling actor's naming/structure. No issues found. | +| 9 | Testing coverage | ☑ | New deterministic gated-repository regression test pins one-pass-per-tick and guard reset; existing recovery test hardened against the async-reset race. All 4 tests pass. Coverage is adequate for the delta. | +| 10 | Documentation & comments | ☑ | Field, handler, and SampleComplete XML docs updated to explain the guard, the enqueue-not-coalesce timer rationale, and the lower-on-fault behaviour. Accurate and clear. No issues found. | + +_No new findings — the changes in this module are clean._ diff --git a/code-reviews/ManagementService/findings.md b/code-reviews/ManagementService/findings.md index 15ad9c58..82e98dd1 100644 --- a/code-reviews/ManagementService/findings.md +++ b/code-reviews/ManagementService/findings.md @@ -5,9 +5,9 @@ | Module | `src/ZB.MOM.WW.ScadaBridge.ManagementService` | | Design doc | `docs/requirements/Component-ManagementService.md` | | Status | Reviewed | -| Last reviewed | 2026-06-19 | +| Last reviewed | 2026-06-24 | | Reviewer | claude-agent | -| Commit reviewed | `d6ead8ae` | +| Commit reviewed | `1f9de8a2` | | Open findings | 0 (1 Deferred — see ManagementService-012) | ## Summary @@ -1216,3 +1216,26 @@ Make the comments accurate after fixing the gating. Resolved 2026-06-19 (commit `cd8e4872`): with both update commands moved to Administrator (MS-024), the SMTP "Admin-only" comment is now accurate and the SMS list comment was updated to "Admin-only". + +## Re-review — 2026-06-24 (commit `1f9de8a2`) + +Focused re-review of the changes since the prior review — verifying the code-review remediation + feature fixes are sound and regression-free. Reviewed by a per-module workflow agent; findings code-verified by the orchestrator. + +**Changes reviewed:** Three remediation changes to ManagementActor.cs: (1) MgmtSvc-021 moved UpdateSmtpConfigCommand and UpdateSmsConfigCommand from the Designer role gate to the Administrator gate in GetRequiredRole; (2) MgmtSvc-021 changed HandleUpdateSmsConfig to treat an empty/whitespace AuthToken as "omitted" (IsNullOrWhiteSpace guard) instead of clearing the stored Twilio secret; (3) Security-023 added a ValidateMappingRole helper that rejects any LDAP-group-mapping Role not in the canonical Roles.All set (case-insensitive membership, verbatim storage) at the create/update write paths. + +**Verdict:** The changed code is clean, correct, and regression-free. All three fixes are sound: the SMTP/SMS admin gate now matches the design doc (Component-NotificationService.md line 19: "Admin role (SMTP and SMS configuration)") and the RequireAdmin UI pages, closing a privilege gap where a Designer could rotate production credentials via CLI/API. The empty-AuthToken guard correctly prevents silent secret wipe while preserving the intentionally-asymmetric null-only guard on SMTP Credentials (which may be cleared for anonymous SMTP). The role-mapping validation rejects non-canonical roles before any DB write via a clean COMMAND_FAILED client error (ManagementCommandException), and its case-insensitive membership / verbatim-storage choice flows correctly through RoleMapper which compares roles case-insensitively. Test coverage is thorough and matches each change point precisely. No new issues introduced. + +| # | Category | Examined | Notes | +|---|----------|----------|-------| +| 1 | Correctness & logic bugs | ☑ | IsNullOrWhiteSpace guard correctly preserves the stored AuthToken on blank input; ValidateMappingRole rejects before DB write; gate reassignment is internally consistent (commands removed from Designer block, added to Administrator block, no duplication). No issues found. | +| 2 | Akka.NET conventions | ☑ | No actor-model changes; handlers remain static async returning into the existing PipeTo/error envelope path. ManagementCommandException surfaces as ManagementError COMMAND_FAILED, not an internal-detail leak. No issues found. | +| 3 | Concurrency & thread safety | ☑ | ValidateMappingRole is a pure static method; no shared mutable state, no captured sender/this. Scoped repos resolved per-command as before. No issues found. | +| 4 | Error handling & resilience | ☑ | Validation throws a curated ManagementCommandException whose message is safe to surface verbatim (no server/DB internals); falls into the existing finding-016 client-safe error path. No issues found. | +| 5 | Security | ☑ | Core intent of the diff: closes a privilege-escalation gap (SMTP/SMS secret rotation now Admin-only, matching UI RequireAdmin), prevents accidental secret wipe, and rejects non-canonical role claims. AuthToken still projected away in responses/audit. No issues found. | +| 6 | Performance & resource management | ☑ | Roles.All.Contains over a 6-element array per write is negligible; no new allocations on hot paths, no IDisposable/stream lifetimes touched. No issues found. | +| 7 | Design-document adherence | ☑ | Component-NotificationService.md line 19 explicitly mandates Admin role for SMTP/SMS config; the gate change aligns code with spec (prior Designer gate was the drift). Role-canonicalization matches Roles.cs source-of-truth intent. No stale doc identified. No issues found. | +| 8 | Code organization & conventions | ☑ | ValidateMappingRole factored once and reused by both create/update handlers; comments are accurate and cite the asymmetry rationale with SmtpConfiguration. Consistent with surrounding style. No issues found. | +| 9 | Testing coverage | ☑ | New RoleMappingValidationTests (unknown role, misspelled canonical, no-row asserts, canonical success with verbatim casing) plus ManagementActorTests deltas covering Admin-positive, Designer/Viewer-rejection, and empty-AuthToken preservation. Each change point is directly exercised. No issues found. | +| 10 | Documentation & comments | ☑ | MgmtSvc-020 comment updated from 'Designer-gated' to 'Admin-only' to match the new gate; MgmtSvc-021 and Security-023 comments accurately describe rationale and explicitly scope out the deferred case-sensitivity asymmetry. No stale comments. No issues found. | + +_No new findings — the changes in this module are clean._ diff --git a/code-reviews/README.md b/code-reviews/README.md index 49c1c4da..12bed095 100644 --- a/code-reviews/README.md +++ b/code-reviews/README.md @@ -41,38 +41,38 @@ module file and counted in **Total**. |----------|---------------| | Critical | 0 | | High | 0 | -| Medium | 0 | -| Low | 0 | -| **Total** | **0** | +| Medium | 4 | +| Low | 4 | +| **Total** | **8** | ## Module Status | Module | Last reviewed | Commit | Open (C/H/M/L) | Open | Total | |--------|---------------|--------|----------------|------|-------| -| [AuditLog](AuditLog/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 16 | +| [AuditLog](AuditLog/findings.md) | 2026-06-24 | `1f9de8a2` | 0/0/0/1 | 1 | 17 | | [CLI](CLI/findings.md) | 2026-06-19 | `d6ead8ae` | 0/0/0/0 | 0 | 24 | -| [CentralUI](CentralUI/findings.md) | 2026-06-19 | `d6ead8ae` | 0/0/0/0 | 0 | 36 | +| [CentralUI](CentralUI/findings.md) | 2026-06-24 | `1f9de8a2` | 0/0/0/1 | 1 | 37 | | [ClusterInfrastructure](ClusterInfrastructure/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 15 | -| [Commons](Commons/findings.md) | 2026-06-19 | `d6ead8ae` | 0/0/0/0 | 0 | 26 | +| [Commons](Commons/findings.md) | 2026-06-24 | `1f9de8a2` | 0/0/0/0 | 0 | 26 | | [Communication](Communication/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 24 | -| [ConfigurationDatabase](ConfigurationDatabase/findings.md) | 2026-06-19 | `d6ead8ae` | 0/0/0/0 | 0 | 27 | -| [DataConnectionLayer](DataConnectionLayer/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 26 | -| [DeploymentManager](DeploymentManager/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 27 | -| [ExternalSystemGateway](ExternalSystemGateway/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 26 | +| [ConfigurationDatabase](ConfigurationDatabase/findings.md) | 2026-06-24 | `1f9de8a2` | 0/0/0/0 | 0 | 27 | +| [DataConnectionLayer](DataConnectionLayer/findings.md) | 2026-06-24 | `1f9de8a2` | 0/0/1/0 | 1 | 27 | +| [DeploymentManager](DeploymentManager/findings.md) | 2026-06-24 | `1f9de8a2` | 0/0/0/0 | 0 | 27 | +| [ExternalSystemGateway](ExternalSystemGateway/findings.md) | 2026-06-24 | `1f9de8a2` | 0/0/0/0 | 0 | 26 | | [HealthMonitoring](HealthMonitoring/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 25 | -| [Host](Host/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 26 | -| [InboundAPI](InboundAPI/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 30 | -| [KpiHistory](KpiHistory/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 6 | -| [ManagementService](ManagementService/findings.md) | 2026-06-19 | `d6ead8ae` | 0/0/0/0 | 0 | 26 | +| [Host](Host/findings.md) | 2026-06-24 | `1f9de8a2` | 0/0/0/0 | 0 | 26 | +| [InboundAPI](InboundAPI/findings.md) | 2026-06-24 | `1f9de8a2` | 0/0/1/0 | 1 | 31 | +| [KpiHistory](KpiHistory/findings.md) | 2026-06-24 | `1f9de8a2` | 0/0/0/0 | 0 | 6 | +| [ManagementService](ManagementService/findings.md) | 2026-06-24 | `1f9de8a2` | 0/0/0/0 | 0 | 26 | | [NotificationOutbox](NotificationOutbox/findings.md) | 2026-06-19 | `d6ead8ae` | 0/0/0/0 | 0 | 13 | | [NotificationService](NotificationService/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 28 | -| [ScriptAnalysis](ScriptAnalysis/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 8 | +| [ScriptAnalysis](ScriptAnalysis/findings.md) | 2026-06-24 | `1f9de8a2` | 0/0/0/1 | 1 | 9 | | [Security](Security/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 25 | -| [SiteCallAudit](SiteCallAudit/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 9 | -| [SiteEventLogging](SiteEventLogging/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 27 | -| [SiteRuntime](SiteRuntime/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 31 | -| [StoreAndForward](StoreAndForward/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 27 | -| [TemplateEngine](TemplateEngine/findings.md) | 2026-06-20 | `4307c381` | 0/0/0/0 | 0 | 25 | +| [SiteCallAudit](SiteCallAudit/findings.md) | 2026-06-24 | `1f9de8a2` | 0/0/0/0 | 0 | 9 | +| [SiteEventLogging](SiteEventLogging/findings.md) | 2026-06-24 | `1f9de8a2` | 0/0/0/0 | 0 | 27 | +| [SiteRuntime](SiteRuntime/findings.md) | 2026-06-24 | `1f9de8a2` | 0/0/1/1 | 2 | 33 | +| [StoreAndForward](StoreAndForward/findings.md) | 2026-06-24 | `1f9de8a2` | 0/0/1/0 | 1 | 28 | +| [TemplateEngine](TemplateEngine/findings.md) | 2026-06-24 | `1f9de8a2` | 0/0/0/0 | 0 | 25 | | [Transport](Transport/findings.md) | 2026-06-19 | `d6ead8ae` | 0/0/0/0 | 0 | 15 | ## Pending Findings @@ -90,10 +90,20 @@ _None open._ _None open._ -### Medium (0) +### Medium (4) -_None open._ +| ID | Module | Title | +|----|--------|-------| +| DataConnectionLayer-029 | [DataConnectionLayer](DataConnectionLayer/findings.md) | Resubscribe during orphaned in-flight subscribe leaks an alarm feed | +| InboundAPI-031 | [InboundAPI](InboundAPI/findings.md) | Wait grace backstop tighter than the layer's round-trip slack | +| SiteRuntime-032 | [SiteRuntime](SiteRuntime/findings.md) | Deleting a disabled instance leaks the deployed count | +| StoreAndForward-028 | [StoreAndForward](StoreAndForward/findings.md) | StopAsync resets register-guard but not _bufferedCount, double-counting gauge on restart | -### Low (0) +### Low (4) -_None open._ +| ID | Module | Title | +|----|--------|-------| +| AuditLog-017 | [AuditLog](AuditLog/findings.md) | AuditLog-014 guard path not exercised by a test | +| CentralUI-037 | [CentralUI](CentralUI/findings.md) | RunInSandboxAsync doc comment omits new Database-throws behavior | +| ScriptAnalysis-009 | [ScriptAnalysis](ScriptAnalysis/findings.md) | Inline Pass 1 comment no longer reflects parameterized base references | +| SiteRuntime-033 | [SiteRuntime](SiteRuntime/findings.md) | Native-alarm doc stale re: per-condition eviction and cap return-to-normal | diff --git a/code-reviews/ScriptAnalysis/findings.md b/code-reviews/ScriptAnalysis/findings.md index 8a6a94ea..bf29888f 100644 --- a/code-reviews/ScriptAnalysis/findings.md +++ b/code-reviews/ScriptAnalysis/findings.md @@ -5,10 +5,10 @@ | Module | `src/ZB.MOM.WW.ScadaBridge.ScriptAnalysis` | | Design doc | `docs/requirements/Component-ScriptAnalysis.md` | | Status | Reviewed | -| Last reviewed | 2026-06-20 | +| Last reviewed | 2026-06-24 | | Reviewer | claude-agent | -| Commit reviewed | `4307c381` | -| Open findings | 0 | +| Commit reviewed | `1f9de8a2` | +| Open findings | 1 | ## Summary @@ -405,3 +405,47 @@ defence for the compile-running consumers). Won't Fix 2026-06-20: the static gate is intentionally defence-in-depth, not a runtime sandbox — this is the explicitly-documented design posture, not a defect. Recorded as a standing risk for visibility. + +## Re-review — 2026-06-24 (commit `1f9de8a2`) + +Focused re-review of the changes since the prior review — verifying the code-review remediation + feature fixes are sound and regression-free. Reviewed by a per-module workflow agent; findings code-verified by the orchestrator. + +**Changes reviewed:** The diff remediates prior finding ScriptAnalysis-001: when TRUSTED_PLATFORM_ASSEMBLIES is unavailable (single-file/AOT/trimmed host), BuildAnalysisReferences now folds new ForbiddenAnchorAssemblies (Process, Socket, File, Thread, Assembly, Marshal) into the minimal fallback so a bare forbidden type in an allowed namespace still resolves and is flagged, sets an observable AnalysisReferencesDegraded flag plus a Trace.TraceWarning, and refactors per-assembly add into TryAddAssembly. A new BuildMinimalFallbackReferences() helper and a FindViolations(code, baseReferences, extraReferences) overload let tests pin the degraded path; the extraReferences XML-doc was corrected to state extra references only widen resolution and can never whitelist a forbidden API. Test files added 11 adversarial cases (SA-003) covering TPA-fallback, extension methods, verbatim/Unicode-escape identifiers, unsafe blocks, and comment/string-literal false-positive guards. + +**Verdict:** The changed code is sound and regression-free. The SA-001 fix is correctly scoped: forbidden-API anchor assemblies are folded in only on the degraded (no-TPA) path and only into the validator's AnalysisReferences, never into DefaultReferences, so the RoslynScriptCompiler compile gate keeps rejecting forbidden types as undefined symbols (its independent second layer of defence is preserved). Static field-initialization order is correct (anchor lists declared before AnalysisReferences), the degradation is now loud, and the new FindViolations overload is fully backward-compatible with all four existing single-arg consumers. The 34 ScriptTrustValidator tests pass, including the 11 new adversarial cases, and the design doc (REQ-SA-2, TPA-fallback section) is fully in sync with the code. The only observations are minor: no test asserts the production non-degraded flag value, and one inline comment is slightly imprecise inside the parameterized overload — neither affects correctness or security. + +| # | Category | Examined | Notes | +|---|----------|----------|-------| +| 1 | Correctness & logic bugs | ☑ | Fallback enrichment, tpaAvailable detection (byPath.Count>0), and TryAddAssembly dedup logic are correct. Static init order (DefaultAssemblies/ForbiddenAnchorAssemblies before AnalysisReferences) is safe. No issues found. | +| 2 | Akka.NET conventions | ☑ | Module has no actors/Akka dependency by design; the change adds only static policy/validator members. Not applicable; no issues found. | +| 3 | Concurrency & thread safety | ☑ | AnalysisReferencesDegraded is set once during static initialization of AnalysisReferences (single-threaded type-init) and read-only thereafter; the new overload is stateless and per-call. No shared mutable state introduced. | +| 4 | Error handling & resilience | ☑ | TryAddAssembly swallows per-assembly read errors (fail-open on a single bad assembly, as before) and the degraded fallback is now LOUD via flag + Trace.TraceWarning rather than silent. Fail-safe posture preserved. | +| 5 | Security | ☑ | Core fix verified: anchors added only to AnalysisReferences on the degraded path, never to DefaultReferences, so the compile-gate's undefined-symbol defence is intact. Bare Process/Socket now caught on the minimal set (test-proven). extraReferences cannot produce a false allow. No new bypass introduced. | +| 6 | Performance & resource management | ☑ | Anchor enrichment adds at most 6 MetadataReference.CreateFromFile calls, and only on the degraded path, all at one-time static init. No per-call cost added (the pre-existing no-cache concern, SA-006, is unchanged and out of delta scope). | +| 7 | Design-document adherence | ☑ | Component-ScriptAnalysis.md (lines 86-88) documents DefaultReferences vs AnalysisReferences, ForbiddenAnchorAssemblies, the not-silent fallback, and AnalysisReferencesDegraded exactly as implemented. No drift in either direction. | +| 8 | Code organization & conventions | ☑ | Extraction of TryAddAssembly removes duplication; ForbiddenAnchorAssemblies / BuildMinimalFallbackReferences are well-placed and thoroughly XML-documented. Overload delegation pattern is clean. No issues found. | +| 9 | Testing coverage | ☑ | 11 new adversarial tests close SA-003 gaps and pass (34/34). Minor: no test asserts AnalysisReferencesDegraded==false on a normal (TPA-present) host, so the production non-degraded path is exercised only implicitly. | +| 10 | Documentation & comments | ☑ | New XML-docs are accurate and the corrected extraReferences doc matches verified behaviour. Minor: the inline comment at ScriptTrustValidator.cs:106-111 still says 'Use the full trusted-platform reference set' but the method now resolves against the baseReferences parameter (minimal-enriched on the fallback path). | + +**New findings from this re-review (1):** + +### ScriptAnalysis-009 — Inline Pass 1 comment no longer reflects parameterized base references + +| | | +|--|--| +| Severity | Low | +| Category | Documentation & comments | +| Status | Open | +| Location | `src/ZB.MOM.WW.ScadaBridge.ScriptAnalysis/ScriptTrustValidator.cs:106` | + +**Description** + +The inline comment at lines 106-111 still asserts 'Use the full trusted-platform reference set (not the minimal runtime-fidelity DefaultReferences)'. After the refactor, Pass 1 now resolves against the baseReferences parameter (line 112), which on the test/degraded path is the minimal anchor-enriched fallback set, not the full TPA set. The comment describes only the default public-entry-point case and is misleading for a maintainer reading the parameterized overload. The overload's own XML-doc (lines 77-89) is correct, so impact is limited to an internal comment. + +**Recommendation** + +Reword the comment to note that Pass 1 resolves against the supplied baseReferences — full TPA set on the public entry point, minimal anchor-enriched fallback when a caller (e.g. tests) passes BuildMinimalFallbackReferences(). + +**Resolution** + +_Unresolved._ diff --git a/code-reviews/SiteCallAudit/findings.md b/code-reviews/SiteCallAudit/findings.md index 83059753..b4331532 100644 --- a/code-reviews/SiteCallAudit/findings.md +++ b/code-reviews/SiteCallAudit/findings.md @@ -5,9 +5,9 @@ | Module | `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit` | | Design doc | `docs/requirements/Component-SiteCallAudit.md` | | Status | Reviewed | -| Last reviewed | 2026-06-20 | +| Last reviewed | 2026-06-24 | | Reviewer | claude-agent | -| Commit reviewed | `4307c381` | +| Commit reviewed | `1f9de8a2` | | Open findings | 0 | ## Summary @@ -558,3 +558,26 @@ missing observability. **Resolution** Resolved 2026-06-20 (commit `fd618cf1`): `ReconcileSiteAsync` now consumes `response.MoreAvailable` via a within-tick continuation drain bounded by a max-pages guard, with explicit no-progress (single-timestamp-saturation) detection that breaks and logs a Warning instead of re-pulling forever. The XML claim of parity with the sibling reconciler was corrected. Idempotent upsert retained. Tests added. + +## Re-review — 2026-06-24 (commit `1f9de8a2`) + +Focused re-review of the changes since the prior review — verifying the code-review remediation + feature fixes are sound and regression-free. Reviewed by a per-module workflow agent; findings code-verified by the orchestrator. + +**Changes reviewed:** Two remediation items in SiteCallAuditActor.cs. SiteCallAudit-007: a new `_backgroundTimersEnabled` master switch decouples the daily terminal-row purge timer (now gated on this flag alone, since it needs only the repository) from the reconciliation timer (which additionally requires IPullSiteCallsClient/ISiteEnumerator and logs a Warning when they are absent) — the repo-only MSSQL test ctor disables both. SiteCallAudit-009: ReconcileSiteAsync now consumes PullSiteCallsResponse.MoreAvailable to drain a backlog within one tick via a page loop bounded by MaxReconciliationPagesPerTick (50), persisting the cursor per page, breaking on a single-timestamp no-progress pin (maxUpdated <= since) with a Warning, and logging Info on hitting the page ceiling. New tests cover the production-ctor-without-collaborators purge path, the multi-page within-tick drain, and the no-progress-pin bound. + +**Verdict:** The changed code is sound and regression-free. The SiteCallAudit-007 decoupling correctly fixes the unbounded-growth risk (a host omitting the reconciliation client now still purges), and the master-switch design keeps the MSSQL read/upsert tests free of scheduled side effects. The SiteCallAudit-009 continuation drain guarantees forward progress (no-progress pin break), bounds dispatcher occupancy (50-page ceiling), and relies on the existing monotonic/idempotent upsert to dedupe the inclusive-boundary re-pull; per-page cursor persistence preserves already-drained rows across a later-page fault caught by the per-site try/catch. XML doc comments and constructor comments were updated consistently with the new behaviour, and new unit tests exercise all three remediation paths. The project builds clean (0 warnings, 0 errors) and all referenced test/options members exist. + +| # | Category | Examined | Notes | +|---|----------|----------|-------| +| 1 | Correctness & logic bugs | ☑ | MoreAvailable drain terminates: no-progress pin (maxUpdated<=since) breaks, page ceiling bounds the loop, inclusive >= boundary re-pull deduped by monotonic upsert. Purge/reconciliation gating logic correct. No issues found. | +| 2 | Akka.NET conventions | ☑ | Self-tick scheduling unchanged in pattern; handlers stay alive via per-site/per-tick try/catch; PipeTo used on read/relay paths; Sender captured before await. Timer arming in PreStart, cancel in PostStop. No issues found. | +| 3 | Concurrency & thread safety | ☑ | _reconciliationCursors and per-page cursor mutated only on the actor thread inside the awaited tick handler; no captured this/sender in the new loop closures. No shared mutable state introduced. No issues found. | +| 4 | Error handling & resilience | ☑ | Per-site fault isolation retained; per-page cursor persistence keeps drained rows on a later-page throw; no-progress and ceiling cases logged at Warning/Info respectively. Purge continue-on-error unchanged. No issues found. | +| 5 | Security | ☑ | No new external input, SQL, secrets, or logging of sensitive data; site id and timestamps logged are not sensitive. No issues found. | +| 6 | Performance & resource management | ☑ | Within-tick drain is bounded at 50 pages x batch-size; DI scope opened once per tick and awaited-disposed (CreateAsyncScope) — pre-existing pattern, not regressed by holding it across up to 50 awaited upsert pages. No issues found. | +| 7 | Design-document adherence | ☑ | Matches Component-SiteCallAudit.md: self-heal pull (Reconciliation), 365-day terminal purge (Retention), eventually-consistent mirror. SiteCallAudit-007 strengthens the unbounded-growth guard the doc implies. No drift found. | +| 8 | Code organization & conventions | ☑ | New const MaxReconciliationPagesPerTick and _backgroundTimersEnabled field placed and documented consistently with surrounding style; ctors set the flag explicitly. No issues found. | +| 9 | Testing coverage | ☑ | New tests cover all three paths: production-ctor-no-collaborators purge (SiteCallAudit-007), multi-page within-tick drain with cursor advance assertion, and single-timestamp saturation pin bound (<10 pulls vs 50 ceiling). Test helpers verified present. No gaps found. | +| 10 | Documentation & comments | ☑ | Class-level and ctor XML docs accurately rewritten to describe independent purge/reconciliation preconditions and the MoreAvailable continuation drain; comments match the implemented behaviour. No issues found. | + +_No new findings — the changes in this module are clean._ diff --git a/code-reviews/SiteEventLogging/findings.md b/code-reviews/SiteEventLogging/findings.md index b663b7cb..de4bec01 100644 --- a/code-reviews/SiteEventLogging/findings.md +++ b/code-reviews/SiteEventLogging/findings.md @@ -5,9 +5,9 @@ | Module | `src/ZB.MOM.WW.ScadaBridge.SiteEventLogging` | | Design doc | `docs/requirements/Component-SiteEventLogging.md` | | Status | Reviewed | -| Last reviewed | 2026-06-20 | +| Last reviewed | 2026-06-24 | | Reviewer | claude-agent | -| Commit reviewed | `4307c381` | +| Commit reviewed | `1f9de8a2` | | Open findings | 0 | ## Summary @@ -1436,3 +1436,26 @@ fail against the current code and pass once SiteEventLogging-024 is fixed. **Resolution** Resolved 2026-06-20 (commit `fd618cf1`): added `Query_FiltersByTimeRange_HandlesNonUtcOffset` — seeds known UTC instants, queries with a +05:00 offset, and asserts the returned row identities (not just count). Verified failing pre-fix, passing post-fix. + +## Re-review — 2026-06-24 (commit `1f9de8a2`) + +Focused re-review of the changes since the prior review — verifying the code-review remediation + feature fixes are sound and regression-free. Reviewed by a per-module workflow agent; findings code-verified by the orchestrator. + +**Changes reviewed:** The diff normalises the `From`/`To` time-range bounds in `EventLogQueryService.ExecuteQuery` to UTC via `.ToUniversalTime()` before `ToString("o")` (lines 84, 92), with explanatory comments tagged SiteEventLogging-024 (re-opening -016). A matching regression test (`Query_FiltersByTimeRange_HandlesNonUtcOffset`) seeds three UTC rows and queries with `+05:00` bounds that bracket only the middle row, asserting on row identity. + +**Verdict:** The change is a correct, well-targeted bug fix. Stored timestamps are written by `SiteEventLogger` as `DateTimeOffset.UtcNow.ToString("o")` (always `+00:00`) and compared lexicographically under BINARY collation; the prior verbatim `ToString("o")` of a non-UTC `DateTimeOffset` emitted a different wall-clock value and `+05:00` suffix that sorted wrongly against stored `+00:00` rows, silently including/excluding the wrong events. `.ToUniversalTime()` normalises both the instant and the offset suffix to match stored values, aligning the code with the system-wide "all timestamps UTC" invariant and the doc's time-range filter requirement. The fix is regression-free (parameterised, no behavioural change for callers already passing UTC) and is backed by a focused test that genuinely fails on the unfixed code and passes after. No new issues found. + +| # | Category | Examined | Notes | +|---|----------|----------|-------| +| 1 | Correctness & logic bugs | ☑ | Fix correctly normalises both bounds to UTC; matches stored ISO-8601 +00:00 format and BINARY lexicographic comparison semantics. No off-by-one or boundary regression. | +| 2 | Akka.NET conventions | ☑ | No actor code touched; query runs via lock-guarded WithConnection. No issues found. | +| 3 | Concurrency & thread safety | ☑ | DateTimeOffset normalisation is pure/stateless; read still funnels through the recorder's write lock. No shared mutable state introduced. | +| 4 | Error handling & resilience | ☑ | Existing try/catch returns a failed EventLogQueryResponse; the change adds no new throw paths (ToUniversalTime cannot throw on a valid DateTimeOffset). No issues found. | +| 5 | Security | ☑ | Bounds remain bound as SqliteParameters; no injection surface change. No secrets in logs. No issues found. | +| 6 | Performance & resource management | ☑ | Two extra in-memory conversions per query; negligible. No new allocations of concern, no resource leaks. | +| 7 | Design-document adherence | ☑ | Aligns with CLAUDE.md 'all timestamps UTC' invariant and Component-SiteEventLogging time-range filter spec; no doc drift introduced. | +| 8 | Code organization & conventions | ☑ | Consistent with existing parameter-binding style; comments accurately describe the stored format and the bug. No issues found. | +| 9 | Testing coverage | ☑ | New test seeds UTC rows, queries with +05:00 bounds, asserts on row identity; genuinely fails pre-fix, passes post-fix. Good targeted coverage of the regression. | +| 10 | Documentation & comments | ☑ | SiteEventLogging-024 comments correctly explain UTC ISO-8601 storage, BINARY lexicographic comparison, and the failure mode. Accurate and not stale. | + +_No new findings — the changes in this module are clean._ diff --git a/code-reviews/SiteRuntime/findings.md b/code-reviews/SiteRuntime/findings.md index 0565a1b7..f8ae6da7 100644 --- a/code-reviews/SiteRuntime/findings.md +++ b/code-reviews/SiteRuntime/findings.md @@ -5,10 +5,10 @@ | Module | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime` | | Design doc | `docs/requirements/Component-SiteRuntime.md` | | Status | Reviewed | -| Last reviewed | 2026-06-20 | +| Last reviewed | 2026-06-24 | | Reviewer | claude-agent | -| Commit reviewed | `4307c381` | -| Open findings | 0 | +| Commit reviewed | `1f9de8a2` | +| Open findings | 2 | ## Summary @@ -1638,3 +1638,70 @@ secret-handling finding in its own right. **Resolution** Resolved 2026-06-20 (commit `fd618cf1`): the site no longer persists notification-list / SMTP config and purges any already-persisted rows (incl. the plaintext SMTP password) on both apply paths (`HandleDeployArtifacts`, replication `HandleApplyArtifacts`), inside the all-or-nothing apply. Paired with DeploymentManager-025 (central stops shipping). Tables retained but kept empty. + +## Re-review — 2026-06-24 (commit `1f9de8a2`) + +Focused re-review of the changes since the prior review — verifying the code-review remediation + feature fixes are sound and regression-free. Reviewed by a per-module workflow agent; findings code-verified by the orchestrator. + +**Changes reviewed:** The diff lands four remediation slices: (1) SiteRuntime-029 — HandleDisable/HandleDelete now cancel a buffered mid-redeploy (telling the displaced deployer "superseded"), and HandleDelete gates the _totalDeployedCount decrement on a new wasPresent flag; (2) SiteRuntime-027 — a new NativeAlarmDropped message is Tell'd from NativeAlarmActor to InstanceActor on every terminal mirror drop (snapshot-swap, retention drop, cap eviction) so InstanceActor.HandleNativeAlarmDropped evicts the stale _latestAlarmEvents/_alarmStates/_alarmTimestamps key; (3) SiteRuntime-028 — EnforceCap now emits a return-to-normal for a still-active evicted condition before dropping it; (4) DeploymentManager-025/SiteRuntime-031 — both the primary (DeploymentManagerActor) and replica (SiteReplicationActor) artifact-apply paths stop persisting notification lists/SMTP config and instead call the new SiteStorageService.PurgeCentralOnlyNotificationConfigAsync to delete any pre-fix rows (including the plaintext SMTP password). + +**Verdict:** Three of the four fixes are correct and well-tested: the native-alarm key-eviction (027), the cap-eviction return-to-normal (028), and the central-only notification/SMTP purge (031) are sound, message ordering is preserved (return-to-normal is Tell'd before NativeAlarmDropped on the same sender/receiver pair, so the stream sees the clear before the key is evicted), and the purge runs unconditionally and idempotently on both apply paths against tables that always exist. The SiteRuntime-029 redeploy/delete-race fix is correct for the case it targets (delete-during-redeploy, covered by a new passing test), but it over-corrected the counter guard: gating the _totalDeployedCount decrement on presence in _instanceActors-or-_terminatingActorsByName means deleting a DISABLED instance (which is counted in _totalDeployedCount yet absent from both maps) no longer decrements the count, leaking the deployed/disabled tally on the health dashboard. No test exercises the disable-then-delete count path, so the regression is uncaught. One Low doc-drift item: the SiteRuntime component doc's native-alarm retention/cap sections were not updated for the per-condition _latestAlarmEvents eviction (027) or the cap-eviction return-to-normal emit (028). + +| # | Category | Examined | Notes | +|---|----------|----------|-------| +| 1 | Correctness & logic bugs | ☑ | Found: HandleDelete's new wasPresent guard fails to decrement _totalDeployedCount when deleting a disabled instance (absent from both maps but counted). Redeploy-race and native-alarm fixes are otherwise correct. | +| 2 | Akka.NET conventions | ☑ | NativeAlarmDropped uses Tell (hot path); single sender/receiver pair preserves ordering so return-to-normal precedes the drop signal. No captured sender/this in closures. PipeTo used for async results. No issues. | +| 3 | Concurrency & thread safety | ☑ | All map mutations (_pendingRedeploys, _terminatingActorsByName, _alarms, _latestAlarmEvents) occur on the actor thread; LogDeploymentEvent off-thread touches only readonly _serviceProvider. No issues. | +| 4 | Error handling & resilience | ☑ | Purge runs inside SiteReplicationActor's try/catch; native persistence stays fire-and-forget OnlyOnFaulted. Displaced deployers are told Failed-superseded rather than left waiting. No issues. | +| 5 | Security | ☑ | PurgeCentralOnlyNotificationConfigAsync proactively deletes any pre-fix plaintext SMTP password rows on every apply — a positive security remediation. DELETE statements are static SQL, no injection. No issues. | +| 6 | Performance & resource management | ☑ | SiteRuntime-027 eviction fixes an unbounded _latestAlarmEvents growth for sources that mint a fresh SourceReference per occurrence. Purge opens a short-lived SqliteConnection per apply (acceptable, low frequency). No issues. | +| 7 | Design-document adherence | ☑ | Central-only notification/SMTP purge matches the documented decision (sites never deliver). Cap-eviction now emits a final state change, aligning with the retention-drop contract. | +| 8 | Code organization & conventions | ☑ | NativeAlarmDropped placed in Messages/, NotifyParentDropped centralizes the Tell, wasPresent reads clearly. Comments are accurate and reference the issue IDs. No issues. | +| 9 | Testing coverage | ☑ | SiteRuntime-027/028 and delete-during-redeploy (SR029) have new passing tests. Gap: no test asserts _totalDeployedCount after disable-then-delete, so the counter-leak regression is uncaught; snapshot-swap drop signal not directly asserted. | +| 10 | Documentation & comments | ☑ | Code comments are strong. Doc drift: Component-SiteRuntime.md native-alarm retention/cap sections do not mention the new per-condition _latestAlarmEvents eviction (027) or the cap-eviction return-to-normal (028). | + +**New findings from this re-review (2):** + +### SiteRuntime-032 — Deleting a disabled instance leaks the deployed count + +| | | +|--|--| +| Severity | Medium | +| Category | Correctness & logic bugs | +| Status | Open | +| Location | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs:660-690` | + +**Description** + +The SiteRuntime-029 fix gates the _totalDeployedCount decrement on a new wasPresent flag that is set true only when the instance is live in _instanceActors OR mid-redeploy in _terminatingActorsByName (lines 661-684). But a DISABLED instance is in NEITHER map: HandleDisable removes it from _instanceActors (line 555) and never adds it to _terminatingActorsByName, yet it remains counted in _totalDeployedCount (startup sets _totalDeployedCount = msg.Configs.Count over ALL configs incl. disabled at line 271, and disable never decrements). So deleting a disabled instance hits the _terminatingActorsByName miss, then the _instanceActors miss, leaves wasPresent=false, and skips the decrement. The instance's deployed-config row is removed from SQLite but _totalDeployedCount stays too high — UpdateInstanceCounts then reports a phantom 'deployed=N, disabled=N-_instanceActors.Count' for an instance that no longer exists. The pre-fix code decremented unconditionally (clamped at 0), so this is a behavioral regression. Each disable→delete cycle accumulates the drift on the health dashboard. The original concern the guard targeted (a delete for a WHOLLY-UNKNOWN instance driving the count negative) was already mitigated by the existing Math.Max(0, ...) clamp. + +_Severity note (orchestrator): recorded **Medium** rather than the reviewer's initial High. `_totalDeployedCount` feeds only `UpdateInstanceCounts()` -> the health collector's deployed/enabled/disabled tiles; no runtime behaviour is gated on it, and the drift self-heals on the next singleton restart/failover (startup reloads the count from SQLite via `_totalDeployedCount = msg.Configs.Count`). Impact is a monotonic over-count of the disabled tile, accumulating one per disable->delete cycle until restart — incorrect but observational, hence Medium._ + +**Recommendation** + +Treat 'has a persisted deployed config' as present, not just 'in one of the two in-memory maps'. Either (a) check storage for an existing deployed-config row before deciding wasPresent (e.g. set wasPresent based on the RemoveDeployedConfigAsync result returning whether a row was removed), or (b) keep the unconditional Math.Max(0, _totalDeployedCount - 1) for the delete path (its clamp already prevents going negative) and rely on it; the disabled-instance case then decrements correctly. Add a regression test asserting _totalDeployedCount returns to 0 after deploy → disable → delete (no re-enable). + +**Resolution** + +_Unresolved._ + +### SiteRuntime-033 — Native-alarm doc stale re: per-condition eviction and cap return-to-normal + +| | | +|--|--| +| Severity | Low | +| Category | Documentation & comments | +| Status | Open | +| Location | `docs/requirements/Component-SiteRuntime.md:272` | + +**Description** + +The native-alarm spec was not updated for two behaviors this diff introduces. (1) The 'Per-source cap' bullet (line 272) states only that the oldest condition is dropped and 'the eviction is logged', but EnforceCap now also emits a return-to-normal AlarmStateChanged for a still-active evicted condition (SiteRuntime-028, NativeAlarmActor.cs:315-318) — the doc implies a silent drop with no state change. (2) The 'Latest-event retention' and 'Reset semantics' bullets (lines 291, 294) describe _latestAlarmEvents as cleared only on redeploy/undeploy, but the SiteRuntime-027 fix now evicts a condition's key per-drop (snapshot-swap, retention drop, cap eviction) via the new NativeAlarmDropped signal — the doc omits this per-condition eviction, which is the whole point of the memory-leak fix. + +**Recommendation** + +Update the Per-source cap bullet to note that an active evicted condition emits a final return-to-normal before being dropped (mirroring the retention-drop bullet at line 271), and update the retention/reset bullets to document that _latestAlarmEvents keys for native conditions are evicted per-condition when the condition leaves the mirror (via NativeAlarmDropped), not only on redeploy/undeploy. + +**Resolution** + +_Unresolved._ diff --git a/code-reviews/StoreAndForward/findings.md b/code-reviews/StoreAndForward/findings.md index 8f6f8f6f..95bd9c98 100644 --- a/code-reviews/StoreAndForward/findings.md +++ b/code-reviews/StoreAndForward/findings.md @@ -5,10 +5,10 @@ | Module | `src/ZB.MOM.WW.ScadaBridge.StoreAndForward` | | Design doc | `docs/requirements/Component-StoreAndForward.md` | | Status | Reviewed | -| Last reviewed | 2026-06-20 | +| Last reviewed | 2026-06-24 | | Reviewer | claude-agent | -| Commit reviewed | `4307c381` | -| Open findings | 0 (025 Resolved; 026, 027 Deferred; 5 Deferred total: 002, 011, 012, 026, 027; all prior Open findings resolved) | +| Commit reviewed | `1f9de8a2` | +| Open findings | 1 (025 Resolved; 026, 027 Deferred; 5 Deferred total: 002, 011, 012, 026, 027; all prior Open findings resolved) | ## Summary @@ -1736,3 +1736,46 @@ not a forced fix. Options, in rough order of preference: Deferred 2026-06-20: the per-retry `"Retried"` site-event volume is spec-conformant (Component-StoreAndForward.md lists 'retried' as a logged activity) and the site log is capped, so the residual concern is history eviction, not disk exhaustion. Whether to drop/sample/threshold the event vs. document the volume is a design-vs-scale decision for the owner. +## Re-review — 2026-06-24 (commit `1f9de8a2`) + +Focused re-review of the changes since the prior review — verifying the code-review remediation + feature fixes are sound and regression-free. Reviewed by a per-module workflow agent; findings code-verified by the orchestrator. + +**Changes reviewed:** The diff implements the StoreAndForward-025 remediation in StoreAndForwardService.cs: it adds a `_queueDepthProvider` field that retains the exact delegate registered with the process-global `ScadaBridgeTelemetry` queue-depth gauge in `StartAsync`, and reworks `StopAsync` so that after the in-flight retry sweep drains it deregisters that delegate by identity via the new `ScadaBridgeTelemetry.ClearQueueDepthProvider` (compare-and-clear), nulls `_queueDepthProvider`, and resets the `_queueDepthProviderRegistered` one-time guard to 0. The sweep-wait block was also refactored from an early-return guard into an `if (inflight is not null && !inflight.IsCompleted)` block (behaviour-preserving). A companion `ClearQueueDepthProvider` was added to ScadaBridgeTelemetry and three regression tests were added to QueueDepthGaugeTests. + +**Verdict:** The StoreAndForward-025 fix is largely sound: the identity-checked compare-and-clear correctly prevents a late stop from stomping a successor instance's provider (proven by the passing late-stop test), the reference-type `Interlocked.CompareExchange` overload binds correctly, the project builds clean, and all five QueueDepthGaugeTests pass. The StopAsync refactor preserves the StoreAndForward-024 sweep-wait semantics. However, the change introduces one regression in the restart path it newly enables: resetting `_queueDepthProviderRegistered` to 0 without resetting `_bufferedCount` means a Stop→Start cycle on a non-empty buffer re-runs the additive seed and double-counts still-buffered rows into the gauge — contradicting the fix's own comment that a later StartAsync "re-registers cleanly." Impact is gauge-accuracy only (the metric is documented as approximate), so it is bounded, but it is a real behavioural drift versus the prior permanent-latch guard. + +| # | Category | Examined | Notes | +|---|----------|----------|-------| +| 1 | Correctness & logic bugs | ☑ | StopAsync resets _queueDepthProviderRegistered to 0 but never resets _bufferedCount, so a same-instance Stop→Start over a non-empty buffer re-runs Interlocked.Add(pending) and double-counts the gauge. See finding. | +| 2 | Akka.NET conventions | ☑ | No issues found. StoreAndForwardService is a plain service, not an actor; no actor-state, Tell/Ask, or supervision concerns touched by the diff. | +| 3 | Concurrency & thread safety | ☑ | Telemetry clear uses Interlocked.CompareExchange (reference overload) and the gauge reads via Volatile — sync-safe. _queueDepthProvider is read/written only on the same instance during the documented single-threaded Start/Stop lifecycle. No data race in the changed code. | +| 4 | Error handling & resilience | ☑ | StopAsync's bounded WaitAsync with TimeoutException + catch-all (swallow + log Warning) is preserved through the refactor; the provider clear runs unconditionally afterward so a timed-out sweep still releases the gauge slot. No issues found. | +| 5 | Security | ☑ | No issues found. No secrets, no SQL/LDAP, no script-trust surface, no logging of sensitive data in the changed code. | +| 6 | Performance & resource management | ☑ | The fix correctly addresses the original StoreAndForward-025 leak: the global delegate slot no longer pins the dead service after a graceful StopAsync, and the gauge stops reporting a frozen depth. | +| 7 | Design-document adherence | ☑ | Component-StoreAndForward.md references buffer-depth metrics only generically; the gauge provider lifecycle is an implementation detail. No spec drift. Note: Host AkkaHostedService.StopAsync never calls StoreAndForwardService.StopAsync, so the clear path is not exercised on real host shutdown — pre-existing wiring gap, out of scope of this file. | +| 8 | Code organization & conventions | ☑ | No issues found. New field, telemetry method, and XML docs follow existing patterns; the explicit (Func) cast on the seed delegate is needed for overload disambiguation. | +| 9 | Testing coverage | ☑ | Three new tests cover clear-on-stop, late-stop-doesn't-clobber-successor, and the additive seed. No test covers a Stop→Start restart on the same instance with a non-empty buffer, which is exactly the double-count path the finding describes. | +| 10 | Documentation & comments | ☑ | The StopAsync XML doc and the inline comment at lines 405-406/451-452 claim a later StartAsync 're-registers cleanly', which is inaccurate for _bufferedCount — the comment overstates the cleanliness of the restart path. Tied to the finding. | + +**New findings from this re-review (1):** + +### StoreAndForward-028 — StopAsync resets register-guard but not _bufferedCount, double-counting gauge on restart + +| | | +|--|--| +| Severity | Medium | +| Category | Correctness & logic bugs | +| Status | Open | +| Location | `src/ZB.MOM.WW.ScadaBridge.StoreAndForward/StoreAndForwardService.cs:459` | + +**Description** + +The StoreAndForward-025 fix added `Interlocked.Exchange(ref _queueDepthProviderRegistered, 0)` in StopAsync (line 459) so a later StartAsync re-registers — and the new XML doc/comment (lines 405-406, 451-452) explicitly claims StartAsync then 're-registers cleanly.' But StopAsync never resets `_bufferedCount` (verified: it is only ever Increment/Decrement/Add, never reset — lines 154/359/619/678/712/751/973). On a same-instance Stop→Start cycle the guard CompareExchange at line 357 now succeeds again, so `Interlocked.Add(ref _bufferedCount, pending)` (line 359) re-adds the freshly-read durable Pending count on top of the leftover in-memory value from the prior lifecycle. When the service stops with a non-empty buffer (the normal case — S&F stops with queued rows), `_bufferedCount` is ~N at stop and stays N; the restart adds another N, so the gauge reports ~2N. Before this change the guard was a permanent latch (never reset), so a second StartAsync short-circuited and the seed never re-ran — no double-count. The service is a DI singleton and the Host comment notes the test harness triggers second start/stop interleavings, so this restart path is reachable. Impact is confined to gauge accuracy (the depth metric is documented as approximate, dashboard-only — no data loss), but it is a real regression in the path this change newly enables and mis-documents. + +**Recommendation** + +Reset the cached counter on stop alongside the guard, e.g. `Interlocked.Exchange(ref _bufferedCount, 0);` immediately before/after the `Interlocked.Exchange(ref _queueDepthProviderRegistered, 0)` at line 459, so a subsequent StartAsync seeds from a clean zero base. Then add a regression test to QueueDepthGaugeTests that buffers N rows, calls StopAsync, calls StartAsync again on the same instance, and asserts the gauge reports N (not 2N). Alternatively, correct the 're-registers cleanly' comment to document that restart is unsupported and keep the guard a permanent latch. + +**Resolution** + +_Unresolved._ diff --git a/code-reviews/TemplateEngine/findings.md b/code-reviews/TemplateEngine/findings.md index de52e5c4..f4e27b7d 100644 --- a/code-reviews/TemplateEngine/findings.md +++ b/code-reviews/TemplateEngine/findings.md @@ -5,9 +5,9 @@ | Module | `src/ZB.MOM.WW.ScadaBridge.TemplateEngine` | | Design doc | `docs/requirements/Component-TemplateEngine.md` | | Status | Reviewed | -| Last reviewed | 2026-06-20 | +| Last reviewed | 2026-06-24 | | Reviewer | claude-agent | -| Commit reviewed | `4307c381` | +| Commit reviewed | `1f9de8a2` | | Open findings | 0 | ## Summary @@ -1388,3 +1388,25 @@ and add a legacy-array test asserting names are returned. Resolved 2026-06-20 (commit `fd618cf1`): deleted the dead, misnamed `SemanticValidator.ParseParameterDefinitions` (no production caller per grep; its only references were two tests, which asserted the buggy type-not-name output). Tests removed. +## Re-review — 2026-06-24 (commit `1f9de8a2`) + +Focused re-review of the changes since the prior review — verifying the code-review remediation + feature fixes are sound and regression-free. Reviewed by a per-module workflow agent; findings code-verified by the orchestrator. + +**Changes reviewed:** Two changes since 4307c381. (1) TemplateFolderService.cs: each of the five folder mutators (Create/Rename/Move/Reorder/Delete) now issues a second SaveChangesAsync after auditService.LogAsync, so the audit row staged on the EF change tracker is actually persisted instead of being discarded when the ManagementActor's DI scope is disposed. (2) SemanticValidator.cs: the unused internal ParseParameterDefinitions(string?) helper was deleted along with its two unit tests; the matching call-order tests for folder audit persistence were added. + +**Verdict:** Both deltas are correct, minimal, and regression-free. The audit-persistence fix closes a real defect: AuditService.LogAsync only stages an AuditLogEntry on the change tracker (it explicitly documents that the caller must SaveChanges), so before this change folder-mutation audit rows were never flushed — violating the design doc's "all template changes are audit logged" requirement. The fix mirrors the established, deliberate TemplateService pattern (save entity to populate the key, log, then save the staged audit row) and is covered by new call-order tests asserting a save follows the log. The ParseParameterDefinitions removal is genuinely dead code (no remaining references in src or tests, no design-doc footprint, System.Text.Json still used elsewhere so no orphaned using). Project builds with zero warnings; all 88 TemplateFolderService + SemanticValidator tests pass. No new issues found. + +| # | Category | Examined | Notes | +|---|----------|----------|-------| +| 1 | Correctness & logic bugs | ☑ | Second SaveChangesAsync correctly flushes the staged audit row (LogAsync only stages, per AuditService.cs:59-61). For Rename/Move/Reorder/Delete the entity id is already known so the audit entityId is correct. No issues found. | +| 2 | Akka.NET conventions | ☑ | No actor code touched; TemplateFolderService is a plain async service invoked from within a ManagementActor DI scope. No Tell/Ask, supervision, or closure-capture concerns in the delta. | +| 3 | Concurrency & thread safety | ☑ | No shared mutable state introduced; per-request scoped DbContext, sequential awaits. No issues found. | +| 4 | Error handling & resilience | ☑ | The added saves are non-atomic with the entity save (separate transaction), matching the accepted TemplateService trade-off; audit-as-best-effort is consistent with system design. No issues found. | +| 5 | Security | ☑ | No secrets, no injection surface; audit serialization handles cycles/depth in AuditService. No change to security posture. | +| 6 | Performance & resource management | ☑ | One extra round-trip SaveChangesAsync per folder mutation — negligible for low-frequency admin operations and required for correctness. No leaks (no IDisposable introduced). | +| 7 | Design-document adherence | ☑ | Fix realizes Component-TemplateEngine.md:218 'all template changes are audit logged'. ParseParameterDefinitions removal drops no documented capability. | +| 8 | Code organization & conventions | ☑ | Mirrors TemplateService pattern exactly; dead-code removal leaves no orphaned usings (build has 0 warnings). Clean. | +| 9 | Testing coverage | ☑ | New tests assert a SaveChangesAsync follows LogAsync for all five mutators plus Times.Exactly(2) save counts; obsolete ParseParameterDefinitions tests removed. 88 tests pass. | +| 10 | Documentation & comments | ☑ | Each added save has an accurate explanatory comment cross-referencing CreateFolderAsync/TemplateService and the DI-scope-disposal rationale. XML doc on the removed method went with it. | + +_No new findings — the changes in this module are clean._