docs(code-review): re-review 17 changed modules at 1f9de8a2 — 8 new findings
Re-reviewed the modules whose source changed since the last review baseline (full-review remediationfd618cf1+ InboundAPI Database-helper fixesb3c90143), focused on whether the fixes are sound and regression-free. 9 of 17 modules clean; 8 new findings (0 Critical, 0 High, 4 Medium, 4 Low), all code-verified by the orchestrator before recording: - DataConnectionLayer-029 (Med): DCL-023's unsubscribe-clears-in-flight reopens a double-subscribe window that leaks an orphaned alarm feed; the alarm completion handler overwrites the subscription id without the tag-path guard at line 908. - InboundAPI-031 (Med): WaitForAttribute's 5s grace backstop is tighter than the CommunicationService Ask's timeout+IntegrationTimeout (30s) round-trip slack, so a slow-but-valid timed-out 'false' arriving in the 5-30s window is cancelled into an unhandled OperationCanceledException/500 (contradicts spec 6 + its own comment). - SiteRuntime-032 (Med): SiteRuntime-029's wasPresent guard skips the deployed-count decrement when deleting a DISABLED instance (absent from both maps), drifting the health-dashboard tally; self-heals on singleton restart (observational, hence Med). - StoreAndForward-028 (Med): StoreAndForward-025 resets the register-guard but not _bufferedCount, so a same-instance Stop->Start re-seeds the depth gauge to ~2N. - AuditLog-017, CentralUI-037, ScriptAnalysis-009, SiteRuntime-033 (Low): a test-coverage gap plus stale doc-comments/spec following the remediation. Header commit/date bumped to1f9de8a2/ 2026-06-24 on all 17 modules; README regenerated (8 pending / 576 total).
This commit is contained in:
@@ -5,10 +5,10 @@
|
||||
| Module | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime` |
|
||||
| Design doc | `docs/requirements/Component-SiteRuntime.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-06-20 |
|
||||
| Last reviewed | 2026-06-24 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `4307c381` |
|
||||
| Open findings | 0 |
|
||||
| Commit reviewed | `1f9de8a2` |
|
||||
| Open findings | 2 |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -1638,3 +1638,70 @@ secret-handling finding in its own right.
|
||||
**Resolution**
|
||||
|
||||
Resolved 2026-06-20 (commit `fd618cf1`): the site no longer persists notification-list / SMTP config and purges any already-persisted rows (incl. the plaintext SMTP password) on both apply paths (`HandleDeployArtifacts`, replication `HandleApplyArtifacts`), inside the all-or-nothing apply. Paired with DeploymentManager-025 (central stops shipping). Tables retained but kept empty.
|
||||
|
||||
## Re-review — 2026-06-24 (commit `1f9de8a2`)
|
||||
|
||||
Focused re-review of the changes since the prior review — verifying the code-review remediation + feature fixes are sound and regression-free. Reviewed by a per-module workflow agent; findings code-verified by the orchestrator.
|
||||
|
||||
**Changes reviewed:** The diff lands four remediation slices: (1) SiteRuntime-029 — HandleDisable/HandleDelete now cancel a buffered mid-redeploy (telling the displaced deployer "superseded"), and HandleDelete gates the _totalDeployedCount decrement on a new wasPresent flag; (2) SiteRuntime-027 — a new NativeAlarmDropped message is Tell'd from NativeAlarmActor to InstanceActor on every terminal mirror drop (snapshot-swap, retention drop, cap eviction) so InstanceActor.HandleNativeAlarmDropped evicts the stale _latestAlarmEvents/_alarmStates/_alarmTimestamps key; (3) SiteRuntime-028 — EnforceCap now emits a return-to-normal for a still-active evicted condition before dropping it; (4) DeploymentManager-025/SiteRuntime-031 — both the primary (DeploymentManagerActor) and replica (SiteReplicationActor) artifact-apply paths stop persisting notification lists/SMTP config and instead call the new SiteStorageService.PurgeCentralOnlyNotificationConfigAsync to delete any pre-fix rows (including the plaintext SMTP password).
|
||||
|
||||
**Verdict:** Three of the four fixes are correct and well-tested: the native-alarm key-eviction (027), the cap-eviction return-to-normal (028), and the central-only notification/SMTP purge (031) are sound, message ordering is preserved (return-to-normal is Tell'd before NativeAlarmDropped on the same sender/receiver pair, so the stream sees the clear before the key is evicted), and the purge runs unconditionally and idempotently on both apply paths against tables that always exist. The SiteRuntime-029 redeploy/delete-race fix is correct for the case it targets (delete-during-redeploy, covered by a new passing test), but it over-corrected the counter guard: gating the _totalDeployedCount decrement on presence in _instanceActors-or-_terminatingActorsByName means deleting a DISABLED instance (which is counted in _totalDeployedCount yet absent from both maps) no longer decrements the count, leaking the deployed/disabled tally on the health dashboard. No test exercises the disable-then-delete count path, so the regression is uncaught. One Low doc-drift item: the SiteRuntime component doc's native-alarm retention/cap sections were not updated for the per-condition _latestAlarmEvents eviction (027) or the cap-eviction return-to-normal emit (028).
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
|---|----------|----------|-------|
|
||||
| 1 | Correctness & logic bugs | ☑ | Found: HandleDelete's new wasPresent guard fails to decrement _totalDeployedCount when deleting a disabled instance (absent from both maps but counted). Redeploy-race and native-alarm fixes are otherwise correct. |
|
||||
| 2 | Akka.NET conventions | ☑ | NativeAlarmDropped uses Tell (hot path); single sender/receiver pair preserves ordering so return-to-normal precedes the drop signal. No captured sender/this in closures. PipeTo used for async results. No issues. |
|
||||
| 3 | Concurrency & thread safety | ☑ | All map mutations (_pendingRedeploys, _terminatingActorsByName, _alarms, _latestAlarmEvents) occur on the actor thread; LogDeploymentEvent off-thread touches only readonly _serviceProvider. No issues. |
|
||||
| 4 | Error handling & resilience | ☑ | Purge runs inside SiteReplicationActor's try/catch; native persistence stays fire-and-forget OnlyOnFaulted. Displaced deployers are told Failed-superseded rather than left waiting. No issues. |
|
||||
| 5 | Security | ☑ | PurgeCentralOnlyNotificationConfigAsync proactively deletes any pre-fix plaintext SMTP password rows on every apply — a positive security remediation. DELETE statements are static SQL, no injection. No issues. |
|
||||
| 6 | Performance & resource management | ☑ | SiteRuntime-027 eviction fixes an unbounded _latestAlarmEvents growth for sources that mint a fresh SourceReference per occurrence. Purge opens a short-lived SqliteConnection per apply (acceptable, low frequency). No issues. |
|
||||
| 7 | Design-document adherence | ☑ | Central-only notification/SMTP purge matches the documented decision (sites never deliver). Cap-eviction now emits a final state change, aligning with the retention-drop contract. |
|
||||
| 8 | Code organization & conventions | ☑ | NativeAlarmDropped placed in Messages/, NotifyParentDropped centralizes the Tell, wasPresent reads clearly. Comments are accurate and reference the issue IDs. No issues. |
|
||||
| 9 | Testing coverage | ☑ | SiteRuntime-027/028 and delete-during-redeploy (SR029) have new passing tests. Gap: no test asserts _totalDeployedCount after disable-then-delete, so the counter-leak regression is uncaught; snapshot-swap drop signal not directly asserted. |
|
||||
| 10 | Documentation & comments | ☑ | Code comments are strong. Doc drift: Component-SiteRuntime.md native-alarm retention/cap sections do not mention the new per-condition _latestAlarmEvents eviction (027) or the cap-eviction return-to-normal (028). |
|
||||
|
||||
**New findings from this re-review (2):**
|
||||
|
||||
### SiteRuntime-032 — Deleting a disabled instance leaks the deployed count
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs:660-690` |
|
||||
|
||||
**Description**
|
||||
|
||||
The SiteRuntime-029 fix gates the _totalDeployedCount decrement on a new wasPresent flag that is set true only when the instance is live in _instanceActors OR mid-redeploy in _terminatingActorsByName (lines 661-684). But a DISABLED instance is in NEITHER map: HandleDisable removes it from _instanceActors (line 555) and never adds it to _terminatingActorsByName, yet it remains counted in _totalDeployedCount (startup sets _totalDeployedCount = msg.Configs.Count over ALL configs incl. disabled at line 271, and disable never decrements). So deleting a disabled instance hits the _terminatingActorsByName miss, then the _instanceActors miss, leaves wasPresent=false, and skips the decrement. The instance's deployed-config row is removed from SQLite but _totalDeployedCount stays too high — UpdateInstanceCounts then reports a phantom 'deployed=N, disabled=N-_instanceActors.Count' for an instance that no longer exists. The pre-fix code decremented unconditionally (clamped at 0), so this is a behavioral regression. Each disable→delete cycle accumulates the drift on the health dashboard. The original concern the guard targeted (a delete for a WHOLLY-UNKNOWN instance driving the count negative) was already mitigated by the existing Math.Max(0, ...) clamp.
|
||||
|
||||
_Severity note (orchestrator): recorded **Medium** rather than the reviewer's initial High. `_totalDeployedCount` feeds only `UpdateInstanceCounts()` -> the health collector's deployed/enabled/disabled tiles; no runtime behaviour is gated on it, and the drift self-heals on the next singleton restart/failover (startup reloads the count from SQLite via `_totalDeployedCount = msg.Configs.Count`). Impact is a monotonic over-count of the disabled tile, accumulating one per disable->delete cycle until restart — incorrect but observational, hence Medium._
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Treat 'has a persisted deployed config' as present, not just 'in one of the two in-memory maps'. Either (a) check storage for an existing deployed-config row before deciding wasPresent (e.g. set wasPresent based on the RemoveDeployedConfigAsync result returning whether a row was removed), or (b) keep the unconditional Math.Max(0, _totalDeployedCount - 1) for the delete path (its clamp already prevents going negative) and rely on it; the disabled-instance case then decrements correctly. Add a regression test asserting _totalDeployedCount returns to 0 after deploy → disable → delete (no re-enable).
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### SiteRuntime-033 — Native-alarm doc stale re: per-condition eviction and cap return-to-normal
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Documentation & comments |
|
||||
| Status | Open |
|
||||
| Location | `docs/requirements/Component-SiteRuntime.md:272` |
|
||||
|
||||
**Description**
|
||||
|
||||
The native-alarm spec was not updated for two behaviors this diff introduces. (1) The 'Per-source cap' bullet (line 272) states only that the oldest condition is dropped and 'the eviction is logged', but EnforceCap now also emits a return-to-normal AlarmStateChanged for a still-active evicted condition (SiteRuntime-028, NativeAlarmActor.cs:315-318) — the doc implies a silent drop with no state change. (2) The 'Latest-event retention' and 'Reset semantics' bullets (lines 291, 294) describe _latestAlarmEvents as cleared only on redeploy/undeploy, but the SiteRuntime-027 fix now evicts a condition's key per-drop (snapshot-swap, retention drop, cap eviction) via the new NativeAlarmDropped signal — the doc omits this per-condition eviction, which is the whole point of the memory-leak fix.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Update the Per-source cap bullet to note that an active evicted condition emits a final return-to-normal before being dropped (mirroring the retention-drop bullet at line 271), and update the retention/reset bullets to document that _latestAlarmEvents keys for native conditions are evicted per-condition when the condition leaves the mirror (via NativeAlarmDropped), not only on redeploy/undeploy.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
Reference in New Issue
Block a user