docs(code-review): full review at 4307c381 — 18 modules, 67 findings recorded + remediation tracked
Full per-module re-review of the 16 stale modules (last seen1eb6e97/ 2026-05-28) plus first-ever reviews of KpiHistory (#26) and ScriptAnalysis (#25), at HEAD4307c381. 67 new findings (0 Critical, 6 High, 27 Medium, 34 Low). Remediation in commitfd618cf1closed 5 of the 6 Highs and ~33 Medium/Low; the rest are Deferred/Won't Fix with rationale. Remaining pending (4) are all InboundAPI's Database-helper findings (IA-026 High .. IA-029), left to the active feat/ipsen-movein effort per owner decision. Highlights: caught a central-only-delivery security drift (SMTP creds broadcast to sites — DM-025/SR-031), a never-committed 'Resolved' fix (SiteEventLogging-016 → -024), an unguarded KPI recorder tick (KH-001), a trust-analyzer fallback weakening (SA-001), and a native-alarm subscribe-path leak (DCL-023). ScriptAnalysis verdict: trust boundary is semantically sound (symbol-based) in the production cluster config. README regenerated; regen-readme.py --check passes (4 pending / 567 total).
This commit is contained in:
@@ -5,10 +5,10 @@
|
||||
| Module | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime` |
|
||||
| Design doc | `docs/requirements/Component-SiteRuntime.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-28 |
|
||||
| Last reviewed | 2026-06-20 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `1eb6e97` |
|
||||
| Open findings | 3 |
|
||||
| Commit reviewed | `4307c381` |
|
||||
| Open findings | 0 |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -109,6 +109,40 @@ _Re-review (2026-05-28, `1eb6e97`):_
|
||||
|
||||
## Findings
|
||||
|
||||
#### Re-review 2026-06-20 (commit `4307c381`) — full review
|
||||
|
||||
The module was re-reviewed in full at `4307c381` (current HEAD). The diff against the
|
||||
prior baseline `1eb6e97` shows as 100% additions because of the ScadaBridge rename
|
||||
(the project moved paths), so the whole module was re-read at its current state rather
|
||||
than relying on the diff. Health is good: all prior findings 001–026 remain
|
||||
Resolved/Deferred with no regressions observed in their fixed call sites
|
||||
(SetAttribute DCL routing, the watch+buffer redeploy, the SiteRuntime-020 terminating
|
||||
shadow, the OperationTrackingStore read/write split + safe Dispose, the
|
||||
AuditingDb `Inner` accessor, invariant-culture numeric parsing, the
|
||||
`_attributes` child-snapshot isolation). The new M7 surface (NativeAlarmActor,
|
||||
CertStoreActor) and the WaitForAttribute/batch-write additions are generally
|
||||
well-built. Five new findings were recorded, all in the native-alarm and
|
||||
deployment-lifecycle bookkeeping: an unbounded `_latestAlarmEvents` map on the
|
||||
Instance Actor (027, Medium), a phantom-active alarm left behind by
|
||||
`NativeAlarmActor.EnforceCap` (028, Medium), a delete-during-pending-redeploy
|
||||
that both over-decrements the deployed counter and resurrects the deleted
|
||||
instance (029, Medium), missing tests for the cap/retention/replication paths
|
||||
(030, Low), and stale site-side persistence of notification-list + SMTP config
|
||||
that the design decision moved to central-only (031, Low).
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
|---|----------|----------|-------|
|
||||
| 1 | Correctness & logic bugs | ✓ | EnforceCap leaves phantom-active alarm (028); delete-during-redeploy over-decrements + resurrects (029). HiLo/RoC/edge logic otherwise sound. |
|
||||
| 2 | Akka.NET conventions | ✓ | Supervision (Resume coordinators / Stop short-lived) correct; execution actors on ScriptExecutionScheduler; cert broadcast uses PipeTo, captures `sender` before async. Trigger-eval mailbox-block stays Deferred (014). |
|
||||
| 3 | Concurrency & thread safety | ✓ | Child-snapshot isolation (017) intact; `_createdConnections` actor-confined via ApplyArtifact… dispatch (021); OperationTrackingStore read/write split (024). No new actor-thread violations. |
|
||||
| 4 | Error handling & resilience | ✓ | Best-effort persistence/replication paths log-on-fault; deploy reports Success only post-persist (005). Native source-unavailable retains+marks-uncertain. |
|
||||
| 5 | Security | ✓ | Trust verdict delegated to shared ScriptTrustValidator; CertStoreActor path-traversal-guarded; no reflection in scripts; SQL params captured per M4 design (redaction at write time). |
|
||||
| 6 | Performance & resource management | ✓ | `_latestAlarmEvents` unbounded growth on native-alarm churn (027). ScriptExecutionScheduler background threads OK; per-call SQLite connections acceptable. |
|
||||
| 7 | Design-document adherence | ✓ | Actor hierarchy / native-alarm wiring conform. Stale site persistence of notification-list + SMTP config vs. "central-only, not deployed to sites" (031). |
|
||||
| 8 | Code organization & conventions | ✓ | Reflection anti-pattern eliminated (006/022); Options owned by component; additive message evolution honoured. |
|
||||
| 9 | Testing coverage | ✓ | Broad new suites (NativeAlarmActor, WaitForAttribute, SetAttribute, ExecutionActor, scheduler). EnforceCap/retention-drop + SiteReplicationActor still untested (030). |
|
||||
| 10 | Documentation & comments | ✓ | XML docs accurate and extensive; ReplicationMessages now documented (026). No new stale comments found. |
|
||||
|
||||
### SiteRuntime-001 — `Instance.SetAttribute` never writes to the Data Connection Layer
|
||||
|
||||
| | |
|
||||
@@ -1370,3 +1404,237 @@ the direction (outbound to peer / inbound from peer) and what is replicated.
|
||||
The two pre-existing group-header XML blocks were converted to plain `//`
|
||||
comments to avoid orphaned doc-summaries above the first record in each group.
|
||||
Marker-base-type idea left out of scope.
|
||||
|
||||
### SiteRuntime-027 — `InstanceActor._latestAlarmEvents` grows without bound as native conditions churn
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Performance & resource management |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/InstanceActor.cs:67`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/InstanceActor.cs:1007` |
|
||||
|
||||
**Description**
|
||||
|
||||
`HandleAlarmStateChanged` stores the latest enriched `AlarmStateChanged` per
|
||||
`changed.AlarmName` into `_latestAlarmEvents` (line 1013) and **never removes
|
||||
anything from that dictionary**. For computed alarms the key set is bounded (one
|
||||
entry per configured alarm). For **native** alarms, however, the key is the
|
||||
per-condition `SourceReference` (every `NativeAlarmActor.Emit` stamps
|
||||
`AlarmName = t.SourceReference`). When a native condition fully runs its course —
|
||||
`NativeAlarmActor.ApplyLiveTransition` drops it from the mirror once
|
||||
`!Active && Acknowledged` (NativeAlarmActor.cs:243), or `EnforceCap` evicts the
|
||||
oldest — the Instance Actor still holds the final `AlarmStateChanged` for that
|
||||
`SourceReference` forever as a (Normal) entry. There is no message back to the
|
||||
Instance Actor to evict it.
|
||||
|
||||
On an OPC UA A&C source that mints a fresh `SourceReference` per occurrence
|
||||
(conditionId/branchId style references are common), the map accumulates one
|
||||
permanently-retained entry per distinct condition the instance has *ever* seen.
|
||||
The `MirroredAlarmCapPerSource` cap (default 1000) bounds the
|
||||
`NativeAlarmActor._alarms` set but does **not** bound `_latestAlarmEvents`, which
|
||||
is on the long-lived Instance Actor. Over weeks of uptime on a busy site this is a
|
||||
steadily-growing per-instance memory leak, and `BuildAlarmStatesSnapshot`
|
||||
(line 1055, `_latestAlarmEvents.Values.ToList()`) makes every DebugView snapshot
|
||||
proportionally larger and slower, returning thousands of stale Normal rows.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Evict from `_latestAlarmEvents` when a native condition reaches its terminal
|
||||
return-to-normal+dropped state. The cleanest signal is the `AlarmStateChanged`
|
||||
the `NativeAlarmActor` already emits on drop-out: have the `NativeAlarmActor`
|
||||
stamp a "final/dropped" marker (an additive bool on `AlarmStateChanged`, or a
|
||||
dedicated `NativeAlarmDropped(sourceReference)` Tell to the parent), and have
|
||||
`HandleAlarmStateChanged` `_latestAlarmEvents.Remove(...)` on it. Alternatively
|
||||
cap/prune `_latestAlarmEvents` for native keys mirroring
|
||||
`MirroredAlarmCapPerSource`. Computed-alarm keys must stay (they are
|
||||
configuration-bounded).
|
||||
|
||||
**Resolution**
|
||||
|
||||
Resolved 2026-06-20 (commit `fd618cf1`): added an additive `NativeAlarmDropped(SourceReference)` Tell emitted by `NativeAlarmActor` at all terminal-drop sites (snapshot-swap removal, retention drop, cap eviction) after the return-to-normal emit; `InstanceActor.HandleNativeAlarmDropped` removes the native key from `_latestAlarmEvents` (and `_alarmStates`/`_alarmTimestamps`). Computed-alarm keys are never dropped.
|
||||
|
||||
### SiteRuntime-028 — `NativeAlarmActor.EnforceCap` drops a condition without a return-to-normal, leaving a phantom Active alarm
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/NativeAlarmActor.cs:281` |
|
||||
|
||||
**Description**
|
||||
|
||||
`EnforceCap` evicts the oldest mirrored conditions once the per-source cap is
|
||||
exceeded: it removes them from `_alarms`, calls `PersistDelete`, and logs the
|
||||
eviction — but it does **not** `Emit` a return-to-normal `AlarmStateChanged` for
|
||||
the dropped condition. If the evicted condition was still `Active`, the Instance
|
||||
Actor's `_latestAlarmEvents` (and, downstream, central's gRPC stream and the
|
||||
operator Alarm Summary page) keep showing it as **Active** indefinitely. The
|
||||
mirror has silently forgotten the condition, so no later transition can ever clear
|
||||
it — a phantom stuck-active alarm.
|
||||
|
||||
This is inconsistent with the sibling drop path `ApplySnapshotSwap`
|
||||
(NativeAlarmActor.cs:205), which correctly emits `prior.Condition with { Active =
|
||||
false }` for every condition that falls out of the new snapshot. The retention
|
||||
drop in `ApplyLiveTransition` (line 243) is safe only because it drops *after*
|
||||
emitting the condition's own already-inactive state; `EnforceCap` drops a
|
||||
condition whose last-emitted state may still be Active, with no compensating emit.
|
||||
|
||||
The design doc explicitly states the cap eviction is "logged — there is no silent
|
||||
truncation," but logging alone does not reconcile the operator-visible state: the
|
||||
alarm view is silently wrong.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
In `EnforceCap`, before removing each overflow condition, `Emit(a, a.Condition
|
||||
with { Active = false })` (mirroring `ApplySnapshotSwap`) so the eviction produces
|
||||
a return-to-normal on the stream and clears the phantom. Add a test asserting that
|
||||
exceeding `MirroredAlarmCapPerSource` with an active oldest condition emits a
|
||||
Normal `AlarmStateChanged` for the evicted `SourceReference`.
|
||||
|
||||
**Resolution**
|
||||
|
||||
Resolved 2026-06-20 (commit `fd618cf1`): `NativeAlarmActor.EnforceCap` now emits `Emit(evicted, evicted.Condition with { Active = false })` for a still-active evicted condition before removing it, clearing the phantom stuck-Active alarm on the stream/UI. Cap-eviction test added.
|
||||
|
||||
### SiteRuntime-029 — A delete (or disable) arriving during a pending redeploy over-decrements the counter and is undone by `HandleTerminated`
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs:627`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs:402` |
|
||||
|
||||
**Description**
|
||||
|
||||
`HandleDelete` and `HandleDisable` only consult `_instanceActors` to find the live
|
||||
actor; neither consults the SiteRuntime-020 `_terminatingActorsByName` /
|
||||
`_pendingRedeploys` bookkeeping. When a redeployment is in flight, `HandleDeploy`
|
||||
has already removed the instance from `_instanceActors`, stopped the predecessor,
|
||||
and buffered a `PendingRedeploy` keyed by the terminating ref. If a
|
||||
`DeleteInstanceCommand` then arrives for that same instance *before* the
|
||||
`Terminated` signal fires:
|
||||
|
||||
1. `_instanceActors.TryGetValue` misses (the entry was removed at redeploy time),
|
||||
so no actor is stopped.
|
||||
2. `_totalDeployedCount = Math.Max(0, _totalDeployedCount - 1)` runs anyway —
|
||||
over-decrementing, because the redeploy path did not increment for an update
|
||||
(the count had already been adjusted), so the deployed/disabled counts reported
|
||||
to the health collector drift.
|
||||
3. `RemoveDeployedConfigAsync` deletes the SQLite row and the deleter is told
|
||||
success.
|
||||
4. The buffered `_pendingRedeploys` entry is **untouched**, so when `Terminated`
|
||||
fires, `HandleTerminated` calls `ApplyDeployment(..., isRedeploy: true)`, which
|
||||
re-creates the Instance Actor and re-writes the deployed config to SQLite —
|
||||
**resurrecting the instance the operator just deleted**, with the counter now
|
||||
inconsistent.
|
||||
|
||||
`HandleDisable` has the milder version: the disable persists `is_enabled = false`,
|
||||
but `HandleTerminated` then re-stores the config with `isEnabled: true` and
|
||||
re-creates the actor, so a disable issued mid-redeploy is silently reverted to
|
||||
enabled.
|
||||
|
||||
The window is the redeploy-termination interval — small, but reliably hit when
|
||||
central issues a delete/disable immediately after a deploy (e.g. an operator
|
||||
correcting a mistaken deploy, or an automated teardown), exactly the kind of rapid
|
||||
command sequence SiteRuntime-020 was filed to harden.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Make `HandleDelete`/`HandleDisable` authoritative over the mid-redeploy state:
|
||||
before falling through, check `_terminatingActorsByName`. On a hit, drop the
|
||||
buffered `_pendingRedeploys` entry (so `HandleTerminated` does not resurrect),
|
||||
clear the shadow, and for delete tell the buffered redeploy's `OriginalSender` it
|
||||
was superseded — mirroring the last-write-wins handling already in `HandleDeploy`.
|
||||
Only decrement `_totalDeployedCount` when an instance was actually present
|
||||
(in `_instanceActors` **or** terminating). Add a regression test:
|
||||
deploy → redeploy → delete-before-Terminated asserts the instance stays deleted
|
||||
and the counter is correct.
|
||||
|
||||
**Resolution**
|
||||
|
||||
Resolved 2026-06-20 (commit `fd618cf1`): `HandleDelete`/`HandleDisable` now check `_terminatingActorsByName` before the `_instanceActors` fall-through; on a hit they drop the buffered `_pendingRedeploys` entry and clear the shadow so `HandleTerminated` can't resurrect the instance, and `_totalDeployedCount` is decremented only when an instance was actually present. Delete-during-redeploy race test added (verified failing pre-fix).
|
||||
|
||||
### SiteRuntime-030 — Native-alarm cap/retention-drop and `SiteReplicationActor` paths are untested
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Testing coverage |
|
||||
| Status | Resolved |
|
||||
| Location | `tests/ZB.MOM.WW.ScadaBridge.SiteRuntime.Tests/Actors/NativeAlarmActorTests.cs`, `tests/ZB.MOM.WW.ScadaBridge.SiteRuntime.Tests/` |
|
||||
|
||||
**Description**
|
||||
|
||||
`NativeAlarmActorTests` covers subscribe, raise, snapshot-swap return-to-normal,
|
||||
out-of-order rejection, ack, and site-event emission, but there is **no** test for
|
||||
`EnforceCap` (the per-source cap eviction) or for the `ApplyLiveTransition`
|
||||
retention drop (`!Active && Acknowledged`). Both are non-trivial state
|
||||
transitions, and the cap path harbours an observable defect (SiteRuntime-028) that
|
||||
a targeted test would have caught. `SiteReplicationActor` remains entirely
|
||||
untested — the carried-forward gap that SiteRuntime-016 explicitly deferred to a
|
||||
clustered-ActorSystem harness, still outstanding at this commit (the actor calls
|
||||
`Cluster.Get(Context.System)` in its constructor, so it needs a clustered HOCON
|
||||
test host). The outbound forward, inbound apply, and `SendToPeer` no-peer-drop
|
||||
behaviour are unverified.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Add `NativeAlarmActor` tests for (a) cap eviction emits a return-to-normal for an
|
||||
evicted active condition (pairs with SiteRuntime-028) and (b) a resolved condition
|
||||
(inactive+acked) drops out and deletes its SQLite row. Stand up the clustered
|
||||
test harness SiteRuntime-016 called for and cover `SiteReplicationActor`'s
|
||||
outbound/inbound/peer-tracking paths.
|
||||
|
||||
**Resolution**
|
||||
|
||||
Resolved 2026-06-20 (commit `fd618cf1`): native-alarm cap-eviction and retention-drop tests added (pairs with -027/-028). The clustered `SiteReplicationActor` test harness remains deferred (needs a clustered ActorSystem host, consistent with the prior -016 deferral).
|
||||
|
||||
### SiteRuntime-031 — Site still persists notification-list and SMTP config that the design moved to central-only
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Design-document adherence |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Persistence/SiteStorageService.cs:90`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Persistence/SiteStorageService.cs:105`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs:1383`, `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs:1417` |
|
||||
|
||||
**Description**
|
||||
|
||||
The design decision (CLAUDE.md "External Integrations" / Component-NotificationService;
|
||||
echoed in Component-SiteRuntime.md "System-Wide Artifact Handling": *"Notification
|
||||
lists and SMTP configuration are central-only and are not deployed to sites"*) makes
|
||||
notification delivery central-only — sites store-and-forward to central and never
|
||||
talk to SMTP. Yet SiteRuntime still:
|
||||
|
||||
- creates the `notification_lists` and `smtp_configurations` SQLite tables
|
||||
(`SiteStorageService.InitializeAsync`),
|
||||
- writes them from `HandleDeployArtifacts` (`StoreNotificationListAsync` /
|
||||
`StoreSmtpConfigurationAsync`, DeploymentManagerActor.cs:1383/1417) and the
|
||||
replication path `SiteReplicationActor.HandleApplyArtifacts`, and
|
||||
- the `DeployArtifactsCommand` contract still carries `NotificationLists` /
|
||||
`SmtpConfigurations`.
|
||||
|
||||
The SMTP path persists the SMTP `password` field (SiteStorageService.cs:693) into
|
||||
plaintext site SQLite. If central no longer populates these (per the design), this
|
||||
is dead code carrying a latent secret-at-rest footprint; if central still does, the
|
||||
site is storing SMTP credentials it must never use — both contradict the
|
||||
central-only delivery decision. Either the code/tables are stale and should be
|
||||
removed, or the design doc is stale and the decision needs re-stating. This straddles
|
||||
the SiteRuntime/NotificationService boundary and the shared `DeployArtifactsCommand`
|
||||
contract, so the direction is a design-owner call rather than a clean in-module fix.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Confirm with the design owner whether central still ships notification-list / SMTP
|
||||
artifacts to sites. If not (the stated decision), remove the `notification_lists`
|
||||
and `smtp_configurations` tables, the two `Store…Async` writers, the replication
|
||||
branches, and the corresponding `DeployArtifactsCommand` fields (a coordinated
|
||||
cross-module change). If the decision has changed, update Component-SiteRuntime.md
|
||||
and Component-NotificationService.md and treat the plaintext SMTP password as a
|
||||
secret-handling finding in its own right.
|
||||
|
||||
**Resolution**
|
||||
|
||||
Resolved 2026-06-20 (commit `fd618cf1`): the site no longer persists notification-list / SMTP config and purges any already-persisted rows (incl. the plaintext SMTP password) on both apply paths (`HandleDeployArtifacts`, replication `HandleApplyArtifacts`), inside the all-or-nothing apply. Paired with DeploymentManager-025 (central stops shipping). Tables retained but kept empty.
|
||||
|
||||
Reference in New Issue
Block a user