code-review: 2026-05-28 baseline re-review of all 23 modules at 1eb6e97

Re-applies the full 10-category checklist to every src/ project — including
first-time reviews of the four newer components (AuditLog, NotificationOutbox,
SiteCallAudit, Transport) — so the code-reviews/ index reflects today's
codebase rather than the 2026-05-16 baseline. 172 new Open findings (0
Critical, 18 High, 62 Medium, 92 Low); 481 findings total across 23 modules.

regen-readme.py now derives each module's Last reviewed + Commit from its
findings.md header instead of hard-coding 2026-05-16 / 9c60592, so future
single-module re-reviews show their own date in the Module Status table.
This commit is contained in:
Joseph Doherty
2026-05-28 02:55:47 -04:00
parent 1eb6e972b0
commit f93b7b99bb
25 changed files with 8793 additions and 115 deletions
+335 -3
View File
@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.Communication` |
| Design doc | `docs/requirements/Component-Communication.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-17 |
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | `39d737e` |
| Open findings | 0 |
| Commit reviewed | `1eb6e97` |
| Open findings | 7 |
## Summary
@@ -42,6 +42,47 @@ gRPC-supplied `correlation_id` flows straight into an Akka actor name
(Communication-014), and the factory's endpoint-reuse defect is masked by the test
mock (Communication-015). Four new findings, all Open: one High, one Medium, two Low.
#### Re-review 2026-05-28 (commit `1eb6e97`)
All prior findings (Communication-001..015) remain `Resolved` in this commit. The
re-review walked all 10 checklist categories again on the surface that has not
been re-examined before — the central↔site command/control routing surface
(`CentralCommunicationActor`, `SiteCommunicationActor`) rather than the
previously-mined gRPC streaming surface — and uncovered a cluster of defects
around the connection-state-change workflow. The single material finding is
**`HandleConnectionStateChanged` is dead code**: no production code path emits
`ConnectionStateChanged`, so the documented "kill active debug streams for the
disconnected site" + "mark in-progress deployments as failed" workflow never
fires at runtime (Communication-016). The downstream consequence is
**`_inProgressDeployments` grows unboundedly** — entries are inserted on every
deployment but only cleaned via that dead path (Communication-017). Three
smaller items round out the re-review: site heartbeats hard-code
`IsActive: true` regardless of node role (Communication-018), the
60-second-periodic `LoadSiteAddressesFromDb` task has no CancellationToken so a
hung DB query has no upper bound (Communication-019), the
`SiteAddressCacheLoaded` internal message carries a mutable
`Dictionary`/`List` (Communication-020), `SiteStreamGrpcServer.SubscribeInstance`
leaks the StreamRelayActor if `_streamSubscriber.Subscribe` throws between
`ActorOf` and the `try` block (Communication-021), and `_debugSubscriptions`
keyed by caller-supplied `CorrelationId` could orphan a subscriber on ID reuse
(Communication-022). Seven new findings, all Open: one High, one Medium, five
Low.
## Checklist coverage 2026-05-28
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ✓ | `HandleConnectionStateChanged` and its `_inProgressDeployments` / `_debugSubscriptions` cleanup never fire — the connection-state workflow is dead (Communication-016, Communication-017). `_debugSubscriptions` correlation-ID overwrite risk (Communication-022). |
| 2 | Akka.NET conventions | ✓ | `SiteAddressCacheLoaded` carries mutable `Dictionary<string, List<string>>` — violates message-immutability convention (Communication-020). `Forward`/`PipeTo`/Sender-capture all clean. |
| 3 | Concurrency & thread safety | ✓ | All mutable state mutated on the actor thread. `_subscriptions` ConcurrentDictionary use disciplined. No new issues. |
| 4 | Error handling & resilience | ✓ | `LoadSiteAddressesFromDb` lacks a `CancellationToken` propagation point (Communication-019). `SubscribeInstance` leaks the relay actor if `Subscribe` throws pre-try (Communication-021). |
| 5 | Security | ✓ | Correlation-id validation in place (Communication-014). No new issues. |
| 6 | Performance & resource management | ✓ | `_inProgressDeployments` grows unboundedly (Communication-017). gRPC client/server lifecycles otherwise clean. |
| 7 | Design-document adherence | ✓ | `ConnectionStateChanged` handler is dead code — the doc-stated "kill streams on disconnect, fail in-progress deployments" workflow does not actually run (Communication-016). Site heartbeats always report `IsActive: true` regardless of role (Communication-018). |
| 8 | Code organization & conventions | ✓ | Options pattern correct; mapper placement and proto evolution are additive-only. No new issues. |
| 9 | Testing coverage | ✓ | `CentralCommunicationActorTests.ConnectionLost_DebugStreamsKilled` exercises a code path that no production caller ever drives — gives false confidence (related to Communication-016). |
| 10 | Documentation & comments | ✓ | Detailed XML docs added in this commit. No new issues. |
## Checklist coverage
| # | Category | Examined | Notes |
@@ -726,3 +767,294 @@ gained `On_GrpcError_Reconnects_To_Other_Node_Endpoint`, which uses a new
per endpoint (instead of one fixed mock regardless of endpoint), so the bridge actor's
NodeA→NodeB reconnect is now verified to actually target the NodeB endpoint rather
than being masked by an endpoint-agnostic mock.
### Communication-016 — `HandleConnectionStateChanged` is dead code — the documented disconnect-cleanup workflow never fires
| | |
|--|--|
| Severity | High |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:169`, `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:338-375` |
**Description**
`CentralCommunicationActor.HandleConnectionStateChanged` is wired to
`Receive<ConnectionStateChanged>` and implements two important workflows on
`IsConnected == false`: (1) kill every active debug stream for the disconnected
site (`_debugSubscriptions` walk → `DebugStreamTerminated` Tell to each
subscriber); (2) mark every in-progress deployment for that site as failed
(`_inProgressDeployments` walk → entry removal). Both are documented in the
component design doc's "Connection Failure Behavior" section and in WP-5 of the
work plan referenced in the class's own XML doc comment.
A repo-wide search (`grep -rn ConnectionStateChanged src/ tests/`) shows **no
production code ever emits `ConnectionStateChanged`**. The only producers are
the unit test `CentralCommunicationActorTests.ConnectionLost_DebugStreamsKilled`
(line 137) and the Commons message-roundtrip test. The
`CentralCommunicationActor` therefore never receives one in production, the
disconnect-cleanup workflow never fires, and `_debugSubscriptions` /
`_inProgressDeployments` are never pruned via this path.
Concrete consequences:
- A site goes down → its active debug streams do **not** get a synchronous
`DebugStreamTerminated` notification from central. The bridge actor must
detect the disconnect itself via gRPC keepalive timing out (~25s) or TCP RST.
Subscribers wait that long for the `OnStreamTerminated` callback instead of
the documented "immediately killed by central" behaviour.
- In-progress deployments to a disconnected site continue to occupy the
Ask-reply window and only fail when the Ask times out at the
`CommunicationService.DeployInstanceAsync` layer (120s). They are never
proactively marked failed.
- The unit test gives a strong false impression that the workflow works — it
exercises a code path that has no production caller.
The design doc and CLAUDE.md mention "ClusterClient handles failover between
NodeA and NodeB internally — there is no application-level NodeA preference /
NodeB fallback logic" — so the ClusterClient mechanism is the documented
failover transport. But that says nothing about *signalling* a fully-down
remote cluster to central's coordinator actor, which is exactly what
`ConnectionStateChanged` was meant to do.
**Recommendation**
Pick one of:
- Wire a producer for `ConnectionStateChanged` — e.g. subscribe to
`ClusterClient`'s contact-point/cluster events (`ClusterClient.ContactPoints`
Refresh / `ContactPointAdded` / `ContactPointRemoved`) or watch the
ClusterClient actor for a "no contact points reachable" state — and have it
publish `ConnectionStateChanged` to `Self` on each transition.
- If the documented "synchronously kill streams on disconnect" behaviour is
intentionally being dropped in favour of the slower keepalive-based
detection, delete the handler, the `ConnectionStateChanged` record, and the
related `_debugSubscriptions` / `_inProgressDeployments` tracking, then
update the design doc's "Connection Failure Behavior" section accordingly.
Either way, replace `CentralCommunicationActorTests.ConnectionLost_DebugStreamsKilled`
— at present it asserts a behaviour that no production code triggers.
---
### Communication-017 — `_inProgressDeployments` grows unboundedly — successful deployments are never cleaned up
| | |
|--|--|
| Severity | Medium |
| Category | Performance & resource management |
| Status | Open |
| Location | `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:73`, `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:501`, `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:357-367` |
**Description**
`TrackMessageForCleanup` inserts `_inProgressDeployments[deploy.DeploymentId] =
envelope.SiteId` on every `DeployInstanceCommand` routed to a site (line 501).
The only places that *remove* from `_inProgressDeployments` are:
- `HandleConnectionStateChanged` on `IsConnected == false` (line 366) — which
per Communication-016 never fires in production.
- `PostStop` (line 553) — only on actor death (central failover).
There is **no removal on the normal happy path** — neither when the site replies
`DeploymentStatusResponse` (the reply goes to the Ask's temporary reply actor,
not back through `CentralCommunicationActor`), nor on Ask timeout. Every
successful or failed deployment leaves its entry behind for the lifetime of the
process.
Memory impact is modest (each entry is ~70-100 bytes), but the dictionary grows
monotonically. Over months of operation across all sites a central node could
accumulate tens of thousands of entries — a real, observable leak. More
seriously, the field is *also* the source-of-truth set the
`HandleConnectionStateChanged` walk uses to fail in-progress deployments, so
even if a `ConnectionStateChanged` *were* fired today, the walk would
"fail" thousands of already-completed deployments and Tell their (now stale)
correlation-IDs into the void.
`_debugSubscriptions` (line 67) shares the same shape — but a normal debug
session ends with an `UnsubscribeDebugViewRequest` that *does* drive cleanup
(line 497), so leaks are only realised when a consumer crashes without
unsubscribing.
**Recommendation**
Either remove `_inProgressDeployments` entirely (it has no other consumer once
Communication-016 is fixed by deletion) or, if the disconnect-cleanup workflow
is retained, add a removal hook on the reply path. The simplest fix is to
subscribe `CentralCommunicationActor` to the Ask reply: route
`DeployInstanceCommand` through the actor with the actor as the Ask sender,
forward the reply to the original caller, and `_inProgressDeployments.Remove`
in the same handler. (Today the Ask is taken on the *actor* itself by the
caller, so the reply skips the coordinator.)
---
### Communication-018 — Site heartbeats hard-code `IsActive: true` regardless of node role
| | |
|--|--|
| Severity | Low |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.Communication/Actors/SiteCommunicationActor.cs:357-371` |
**Description**
`SiteCommunicationActor.SendHeartbeatToCentral` builds
`new HeartbeatMessage(_siteId, hostname, IsActive: true, DateTimeOffset.UtcNow)`
on every periodic tick (line 366), with no inspection of whether this node is
actually the active site node or a standby. The `HeartbeatMessage.IsActive`
field thus carries the literal value `true` on every heartbeat from every
node, and the field is effectively dead — central's `HandleHeartbeat` doesn't
consume it either (line 297 only passes `SiteId` and `Timestamp` to
`MarkHeartbeat`).
Per CLAUDE.md's Cluster & Failover section the active/standby distinction is
real ("Both nodes are seed nodes", "keep-oldest split-brain resolver",
"automatic dual-node recovery"), so a heartbeat that *could* carry node-role
information would be useful for the central health dashboard distinguishing
"active node down, standby up" from "site fully offline". As shipped, the
field is contract noise and a future implementer might mistakenly assume it
already carries meaningful state.
**Recommendation**
Either (a) resolve the current cluster role at heartbeat-send time and pass it
through — e.g. `Cluster.Get(Context.System).SelfRoles.Contains("active")` or
the project's existing role mechanism — and have the central aggregator
consume `IsActive`; or (b) drop the `IsActive` field from `HeartbeatMessage`
(additive-only-evolution: deprecate the field, default to `true`, plan
removal in a major message contract revision).
---
### Communication-019 — `LoadSiteAddressesFromDb` does not pass a `CancellationToken` to the repository
| | |
|--|--|
| Severity | Low |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:397-431` |
**Description**
`LoadSiteAddressesFromDb` runs `await repo.GetAllSitesAsync()` inside
`Task.Run(async () => ...).PipeTo(self)` with no cancellation token (line 404).
The repository signature accepts `CancellationToken` (the test mock declares
`GetAllSitesAsync(Arg.Any<CancellationToken>())`), but the actor calls the
no-arg overload — so a hung MS SQL connection has no upper bound. The
60-second-periodic refresh keeps firing; each tick spawns a fresh `Task.Run`
that piles up if the database is consistently slow. The actor itself is
unaffected (it's not blocked), but pending tasks and DB connection-pool
resources accumulate, and the `Status.Failure` handler (Communication-006)
never fires because the task never faults — it just sits.
**Recommendation**
Maintain a per-load `CancellationTokenSource` with a deadline (e.g. the same
60s the refresh runs on, or a configurable timeout in `CommunicationOptions`).
Pass its `Token` to `GetAllSitesAsync`. Cancel the prior token before spinning
a new load to avoid task accumulation.
---
### Communication-020 — `SiteAddressCacheLoaded` carries mutable `Dictionary`/`List` types
| | |
|--|--|
| Severity | Low |
| Category | Akka.NET conventions |
| Status | Open |
| Location | `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:567` |
**Description**
The Akka.NET convention is that messages crossing actor boundaries (even
internal Self-messages over an async task boundary) are immutable.
`SiteAddressCacheLoaded(Dictionary<string, List<string>> SiteContacts)` is a
record but its `SiteContacts` payload is a mutable `Dictionary` whose values
are mutable `List<string>`. Constructed inside `Task.Run` and handed off to
the actor, the cache could in principle be mutated by either side; in
practice nothing does, but the type is a stale-evidence guarantee that
CLAUDE.md's "message immutability" rule is being followed only by convention.
**Recommendation**
Change the record signature to use `IReadOnlyDictionary<string, IReadOnlyList<string>>`
(or `ImmutableDictionary` / `ImmutableArray<string>`) and freeze the data
before piping. The cost is negligible — the payload is built and consumed
once per refresh tick.
---
### Communication-021 — `SiteStreamGrpcServer.SubscribeInstance` leaks the `StreamRelayActor` if `Subscribe` throws pre-try
| | |
|--|--|
| Severity | Low |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.Communication/Grpc/SiteStreamGrpcServer.cs:188-200` |
**Description**
`SubscribeInstance` performs these statements in order (lines 189-194), all
*before* the `try` block at line 200:
1. `Interlocked.Increment(ref _actorCounter)`
2. `_actorSystem!.ActorOf(Props.Create(typeof(StreamRelayActor), ...))`
3. `_streamSubscriber.Subscribe(request.InstanceUniqueName, relayActor)`
If step 3 throws (the subscriber is wired but its `Subscribe` faults — a stale
instance name, a temporary index lookup failure, etc.), the exception escapes
the method as an unhandled `RpcException` *and* leaks the freshly-created
`relayActor`. The `finally` block at line 211 is unreachable because the
throw happens before the `try`. The actor's `_activeStreams` entry, the
`StreamEntry.Cts`, and the `Channel<SiteStreamEvent>` are also leaked.
In normal operation `_streamSubscriber.Subscribe` does not throw, so the bug is
latent — but a misbehaving site runtime (e.g. `SiteStreamManager` faulted
because the actor system is shutting down) would surface it.
**Recommendation**
Restructure to either (a) wrap the `Subscribe` call in a `try` whose `catch`
stops the relay actor and disposes the CTS, or (b) move the actor + subscriber
creation *inside* the existing `try` block (the `finally` will then handle
cleanup uniformly). Option (b) is the simplest — just move lines 189-194 down
past the `try {` brace.
---
### Communication-022 — `_debugSubscriptions` keyed by caller-supplied correlation ID; reuse silently orphans the prior subscriber
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:67`, `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:493` |
**Description**
`TrackMessageForCleanup` on `SubscribeDebugViewRequest` does
`_debugSubscriptions[sub.CorrelationId] = (envelope.SiteId, Sender)` (line 493).
The dictionary indexer silently overwrites any prior entry for the same
`CorrelationId`. If two debug sessions ever reuse the same correlation ID (e.g.
two Blazor users start a stream at the same moment with a non-GUID id, or a
caller bug, or a malicious caller as flagged in the cousin
Communication-014), the first subscriber's entry is overwritten and lost —
on a later `ConnectionStateChanged(false)` (per Communication-016 it never
actually fires today, but the design intent stands), only the *second*
subscriber would be notified of the disconnect.
`DebugStreamService.StartStreamAsync` uses `Guid.NewGuid().ToString("N")` as
the session id (`DebugStreamService.cs:97`), so a real collision is
astronomically unlikely in normal operation. But the central side is not
defending itself: a CLI consumer or a future caller is implicitly trusted to
generate globally-unique ids.
**Recommendation**
When the slot is already occupied, log a Warning and either reject the new
subscription with an error response or evict the prior subscriber via
`DebugStreamTerminated` before installing the new one. Mirrors the
`SiteStreamGrpcServer` defensive behaviour where a duplicate `correlation_id`
cancels the existing stream (line 167).