code-review: 2026-05-28 baseline re-review of all 23 modules at 1eb6e97
Re-applies the full 10-category checklist to every src/ project — including
first-time reviews of the four newer components (AuditLog, NotificationOutbox,
SiteCallAudit, Transport) — so the code-reviews/ index reflects today's
codebase rather than the 2026-05-16 baseline. 172 new Open findings (0
Critical, 18 High, 62 Medium, 92 Low); 481 findings total across 23 modules.
regen-readme.py now derives each module's Last reviewed + Commit from its
findings.md header instead of hard-coding 2026-05-16 / 9c60592, so future
single-module re-reviews show their own date in the Module Status table.
This commit is contained in:
@@ -5,10 +5,10 @@
|
||||
| Module | `src/ScadaLink.Communication` |
|
||||
| Design doc | `docs/requirements/Component-Communication.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-17 |
|
||||
| Last reviewed | 2026-05-28 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `39d737e` |
|
||||
| Open findings | 0 |
|
||||
| Commit reviewed | `1eb6e97` |
|
||||
| Open findings | 7 |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -42,6 +42,47 @@ gRPC-supplied `correlation_id` flows straight into an Akka actor name
|
||||
(Communication-014), and the factory's endpoint-reuse defect is masked by the test
|
||||
mock (Communication-015). Four new findings, all Open: one High, one Medium, two Low.
|
||||
|
||||
#### Re-review 2026-05-28 (commit `1eb6e97`)
|
||||
|
||||
All prior findings (Communication-001..015) remain `Resolved` in this commit. The
|
||||
re-review walked all 10 checklist categories again on the surface that has not
|
||||
been re-examined before — the central↔site command/control routing surface
|
||||
(`CentralCommunicationActor`, `SiteCommunicationActor`) rather than the
|
||||
previously-mined gRPC streaming surface — and uncovered a cluster of defects
|
||||
around the connection-state-change workflow. The single material finding is
|
||||
**`HandleConnectionStateChanged` is dead code**: no production code path emits
|
||||
`ConnectionStateChanged`, so the documented "kill active debug streams for the
|
||||
disconnected site" + "mark in-progress deployments as failed" workflow never
|
||||
fires at runtime (Communication-016). The downstream consequence is
|
||||
**`_inProgressDeployments` grows unboundedly** — entries are inserted on every
|
||||
deployment but only cleaned via that dead path (Communication-017). Three
|
||||
smaller items round out the re-review: site heartbeats hard-code
|
||||
`IsActive: true` regardless of node role (Communication-018), the
|
||||
60-second-periodic `LoadSiteAddressesFromDb` task has no CancellationToken so a
|
||||
hung DB query has no upper bound (Communication-019), the
|
||||
`SiteAddressCacheLoaded` internal message carries a mutable
|
||||
`Dictionary`/`List` (Communication-020), `SiteStreamGrpcServer.SubscribeInstance`
|
||||
leaks the StreamRelayActor if `_streamSubscriber.Subscribe` throws between
|
||||
`ActorOf` and the `try` block (Communication-021), and `_debugSubscriptions`
|
||||
keyed by caller-supplied `CorrelationId` could orphan a subscriber on ID reuse
|
||||
(Communication-022). Seven new findings, all Open: one High, one Medium, five
|
||||
Low.
|
||||
|
||||
## Checklist coverage 2026-05-28
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
|---|----------|----------|-------|
|
||||
| 1 | Correctness & logic bugs | ✓ | `HandleConnectionStateChanged` and its `_inProgressDeployments` / `_debugSubscriptions` cleanup never fire — the connection-state workflow is dead (Communication-016, Communication-017). `_debugSubscriptions` correlation-ID overwrite risk (Communication-022). |
|
||||
| 2 | Akka.NET conventions | ✓ | `SiteAddressCacheLoaded` carries mutable `Dictionary<string, List<string>>` — violates message-immutability convention (Communication-020). `Forward`/`PipeTo`/Sender-capture all clean. |
|
||||
| 3 | Concurrency & thread safety | ✓ | All mutable state mutated on the actor thread. `_subscriptions` ConcurrentDictionary use disciplined. No new issues. |
|
||||
| 4 | Error handling & resilience | ✓ | `LoadSiteAddressesFromDb` lacks a `CancellationToken` propagation point (Communication-019). `SubscribeInstance` leaks the relay actor if `Subscribe` throws pre-try (Communication-021). |
|
||||
| 5 | Security | ✓ | Correlation-id validation in place (Communication-014). No new issues. |
|
||||
| 6 | Performance & resource management | ✓ | `_inProgressDeployments` grows unboundedly (Communication-017). gRPC client/server lifecycles otherwise clean. |
|
||||
| 7 | Design-document adherence | ✓ | `ConnectionStateChanged` handler is dead code — the doc-stated "kill streams on disconnect, fail in-progress deployments" workflow does not actually run (Communication-016). Site heartbeats always report `IsActive: true` regardless of role (Communication-018). |
|
||||
| 8 | Code organization & conventions | ✓ | Options pattern correct; mapper placement and proto evolution are additive-only. No new issues. |
|
||||
| 9 | Testing coverage | ✓ | `CentralCommunicationActorTests.ConnectionLost_DebugStreamsKilled` exercises a code path that no production caller ever drives — gives false confidence (related to Communication-016). |
|
||||
| 10 | Documentation & comments | ✓ | Detailed XML docs added in this commit. No new issues. |
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
@@ -726,3 +767,294 @@ gained `On_GrpcError_Reconnects_To_Other_Node_Endpoint`, which uses a new
|
||||
per endpoint (instead of one fixed mock regardless of endpoint), so the bridge actor's
|
||||
NodeA→NodeB reconnect is now verified to actually target the NodeB endpoint rather
|
||||
than being masked by an endpoint-agnostic mock.
|
||||
|
||||
### Communication-016 — `HandleConnectionStateChanged` is dead code — the documented disconnect-cleanup workflow never fires
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | High |
|
||||
| Category | Design-document adherence |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:169`, `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:338-375` |
|
||||
|
||||
**Description**
|
||||
|
||||
`CentralCommunicationActor.HandleConnectionStateChanged` is wired to
|
||||
`Receive<ConnectionStateChanged>` and implements two important workflows on
|
||||
`IsConnected == false`: (1) kill every active debug stream for the disconnected
|
||||
site (`_debugSubscriptions` walk → `DebugStreamTerminated` Tell to each
|
||||
subscriber); (2) mark every in-progress deployment for that site as failed
|
||||
(`_inProgressDeployments` walk → entry removal). Both are documented in the
|
||||
component design doc's "Connection Failure Behavior" section and in WP-5 of the
|
||||
work plan referenced in the class's own XML doc comment.
|
||||
|
||||
A repo-wide search (`grep -rn ConnectionStateChanged src/ tests/`) shows **no
|
||||
production code ever emits `ConnectionStateChanged`**. The only producers are
|
||||
the unit test `CentralCommunicationActorTests.ConnectionLost_DebugStreamsKilled`
|
||||
(line 137) and the Commons message-roundtrip test. The
|
||||
`CentralCommunicationActor` therefore never receives one in production, the
|
||||
disconnect-cleanup workflow never fires, and `_debugSubscriptions` /
|
||||
`_inProgressDeployments` are never pruned via this path.
|
||||
|
||||
Concrete consequences:
|
||||
- A site goes down → its active debug streams do **not** get a synchronous
|
||||
`DebugStreamTerminated` notification from central. The bridge actor must
|
||||
detect the disconnect itself via gRPC keepalive timing out (~25s) or TCP RST.
|
||||
Subscribers wait that long for the `OnStreamTerminated` callback instead of
|
||||
the documented "immediately killed by central" behaviour.
|
||||
- In-progress deployments to a disconnected site continue to occupy the
|
||||
Ask-reply window and only fail when the Ask times out at the
|
||||
`CommunicationService.DeployInstanceAsync` layer (120s). They are never
|
||||
proactively marked failed.
|
||||
- The unit test gives a strong false impression that the workflow works — it
|
||||
exercises a code path that has no production caller.
|
||||
|
||||
The design doc and CLAUDE.md mention "ClusterClient handles failover between
|
||||
NodeA and NodeB internally — there is no application-level NodeA preference /
|
||||
NodeB fallback logic" — so the ClusterClient mechanism is the documented
|
||||
failover transport. But that says nothing about *signalling* a fully-down
|
||||
remote cluster to central's coordinator actor, which is exactly what
|
||||
`ConnectionStateChanged` was meant to do.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Pick one of:
|
||||
- Wire a producer for `ConnectionStateChanged` — e.g. subscribe to
|
||||
`ClusterClient`'s contact-point/cluster events (`ClusterClient.ContactPoints`
|
||||
Refresh / `ContactPointAdded` / `ContactPointRemoved`) or watch the
|
||||
ClusterClient actor for a "no contact points reachable" state — and have it
|
||||
publish `ConnectionStateChanged` to `Self` on each transition.
|
||||
- If the documented "synchronously kill streams on disconnect" behaviour is
|
||||
intentionally being dropped in favour of the slower keepalive-based
|
||||
detection, delete the handler, the `ConnectionStateChanged` record, and the
|
||||
related `_debugSubscriptions` / `_inProgressDeployments` tracking, then
|
||||
update the design doc's "Connection Failure Behavior" section accordingly.
|
||||
|
||||
Either way, replace `CentralCommunicationActorTests.ConnectionLost_DebugStreamsKilled`
|
||||
— at present it asserts a behaviour that no production code triggers.
|
||||
|
||||
---
|
||||
|
||||
### Communication-017 — `_inProgressDeployments` grows unboundedly — successful deployments are never cleaned up
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Performance & resource management |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:73`, `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:501`, `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:357-367` |
|
||||
|
||||
**Description**
|
||||
|
||||
`TrackMessageForCleanup` inserts `_inProgressDeployments[deploy.DeploymentId] =
|
||||
envelope.SiteId` on every `DeployInstanceCommand` routed to a site (line 501).
|
||||
The only places that *remove* from `_inProgressDeployments` are:
|
||||
- `HandleConnectionStateChanged` on `IsConnected == false` (line 366) — which
|
||||
per Communication-016 never fires in production.
|
||||
- `PostStop` (line 553) — only on actor death (central failover).
|
||||
|
||||
There is **no removal on the normal happy path** — neither when the site replies
|
||||
`DeploymentStatusResponse` (the reply goes to the Ask's temporary reply actor,
|
||||
not back through `CentralCommunicationActor`), nor on Ask timeout. Every
|
||||
successful or failed deployment leaves its entry behind for the lifetime of the
|
||||
process.
|
||||
|
||||
Memory impact is modest (each entry is ~70-100 bytes), but the dictionary grows
|
||||
monotonically. Over months of operation across all sites a central node could
|
||||
accumulate tens of thousands of entries — a real, observable leak. More
|
||||
seriously, the field is *also* the source-of-truth set the
|
||||
`HandleConnectionStateChanged` walk uses to fail in-progress deployments, so
|
||||
even if a `ConnectionStateChanged` *were* fired today, the walk would
|
||||
"fail" thousands of already-completed deployments and Tell their (now stale)
|
||||
correlation-IDs into the void.
|
||||
|
||||
`_debugSubscriptions` (line 67) shares the same shape — but a normal debug
|
||||
session ends with an `UnsubscribeDebugViewRequest` that *does* drive cleanup
|
||||
(line 497), so leaks are only realised when a consumer crashes without
|
||||
unsubscribing.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Either remove `_inProgressDeployments` entirely (it has no other consumer once
|
||||
Communication-016 is fixed by deletion) or, if the disconnect-cleanup workflow
|
||||
is retained, add a removal hook on the reply path. The simplest fix is to
|
||||
subscribe `CentralCommunicationActor` to the Ask reply: route
|
||||
`DeployInstanceCommand` through the actor with the actor as the Ask sender,
|
||||
forward the reply to the original caller, and `_inProgressDeployments.Remove`
|
||||
in the same handler. (Today the Ask is taken on the *actor* itself by the
|
||||
caller, so the reply skips the coordinator.)
|
||||
|
||||
---
|
||||
|
||||
### Communication-018 — Site heartbeats hard-code `IsActive: true` regardless of node role
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Design-document adherence |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Communication/Actors/SiteCommunicationActor.cs:357-371` |
|
||||
|
||||
**Description**
|
||||
|
||||
`SiteCommunicationActor.SendHeartbeatToCentral` builds
|
||||
`new HeartbeatMessage(_siteId, hostname, IsActive: true, DateTimeOffset.UtcNow)`
|
||||
on every periodic tick (line 366), with no inspection of whether this node is
|
||||
actually the active site node or a standby. The `HeartbeatMessage.IsActive`
|
||||
field thus carries the literal value `true` on every heartbeat from every
|
||||
node, and the field is effectively dead — central's `HandleHeartbeat` doesn't
|
||||
consume it either (line 297 only passes `SiteId` and `Timestamp` to
|
||||
`MarkHeartbeat`).
|
||||
|
||||
Per CLAUDE.md's Cluster & Failover section the active/standby distinction is
|
||||
real ("Both nodes are seed nodes", "keep-oldest split-brain resolver",
|
||||
"automatic dual-node recovery"), so a heartbeat that *could* carry node-role
|
||||
information would be useful for the central health dashboard distinguishing
|
||||
"active node down, standby up" from "site fully offline". As shipped, the
|
||||
field is contract noise and a future implementer might mistakenly assume it
|
||||
already carries meaningful state.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Either (a) resolve the current cluster role at heartbeat-send time and pass it
|
||||
through — e.g. `Cluster.Get(Context.System).SelfRoles.Contains("active")` or
|
||||
the project's existing role mechanism — and have the central aggregator
|
||||
consume `IsActive`; or (b) drop the `IsActive` field from `HeartbeatMessage`
|
||||
(additive-only-evolution: deprecate the field, default to `true`, plan
|
||||
removal in a major message contract revision).
|
||||
|
||||
---
|
||||
|
||||
### Communication-019 — `LoadSiteAddressesFromDb` does not pass a `CancellationToken` to the repository
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Error handling & resilience |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:397-431` |
|
||||
|
||||
**Description**
|
||||
|
||||
`LoadSiteAddressesFromDb` runs `await repo.GetAllSitesAsync()` inside
|
||||
`Task.Run(async () => ...).PipeTo(self)` with no cancellation token (line 404).
|
||||
The repository signature accepts `CancellationToken` (the test mock declares
|
||||
`GetAllSitesAsync(Arg.Any<CancellationToken>())`), but the actor calls the
|
||||
no-arg overload — so a hung MS SQL connection has no upper bound. The
|
||||
60-second-periodic refresh keeps firing; each tick spawns a fresh `Task.Run`
|
||||
that piles up if the database is consistently slow. The actor itself is
|
||||
unaffected (it's not blocked), but pending tasks and DB connection-pool
|
||||
resources accumulate, and the `Status.Failure` handler (Communication-006)
|
||||
never fires because the task never faults — it just sits.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Maintain a per-load `CancellationTokenSource` with a deadline (e.g. the same
|
||||
60s the refresh runs on, or a configurable timeout in `CommunicationOptions`).
|
||||
Pass its `Token` to `GetAllSitesAsync`. Cancel the prior token before spinning
|
||||
a new load to avoid task accumulation.
|
||||
|
||||
---
|
||||
|
||||
### Communication-020 — `SiteAddressCacheLoaded` carries mutable `Dictionary`/`List` types
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Akka.NET conventions |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:567` |
|
||||
|
||||
**Description**
|
||||
|
||||
The Akka.NET convention is that messages crossing actor boundaries (even
|
||||
internal Self-messages over an async task boundary) are immutable.
|
||||
`SiteAddressCacheLoaded(Dictionary<string, List<string>> SiteContacts)` is a
|
||||
record but its `SiteContacts` payload is a mutable `Dictionary` whose values
|
||||
are mutable `List<string>`. Constructed inside `Task.Run` and handed off to
|
||||
the actor, the cache could in principle be mutated by either side; in
|
||||
practice nothing does, but the type is a stale-evidence guarantee that
|
||||
CLAUDE.md's "message immutability" rule is being followed only by convention.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Change the record signature to use `IReadOnlyDictionary<string, IReadOnlyList<string>>`
|
||||
(or `ImmutableDictionary` / `ImmutableArray<string>`) and freeze the data
|
||||
before piping. The cost is negligible — the payload is built and consumed
|
||||
once per refresh tick.
|
||||
|
||||
---
|
||||
|
||||
### Communication-021 — `SiteStreamGrpcServer.SubscribeInstance` leaks the `StreamRelayActor` if `Subscribe` throws pre-try
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Error handling & resilience |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Communication/Grpc/SiteStreamGrpcServer.cs:188-200` |
|
||||
|
||||
**Description**
|
||||
|
||||
`SubscribeInstance` performs these statements in order (lines 189-194), all
|
||||
*before* the `try` block at line 200:
|
||||
1. `Interlocked.Increment(ref _actorCounter)`
|
||||
2. `_actorSystem!.ActorOf(Props.Create(typeof(StreamRelayActor), ...))`
|
||||
3. `_streamSubscriber.Subscribe(request.InstanceUniqueName, relayActor)`
|
||||
|
||||
If step 3 throws (the subscriber is wired but its `Subscribe` faults — a stale
|
||||
instance name, a temporary index lookup failure, etc.), the exception escapes
|
||||
the method as an unhandled `RpcException` *and* leaks the freshly-created
|
||||
`relayActor`. The `finally` block at line 211 is unreachable because the
|
||||
throw happens before the `try`. The actor's `_activeStreams` entry, the
|
||||
`StreamEntry.Cts`, and the `Channel<SiteStreamEvent>` are also leaked.
|
||||
|
||||
In normal operation `_streamSubscriber.Subscribe` does not throw, so the bug is
|
||||
latent — but a misbehaving site runtime (e.g. `SiteStreamManager` faulted
|
||||
because the actor system is shutting down) would surface it.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Restructure to either (a) wrap the `Subscribe` call in a `try` whose `catch`
|
||||
stops the relay actor and disposes the CTS, or (b) move the actor + subscriber
|
||||
creation *inside* the existing `try` block (the `finally` will then handle
|
||||
cleanup uniformly). Option (b) is the simplest — just move lines 189-194 down
|
||||
past the `try {` brace.
|
||||
|
||||
---
|
||||
|
||||
### Communication-022 — `_debugSubscriptions` keyed by caller-supplied correlation ID; reuse silently orphans the prior subscriber
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:67`, `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:493` |
|
||||
|
||||
**Description**
|
||||
|
||||
`TrackMessageForCleanup` on `SubscribeDebugViewRequest` does
|
||||
`_debugSubscriptions[sub.CorrelationId] = (envelope.SiteId, Sender)` (line 493).
|
||||
The dictionary indexer silently overwrites any prior entry for the same
|
||||
`CorrelationId`. If two debug sessions ever reuse the same correlation ID (e.g.
|
||||
two Blazor users start a stream at the same moment with a non-GUID id, or a
|
||||
caller bug, or a malicious caller as flagged in the cousin
|
||||
Communication-014), the first subscriber's entry is overwritten and lost —
|
||||
on a later `ConnectionStateChanged(false)` (per Communication-016 it never
|
||||
actually fires today, but the design intent stands), only the *second*
|
||||
subscriber would be notified of the disconnect.
|
||||
|
||||
`DebugStreamService.StartStreamAsync` uses `Guid.NewGuid().ToString("N")` as
|
||||
the session id (`DebugStreamService.cs:97`), so a real collision is
|
||||
astronomically unlikely in normal operation. But the central side is not
|
||||
defending itself: a CLI consumer or a future caller is implicitly trusted to
|
||||
generate globally-unique ids.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
When the slot is already occupied, log a Warning and either reject the new
|
||||
subscription with an error response or evict the prior subscriber via
|
||||
`DebugStreamTerminated` before installing the new one. Mirrors the
|
||||
`SiteStreamGrpcServer` defensive behaviour where a duplicate `correlation_id`
|
||||
cancels the existing stream (line 167).
|
||||
|
||||
Reference in New Issue
Block a user