code-review: 2026-05-28 baseline re-review of all 23 modules at 1eb6e97

Re-applies the full 10-category checklist to every src/ project — including first-time reviews of the four newer components (AuditLog, NotificationOutbox, SiteCallAudit, Transport) — so the code-reviews/ index reflects today's codebase rather than the 2026-05-16 baseline. 172 new Open findings (0 Critical, 18 High, 62 Medium, 92 Low); 481 findings total across 23 modules. regen-readme.py now derives each module's Last reviewed + Commit from its findings.md header instead of hard-coding 2026-05-16 / 9c60592, so future single-module re-reviews show their own date in the Module Status table.
2026-05-28 02:55:47 -04:00
parent 1eb6e972b0
commit f93b7b99bb
25 changed files with 8793 additions and 115 deletions
@@ -5,10 +5,10 @@
 | Module | `src/ScadaLink.Communication` |
 | Design doc | `docs/requirements/Component-Communication.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-17 |
+| Last reviewed | 2026-05-28 |
 | Reviewer | claude-agent |
-| Commit reviewed | `39d737e` |
-| Open findings | 0 |
+| Commit reviewed | `1eb6e97` |
+| Open findings | 7 |

 ## Summary

@@ -42,6 +42,47 @@ gRPC-supplied `correlation_id` flows straight into an Akka actor name
 (Communication-014), and the factory's endpoint-reuse defect is masked by the test
 mock (Communication-015). Four new findings, all Open: one High, one Medium, two Low.

+#### Re-review 2026-05-28 (commit `1eb6e97`)
+
+All prior findings (Communication-001..015) remain `Resolved` in this commit. The
+re-review walked all 10 checklist categories again on the surface that has not
+been re-examined before — the central↔site command/control routing surface
+(`CentralCommunicationActor`, `SiteCommunicationActor`) rather than the
+previously-mined gRPC streaming surface — and uncovered a cluster of defects
+around the connection-state-change workflow. The single material finding is
+**`HandleConnectionStateChanged` is dead code**: no production code path emits
+`ConnectionStateChanged`, so the documented "kill active debug streams for the
+disconnected site" + "mark in-progress deployments as failed" workflow never
+fires at runtime (Communication-016). The downstream consequence is
+**`_inProgressDeployments` grows unboundedly** — entries are inserted on every
+deployment but only cleaned via that dead path (Communication-017). Three
+smaller items round out the re-review: site heartbeats hard-code
+`IsActive: true` regardless of node role (Communication-018), the
+60-second-periodic `LoadSiteAddressesFromDb` task has no CancellationToken so a
+hung DB query has no upper bound (Communication-019), the
+`SiteAddressCacheLoaded` internal message carries a mutable
+`Dictionary`/`List` (Communication-020), `SiteStreamGrpcServer.SubscribeInstance`
+leaks the StreamRelayActor if `_streamSubscriber.Subscribe` throws between
+`ActorOf` and the `try` block (Communication-021), and `_debugSubscriptions`
+keyed by caller-supplied `CorrelationId` could orphan a subscriber on ID reuse
+(Communication-022). Seven new findings, all Open: one High, one Medium, five
+Low.
+
+## Checklist coverage 2026-05-28
+
+| # | Category | Examined | Notes |
+|---|----------|----------|-------|
+| 1 | Correctness & logic bugs | ✓ | `HandleConnectionStateChanged` and its `_inProgressDeployments` / `_debugSubscriptions` cleanup never fire — the connection-state workflow is dead (Communication-016, Communication-017). `_debugSubscriptions` correlation-ID overwrite risk (Communication-022). |
+| 2 | Akka.NET conventions | ✓ | `SiteAddressCacheLoaded` carries mutable `Dictionary<string, List<string>>` — violates message-immutability convention (Communication-020). `Forward`/`PipeTo`/Sender-capture all clean. |
+| 3 | Concurrency & thread safety | ✓ | All mutable state mutated on the actor thread. `_subscriptions` ConcurrentDictionary use disciplined. No new issues. |
+| 4 | Error handling & resilience | ✓ | `LoadSiteAddressesFromDb` lacks a `CancellationToken` propagation point (Communication-019). `SubscribeInstance` leaks the relay actor if `Subscribe` throws pre-try (Communication-021). |
+| 5 | Security | ✓ | Correlation-id validation in place (Communication-014). No new issues. |
+| 6 | Performance & resource management | ✓ | `_inProgressDeployments` grows unboundedly (Communication-017). gRPC client/server lifecycles otherwise clean. |
+| 7 | Design-document adherence | ✓ | `ConnectionStateChanged` handler is dead code — the doc-stated "kill streams on disconnect, fail in-progress deployments" workflow does not actually run (Communication-016). Site heartbeats always report `IsActive: true` regardless of role (Communication-018). |
+| 8 | Code organization & conventions | ✓ | Options pattern correct; mapper placement and proto evolution are additive-only. No new issues. |
+| 9 | Testing coverage | ✓ | `CentralCommunicationActorTests.ConnectionLost_DebugStreamsKilled` exercises a code path that no production caller ever drives — gives false confidence (related to Communication-016). |
+| 10 | Documentation & comments | ✓ | Detailed XML docs added in this commit. No new issues. |
+
 ## Checklist coverage

 | # | Category | Examined | Notes |
@@ -726,3 +767,294 @@ gained `On_GrpcError_Reconnects_To_Other_Node_Endpoint`, which uses a new
 per endpoint (instead of one fixed mock regardless of endpoint), so the bridge actor's
 NodeA→NodeB reconnect is now verified to actually target the NodeB endpoint rather
 than being masked by an endpoint-agnostic mock.
+
+### Communication-016 — `HandleConnectionStateChanged` is dead code — the documented disconnect-cleanup workflow never fires
+
+| | |
+|--|--|
+| Severity | High |
+| Category | Design-document adherence |
+| Status | Open |
+| Location | `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:169`, `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:338-375` |
+
+**Description**
+
+`CentralCommunicationActor.HandleConnectionStateChanged` is wired to
+`Receive<ConnectionStateChanged>` and implements two important workflows on
+`IsConnected == false`: (1) kill every active debug stream for the disconnected
+site (`_debugSubscriptions` walk → `DebugStreamTerminated` Tell to each
+subscriber); (2) mark every in-progress deployment for that site as failed
+(`_inProgressDeployments` walk → entry removal). Both are documented in the
+component design doc's "Connection Failure Behavior" section and in WP-5 of the
+work plan referenced in the class's own XML doc comment.
+
+A repo-wide search (`grep -rn ConnectionStateChanged src/ tests/`) shows **no
+production code ever emits `ConnectionStateChanged`**. The only producers are
+the unit test `CentralCommunicationActorTests.ConnectionLost_DebugStreamsKilled`
+(line 137) and the Commons message-roundtrip test. The
+`CentralCommunicationActor` therefore never receives one in production, the
+disconnect-cleanup workflow never fires, and `_debugSubscriptions` /
+`_inProgressDeployments` are never pruned via this path.
+
+Concrete consequences:
+- A site goes down → its active debug streams do **not** get a synchronous
+  `DebugStreamTerminated` notification from central. The bridge actor must
+  detect the disconnect itself via gRPC keepalive timing out (~25s) or TCP RST.
+  Subscribers wait that long for the `OnStreamTerminated` callback instead of
+  the documented "immediately killed by central" behaviour.
+- In-progress deployments to a disconnected site continue to occupy the
+  Ask-reply window and only fail when the Ask times out at the
+  `CommunicationService.DeployInstanceAsync` layer (120s). They are never
+  proactively marked failed.
+- The unit test gives a strong false impression that the workflow works — it
+  exercises a code path that has no production caller.
+
+The design doc and CLAUDE.md mention "ClusterClient handles failover between
+NodeA and NodeB internally — there is no application-level NodeA preference /
+NodeB fallback logic" — so the ClusterClient mechanism is the documented
+failover transport. But that says nothing about *signalling* a fully-down
+remote cluster to central's coordinator actor, which is exactly what
+`ConnectionStateChanged` was meant to do.
+
+**Recommendation**
+
+Pick one of:
+- Wire a producer for `ConnectionStateChanged` — e.g. subscribe to
+  `ClusterClient`'s contact-point/cluster events (`ClusterClient.ContactPoints`
+  Refresh / `ContactPointAdded` / `ContactPointRemoved`) or watch the
+  ClusterClient actor for a "no contact points reachable" state — and have it
+  publish `ConnectionStateChanged` to `Self` on each transition.
+- If the documented "synchronously kill streams on disconnect" behaviour is
+  intentionally being dropped in favour of the slower keepalive-based
+  detection, delete the handler, the `ConnectionStateChanged` record, and the
+  related `_debugSubscriptions` / `_inProgressDeployments` tracking, then
+  update the design doc's "Connection Failure Behavior" section accordingly.
+
+Either way, replace `CentralCommunicationActorTests.ConnectionLost_DebugStreamsKilled`
+— at present it asserts a behaviour that no production code triggers.
+
+---
+
+### Communication-017 — `_inProgressDeployments` grows unboundedly — successful deployments are never cleaned up
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Performance & resource management |
+| Status | Open |
+| Location | `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:73`, `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:501`, `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:357-367` |
+
+**Description**
+
+`TrackMessageForCleanup` inserts `_inProgressDeployments[deploy.DeploymentId] =
+envelope.SiteId` on every `DeployInstanceCommand` routed to a site (line 501).
+The only places that *remove* from `_inProgressDeployments` are:
+- `HandleConnectionStateChanged` on `IsConnected == false` (line 366) — which
+  per Communication-016 never fires in production.
+- `PostStop` (line 553) — only on actor death (central failover).
+
+There is **no removal on the normal happy path** — neither when the site replies
+`DeploymentStatusResponse` (the reply goes to the Ask's temporary reply actor,
+not back through `CentralCommunicationActor`), nor on Ask timeout. Every
+successful or failed deployment leaves its entry behind for the lifetime of the
+process.
+
+Memory impact is modest (each entry is ~70-100 bytes), but the dictionary grows
+monotonically. Over months of operation across all sites a central node could
+accumulate tens of thousands of entries — a real, observable leak. More
+seriously, the field is *also* the source-of-truth set the
+`HandleConnectionStateChanged` walk uses to fail in-progress deployments, so
+even if a `ConnectionStateChanged` *were* fired today, the walk would
+"fail" thousands of already-completed deployments and Tell their (now stale)
+correlation-IDs into the void.
+
+`_debugSubscriptions` (line 67) shares the same shape — but a normal debug
+session ends with an `UnsubscribeDebugViewRequest` that *does* drive cleanup
+(line 497), so leaks are only realised when a consumer crashes without
+unsubscribing.
+
+**Recommendation**
+
+Either remove `_inProgressDeployments` entirely (it has no other consumer once
+Communication-016 is fixed by deletion) or, if the disconnect-cleanup workflow
+is retained, add a removal hook on the reply path. The simplest fix is to
+subscribe `CentralCommunicationActor` to the Ask reply: route
+`DeployInstanceCommand` through the actor with the actor as the Ask sender,
+forward the reply to the original caller, and `_inProgressDeployments.Remove`
+in the same handler. (Today the Ask is taken on the *actor* itself by the
+caller, so the reply skips the coordinator.)
+
+---
+
+### Communication-018 — Site heartbeats hard-code `IsActive: true` regardless of node role
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Design-document adherence |
+| Status | Open |
+| Location | `src/ScadaLink.Communication/Actors/SiteCommunicationActor.cs:357-371` |
+
+**Description**
+
+`SiteCommunicationActor.SendHeartbeatToCentral` builds
+`new HeartbeatMessage(_siteId, hostname, IsActive: true, DateTimeOffset.UtcNow)`
+on every periodic tick (line 366), with no inspection of whether this node is
+actually the active site node or a standby. The `HeartbeatMessage.IsActive`
+field thus carries the literal value `true` on every heartbeat from every
+node, and the field is effectively dead — central's `HandleHeartbeat` doesn't
+consume it either (line 297 only passes `SiteId` and `Timestamp` to
+`MarkHeartbeat`).
+
+Per CLAUDE.md's Cluster & Failover section the active/standby distinction is
+real ("Both nodes are seed nodes", "keep-oldest split-brain resolver",
+"automatic dual-node recovery"), so a heartbeat that *could* carry node-role
+information would be useful for the central health dashboard distinguishing
+"active node down, standby up" from "site fully offline". As shipped, the
+field is contract noise and a future implementer might mistakenly assume it
+already carries meaningful state.
+
+**Recommendation**
+
+Either (a) resolve the current cluster role at heartbeat-send time and pass it
+through — e.g. `Cluster.Get(Context.System).SelfRoles.Contains("active")` or
+the project's existing role mechanism — and have the central aggregator
+consume `IsActive`; or (b) drop the `IsActive` field from `HeartbeatMessage`
+(additive-only-evolution: deprecate the field, default to `true`, plan
+removal in a major message contract revision).
+
+---
+
+### Communication-019 — `LoadSiteAddressesFromDb` does not pass a `CancellationToken` to the repository
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Error handling & resilience |
+| Status | Open |
+| Location | `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:397-431` |
+
+**Description**
+
+`LoadSiteAddressesFromDb` runs `await repo.GetAllSitesAsync()` inside
+`Task.Run(async () => ...).PipeTo(self)` with no cancellation token (line 404).
+The repository signature accepts `CancellationToken` (the test mock declares
+`GetAllSitesAsync(Arg.Any<CancellationToken>())`), but the actor calls the
+no-arg overload — so a hung MS SQL connection has no upper bound. The
+60-second-periodic refresh keeps firing; each tick spawns a fresh `Task.Run`
+that piles up if the database is consistently slow. The actor itself is
+unaffected (it's not blocked), but pending tasks and DB connection-pool
+resources accumulate, and the `Status.Failure` handler (Communication-006)
+never fires because the task never faults — it just sits.
+
+**Recommendation**
+
+Maintain a per-load `CancellationTokenSource` with a deadline (e.g. the same
+60s the refresh runs on, or a configurable timeout in `CommunicationOptions`).
+Pass its `Token` to `GetAllSitesAsync`. Cancel the prior token before spinning
+a new load to avoid task accumulation.
+
+---
+
+### Communication-020 — `SiteAddressCacheLoaded` carries mutable `Dictionary`/`List` types
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Akka.NET conventions |
+| Status | Open |
+| Location | `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:567` |
+
+**Description**
+
+The Akka.NET convention is that messages crossing actor boundaries (even
+internal Self-messages over an async task boundary) are immutable.
+`SiteAddressCacheLoaded(Dictionary<string, List<string>> SiteContacts)` is a
+record but its `SiteContacts` payload is a mutable `Dictionary` whose values
+are mutable `List<string>`. Constructed inside `Task.Run` and handed off to
+the actor, the cache could in principle be mutated by either side; in
+practice nothing does, but the type is a stale-evidence guarantee that
+CLAUDE.md's "message immutability" rule is being followed only by convention.
+
+**Recommendation**
+
+Change the record signature to use `IReadOnlyDictionary<string, IReadOnlyList<string>>`
+(or `ImmutableDictionary` / `ImmutableArray<string>`) and freeze the data
+before piping. The cost is negligible — the payload is built and consumed
+once per refresh tick.
+
+---
+
+### Communication-021 — `SiteStreamGrpcServer.SubscribeInstance` leaks the `StreamRelayActor` if `Subscribe` throws pre-try
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Error handling & resilience |
+| Status | Open |
+| Location | `src/ScadaLink.Communication/Grpc/SiteStreamGrpcServer.cs:188-200` |
+
+**Description**
+
+`SubscribeInstance` performs these statements in order (lines 189-194), all
+*before* the `try` block at line 200:
+1. `Interlocked.Increment(ref _actorCounter)`
+2. `_actorSystem!.ActorOf(Props.Create(typeof(StreamRelayActor), ...))`
+3. `_streamSubscriber.Subscribe(request.InstanceUniqueName, relayActor)`
+
+If step 3 throws (the subscriber is wired but its `Subscribe` faults — a stale
+instance name, a temporary index lookup failure, etc.), the exception escapes
+the method as an unhandled `RpcException` *and* leaks the freshly-created
+`relayActor`. The `finally` block at line 211 is unreachable because the
+throw happens before the `try`. The actor's `_activeStreams` entry, the
+`StreamEntry.Cts`, and the `Channel<SiteStreamEvent>` are also leaked.
+
+In normal operation `_streamSubscriber.Subscribe` does not throw, so the bug is
+latent — but a misbehaving site runtime (e.g. `SiteStreamManager` faulted
+because the actor system is shutting down) would surface it.
+
+**Recommendation**
+
+Restructure to either (a) wrap the `Subscribe` call in a `try` whose `catch`
+stops the relay actor and disposes the CTS, or (b) move the actor + subscriber
+creation *inside* the existing `try` block (the `finally` will then handle
+cleanup uniformly). Option (b) is the simplest — just move lines 189-194 down
+past the `try {` brace.
+
+---
+
+### Communication-022 — `_debugSubscriptions` keyed by caller-supplied correlation ID; reuse silently orphans the prior subscriber
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:67`, `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:493` |
+
+**Description**
+
+`TrackMessageForCleanup` on `SubscribeDebugViewRequest` does
+`_debugSubscriptions[sub.CorrelationId] = (envelope.SiteId, Sender)` (line 493).
+The dictionary indexer silently overwrites any prior entry for the same
+`CorrelationId`. If two debug sessions ever reuse the same correlation ID (e.g.
+two Blazor users start a stream at the same moment with a non-GUID id, or a
+caller bug, or a malicious caller as flagged in the cousin
+Communication-014), the first subscriber's entry is overwritten and lost —
+on a later `ConnectionStateChanged(false)` (per Communication-016 it never
+actually fires today, but the design intent stands), only the *second*
+subscriber would be notified of the disconnect.
+
+`DebugStreamService.StartStreamAsync` uses `Guid.NewGuid().ToString("N")` as
+the session id (`DebugStreamService.cs:97`), so a real collision is
+astronomically unlikely in normal operation. But the central side is not
+defending itself: a CLI consumer or a future caller is implicitly trusted to
+generate globally-unique ids.
+
+**Recommendation**
+
+When the slot is already occupied, log a Warning and either reject the new
+subscription with an error response or evict the prior subscriber via
+`DebugStreamTerminated` before installing the new one. Mirrors the
+`SiteStreamGrpcServer` defensive behaviour where a duplicate `correlation_id`
+cancels the existing stream (line 167).