docs: add code review process and baseline review of all 19 modules

Establishes a per-module code review workflow under code-reviews/ and records the 2026-05-16 baseline review (commit 9c60592): 241 findings across all src/ modules (6 Critical, 46 High, 100 Medium, 89 Low). This is the clean starting point for remediation work.
2026-05-16 18:09:09 -04:00
parent 9c60592632
commit 977d7369a7
23 changed files with 8899 additions and 0 deletions
--- a/code-reviews/Communication/findings.md
+++ b/code-reviews/Communication/findings.md
@@ -0,0 +1,404 @@
+# Code Review — Communication
+
+| Field | Value |
+|-------|-------|
+| Module | `src/ScadaLink.Communication` |
+| Design doc | `docs/requirements/Component-Communication.md` |
+| Status | Reviewed |
+| Last reviewed | 2026-05-16 |
+| Reviewer | claude-agent |
+| Commit reviewed | `9c60592` |
+| Open findings | 11 |
+
+## Summary
+
+The Communication module is generally well-structured and matches the design doc's
+two-transport model (ClusterClient for command/control, gRPC server-streaming for
+real-time data). The actors keep mutable state on the actor thread, use `PipeTo` for
+async work, and the gRPC server/client lifecycle is mostly disciplined. However the
+review found one Critical issue (a `TimeoutException` from `DebugStreamService` leaves
+an orphaned bridge actor and an active site-side subscription, leaking resources on
+every snapshot timeout) and several High/Medium issues clustered around two themes:
+**(a) gRPC subscription bookkeeping races** — `SiteStreamGrpcClient` overwrites and
+removes subscription entries by correlation ID without disposal or ownership checks,
+so reconnect cycles leak `CancellationTokenSource`es and can cancel the wrong stream;
+and **(b) missing supervision strategy** on the coordinator actors, contrary to the
+CLAUDE.md "Resume for coordinator actors" decision. Design-doc adherence is otherwise
+good. Test coverage is broad for happy paths but has gaps around failover, cache
+mutation races, and the snapshot-timeout cleanup path.
+
+## Checklist coverage
+
+| # | Category | Examined | Notes |
+|---|----------|----------|-------|
+| 1 | Correctness & logic bugs | ✓ | Snapshot-timeout orphan, reconnect not calling `CleanupGrpc`, subscription-map races. |
+| 2 | Akka.NET conventions | ✓ | No supervision strategy on coordinators; `Sender` captured in async-launched closure path. |
+| 3 | Concurrency & thread safety | ✓ | `SiteStreamGrpcClient._subscriptions` overwrite/remove race; `_siteClients` field reassignment unused but non-readonly. |
+| 4 | Error handling & resilience | ✓ | gRPC reconnect leaks server-side relay; `LoadSiteAddressesFromDb` swallows DB failures silently. |
+| 5 | Security | ✓ | No findings in module code. DebugStreamHub auth lives outside this module (Central UI). |
+| 6 | Performance & resource management | ✓ | Orphaned subscriptions/CTS leaks; `SiteStreamGrpcClientFactory.Dispose` blocks on async. |
+| 7 | Design-document adherence | ✓ | `GrpcMaxStreamLifetime` / keepalive options defined but never applied; hard-coded values used instead. |
+| 8 | Code organization & conventions | ✓ | Options pattern correct; minor: public records declared in actor files. No structural issues. |
+| 9 | Testing coverage | ✓ | No tests for snapshot-timeout cleanup, address-cache refresh races, or gRPC server reconnect-leak. |
+| 10 | Documentation & comments | ✓ | XML comment on `DebugStreamBridgeActor` says "Persistent actor" — it is not an Akka.Persistence actor. |
+
+## Findings
+
+### Communication-001 — Snapshot timeout leaves orphaned bridge actor and site subscription
+
+| | |
+|--|--|
+| Severity | Critical |
+| Category | Performance & resource management |
+| Status | Open |
+| Location | `src/ScadaLink.Communication/DebugStreamService.cs:139`, `src/ScadaLink.Communication/DebugStreamService.cs:149` |
+
+**Description**
+
+When `StartStreamAsync` times out waiting for the initial snapshot it calls
+`StopStream(sessionId)` and throws. `StopStream` only sends `StopDebugStream` to the
+bridge actor **if the session is still in `_sessions`**. But the bridge actor was added
+to `_sessions` at line 124 and is only removed by `onTerminatedWrapper`. The serious
+case is the race where `onTerminatedWrapper` fires first (e.g. site disconnect arrives
+during the wait): `snapshotTcs.TrySetException` completes the await with an
+`InvalidOperationException` rather than `OperationCanceledException`, which is **not**
+caught by the `catch (OperationCanceledException)` block. The exception propagates
+uncaught, `StopStream` is never reached, and if the bridge actor is instead orphaned
+(snapshot never arrives, site silent, no terminate) the only cleanup is the 5-minute
+`ReceiveTimeout` in the actor — meaning a site-side `StreamRelayActor` and gRPC stream
+can stay alive for up to 5 minutes after the central caller has given up. Combined with
+the 30s timeout, every transient snapshot delay leaks site resources for minutes.
+
+**Recommendation**
+
+In `StartStreamAsync`, wrap the `await` so that *any* failure or cancellation
+deterministically calls `StopStream(sessionId)` (e.g. `try/catch (Exception)` or a
+`finally` that stops the session when the result was not returned). Ensure
+`StopStream` is idempotent and always sends `StopDebugStream` even if the session was
+already removed, so the bridge actor (and its site-side subscription) is torn down
+promptly rather than waiting for the orphan `ReceiveTimeout`.
+
+**Resolution**
+
+_Unresolved._
+
+### Communication-002 — gRPC reconnect does not unsubscribe the previous stream, leaking site-side relay actors
+
+| | |
+|--|--|
+| Severity | High |
+| Category | Error handling & resilience |
+| Status | Open |
+| Location | `src/ScadaLink.Communication/Actors/DebugStreamBridgeActor.cs:170`, `src/ScadaLink.Communication/Actors/DebugStreamBridgeActor.cs:143` |
+
+**Description**
+
+On a gRPC stream error, `HandleGrpcError` increments the retry count, flips
+`_useNodeA`, and schedules `OpenGrpcStream`. `OpenGrpcStream` cancels and disposes
+`_grpcCts` and starts a fresh `SubscribeInstance` call — but it never calls
+`client.Unsubscribe(_correlationId)` on the *old* node's client, and the site-side
+`SiteStreamGrpcServer` keys active streams by `correlation_id` only. Because the new
+subscription goes to the *other* node (`_useNodeA` flipped), the old node's
+`SiteStreamGrpcServer` still has an active stream + `StreamRelayActor` +
+`SiteStreamManager` subscription for that correlation ID. The old node only learns the
+client is gone via TCP RST or keepalive — exactly the failure mode that triggered the
+reconnect (network partition / silent node), so detection may take ~25s or never. Each
+reconnect can therefore leave a zombie relay actor on the failed node. `CleanupGrpc`
+(which *does* call `Unsubscribe`) is only invoked on terminal paths, not between
+reconnect attempts.
+
+**Recommendation**
+
+Before reconnecting in `HandleGrpcError` / at the top of `OpenGrpcStream`, call
+`Unsubscribe(_correlationId)` on the client for the *previous* endpoint (the one that
+just failed) so the local CTS is cancelled and — where the channel is still alive —
+the gRPC cancellation reaches the site and stops the relay actor.
+
+**Resolution**
+
+_Unresolved._
+
+### Communication-003 — SiteStreamGrpcClient subscription map overwritten without disposal; reconnect can cancel the wrong stream
+
+| | |
+|--|--|
+| Severity | High |
+| Category | Concurrency & thread safety |
+| Status | Open |
+| Location | `src/ScadaLink.Communication/Grpc/SiteStreamGrpcClient.cs:77`, `src/ScadaLink.Communication/Grpc/SiteStreamGrpcClient.cs:106` |
+
+**Description**
+
+`SubscribeAsync` does `_subscriptions[correlationId] = cts;` (line 77),
+unconditionally overwriting any existing entry for that correlation ID without
+cancelling or disposing the previous `CancellationTokenSource`. The `finally` block
+then does `_subscriptions.TryRemove(correlationId, out _)` (line 106) which removes
+the entry **by key only, regardless of which CTS is stored**. Because
+`DebugStreamBridgeActor` reuses the same `_correlationId` across reconnect attempts
+(and `SiteStreamGrpcClientFactory` returns the same `SiteStreamGrpcClient` for a site
+even after a node flip), two `SubscribeAsync` calls can briefly share a correlation
+ID. The first call's `finally` then removes the *second* call's CTS entry, so a later
+`Unsubscribe(correlationId)` finds nothing and the live stream is never cancelled — an
+orphan. Conversely the overwritten CTS is leaked (never disposed).
+
+**Recommendation**
+
+When inserting, cancel+dispose any prior CTS for that correlation ID. In the `finally`,
+remove only if the stored CTS is the one this call created (use the
+`TryRemove(KeyValuePair)` overload, mirroring what `SiteStreamGrpcServer` already does
+with `StreamEntry`). Consider keying subscriptions by a per-call GUID rather than the
+caller-supplied correlation ID.
+
+**Resolution**
+
+_Unresolved._
+
+### Communication-004 — Coordinator actors declare no SupervisorStrategy (design requires Resume)
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Akka.NET conventions |
+| Status | Open |
+| Location | `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:42`, `src/ScadaLink.Communication/Actors/SiteCommunicationActor.cs:22` |
+
+**Description**
+
+CLAUDE.md ("Explicit supervision strategies: Resume for coordinator actors, Stop for
+short-lived execution actors") requires coordinator actors to use an explicit `Resume`
+supervision strategy. `CentralCommunicationActor` and `SiteCommunicationActor` are
+long-lived coordinators (they own the per-site ClusterClient map, debug
+subscriptions, in-progress deployments) but neither overrides `SupervisorStrategy`.
+They fall back to the Akka default (`OneForOneStrategy` with `Restart`). A child fault
+— e.g. a `ClusterClient` child of `CentralCommunicationActor` created by
+`DefaultSiteClientFactory` — would `Restart` under the default strategy, and any
+exception in the coordinator itself would restart it, wiping `_siteClients`,
+`_debugSubscriptions`, and `_inProgressDeployments` silently. The design intent is
+`Resume` so transient child faults do not discard coordinator state.
+
+**Recommendation**
+
+Override `SupervisorStrategy` on both actors to return an explicit
+`OneForOneStrategy` with `Directive.Resume` (or the project's standard coordinator
+strategy), matching the documented decision and other coordinator actors.
+
+**Resolution**
+
+_Unresolved._
+
+### Communication-005 — gRPC keepalive and max-stream-lifetime options are defined but never applied
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Design-document adherence |
+| Status | Open |
+| Location | `src/ScadaLink.Communication/Grpc/SiteStreamGrpcClient.cs:25`, `src/ScadaLink.Communication/CommunicationOptions.cs:36` |
+
+**Description**
+
+`CommunicationOptions` exposes `GrpcKeepAlivePingDelay`, `GrpcKeepAlivePingTimeout`,
+`GrpcMaxStreamLifetime`, and `GrpcMaxConcurrentStreams`, and the design doc's
+"gRPC Connection Keepalive" section explicitly states these are configurable. However
+`SiteStreamGrpcClient`'s constructor hard-codes `KeepAlivePingDelay =
+TimeSpan.FromSeconds(15)` and `KeepAlivePingTimeout = TimeSpan.FromSeconds(10)`
+instead of reading the options. `GrpcMaxStreamLifetime` (the documented "Session
+timeout — 4 hours" third layer of dead-client detection) is not referenced anywhere
+— `SiteStreamGrpcServer.SubscribeInstance` creates a linked CTS from the call
+cancellation token only, with no `CancelAfter`. The 4-hour zombie-stream safety net
+described in the design doc does not exist in code. `GrpcMaxConcurrentStreams` is also
+not wired to the server (`SiteStreamGrpcServer` takes a `maxConcurrentStreams`
+constructor parameter defaulting to 100, but nothing binds the option to it).
+
+**Recommendation**
+
+Flow `CommunicationOptions` into `SiteStreamGrpcClient` and `SiteStreamGrpcServer`
+(via the factory / DI). Apply `GrpcKeepAlivePingDelay` / `GrpcKeepAlivePingTimeout` to
+the `SocketsHttpHandler`, bind `GrpcMaxConcurrentStreams` to the server's limit, and
+implement the `GrpcMaxStreamLifetime` session timeout with `CancelAfter` on the
+server-side stream CTS — or, if the 4-hour cap is intentionally dropped, remove the
+option and update the design doc.
+
+**Resolution**
+
+_Unresolved._
+
+### Communication-006 — Site address load failures are silently swallowed, leaving a stale cache
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Error handling & resilience |
+| Status | Open |
+| Location | `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:204` |
+
+**Description**
+
+`LoadSiteAddressesFromDb` runs the repository query inside `Task.Run(...).PipeTo(self)`.
+If `GetAllSitesAsync` throws (database unavailable, transient connection error), the
+faulted task is piped to `Self` as a `Status.Failure`. `CentralCommunicationActor` has
+no `Receive<Status.Failure>` handler, so the failure becomes an unhandled message
+(logged at debug, not surfaced) and the periodic refresh silently fails. If the
+*first* startup load fails the actor runs with an empty `_siteClients` map — every
+`SiteEnvelope` is dropped (line 187) and every Ask times out with no indication of the
+root cause.
+
+**Recommendation**
+
+Add a `Receive<Status.Failure>` handler that logs the load failure at Warning/Error
+level so operators can distinguish "site has no addresses configured" from "database
+is down". Optionally surface a health metric for repeated load failures.
+
+**Resolution**
+
+_Unresolved._
+
+### Communication-007 — `SiteStreamGrpcClientFactory.Dispose` blocks on async work (sync-over-async)
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Performance & resource management |
+| Status | Open |
+| Location | `src/ScadaLink.Communication/Grpc/SiteStreamGrpcClientFactory.cs:53` |
+
+**Description**
+
+`Dispose()` calls `DisposeAsync().AsTask().GetAwaiter().GetResult()`. This is the
+classic sync-over-async pattern: it blocks the calling thread until all per-site
+`SiteStreamGrpcClient.DisposeAsync` calls complete. If `Dispose` is invoked from a
+context with a single-threaded synchronization context or from DI container shutdown
+on a constrained thread pool, this can deadlock or stall host shutdown. The class
+already implements `IAsyncDisposable`.
+
+**Recommendation**
+
+Prefer registering and disposing the factory through `IAsyncDisposable` only (modern
+.NET DI honours it for singletons). If a synchronous `Dispose` must remain, dispose
+the underlying `GrpcChannel`s directly (synchronous) rather than blocking on the async
+path, or document why blocking is safe here.
+
+**Resolution**
+
+_Unresolved._
+
+### Communication-008 — Reconnect retry-count reset can mask a flapping stream indefinitely
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.Communication/Actors/DebugStreamBridgeActor.cs:71`, `src/ScadaLink.Communication/Actors/DebugStreamBridgeActor.cs:174` |
+
+**Description**
+
+`_retryCount` is reset to 0 every time a single `AttributeValueChanged` or
+`AlarmStateChanged` event is received (lines 72, 77). Combined with `MaxRetries = 3`,
+a stream that connects, delivers exactly one event, then fails — repeatedly — will
+reconnect forever. The design doc states "max 3 retries, terminate the session if all
+retries fail"; the current logic only terminates after 3 *consecutive* failures with
+zero intervening events, so a flapping site never trips the limit and the debug
+session (and its site-side relay) lives on indefinitely. The `ReceiveTimeout` orphan
+net is also reset by every received message, so it does not bound this case either.
+
+**Recommendation**
+
+Either reset `_retryCount` only after the stream has been stably connected for some
+minimum duration (e.g. a timer armed on stream open, cancelled on the next error), or
+keep a separate cumulative reconnect counter / time window that bounds total
+reconnects regardless of intervening events.
+
+**Resolution**
+
+_Unresolved._
+
+### Communication-009 — `_siteClients` field is mutable and reassignable; cache update is not atomic on failure
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Concurrency & thread safety |
+| Status | Open |
+| Location | `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:53`, `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:240` |
+
+**Description**
+
+`_siteClients` is a non-`readonly` `Dictionary` field. It is only mutated on the actor
+thread (correct), but the field is needlessly reassignable, and
+`HandleSiteAddressCacheLoaded` mutates it in place across several loops. If
+`ActorPath.Parse` throws on a malformed address mid-loop (e.g. a site row with a
+garbage `NodeAAddress`), the method aborts partway through, having already stopped
+some ClusterClients and added others — leaving the cache partially updated with no
+recovery until the next 60s refresh. The other actor mutable collections
+(`_debugSubscriptions`, `_inProgressDeployments`) are correctly `readonly`.
+
+**Recommendation**
+
+Mark `_siteClients` `readonly`. Validate/parse all addresses up front (or wrap
+`ActorPath.Parse` in a try/catch that logs and skips the bad site) so a single
+malformed site record cannot abort the whole refresh and leave a half-updated cache.
+
+**Resolution**
+
+_Unresolved._
+
+### Communication-010 — `DebugStreamBridgeActor` XML doc incorrectly describes it as a "Persistent actor"
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Documentation & comments |
+| Status | Open |
+| Location | `src/ScadaLink.Communication/Actors/DebugStreamBridgeActor.cs:10` |
+
+**Description**
+
+The class summary opens with "Persistent actor (one per active debug session)...".
+The actor derives from `ReceiveActor`, not a persistent actor base class, holds no
+`PersistenceId`, and writes no journal/snapshot. "Persistent" is misleading — debug
+sessions are explicitly "session-based and temporary" per the design doc. A reader
+could assume state survives restart, which it does not.
+
+**Recommendation**
+
+Reword the summary to "Long-lived (per active debug session) actor on the central
+side..." or similar, removing the word "Persistent".
+
+**Resolution**
+
+_Unresolved._
+
+### Communication-011 — No test coverage for snapshot-timeout cleanup, address-cache failure, or gRPC reconnect leak
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Testing coverage |
+| Status | Open |
+| Location | `tests/ScadaLink.Communication.Tests/` (module-wide) |
+
+**Description**
+
+The test suite covers happy-path routing, handler-not-registered failures, heartbeat
+bumping, cache refresh, and gRPC bridge reconnect/retry. However several critical
+paths identified in this review have no coverage:
+
+- The `DebugStreamService.StartStreamAsync` snapshot-timeout path (Communication-001)
+  — no test verifies bridge actor / site subscription teardown on timeout, nor the
+  `onTerminated`-before-snapshot race that throws a non-`OperationCanceledException`.
+- `CentralCommunicationActor` behaviour when `LoadSiteAddressesFromDb` faults
+  (Communication-006) — `RefreshSiteAddresses_UpdatesCache` only exercises success.
+- `SiteStreamGrpcClient` subscription-map overwrite/removal race (Communication-003)
+  and gRPC reconnect not unsubscribing the old node (Communication-002).
+- A malformed `NodeAAddress` aborting `HandleSiteAddressCacheLoaded` (Communication-009).
+
+**Recommendation**
+
+Add tests for: snapshot timeout / pre-snapshot termination cleanup; address-load
+failure logging and empty-cache behaviour; reusing a correlation ID across
+`SubscribeAsync` calls; and a malformed site address during cache refresh.
+
+**Resolution**
+
+_Unresolved._