Establishes a per-module code review workflow under code-reviews/ and
records the 2026-05-16 baseline review (commit 9c60592): 241 findings
across all src/ modules (6 Critical, 46 High, 100 Medium, 89 Low).
This is the clean starting point for remediation work.
19 KiB
Code Review — Communication
| Field | Value |
|---|---|
| Module | src/ScadaLink.Communication |
| Design doc | docs/requirements/Component-Communication.md |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Reviewer | claude-agent |
| Commit reviewed | 9c60592 |
| Open findings | 11 |
Summary
The Communication module is generally well-structured and matches the design doc's
two-transport model (ClusterClient for command/control, gRPC server-streaming for
real-time data). The actors keep mutable state on the actor thread, use PipeTo for
async work, and the gRPC server/client lifecycle is mostly disciplined. However the
review found one Critical issue (a TimeoutException from DebugStreamService leaves
an orphaned bridge actor and an active site-side subscription, leaking resources on
every snapshot timeout) and several High/Medium issues clustered around two themes:
(a) gRPC subscription bookkeeping races — SiteStreamGrpcClient overwrites and
removes subscription entries by correlation ID without disposal or ownership checks,
so reconnect cycles leak CancellationTokenSourcees and can cancel the wrong stream;
and (b) missing supervision strategy on the coordinator actors, contrary to the
CLAUDE.md "Resume for coordinator actors" decision. Design-doc adherence is otherwise
good. Test coverage is broad for happy paths but has gaps around failover, cache
mutation races, and the snapshot-timeout cleanup path.
Checklist coverage
| # | Category | Examined | Notes |
|---|---|---|---|
| 1 | Correctness & logic bugs | ✓ | Snapshot-timeout orphan, reconnect not calling CleanupGrpc, subscription-map races. |
| 2 | Akka.NET conventions | ✓ | No supervision strategy on coordinators; Sender captured in async-launched closure path. |
| 3 | Concurrency & thread safety | ✓ | SiteStreamGrpcClient._subscriptions overwrite/remove race; _siteClients field reassignment unused but non-readonly. |
| 4 | Error handling & resilience | ✓ | gRPC reconnect leaks server-side relay; LoadSiteAddressesFromDb swallows DB failures silently. |
| 5 | Security | ✓ | No findings in module code. DebugStreamHub auth lives outside this module (Central UI). |
| 6 | Performance & resource management | ✓ | Orphaned subscriptions/CTS leaks; SiteStreamGrpcClientFactory.Dispose blocks on async. |
| 7 | Design-document adherence | ✓ | GrpcMaxStreamLifetime / keepalive options defined but never applied; hard-coded values used instead. |
| 8 | Code organization & conventions | ✓ | Options pattern correct; minor: public records declared in actor files. No structural issues. |
| 9 | Testing coverage | ✓ | No tests for snapshot-timeout cleanup, address-cache refresh races, or gRPC server reconnect-leak. |
| 10 | Documentation & comments | ✓ | XML comment on DebugStreamBridgeActor says "Persistent actor" — it is not an Akka.Persistence actor. |
Findings
Communication-001 — Snapshot timeout leaves orphaned bridge actor and site subscription
| Severity | Critical |
| Category | Performance & resource management |
| Status | Open |
| Location | src/ScadaLink.Communication/DebugStreamService.cs:139, src/ScadaLink.Communication/DebugStreamService.cs:149 |
Description
When StartStreamAsync times out waiting for the initial snapshot it calls
StopStream(sessionId) and throws. StopStream only sends StopDebugStream to the
bridge actor if the session is still in _sessions. But the bridge actor was added
to _sessions at line 124 and is only removed by onTerminatedWrapper. The serious
case is the race where onTerminatedWrapper fires first (e.g. site disconnect arrives
during the wait): snapshotTcs.TrySetException completes the await with an
InvalidOperationException rather than OperationCanceledException, which is not
caught by the catch (OperationCanceledException) block. The exception propagates
uncaught, StopStream is never reached, and if the bridge actor is instead orphaned
(snapshot never arrives, site silent, no terminate) the only cleanup is the 5-minute
ReceiveTimeout in the actor — meaning a site-side StreamRelayActor and gRPC stream
can stay alive for up to 5 minutes after the central caller has given up. Combined with
the 30s timeout, every transient snapshot delay leaks site resources for minutes.
Recommendation
In StartStreamAsync, wrap the await so that any failure or cancellation
deterministically calls StopStream(sessionId) (e.g. try/catch (Exception) or a
finally that stops the session when the result was not returned). Ensure
StopStream is idempotent and always sends StopDebugStream even if the session was
already removed, so the bridge actor (and its site-side subscription) is torn down
promptly rather than waiting for the orphan ReceiveTimeout.
Resolution
Unresolved.
Communication-002 — gRPC reconnect does not unsubscribe the previous stream, leaking site-side relay actors
| Severity | High |
| Category | Error handling & resilience |
| Status | Open |
| Location | src/ScadaLink.Communication/Actors/DebugStreamBridgeActor.cs:170, src/ScadaLink.Communication/Actors/DebugStreamBridgeActor.cs:143 |
Description
On a gRPC stream error, HandleGrpcError increments the retry count, flips
_useNodeA, and schedules OpenGrpcStream. OpenGrpcStream cancels and disposes
_grpcCts and starts a fresh SubscribeInstance call — but it never calls
client.Unsubscribe(_correlationId) on the old node's client, and the site-side
SiteStreamGrpcServer keys active streams by correlation_id only. Because the new
subscription goes to the other node (_useNodeA flipped), the old node's
SiteStreamGrpcServer still has an active stream + StreamRelayActor +
SiteStreamManager subscription for that correlation ID. The old node only learns the
client is gone via TCP RST or keepalive — exactly the failure mode that triggered the
reconnect (network partition / silent node), so detection may take ~25s or never. Each
reconnect can therefore leave a zombie relay actor on the failed node. CleanupGrpc
(which does call Unsubscribe) is only invoked on terminal paths, not between
reconnect attempts.
Recommendation
Before reconnecting in HandleGrpcError / at the top of OpenGrpcStream, call
Unsubscribe(_correlationId) on the client for the previous endpoint (the one that
just failed) so the local CTS is cancelled and — where the channel is still alive —
the gRPC cancellation reaches the site and stops the relay actor.
Resolution
Unresolved.
Communication-003 — SiteStreamGrpcClient subscription map overwritten without disposal; reconnect can cancel the wrong stream
| Severity | High |
| Category | Concurrency & thread safety |
| Status | Open |
| Location | src/ScadaLink.Communication/Grpc/SiteStreamGrpcClient.cs:77, src/ScadaLink.Communication/Grpc/SiteStreamGrpcClient.cs:106 |
Description
SubscribeAsync does _subscriptions[correlationId] = cts; (line 77),
unconditionally overwriting any existing entry for that correlation ID without
cancelling or disposing the previous CancellationTokenSource. The finally block
then does _subscriptions.TryRemove(correlationId, out _) (line 106) which removes
the entry by key only, regardless of which CTS is stored. Because
DebugStreamBridgeActor reuses the same _correlationId across reconnect attempts
(and SiteStreamGrpcClientFactory returns the same SiteStreamGrpcClient for a site
even after a node flip), two SubscribeAsync calls can briefly share a correlation
ID. The first call's finally then removes the second call's CTS entry, so a later
Unsubscribe(correlationId) finds nothing and the live stream is never cancelled — an
orphan. Conversely the overwritten CTS is leaked (never disposed).
Recommendation
When inserting, cancel+dispose any prior CTS for that correlation ID. In the finally,
remove only if the stored CTS is the one this call created (use the
TryRemove(KeyValuePair) overload, mirroring what SiteStreamGrpcServer already does
with StreamEntry). Consider keying subscriptions by a per-call GUID rather than the
caller-supplied correlation ID.
Resolution
Unresolved.
Communication-004 — Coordinator actors declare no SupervisorStrategy (design requires Resume)
| Severity | Medium |
| Category | Akka.NET conventions |
| Status | Open |
| Location | src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:42, src/ScadaLink.Communication/Actors/SiteCommunicationActor.cs:22 |
Description
CLAUDE.md ("Explicit supervision strategies: Resume for coordinator actors, Stop for
short-lived execution actors") requires coordinator actors to use an explicit Resume
supervision strategy. CentralCommunicationActor and SiteCommunicationActor are
long-lived coordinators (they own the per-site ClusterClient map, debug
subscriptions, in-progress deployments) but neither overrides SupervisorStrategy.
They fall back to the Akka default (OneForOneStrategy with Restart). A child fault
— e.g. a ClusterClient child of CentralCommunicationActor created by
DefaultSiteClientFactory — would Restart under the default strategy, and any
exception in the coordinator itself would restart it, wiping _siteClients,
_debugSubscriptions, and _inProgressDeployments silently. The design intent is
Resume so transient child faults do not discard coordinator state.
Recommendation
Override SupervisorStrategy on both actors to return an explicit
OneForOneStrategy with Directive.Resume (or the project's standard coordinator
strategy), matching the documented decision and other coordinator actors.
Resolution
Unresolved.
Communication-005 — gRPC keepalive and max-stream-lifetime options are defined but never applied
| Severity | Medium |
| Category | Design-document adherence |
| Status | Open |
| Location | src/ScadaLink.Communication/Grpc/SiteStreamGrpcClient.cs:25, src/ScadaLink.Communication/CommunicationOptions.cs:36 |
Description
CommunicationOptions exposes GrpcKeepAlivePingDelay, GrpcKeepAlivePingTimeout,
GrpcMaxStreamLifetime, and GrpcMaxConcurrentStreams, and the design doc's
"gRPC Connection Keepalive" section explicitly states these are configurable. However
SiteStreamGrpcClient's constructor hard-codes KeepAlivePingDelay = TimeSpan.FromSeconds(15) and KeepAlivePingTimeout = TimeSpan.FromSeconds(10)
instead of reading the options. GrpcMaxStreamLifetime (the documented "Session
timeout — 4 hours" third layer of dead-client detection) is not referenced anywhere
— SiteStreamGrpcServer.SubscribeInstance creates a linked CTS from the call
cancellation token only, with no CancelAfter. The 4-hour zombie-stream safety net
described in the design doc does not exist in code. GrpcMaxConcurrentStreams is also
not wired to the server (SiteStreamGrpcServer takes a maxConcurrentStreams
constructor parameter defaulting to 100, but nothing binds the option to it).
Recommendation
Flow CommunicationOptions into SiteStreamGrpcClient and SiteStreamGrpcServer
(via the factory / DI). Apply GrpcKeepAlivePingDelay / GrpcKeepAlivePingTimeout to
the SocketsHttpHandler, bind GrpcMaxConcurrentStreams to the server's limit, and
implement the GrpcMaxStreamLifetime session timeout with CancelAfter on the
server-side stream CTS — or, if the 4-hour cap is intentionally dropped, remove the
option and update the design doc.
Resolution
Unresolved.
Communication-006 — Site address load failures are silently swallowed, leaving a stale cache
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Open |
| Location | src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:204 |
Description
LoadSiteAddressesFromDb runs the repository query inside Task.Run(...).PipeTo(self).
If GetAllSitesAsync throws (database unavailable, transient connection error), the
faulted task is piped to Self as a Status.Failure. CentralCommunicationActor has
no Receive<Status.Failure> handler, so the failure becomes an unhandled message
(logged at debug, not surfaced) and the periodic refresh silently fails. If the
first startup load fails the actor runs with an empty _siteClients map — every
SiteEnvelope is dropped (line 187) and every Ask times out with no indication of the
root cause.
Recommendation
Add a Receive<Status.Failure> handler that logs the load failure at Warning/Error
level so operators can distinguish "site has no addresses configured" from "database
is down". Optionally surface a health metric for repeated load failures.
Resolution
Unresolved.
Communication-007 — SiteStreamGrpcClientFactory.Dispose blocks on async work (sync-over-async)
| Severity | Medium |
| Category | Performance & resource management |
| Status | Open |
| Location | src/ScadaLink.Communication/Grpc/SiteStreamGrpcClientFactory.cs:53 |
Description
Dispose() calls DisposeAsync().AsTask().GetAwaiter().GetResult(). This is the
classic sync-over-async pattern: it blocks the calling thread until all per-site
SiteStreamGrpcClient.DisposeAsync calls complete. If Dispose is invoked from a
context with a single-threaded synchronization context or from DI container shutdown
on a constrained thread pool, this can deadlock or stall host shutdown. The class
already implements IAsyncDisposable.
Recommendation
Prefer registering and disposing the factory through IAsyncDisposable only (modern
.NET DI honours it for singletons). If a synchronous Dispose must remain, dispose
the underlying GrpcChannels directly (synchronous) rather than blocking on the async
path, or document why blocking is safe here.
Resolution
Unresolved.
Communication-008 — Reconnect retry-count reset can mask a flapping stream indefinitely
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | src/ScadaLink.Communication/Actors/DebugStreamBridgeActor.cs:71, src/ScadaLink.Communication/Actors/DebugStreamBridgeActor.cs:174 |
Description
_retryCount is reset to 0 every time a single AttributeValueChanged or
AlarmStateChanged event is received (lines 72, 77). Combined with MaxRetries = 3,
a stream that connects, delivers exactly one event, then fails — repeatedly — will
reconnect forever. The design doc states "max 3 retries, terminate the session if all
retries fail"; the current logic only terminates after 3 consecutive failures with
zero intervening events, so a flapping site never trips the limit and the debug
session (and its site-side relay) lives on indefinitely. The ReceiveTimeout orphan
net is also reset by every received message, so it does not bound this case either.
Recommendation
Either reset _retryCount only after the stream has been stably connected for some
minimum duration (e.g. a timer armed on stream open, cancelled on the next error), or
keep a separate cumulative reconnect counter / time window that bounds total
reconnects regardless of intervening events.
Resolution
Unresolved.
Communication-009 — _siteClients field is mutable and reassignable; cache update is not atomic on failure
| Severity | Low |
| Category | Concurrency & thread safety |
| Status | Open |
| Location | src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:53, src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:240 |
Description
_siteClients is a non-readonly Dictionary field. It is only mutated on the actor
thread (correct), but the field is needlessly reassignable, and
HandleSiteAddressCacheLoaded mutates it in place across several loops. If
ActorPath.Parse throws on a malformed address mid-loop (e.g. a site row with a
garbage NodeAAddress), the method aborts partway through, having already stopped
some ClusterClients and added others — leaving the cache partially updated with no
recovery until the next 60s refresh. The other actor mutable collections
(_debugSubscriptions, _inProgressDeployments) are correctly readonly.
Recommendation
Mark _siteClients readonly. Validate/parse all addresses up front (or wrap
ActorPath.Parse in a try/catch that logs and skips the bad site) so a single
malformed site record cannot abort the whole refresh and leave a half-updated cache.
Resolution
Unresolved.
Communication-010 — DebugStreamBridgeActor XML doc incorrectly describes it as a "Persistent actor"
| Severity | Low |
| Category | Documentation & comments |
| Status | Open |
| Location | src/ScadaLink.Communication/Actors/DebugStreamBridgeActor.cs:10 |
Description
The class summary opens with "Persistent actor (one per active debug session)...".
The actor derives from ReceiveActor, not a persistent actor base class, holds no
PersistenceId, and writes no journal/snapshot. "Persistent" is misleading — debug
sessions are explicitly "session-based and temporary" per the design doc. A reader
could assume state survives restart, which it does not.
Recommendation
Reword the summary to "Long-lived (per active debug session) actor on the central side..." or similar, removing the word "Persistent".
Resolution
Unresolved.
Communication-011 — No test coverage for snapshot-timeout cleanup, address-cache failure, or gRPC reconnect leak
| Severity | Low |
| Category | Testing coverage |
| Status | Open |
| Location | tests/ScadaLink.Communication.Tests/ (module-wide) |
Description
The test suite covers happy-path routing, handler-not-registered failures, heartbeat bumping, cache refresh, and gRPC bridge reconnect/retry. However several critical paths identified in this review have no coverage:
- The
DebugStreamService.StartStreamAsyncsnapshot-timeout path (Communication-001) — no test verifies bridge actor / site subscription teardown on timeout, nor theonTerminated-before-snapshot race that throws a non-OperationCanceledException. CentralCommunicationActorbehaviour whenLoadSiteAddressesFromDbfaults (Communication-006) —RefreshSiteAddresses_UpdatesCacheonly exercises success.SiteStreamGrpcClientsubscription-map overwrite/removal race (Communication-003) and gRPC reconnect not unsubscribing the old node (Communication-002).- A malformed
NodeAAddressabortingHandleSiteAddressCacheLoaded(Communication-009).
Recommendation
Add tests for: snapshot timeout / pre-snapshot termination cleanup; address-load
failure logging and empty-cache behaviour; reusing a correlation ID across
SubscribeAsync calls; and a malformed site address during cache refresh.
Resolution
Unresolved.