Files
scadalink-design/code-reviews/Communication/findings.md

40 KiB

Code Review — Communication

Field Value
Module src/ScadaLink.Communication
Design doc docs/requirements/Component-Communication.md
Status Reviewed
Last reviewed 2026-05-17
Reviewer claude-agent
Commit reviewed 39d737e
Open findings 0

Summary

The Communication module is generally well-structured and matches the design doc's two-transport model (ClusterClient for command/control, gRPC server-streaming for real-time data). The actors keep mutable state on the actor thread, use PipeTo for async work, and the gRPC server/client lifecycle is mostly disciplined. However the review found several High and Medium issues clustered around two themes: (a) gRPC subscription bookkeeping racesSiteStreamGrpcClient overwrites and removes subscription entries by correlation ID without disposal or ownership checks, so reconnect cycles leak CancellationTokenSourcees and can cancel the wrong stream; and (b) missing supervision strategy on the coordinator actors, contrary to the CLAUDE.md "Resume for coordinator actors" decision. Design-doc adherence is otherwise good. Test coverage is broad for happy paths but has gaps around failover, cache mutation races, and the snapshot-timeout cleanup path.

Re-review 2026-05-17 (commit 39d737e)

All prior findings (Communication-001..011) are confirmed Resolved in this commit and the fixes hold up against the source. The re-review walked all 10 checklist categories again and uncovered a previously-missed defect at the centre of the gRPC node-failover path: SiteStreamGrpcClientFactory.GetOrCreate caches one client per site identifier and silently ignores the grpcEndpoint argument on a cache hit. The DebugStreamBridgeActor reconnect logic flips _useNodeA and passes the other node's endpoint, but the factory hands back the original NodeA-bound client every time — so the documented "try the other site node endpoint" failover never actually moves to NodeB (Communication-012). The same caching defect means a site address change is never picked up because RemoveSiteAsync has no production caller (Communication-013). Two Low findings round out the re-review: an untrusted gRPC-supplied correlation_id flows straight into an Akka actor name (Communication-014), and the factory's endpoint-reuse defect is masked by the test mock (Communication-015). Four new findings, all Open: one High, one Medium, two Low.

Checklist coverage

# Category Examined Notes
1 Correctness & logic bugs Re-review: factory ignores endpoint on cache hit, defeating NodeA→NodeB stream failover (Communication-012). Prior items resolved.
2 Akka.NET conventions Coordinator Resume strategies now present and verified. No new issues.
3 Concurrency & thread safety Subscription-map register/remove now ownership-checked. _siteClients readonly. No new issues.
4 Error handling & resilience Status.Failure handler added; reconnect unsubscribes prior stream. No new issues.
5 Security Re-review: public gRPC correlation_id flows unvalidated into an Akka actor name (Communication-014).
6 Performance & resource management Synchronous Dispose paths fixed; CTS leaks resolved. No new issues.
7 Design-document adherence Re-review: site gRPC address-change disposal not wired — RemoveSiteAsync is dead code (Communication-013). gRPC options now applied.
8 Code organization & conventions Options pattern correct; public records still declared in actor files (acceptable). No structural issues.
9 Testing coverage Re-review: prior gaps closed, but the factory mock masks the endpoint-reuse defect — no real node-flip coverage (Communication-015).
10 Documentation & comments DebugStreamBridgeActor summary corrected. No new issues.

Findings

Communication-001 — Early stream termination escapes StartStreamAsync's narrow exception handling

Severity Medium
Category Error handling & resilience
Status Resolved
Location src/ScadaLink.Communication/DebugStreamService.cs:130-143

Re-triaged 2026-05-16: originally filed Critical, claiming an orphaned bridge actor and a multi-minute site-side resource leak on every snapshot timeout. On verification that impact does not occur: DebugStreamBridgeActor calls CleanupGrpc() and Context.Stop(Self) on every path that invokes onTerminated (site disconnect, gRPC max-retries, ReceiveTimeout), so it always self-terminates and releases its gRPC subscription; and the pure-timeout path does reach StopStream, which also stops it. The genuine defect described below is an error-handling gap, not a leak — severity corrected to Medium.

Description

StartStreamAsync awaits the initial snapshot inside a try whose only handler is catch (OperationCanceledException). When the stream terminates before the snapshot arrives, onTerminatedWrapper completes the await via snapshotTcs.TrySetException(new InvalidOperationException(...)). That InvalidOperationException is not an OperationCanceledException, so it escapes the catch entirely: the caller (Blazor debug view / SignalR hub) receives a raw, untranslated exception, and StartStreamAsync performs no teardown of its own on that path — it relies implicitly on the bridge actor self-terminating. Cleanup from the service side is therefore not deterministic, and the failure surfaced to the caller is not a meaningful, documented result.

Recommendation

In StartStreamAsync, catch any exception from the snapshot await, deterministically tear down the bridge actor (Tell(StopDebugStream) via the local actor reference, since a racing onTerminatedWrapper may already have removed the session entry), and translate the failure into a meaningful exception for the caller.

Resolution

Resolved 2026-05-16. The catch (OperationCanceledException)-only block in StartStreamAsync was replaced with catch (Exception): it removes the session entry, sends StopDebugStream to the bridge actor via the local reference (idempotent — the actor may already be stopping itself), and throws a descriptive exception — TimeoutException for the 30s timeout, otherwise an InvalidOperationException that names the instance/site and wraps the underlying cause. Regression test DebugStreamServiceTests.StartStreamAsync_StreamTerminatesBeforeSnapshot_ThrowsMeaningfulException fails against the pre-fix code and passes after. Fixed by the commit whose message references Communication-001.

Communication-002 — gRPC reconnect does not unsubscribe the previous stream, leaking site-side relay actors

Severity High
Category Error handling & resilience
Status Resolved
Location src/ScadaLink.Communication/Actors/DebugStreamBridgeActor.cs:170, src/ScadaLink.Communication/Actors/DebugStreamBridgeActor.cs:143

Description

On a gRPC stream error, HandleGrpcError increments the retry count, flips _useNodeA, and schedules OpenGrpcStream. OpenGrpcStream cancels and disposes _grpcCts and starts a fresh SubscribeInstance call — but it never calls client.Unsubscribe(_correlationId) on the old node's client, and the site-side SiteStreamGrpcServer keys active streams by correlation_id only. Because the new subscription goes to the other node (_useNodeA flipped), the old node's SiteStreamGrpcServer still has an active stream + StreamRelayActor + SiteStreamManager subscription for that correlation ID. The old node only learns the client is gone via TCP RST or keepalive — exactly the failure mode that triggered the reconnect (network partition / silent node), so detection may take ~25s or never. Each reconnect can therefore leave a zombie relay actor on the failed node. CleanupGrpc (which does call Unsubscribe) is only invoked on terminal paths, not between reconnect attempts.

Recommendation

Before reconnecting in HandleGrpcError / at the top of OpenGrpcStream, call Unsubscribe(_correlationId) on the client for the previous endpoint (the one that just failed) so the local CTS is cancelled and — where the channel is still alive — the gRPC cancellation reaches the site and stops the relay actor.

Resolution

Resolved 2026-05-16 (commit <pending>). Root cause confirmed against source: HandleGrpcError flipped _useNodeA and scheduled OpenGrpcStream without ever unsubscribing the failed stream, leaving the old node's StreamRelayActor zombie until TCP/keepalive timeout. Fix: HandleGrpcError now resolves the client for the previous endpoint (before flipping _useNodeA) and calls Unsubscribe(_correlationId) on it, so the local CTS is cancelled and gRPC cancellation reaches the still-alive site. Regression test DebugStreamBridgeActorTests.On_GrpcError_Unsubscribes_Old_Stream_Before_Reconnect fails against the pre-fix code and passes after.

Communication-003 — SiteStreamGrpcClient subscription map overwritten without disposal; reconnect can cancel the wrong stream

Severity High
Category Concurrency & thread safety
Status Resolved
Location src/ScadaLink.Communication/Grpc/SiteStreamGrpcClient.cs:77, src/ScadaLink.Communication/Grpc/SiteStreamGrpcClient.cs:106

Description

SubscribeAsync does _subscriptions[correlationId] = cts; (line 77), unconditionally overwriting any existing entry for that correlation ID without cancelling or disposing the previous CancellationTokenSource. The finally block then does _subscriptions.TryRemove(correlationId, out _) (line 106) which removes the entry by key only, regardless of which CTS is stored. Because DebugStreamBridgeActor reuses the same _correlationId across reconnect attempts (and SiteStreamGrpcClientFactory returns the same SiteStreamGrpcClient for a site even after a node flip), two SubscribeAsync calls can briefly share a correlation ID. The first call's finally then removes the second call's CTS entry, so a later Unsubscribe(correlationId) finds nothing and the live stream is never cancelled — an orphan. Conversely the overwritten CTS is leaked (never disposed).

Recommendation

When inserting, cancel+dispose any prior CTS for that correlation ID. In the finally, remove only if the stored CTS is the one this call created (use the TryRemove(KeyValuePair) overload, mirroring what SiteStreamGrpcServer already does with StreamEntry). Consider keying subscriptions by a per-call GUID rather than the caller-supplied correlation ID.

Resolution

Resolved 2026-05-16 (commit <pending>). Root cause confirmed against source: the inline _subscriptions[correlationId] = cts overwrote a prior CTS without cancel/dispose (leak), and the finally's TryRemove(correlationId, out _) removed by key only — a racing reconnect's live CTS could be removed by the prior call's finally, orphaning the live stream. Fix: extracted two internal helpers used by SubscribeAsyncRegisterSubscription cancels+disposes any existing CTS for the correlation ID before inserting, and RemoveSubscription uses the ConcurrentDictionary.TryRemove(KeyValuePair) overload so it removes only the CTS that call created (mirroring SiteStreamGrpcServer's StreamEntry pattern). Regression tests SiteStreamGrpcClientTests.RegisterSubscription_ReusedCorrelationId_CancelsAndDisposesPriorCts and SiteStreamGrpcClientTests.RemoveSubscription_OnlyRemovesOwnCts_NotAReplacement fail against the pre-fix logic and pass after.

Communication-004 — Coordinator actors declare no SupervisorStrategy (design requires Resume)

Severity Medium
Category Akka.NET conventions
Status Resolved
Location src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:42, src/ScadaLink.Communication/Actors/SiteCommunicationActor.cs:22

Description

CLAUDE.md ("Explicit supervision strategies: Resume for coordinator actors, Stop for short-lived execution actors") requires coordinator actors to use an explicit Resume supervision strategy. CentralCommunicationActor and SiteCommunicationActor are long-lived coordinators (they own the per-site ClusterClient map, debug subscriptions, in-progress deployments) but neither overrides SupervisorStrategy. They fall back to the Akka default (OneForOneStrategy with Restart). A child fault — e.g. a ClusterClient child of CentralCommunicationActor created by DefaultSiteClientFactory — would Restart under the default strategy, and any exception in the coordinator itself would restart it, wiping _siteClients, _debugSubscriptions, and _inProgressDeployments silently. The design intent is Resume so transient child faults do not discard coordinator state.

Recommendation

Override SupervisorStrategy on both actors to return an explicit OneForOneStrategy with Directive.Resume (or the project's standard coordinator strategy), matching the documented decision and other coordinator actors.

Resolution

Resolved 2026-05-16 (commit pending). Root cause confirmed: neither CentralCommunicationActor nor SiteCommunicationActor overrode SupervisorStrategy, so child faults fell back to the Akka default (Restart). Note that an actor's own SupervisorStrategy governs its children — a transient child fault would Restart the child and discard its in-memory state, contrary to the CLAUDE.md "Resume for coordinator actors" decision. Fix: both actors now override SupervisorStrategy() to return a OneForOneStrategy with an unbounded Decider resolving to Directive.Resume (mirroring DataConnectionManagerActor). Regression tests CoordinatorSupervisionTests.CentralCommunicationActor_SupervisorStrategy_IsResume and CoordinatorSupervisionTests.SiteCommunicationActor_SupervisorStrategy_IsResume fail against the pre-fix code (decider yields Restart) and pass after.

Communication-005 — gRPC keepalive and max-stream-lifetime options are defined but never applied

Severity Medium
Category Design-document adherence
Status Resolved
Location src/ScadaLink.Communication/Grpc/SiteStreamGrpcClient.cs:25, src/ScadaLink.Communication/CommunicationOptions.cs:36

Description

CommunicationOptions exposes GrpcKeepAlivePingDelay, GrpcKeepAlivePingTimeout, GrpcMaxStreamLifetime, and GrpcMaxConcurrentStreams, and the design doc's "gRPC Connection Keepalive" section explicitly states these are configurable. However SiteStreamGrpcClient's constructor hard-codes KeepAlivePingDelay = TimeSpan.FromSeconds(15) and KeepAlivePingTimeout = TimeSpan.FromSeconds(10) instead of reading the options. GrpcMaxStreamLifetime (the documented "Session timeout — 4 hours" third layer of dead-client detection) is not referenced anywhere — SiteStreamGrpcServer.SubscribeInstance creates a linked CTS from the call cancellation token only, with no CancelAfter. The 4-hour zombie-stream safety net described in the design doc does not exist in code. GrpcMaxConcurrentStreams is also not wired to the server (SiteStreamGrpcServer takes a maxConcurrentStreams constructor parameter defaulting to 100, but nothing binds the option to it).

Recommendation

Flow CommunicationOptions into SiteStreamGrpcClient and SiteStreamGrpcServer (via the factory / DI). Apply GrpcKeepAlivePingDelay / GrpcKeepAlivePingTimeout to the SocketsHttpHandler, bind GrpcMaxConcurrentStreams to the server's limit, and implement the GrpcMaxStreamLifetime session timeout with CancelAfter on the server-side stream CTS — or, if the 4-hour cap is intentionally dropped, remove the option and update the design doc.

Resolution

Resolved 2026-05-16 (commit pending). Root cause confirmed: SiteStreamGrpcClient hard-coded the keepalive values, GrpcMaxStreamLifetime was referenced nowhere, and GrpcMaxConcurrentStreams was never bound to the server. Fix (scoped to src/ScadaLink.Communication): SiteStreamGrpcClient gained a constructor taking CommunicationOptions and now applies GrpcKeepAlivePingDelay/GrpcKeepAlivePingTimeout to its SocketsHttpHandler; SiteStreamGrpcClientFactory gained an IOptions<CommunicationOptions> DI constructor and flows the options into every client it creates; SiteStreamGrpcServer gained an IOptions<CommunicationOptions> DI constructor that binds GrpcMaxConcurrentStreams and implements the documented 4-hour session timeout via CancellationTokenSource.CancelAfter(GrpcMaxStreamLifetime) on the per-stream CTS. The Host's existing AddSingleton<SiteStreamGrpcServer>() registration resolves the new DI constructor via greedy resolution — no Host change required. Regression tests GrpcOptionsWiringTests.SiteStreamGrpcClient_AppliesKeepAliveFromOptions, GrpcOptionsWiringTests.SiteStreamGrpcClientFactory_FlowsOptionsToCreatedClients, and GrpcOptionsWiringTests.SiteStreamGrpcServer_BindsMaxConcurrentStreamsAndLifetimeFromOptions exercise the wiring (they require the new members to even compile).

Communication-006 — Site address load failures are silently swallowed, leaving a stale cache

Severity Medium
Category Error handling & resilience
Status Resolved
Location src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:204

Description

LoadSiteAddressesFromDb runs the repository query inside Task.Run(...).PipeTo(self). If GetAllSitesAsync throws (database unavailable, transient connection error), the faulted task is piped to Self as a Status.Failure. CentralCommunicationActor has no Receive<Status.Failure> handler, so the failure becomes an unhandled message (logged at debug, not surfaced) and the periodic refresh silently fails. If the first startup load fails the actor runs with an empty _siteClients map — every SiteEnvelope is dropped (line 187) and every Ask times out with no indication of the root cause.

Recommendation

Add a Receive<Status.Failure> handler that logs the load failure at Warning/Error level so operators can distinguish "site has no addresses configured" from "database is down". Optionally surface a health metric for repeated load failures.

Resolution

Resolved 2026-05-16 (commit pending). Root cause confirmed: a faulted LoadSiteAddressesFromDb task is piped to Self as a Status.Failure, but the actor had no handler for it — the failure became an unhandled message (debug-level only) and the periodic refresh failed silently. Fix: added a Receive<Status.Failure> handler that logs the load failure at Warning with the underlying exception as the cause, so operators can distinguish a missing-addresses configuration from a database outage. Regression test CentralCommunicationActorTests.LoadSiteAddressesFailure_IsLoggedNotSilentlySwallowed (repository query throws) asserts the Warning is emitted — it produces no warning against the pre-fix code and passes after.

Communication-007 — SiteStreamGrpcClientFactory.Dispose blocks on async work (sync-over-async)

Severity Medium
Category Performance & resource management
Status Resolved
Location src/ScadaLink.Communication/Grpc/SiteStreamGrpcClientFactory.cs:53

Description

Dispose() calls DisposeAsync().AsTask().GetAwaiter().GetResult(). This is the classic sync-over-async pattern: it blocks the calling thread until all per-site SiteStreamGrpcClient.DisposeAsync calls complete. If Dispose is invoked from a context with a single-threaded synchronization context or from DI container shutdown on a constrained thread pool, this can deadlock or stall host shutdown. The class already implements IAsyncDisposable.

Recommendation

Prefer registering and disposing the factory through IAsyncDisposable only (modern .NET DI honours it for singletons). If a synchronous Dispose must remain, dispose the underlying GrpcChannels directly (synchronous) rather than blocking on the async path, or document why blocking is safe here.

Resolution

Resolved 2026-05-16 (commit pending). Root cause confirmed: Dispose() called DisposeAsync().AsTask().GetAwaiter().GetResult(), the classic sync-over-async pattern. Fix: SiteStreamGrpcClient now also implements IDisposable with a synchronous Dispose() that releases its CancellationTokenSources and underlying GrpcChannel directly (all of that teardown is inherently synchronous); SiteStreamGrpcClientFactory.Dispose() now disposes each cached client via that synchronous path with no blocking on the async path. A CreateClient seam was extracted so the test can substitute a tracking client while still exercising the factory's real caching/disposal machinery. Regression test SiteStreamGrpcClientFactoryDisposeTests.Dispose_DisposesClientsSynchronously_NotViaAsyncPath fails against the pre-fix code (clients disposed via DisposeAsync) and passes after; Dispose_DoesNotDeadlock_UnderSingleThreadedSynchronizationContext guards the stall path.

Communication-008 — Reconnect retry-count reset can mask a flapping stream indefinitely

Severity Medium
Category Correctness & logic bugs
Status Resolved
Location src/ScadaLink.Communication/Actors/DebugStreamBridgeActor.cs:71, src/ScadaLink.Communication/Actors/DebugStreamBridgeActor.cs:174

Description

_retryCount is reset to 0 every time a single AttributeValueChanged or AlarmStateChanged event is received (lines 72, 77). Combined with MaxRetries = 3, a stream that connects, delivers exactly one event, then fails — repeatedly — will reconnect forever. The design doc states "max 3 retries, terminate the session if all retries fail"; the current logic only terminates after 3 consecutive failures with zero intervening events, so a flapping site never trips the limit and the debug session (and its site-side relay) lives on indefinitely. The ReceiveTimeout orphan net is also reset by every received message, so it does not bound this case either.

Recommendation

Either reset _retryCount only after the stream has been stably connected for some minimum duration (e.g. a timer armed on stream open, cancelled on the next error), or keep a separate cumulative reconnect counter / time window that bounds total reconnects regardless of intervening events.

Resolution

Resolved 2026-05-16 (commit pending). Root cause confirmed: _retryCount was reset to 0 on every received AttributeValueChanged/AlarmStateChanged, so a stream that connected, delivered one event, then failed — repeatedly — never tripped MaxRetries. Fix (recommendation option a): the per-event reset was removed; instead OpenGrpcStream arms a single StabilityWindow timer (60s default, internal-settable for tests), and only when it fires (GrpcStreamStable) — i.e. the stream stayed up long enough to be considered recovered — is _retryCount reset. HandleGrpcError cancels that timer, so a stream that fails before the window elapses does not recover its retry budget. A flapping stream therefore terminates after MaxRetries regardless of intervening events. Regression test DebugStreamBridgeActorTests.FlappingStream_DeliveringEventsBetweenFailures_StillTerminatesAfterMaxRetries fails against the pre-fix code (actor never terminates) and passes after; RetryCount_RecoveredOnlyAfterStreamStaysStableForStabilityWindow verifies the budget is recovered after a stable interval. The pre-existing test that codified the buggy per-event reset (Grpc_Error_Resets_RetryCount_On_Successful_Event) was replaced.

Communication-009 — _siteClients field is mutable and reassignable; cache update is not atomic on failure

Severity Low
Category Concurrency & thread safety
Status Resolved
Location src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:53, src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:240

Description

_siteClients is a non-readonly Dictionary field. It is only mutated on the actor thread (correct), but the field is needlessly reassignable, and HandleSiteAddressCacheLoaded mutates it in place across several loops. If ActorPath.Parse throws on a malformed address mid-loop (e.g. a site row with a garbage NodeAAddress), the method aborts partway through, having already stopped some ClusterClients and added others — leaving the cache partially updated with no recovery until the next 60s refresh. The other actor mutable collections (_debugSubscriptions, _inProgressDeployments) are correctly readonly.

Recommendation

Mark _siteClients readonly. Validate/parse all addresses up front (or wrap ActorPath.Parse in a try/catch that logs and skips the bad site) so a single malformed site record cannot abort the whole refresh and leave a half-updated cache.

Resolution

Resolved 2026-05-16 (commit pending). Root cause confirmed: _siteClients was a non-readonly field, and HandleSiteAddressCacheLoaded's add/update loop called ActorPath.Parse per site with no guard — a malformed NodeAAddress threw mid-loop and aborted the refresh, leaving the cache half-updated until the next 60s cycle. Fix: _siteClients is now readonly, and the per-site ActorPath.Parse is wrapped in a try/catch that logs the bad address at Warning and continues to the next site, so a single garbage row cannot starve other sites of their ClusterClient. Regression test CentralCommunicationActorTests.MalformedSiteAddress_DoesNotAbortRefresh_OtherSitesStillRegistered (bad site ordered before a good one) fails against the pre-fix code (good site never registered) and passes after.

Communication-010 — DebugStreamBridgeActor XML doc incorrectly describes it as a "Persistent actor"

Severity Low
Category Documentation & comments
Status Resolved
Location src/ScadaLink.Communication/Actors/DebugStreamBridgeActor.cs:10

Description

The class summary opens with "Persistent actor (one per active debug session)...". The actor derives from ReceiveActor, not a persistent actor base class, holds no PersistenceId, and writes no journal/snapshot. "Persistent" is misleading — debug sessions are explicitly "session-based and temporary" per the design doc. A reader could assume state survives restart, which it does not.

Recommendation

Reword the summary to "Long-lived (per active debug session) actor on the central side..." or similar, removing the word "Persistent".

Resolution

Resolved 2026-05-16 (commit pending). Root cause confirmed: the class summary opened with "Persistent actor (one per active debug session)..." but the actor derives from ReceiveActor, holds no PersistenceId, and writes no journal/snapshot. Fix (documentation only — no behaviour change, so no regression test): the summary was reworded to "Long-lived (one per active debug session) actor on the central side. Debug sessions are session-based and temporary — this actor holds no persisted state and does not derive from an Akka.Persistence base class; its state does not survive a restart."

Communication-011 — No test coverage for snapshot-timeout cleanup, address-cache failure, or gRPC reconnect leak

Severity Low
Category Testing coverage
Status Resolved
Location tests/ScadaLink.Communication.Tests/ (module-wide)

Description

The test suite covers happy-path routing, handler-not-registered failures, heartbeat bumping, cache refresh, and gRPC bridge reconnect/retry. However several critical paths identified in this review have no coverage:

  • The DebugStreamService.StartStreamAsync snapshot-timeout path (Communication-001) — no test verifies bridge actor / site subscription teardown on timeout, nor the onTerminated-before-snapshot race that throws a non-OperationCanceledException.
  • CentralCommunicationActor behaviour when LoadSiteAddressesFromDb faults (Communication-006) — RefreshSiteAddresses_UpdatesCache only exercises success.
  • SiteStreamGrpcClient subscription-map overwrite/removal race (Communication-003) and gRPC reconnect not unsubscribing the old node (Communication-002).
  • A malformed NodeAAddress aborting HandleSiteAddressCacheLoaded (Communication-009).

Recommendation

Add tests for: snapshot timeout / pre-snapshot termination cleanup; address-load failure logging and empty-cache behaviour; reusing a correlation ID across SubscribeAsync calls; and a malformed site address during cache refresh.

Resolution

Resolved 2026-05-16 (commit pending). This is a meta-coverage finding; every gap it enumerates is now covered by a regression test (each fails against its pre-fix code and passes after):

  • Snapshot timeout / pre-snapshot termination (Communication-001) — DebugStreamServiceTests.StartStreamAsync_StreamTerminatesBeforeSnapshot_ThrowsMeaningfulException.
  • gRPC reconnect not unsubscribing the old node (Communication-002) — DebugStreamBridgeActorTests.On_GrpcError_Unsubscribes_Old_Stream_Before_Reconnect.
  • SiteStreamGrpcClient subscription-map overwrite/removal race (Communication-003) — SiteStreamGrpcClientTests.RegisterSubscription_ReusedCorrelationId_CancelsAndDisposesPriorCts and RemoveSubscription_OnlyRemovesOwnCts_NotAReplacement.
  • LoadSiteAddressesFromDb fault (Communication-006) — CentralCommunicationActorTests.LoadSiteAddressesFailure_IsLoggedNotSilentlySwallowed.
  • Malformed NodeAAddress aborting HandleSiteAddressCacheLoaded (Communication-009) — CentralCommunicationActorTests.MalformedSiteAddress_DoesNotAbortRefresh_OtherSitesStillRegistered (added with this finding's resolution). The full module suite (dotnet test tests/ScadaLink.Communication.Tests) is green at 111 passing tests.

Communication-012 — gRPC client factory ignores the endpoint on a cache hit, breaking NodeA→NodeB stream failover

Severity High
Category Correctness & logic bugs
Status Resolved
Location src/ScadaLink.Communication/Grpc/SiteStreamGrpcClientFactory.cs:39, src/ScadaLink.Communication/Actors/DebugStreamBridgeActor.cs:166

Description

SiteStreamGrpcClientFactory.GetOrCreate is _clients.GetOrAdd(siteIdentifier, _ => CreateClient(grpcEndpoint)) — it keys the cache by site identifier only and the grpcEndpoint argument is used exclusively for the first-ever creation. Every subsequent call for that site returns the originally-cached SiteStreamGrpcClient, which is permanently bound to the GrpcChannel of whatever endpoint was passed first.

DebugStreamBridgeActor relies on the opposite behaviour. On a gRPC stream error, HandleGrpcError flips _useNodeA and OpenGrpcStream recomputes endpoint = _useNodeA ? _grpcNodeAAddress : _grpcNodeBAddress, then calls _grpcFactory.GetOrCreate(_siteIdentifier, endpoint) expecting a client connected to the other node. Because the factory ignores the new endpoint, the bridge actor reconnects to the same failed NodeA endpoint on every retry. The design doc's core debug-stream failover behaviour ("tries the other site node endpoint", "NodeB if NodeA failed, or vice versa") is therefore inoperative — when a site node goes down, the debug stream cannot move to the surviving node and simply exhausts MaxRetries against the dead endpoint and terminates. The _useNodeA flip, the previousEndpoint computation in HandleGrpcError, and the CleanupGrpc endpoint selection are all dead logic. (Communication-002's Unsubscribe-before-reconnect fix still functions, but it unsubscribes and re-subscribes on the same client/node rather than the intended other node.)

Recommendation

Make the per-site client aware of both endpoints, or key the cache by (siteIdentifier, endpoint), or have GetOrCreate detect an endpoint change and dispose+recreate the cached client. Given the design intent ("Falls back to NodeB if NodeA connection fails"), the cleanest fix is to give SiteStreamGrpcClient (or a per-site holder) both NodeA/NodeB addresses and let it switch channels internally, removing the endpoint argument from GetOrCreate entirely. Add a test that drives a real SiteStreamGrpcClientFactory through a node flip and asserts the second client targets the other endpoint.

Resolution

Resolved 2026-05-17 (commit pending). Root cause confirmed against source: GetOrCreate was _clients.GetOrAdd(siteIdentifier, …) — keyed by site identifier only, so the grpcEndpoint argument was honoured solely on first creation and the NodeA→NodeB flip reconnected to the dead endpoint forever. Fix: SiteStreamGrpcClient now exposes its bound Endpoint, and GetOrCreate compares the cached client's endpoint against the requested one — on a mismatch it atomically installs (via ConcurrentDictionary.AddOrUpdate) a fresh client for the new endpoint and disposes the stale one, so a node flip actually moves to the surviving node. Regression tests SiteStreamGrpcClientFactoryTests.GetOrCreate_EndpointChanged_ReturnsClientBoundToNewEndpoint and GetOrCreate_SameEndpoint_DoesNotDisposeOrRecreate fail against the pre-fix factory and pass after.

Communication-013 — Site gRPC address changes are never applied; RemoveSiteAsync has no production caller

Severity Medium
Category Design-document adherence
Status Resolved
Location src/ScadaLink.Communication/Grpc/SiteStreamGrpcClientFactory.cs:58

Description

The design doc states that SiteStreamGrpcClientFactory "Disposes clients on site removal or address change." RemoveSiteAsync implements the disposal mechanism, but a repo-wide search finds no production caller — only tests invoke it. Combined with the cache-by-site-identifier behaviour (Communication-012), the consequence is that once a site's SiteStreamGrpcClient is created, a later edit to that site's GrpcNodeAAddress / GrpcNodeBAddress (via the Central UI or CLI) is never reflected in the cached client — it keeps using the stale channel for the life of the process. CentralCommunicationActor already refreshes the Akka address cache every 60s and recreates ClusterClients on change, but there is no equivalent invalidation path wired into the gRPC client factory. A site whose gRPC endpoints are corrected after an initial misconfiguration will never have working debug streaming until the central node is restarted.

Recommendation

Wire a site-removal / address-change signal into SiteStreamGrpcClientFactory — e.g. have CentralCommunicationActor (which already detects address changes in HandleSiteAddressCacheLoaded) call RemoveSiteAsync for sites whose gRPC addresses changed or were removed, or fold the gRPC endpoints into the same refresh cycle. If the on-the-fly address-change requirement is intentionally dropped, remove RemoveSiteAsync and correct the design doc.

Resolution

Resolved 2026-05-17 (commit pending). The address-change staleness — the primary impact ("never have working debug streaming until the central node is restarted") — is fixed in-module by the Communication-012 change: GetOrCreate is now endpoint-change-aware, so the next time DebugStreamBridgeActor requests a stream with a corrected GrpcNodeAAddress/GrpcNodeBAddress the stale cached client is disposed and replaced — no central restart needed and no external wiring required. RemoveSiteAsync is retained as the disposal path for full site removal (a deleted site record) and its doc comment now states that role explicitly; wiring a delete-site callback belongs to the site-management flow in another module and is out of this module's scope. Regression test SiteStreamGrpcClientFactoryTests.GetOrCreate_EndpointChanged_DisposesPriorClient fails against the pre-fix factory (stale client never disposed/replaced) and passes after.

Communication-014 — Untrusted gRPC correlation_id flows directly into an Akka actor name

Severity Low
Category Security
Status Resolved
Location src/ScadaLink.Communication/Grpc/SiteStreamGrpcServer.cs:124

Description

SubscribeInstance is a public gRPC endpoint hosted on each site node. It creates the relay actor with $"stream-relay-{request.CorrelationId}-{actorSeq}" as the actor name, where request.CorrelationId comes straight off the wire. Akka actor names have a restricted character set; a correlation_id containing /, whitespace, or other disallowed characters makes ActorSystem.ActorOf throw InvalidActorNameException. That exception is not caught inside SubscribeInstance, so it escapes as an unhandled RPC fault (and after the _streamSubscriber.Subscribe / _activeStreams entry has already been set up for the duration, though the finally does not run because the throw is before the try). In practice central always supplies a GUID, so impact is low, but the server is trusting client-supplied input to be actor-name-safe.

Recommendation

Validate request.CorrelationId on entry (non-empty, matches an expected GUID/safe pattern) and reject with StatusCode.InvalidArgument otherwise; or derive the actor name solely from the internal _actorCounter and keep the correlation ID only as actor state / dictionary key.

Resolution

Resolved 2026-05-17 (commit pending). Root cause confirmed: SubscribeInstance fed the off-the-wire request.CorrelationId straight into the stream-relay-… actor name, so an id with /, whitespace, or other disallowed characters made ActorOf throw InvalidActorNameException as an unhandled RPC fault. Fix: SubscribeInstance now validates CorrelationId on entry — rejecting null/empty or any value failing ActorPath.IsValidPathElement with StatusCode.InvalidArgument before any actor or subscription state is created. Regression test SiteStreamGrpcServerTests.RejectsCorrelationIdThatIsNotActorNameSafe (theory: /-bearing, whitespace, empty, $-prefixed ids) fails against the pre-fix server and passes after; AcceptsActorNameSafeCorrelationId confirms a normal GUID is still accepted.

Communication-015 — No test exercises the real gRPC client factory across a node flip

Severity Low
Category Testing coverage
Status Resolved
Location tests/ScadaLink.Communication.Tests/Grpc/DebugStreamBridgeActorTests.cs:401, tests/ScadaLink.Communication.Tests/Grpc/SiteStreamGrpcClientFactoryTests.cs

Description

DebugStreamBridgeActorTests exercises the reconnect/failover paths through MockSiteStreamGrpcClientFactory, which returns one fixed mock client regardless of the grpcEndpoint argument. This is exactly the behaviour the real SiteStreamGrpcClientFactory exhibits incorrectly (Communication-012), so the mock masks the defect: On_GrpcError_Reconnects_To_Other_Node passes even though the real factory never reaches the other node. SiteStreamGrpcClientFactoryTests only asserts GetOrCreate returns the same client for the same site — it never checks what happens when the same site is requested with a different endpoint.

Recommendation

Add a SiteStreamGrpcClientFactoryTests case that calls GetOrCreate(site, endpointA) then GetOrCreate(site, endpointB) and asserts the second call targets endpointB (it should fail today and pass after Communication-012 is fixed). Have the bridge-actor test's mock factory track the endpoint per call so node-flip coverage is meaningful.

Resolution

Resolved 2026-05-17 (commit pending). SiteStreamGrpcClientFactoryTests gained GetOrCreate_EndpointChanged_ReturnsClientBoundToNewEndpoint and GetOrCreate_EndpointChanged_DisposesPriorClient, which drive the real SiteStreamGrpcClientFactory across a node flip and assert the second call yields a client bound to the new endpoint with the stale one disposed — both fail against the pre-fix factory and pass after (the Communication-012 fix). DebugStreamBridgeActorTests gained On_GrpcError_Reconnects_To_Other_Node_Endpoint, which uses a new EndpointTrackingGrpcClientFactory test double that hands out a distinct mock client per endpoint (instead of one fixed mock regardless of endpoint), so the bridge actor's NodeA→NodeB reconnect is now verified to actually target the NodeB endpoint rather than being masked by an endpoint-agnostic mock.