Files

T

Joseph Doherty 6ae0fea558 fix(error-handling): close Theme 4 — 18 cancellation / fire-and-forget findings

Async cancellation hygiene, fire-and-forget observability, retry/shutdown
semantics, and audit-row coverage across 9 modules. Highlights:

Cancellation & lifecycle:
- AuditLog-006: SqliteAuditWriter.Dispose hops to thread pool, escaping the
  captured SyncContext that risked sync-over-async deadlock.
- AuditLog-010: SiteAuditTelemetryActor owns a private lifecycle CTS,
  threaded through drain paths instead of CancellationToken.None.
- Comm-019: CentralCommunicationActor adds lifecycle CTS for repo calls.
- Host-019: Migration StartupRetry forwards ApplicationStopping so SIGTERM
  during the bounded-retry window aborts cleanly.

Cursor / retry / counter correctness:
- AuditLog-004: SiteAuditReconciliationActor's cursor now holds at `since`
  when any row's idempotent insert is still being retried (per-EventId
  retry counter, MaxPermanentInsertAttempts=5 escape valve with LogCritical
  abandon). No more silent abandonment of permanently-failing rows.
- ConfigDB-019: Dropped the catch-and-continue on EnsureLookaheadAsync's
  SPLIT loop — by class-doc construction the catch could only mask real
  failures and let the next iteration create permanent partition holes.
- HM-017/018: HealthReportSender + CentralHealthReportLoop snapshot
  per-interval counters before sending, restore via new
  ISiteHealthCollector.AddIntervalCounters on transport failure so counts
  aren't silently lost.

Fire-and-forget / shutdown waits:
- InboundAPI-018: AuditWriteMiddleware observes faulted audit-write tasks
  via OnlyOnFaulted continuation (Warning log; response unchanged).
- SnF-024: StoreAndForwardService.StopAsync awaits in-flight retry sweep
  with a bounded SweepShutdownWaitTimeout (10s).

Leak / refactor:
- Comm-021: SiteStreamGrpcServer.SubscribeInstance wraps Subscribe in its
  own try/catch so a throw doesn't leak the relay actor or _activeStreams
  entry.
- Comm-022: VERIFIED already-closed by Comm-016's dead-code purge.
- CLI-017: BundleCommands' three subcommands delegate to ExecuteCommandAsync
  (auth-failure exit-code contract unified).

Defensive / validation:
- CLI-021: CliConfig.Load wraps file-read/JSON parse so malformed config
  prints a warning and returns defaults instead of crashing the CLI.
- Host-022: ParseLevel emits stderr one-shot warning for unrecognised
  MinimumLevel instead of silently coercing to Information.
- ESG-019: ExternalSystemClient sets HttpClient.Timeout=Infinite so the
  per-call CTS is the sole timeout source (was clipped to 100s by .NET).
- Security-020: New SecurityOptionsValidator (IValidateOptions) rejects
  empty LdapServer/LdapSearchBase with ValidateOnStart.
- DM-019: Lifecycle command timeouts now emit DisableTimedOut/EnableTimedOut/
  DeleteTimedOut audit entries (mirrors DeployFailed pattern).

Plus reconciled stale per-module Open-findings counters that had drifted
from prior sessions.

20+ new regression tests across 11 test projects; build clean; affected
suites all green. README regenerated: 75 open (was 93).

2026-05-28 07:13:28 -04:00

60 KiB

Raw Blame History

Code Review — Communication

Field	Value
Module	`src/ScadaLink.Communication`
Design doc	`docs/requirements/Component-Communication.md`
Status	Reviewed
Last reviewed	2026-05-28
Reviewer	claude-agent
Commit reviewed	`1eb6e97`
Open findings	4

Summary

The Communication module is generally well-structured and matches the design doc's two-transport model (ClusterClient for command/control, gRPC server-streaming for real-time data). The actors keep mutable state on the actor thread, use PipeTo for async work, and the gRPC server/client lifecycle is mostly disciplined. However the review found several High and Medium issues clustered around two themes: (a) gRPC subscription bookkeeping races — SiteStreamGrpcClient overwrites and removes subscription entries by correlation ID without disposal or ownership checks, so reconnect cycles leak CancellationTokenSourcees and can cancel the wrong stream; and (b) missing supervision strategy on the coordinator actors, contrary to the CLAUDE.md "Resume for coordinator actors" decision. Design-doc adherence is otherwise good. Test coverage is broad for happy paths but has gaps around failover, cache mutation races, and the snapshot-timeout cleanup path.

Re-review 2026-05-17 (commit `39d737e`)

All prior findings (Communication-001..011) are confirmed Resolved in this commit and the fixes hold up against the source. The re-review walked all 10 checklist categories again and uncovered a previously-missed defect at the centre of the gRPC node-failover path: SiteStreamGrpcClientFactory.GetOrCreate caches one client per site identifier and silently ignores the grpcEndpoint argument on a cache hit. The DebugStreamBridgeActor reconnect logic flips _useNodeA and passes the other node's endpoint, but the factory hands back the original NodeA-bound client every time — so the documented "try the other site node endpoint" failover never actually moves to NodeB (Communication-012). The same caching defect means a site address change is never picked up because RemoveSiteAsync has no production caller (Communication-013). Two Low findings round out the re-review: an untrusted gRPC-supplied correlation_id flows straight into an Akka actor name (Communication-014), and the factory's endpoint-reuse defect is masked by the test mock (Communication-015). Four new findings, all Open: one High, one Medium, two Low.

Re-review 2026-05-28 (commit `1eb6e97`)

All prior findings (Communication-001..015) remain Resolved in this commit. The re-review walked all 10 checklist categories again on the surface that has not been re-examined before — the central↔site command/control routing surface (CentralCommunicationActor, SiteCommunicationActor) rather than the previously-mined gRPC streaming surface — and uncovered a cluster of defects around the connection-state-change workflow. The single material finding is HandleConnectionStateChanged is dead code: no production code path emits ConnectionStateChanged, so the documented "kill active debug streams for the disconnected site" + "mark in-progress deployments as failed" workflow never fires at runtime (Communication-016). The downstream consequence is _inProgressDeployments grows unboundedly — entries are inserted on every deployment but only cleaned via that dead path (Communication-017). Three smaller items round out the re-review: site heartbeats hard-code IsActive: true regardless of node role (Communication-018), the 60-second-periodic LoadSiteAddressesFromDb task has no CancellationToken so a hung DB query has no upper bound (Communication-019), the SiteAddressCacheLoaded internal message carries a mutable Dictionary/List (Communication-020), SiteStreamGrpcServer.SubscribeInstance leaks the StreamRelayActor if _streamSubscriber.Subscribe throws between ActorOf and the try block (Communication-021), and _debugSubscriptions keyed by caller-supplied CorrelationId could orphan a subscriber on ID reuse (Communication-022). Seven new findings, all Open: one High, one Medium, five Low.

Checklist coverage 2026-05-28

#	Category	Examined	Notes
1	Correctness & logic bugs	✓	`HandleConnectionStateChanged` and its `_inProgressDeployments` / `_debugSubscriptions` cleanup never fire — the connection-state workflow is dead (Communication-016, Communication-017). `_debugSubscriptions` correlation-ID overwrite risk (Communication-022).
2	Akka.NET conventions	✓	`SiteAddressCacheLoaded` carries mutable `Dictionary<string, List<string>>` — violates message-immutability convention (Communication-020). `Forward`/`PipeTo`/Sender-capture all clean.
3	Concurrency & thread safety	✓	All mutable state mutated on the actor thread. `_subscriptions` ConcurrentDictionary use disciplined. No new issues.
4	Error handling & resilience	✓	`LoadSiteAddressesFromDb` lacks a `CancellationToken` propagation point (Communication-019). `SubscribeInstance` leaks the relay actor if `Subscribe` throws pre-try (Communication-021).
5	Security	✓	Correlation-id validation in place (Communication-014). No new issues.
6	Performance & resource management	✓	`_inProgressDeployments` grows unboundedly (Communication-017). gRPC client/server lifecycles otherwise clean.
7	Design-document adherence	✓	`ConnectionStateChanged` handler is dead code — the doc-stated "kill streams on disconnect, fail in-progress deployments" workflow does not actually run (Communication-016). Site heartbeats always report `IsActive: true` regardless of role (Communication-018).
8	Code organization & conventions	✓	Options pattern correct; mapper placement and proto evolution are additive-only. No new issues.
9	Testing coverage	✓	`CentralCommunicationActorTests.ConnectionLost_DebugStreamsKilled` exercises a code path that no production caller ever drives — gives false confidence (related to Communication-016).
10	Documentation & comments	✓	Detailed XML docs added in this commit. No new issues.

Checklist coverage

#	Category	Examined	Notes
1	Correctness & logic bugs	✓	Re-review: factory ignores endpoint on cache hit, defeating NodeA→NodeB stream failover (Communication-012). Prior items resolved.
2	Akka.NET conventions	✓	Coordinator `Resume` strategies now present and verified. No new issues.
3	Concurrency & thread safety	✓	Subscription-map register/remove now ownership-checked. `_siteClients` readonly. No new issues.
4	Error handling & resilience	✓	`Status.Failure` handler added; reconnect unsubscribes prior stream. No new issues.
5	Security	✓	Re-review: public gRPC `correlation_id` flows unvalidated into an Akka actor name (Communication-014).
6	Performance & resource management	✓	Synchronous `Dispose` paths fixed; CTS leaks resolved. No new issues.
7	Design-document adherence	✓	Re-review: site gRPC address-change disposal not wired — `RemoveSiteAsync` is dead code (Communication-013). gRPC options now applied.
8	Code organization & conventions	✓	Options pattern correct; public records still declared in actor files (acceptable). No structural issues.
9	Testing coverage	✓	Re-review: prior gaps closed, but the factory mock masks the endpoint-reuse defect — no real node-flip coverage (Communication-015).
10	Documentation & comments	✓	`DebugStreamBridgeActor` summary corrected. No new issues.

Findings

Communication-001 — Early stream termination escapes StartStreamAsync's narrow exception handling


Severity	Medium
Category	Error handling & resilience
Status	Resolved
Location	`src/ScadaLink.Communication/DebugStreamService.cs:130-143`

Re-triaged 2026-05-16: originally filed Critical, claiming an orphaned bridge actor and a multi-minute site-side resource leak on every snapshot timeout. On verification that impact does not occur: DebugStreamBridgeActor calls CleanupGrpc() and Context.Stop(Self) on every path that invokes onTerminated (site disconnect, gRPC max-retries, ReceiveTimeout), so it always self-terminates and releases its gRPC subscription; and the pure-timeout path does reach StopStream, which also stops it. The genuine defect described below is an error-handling gap, not a leak — severity corrected to Medium.

Description

StartStreamAsync awaits the initial snapshot inside a try whose only handler is catch (OperationCanceledException). When the stream terminates before the snapshot arrives, onTerminatedWrapper completes the await via snapshotTcs.TrySetException(new InvalidOperationException(...)). That InvalidOperationException is not an OperationCanceledException, so it escapes the catch entirely: the caller (Blazor debug view / SignalR hub) receives a raw, untranslated exception, and StartStreamAsync performs no teardown of its own on that path — it relies implicitly on the bridge actor self-terminating. Cleanup from the service side is therefore not deterministic, and the failure surfaced to the caller is not a meaningful, documented result.

Recommendation

In StartStreamAsync, catch any exception from the snapshot await, deterministically tear down the bridge actor (Tell(StopDebugStream) via the local actor reference, since a racing onTerminatedWrapper may already have removed the session entry), and translate the failure into a meaningful exception for the caller.

Resolution

Resolved 2026-05-16. The catch (OperationCanceledException)-only block in StartStreamAsync was replaced with catch (Exception): it removes the session entry, sends StopDebugStream to the bridge actor via the local reference (idempotent — the actor may already be stopping itself), and throws a descriptive exception — TimeoutException for the 30s timeout, otherwise an InvalidOperationException that names the instance/site and wraps the underlying cause. Regression test DebugStreamServiceTests.StartStreamAsync_StreamTerminatesBeforeSnapshot_ThrowsMeaningfulException fails against the pre-fix code and passes after. Fixed by the commit whose message references Communication-001.

Communication-002 — gRPC reconnect does not unsubscribe the previous stream, leaking site-side relay actors


Severity	High
Category	Error handling & resilience
Status	Resolved
Location	`src/ScadaLink.Communication/Actors/DebugStreamBridgeActor.cs:170`, `src/ScadaLink.Communication/Actors/DebugStreamBridgeActor.cs:143`

Description

On a gRPC stream error, HandleGrpcError increments the retry count, flips _useNodeA, and schedules OpenGrpcStream. OpenGrpcStream cancels and disposes _grpcCts and starts a fresh SubscribeInstance call — but it never calls client.Unsubscribe(_correlationId) on the old node's client, and the site-side SiteStreamGrpcServer keys active streams by correlation_id only. Because the new subscription goes to the other node (_useNodeA flipped), the old node's SiteStreamGrpcServer still has an active stream + StreamRelayActor + SiteStreamManager subscription for that correlation ID. The old node only learns the client is gone via TCP RST or keepalive — exactly the failure mode that triggered the reconnect (network partition / silent node), so detection may take ~25s or never. Each reconnect can therefore leave a zombie relay actor on the failed node. CleanupGrpc (which does call Unsubscribe) is only invoked on terminal paths, not between reconnect attempts.

Recommendation

Before reconnecting in HandleGrpcError / at the top of OpenGrpcStream, call Unsubscribe(_correlationId) on the client for the previous endpoint (the one that just failed) so the local CTS is cancelled and — where the channel is still alive — the gRPC cancellation reaches the site and stops the relay actor.

Resolution

Resolved 2026-05-16 (commit <pending>). Root cause confirmed against source: HandleGrpcError flipped _useNodeA and scheduled OpenGrpcStream without ever unsubscribing the failed stream, leaving the old node's StreamRelayActor zombie until TCP/keepalive timeout. Fix: HandleGrpcError now resolves the client for the previous endpoint (before flipping _useNodeA) and calls Unsubscribe(_correlationId) on it, so the local CTS is cancelled and gRPC cancellation reaches the still-alive site. Regression test DebugStreamBridgeActorTests.On_GrpcError_Unsubscribes_Old_Stream_Before_Reconnect fails against the pre-fix code and passes after.

Communication-003 — SiteStreamGrpcClient subscription map overwritten without disposal; reconnect can cancel the wrong stream


Severity	High
Category	Concurrency & thread safety
Status	Resolved
Location	`src/ScadaLink.Communication/Grpc/SiteStreamGrpcClient.cs:77`, `src/ScadaLink.Communication/Grpc/SiteStreamGrpcClient.cs:106`

Description

SubscribeAsync does _subscriptions[correlationId] = cts; (line 77), unconditionally overwriting any existing entry for that correlation ID without cancelling or disposing the previous CancellationTokenSource. The finally block then does _subscriptions.TryRemove(correlationId, out _) (line 106) which removes the entry by key only, regardless of which CTS is stored. Because DebugStreamBridgeActor reuses the same _correlationId across reconnect attempts (and SiteStreamGrpcClientFactory returns the same SiteStreamGrpcClient for a site even after a node flip), two SubscribeAsync calls can briefly share a correlation ID. The first call's finally then removes the second call's CTS entry, so a later Unsubscribe(correlationId) finds nothing and the live stream is never cancelled — an orphan. Conversely the overwritten CTS is leaked (never disposed).

Recommendation

When inserting, cancel+dispose any prior CTS for that correlation ID. In the finally, remove only if the stored CTS is the one this call created (use the TryRemove(KeyValuePair) overload, mirroring what SiteStreamGrpcServer already does with StreamEntry). Consider keying subscriptions by a per-call GUID rather than the caller-supplied correlation ID.

Resolution

Resolved 2026-05-16 (commit <pending>). Root cause confirmed against source: the inline _subscriptions[correlationId] = cts overwrote a prior CTS without cancel/dispose (leak), and the finally's TryRemove(correlationId, out _) removed by key only — a racing reconnect's live CTS could be removed by the prior call's finally, orphaning the live stream. Fix: extracted two internal helpers used by SubscribeAsync — RegisterSubscription cancels+disposes any existing CTS for the correlation ID before inserting, and RemoveSubscription uses the ConcurrentDictionary.TryRemove(KeyValuePair) overload so it removes only the CTS that call created (mirroring SiteStreamGrpcServer's StreamEntry pattern). Regression tests SiteStreamGrpcClientTests.RegisterSubscription_ReusedCorrelationId_CancelsAndDisposesPriorCts and SiteStreamGrpcClientTests.RemoveSubscription_OnlyRemovesOwnCts_NotAReplacement fail against the pre-fix logic and pass after.

Communication-004 — Coordinator actors declare no SupervisorStrategy (design requires Resume)


Severity	Medium
Category	Akka.NET conventions
Status	Resolved
Location	`src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:42`, `src/ScadaLink.Communication/Actors/SiteCommunicationActor.cs:22`

Description

CLAUDE.md ("Explicit supervision strategies: Resume for coordinator actors, Stop for short-lived execution actors") requires coordinator actors to use an explicit Resume supervision strategy. CentralCommunicationActor and SiteCommunicationActor are long-lived coordinators (they own the per-site ClusterClient map, debug subscriptions, in-progress deployments) but neither overrides SupervisorStrategy. They fall back to the Akka default (OneForOneStrategy with Restart). A child fault — e.g. a ClusterClient child of CentralCommunicationActor created by DefaultSiteClientFactory — would Restart under the default strategy, and any exception in the coordinator itself would restart it, wiping _siteClients, _debugSubscriptions, and _inProgressDeployments silently. The design intent is Resume so transient child faults do not discard coordinator state.

Recommendation

Override SupervisorStrategy on both actors to return an explicit OneForOneStrategy with Directive.Resume (or the project's standard coordinator strategy), matching the documented decision and other coordinator actors.

Resolution

Resolved 2026-05-16 (commit pending). Root cause confirmed: neither CentralCommunicationActor nor SiteCommunicationActor overrode SupervisorStrategy, so child faults fell back to the Akka default (Restart). Note that an actor's own SupervisorStrategy governs its children — a transient child fault would Restart the child and discard its in-memory state, contrary to the CLAUDE.md "Resume for coordinator actors" decision. Fix: both actors now override SupervisorStrategy() to return a OneForOneStrategy with an unbounded Decider resolving to Directive.Resume (mirroring DataConnectionManagerActor). Regression tests CoordinatorSupervisionTests.CentralCommunicationActor_SupervisorStrategy_IsResume and CoordinatorSupervisionTests.SiteCommunicationActor_SupervisorStrategy_IsResume fail against the pre-fix code (decider yields Restart) and pass after.

Communication-005 — gRPC keepalive and max-stream-lifetime options are defined but never applied


Severity	Medium
Category	Design-document adherence
Status	Resolved
Location	`src/ScadaLink.Communication/Grpc/SiteStreamGrpcClient.cs:25`, `src/ScadaLink.Communication/CommunicationOptions.cs:36`

Description

CommunicationOptions exposes GrpcKeepAlivePingDelay, GrpcKeepAlivePingTimeout, GrpcMaxStreamLifetime, and GrpcMaxConcurrentStreams, and the design doc's "gRPC Connection Keepalive" section explicitly states these are configurable. However SiteStreamGrpcClient's constructor hard-codes KeepAlivePingDelay = TimeSpan.FromSeconds(15) and KeepAlivePingTimeout = TimeSpan.FromSeconds(10) instead of reading the options. GrpcMaxStreamLifetime (the documented "Session timeout — 4 hours" third layer of dead-client detection) is not referenced anywhere — SiteStreamGrpcServer.SubscribeInstance creates a linked CTS from the call cancellation token only, with no CancelAfter. The 4-hour zombie-stream safety net described in the design doc does not exist in code. GrpcMaxConcurrentStreams is also not wired to the server (SiteStreamGrpcServer takes a maxConcurrentStreams constructor parameter defaulting to 100, but nothing binds the option to it).

Recommendation

Flow CommunicationOptions into SiteStreamGrpcClient and SiteStreamGrpcServer (via the factory / DI). Apply GrpcKeepAlivePingDelay / GrpcKeepAlivePingTimeout to the SocketsHttpHandler, bind GrpcMaxConcurrentStreams to the server's limit, and implement the GrpcMaxStreamLifetime session timeout with CancelAfter on the server-side stream CTS — or, if the 4-hour cap is intentionally dropped, remove the option and update the design doc.

Resolution

Resolved 2026-05-16 (commit pending). Root cause confirmed: SiteStreamGrpcClient hard-coded the keepalive values, GrpcMaxStreamLifetime was referenced nowhere, and GrpcMaxConcurrentStreams was never bound to the server. Fix (scoped to src/ScadaLink.Communication): SiteStreamGrpcClient gained a constructor taking CommunicationOptions and now applies GrpcKeepAlivePingDelay/GrpcKeepAlivePingTimeout to its SocketsHttpHandler; SiteStreamGrpcClientFactory gained an IOptions<CommunicationOptions> DI constructor and flows the options into every client it creates; SiteStreamGrpcServer gained an IOptions<CommunicationOptions> DI constructor that binds GrpcMaxConcurrentStreams and implements the documented 4-hour session timeout via CancellationTokenSource.CancelAfter(GrpcMaxStreamLifetime) on the per-stream CTS. The Host's existing AddSingleton<SiteStreamGrpcServer>() registration resolves the new DI constructor via greedy resolution — no Host change required. Regression tests GrpcOptionsWiringTests.SiteStreamGrpcClient_AppliesKeepAliveFromOptions, GrpcOptionsWiringTests.SiteStreamGrpcClientFactory_FlowsOptionsToCreatedClients, and GrpcOptionsWiringTests.SiteStreamGrpcServer_BindsMaxConcurrentStreamsAndLifetimeFromOptions exercise the wiring (they require the new members to even compile).

Communication-006 — Site address load failures are silently swallowed, leaving a stale cache


Severity	Medium
Category	Error handling & resilience
Status	Resolved
Location	`src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:204`

Description

LoadSiteAddressesFromDb runs the repository query inside Task.Run(...).PipeTo(self). If GetAllSitesAsync throws (database unavailable, transient connection error), the faulted task is piped to Self as a Status.Failure. CentralCommunicationActor has no Receive<Status.Failure> handler, so the failure becomes an unhandled message (logged at debug, not surfaced) and the periodic refresh silently fails. If the first startup load fails the actor runs with an empty _siteClients map — every SiteEnvelope is dropped (line 187) and every Ask times out with no indication of the root cause.

Recommendation

Add a Receive<Status.Failure> handler that logs the load failure at Warning/Error level so operators can distinguish "site has no addresses configured" from "database is down". Optionally surface a health metric for repeated load failures.

Resolution

Resolved 2026-05-16 (commit pending). Root cause confirmed: a faulted LoadSiteAddressesFromDb task is piped to Self as a Status.Failure, but the actor had no handler for it — the failure became an unhandled message (debug-level only) and the periodic refresh failed silently. Fix: added a Receive<Status.Failure> handler that logs the load failure at Warning with the underlying exception as the cause, so operators can distinguish a missing-addresses configuration from a database outage. Regression test CentralCommunicationActorTests.LoadSiteAddressesFailure_IsLoggedNotSilentlySwallowed (repository query throws) asserts the Warning is emitted — it produces no warning against the pre-fix code and passes after.

Communication-007 — `SiteStreamGrpcClientFactory.Dispose` blocks on async work (sync-over-async)


Severity	Medium
Category	Performance & resource management
Status	Resolved
Location	`src/ScadaLink.Communication/Grpc/SiteStreamGrpcClientFactory.cs:53`

Description

Dispose() calls DisposeAsync().AsTask().GetAwaiter().GetResult(). This is the classic sync-over-async pattern: it blocks the calling thread until all per-site SiteStreamGrpcClient.DisposeAsync calls complete. If Dispose is invoked from a context with a single-threaded synchronization context or from DI container shutdown on a constrained thread pool, this can deadlock or stall host shutdown. The class already implements IAsyncDisposable.

Recommendation

Prefer registering and disposing the factory through IAsyncDisposable only (modern .NET DI honours it for singletons). If a synchronous Dispose must remain, dispose the underlying GrpcChannels directly (synchronous) rather than blocking on the async path, or document why blocking is safe here.

Resolution

Resolved 2026-05-16 (commit pending). Root cause confirmed: Dispose() called DisposeAsync().AsTask().GetAwaiter().GetResult(), the classic sync-over-async pattern. Fix: SiteStreamGrpcClient now also implements IDisposable with a synchronous Dispose() that releases its CancellationTokenSources and underlying GrpcChannel directly (all of that teardown is inherently synchronous); SiteStreamGrpcClientFactory.Dispose() now disposes each cached client via that synchronous path with no blocking on the async path. A CreateClient seam was extracted so the test can substitute a tracking client while still exercising the factory's real caching/disposal machinery. Regression test SiteStreamGrpcClientFactoryDisposeTests.Dispose_DisposesClientsSynchronously_NotViaAsyncPath fails against the pre-fix code (clients disposed via DisposeAsync) and passes after; Dispose_DoesNotDeadlock_UnderSingleThreadedSynchronizationContext guards the stall path.

Communication-008 — Reconnect retry-count reset can mask a flapping stream indefinitely


Severity	Medium
Category	Correctness & logic bugs
Status	Resolved
Location	`src/ScadaLink.Communication/Actors/DebugStreamBridgeActor.cs:71`, `src/ScadaLink.Communication/Actors/DebugStreamBridgeActor.cs:174`

Description

_retryCount is reset to 0 every time a single AttributeValueChanged or AlarmStateChanged event is received (lines 72, 77). Combined with MaxRetries = 3, a stream that connects, delivers exactly one event, then fails — repeatedly — will reconnect forever. The design doc states "max 3 retries, terminate the session if all retries fail"; the current logic only terminates after 3 consecutive failures with zero intervening events, so a flapping site never trips the limit and the debug session (and its site-side relay) lives on indefinitely. The ReceiveTimeout orphan net is also reset by every received message, so it does not bound this case either.

Recommendation

Either reset _retryCount only after the stream has been stably connected for some minimum duration (e.g. a timer armed on stream open, cancelled on the next error), or keep a separate cumulative reconnect counter / time window that bounds total reconnects regardless of intervening events.

Resolution

Resolved 2026-05-16 (commit pending). Root cause confirmed: _retryCount was reset to 0 on every received AttributeValueChanged/AlarmStateChanged, so a stream that connected, delivered one event, then failed — repeatedly — never tripped MaxRetries. Fix (recommendation option a): the per-event reset was removed; instead OpenGrpcStream arms a single StabilityWindow timer (60s default, internal-settable for tests), and only when it fires (GrpcStreamStable) — i.e. the stream stayed up long enough to be considered recovered — is _retryCount reset. HandleGrpcError cancels that timer, so a stream that fails before the window elapses does not recover its retry budget. A flapping stream therefore terminates after MaxRetries regardless of intervening events. Regression test DebugStreamBridgeActorTests.FlappingStream_DeliveringEventsBetweenFailures_StillTerminatesAfterMaxRetries fails against the pre-fix code (actor never terminates) and passes after; RetryCount_RecoveredOnlyAfterStreamStaysStableForStabilityWindow verifies the budget is recovered after a stable interval. The pre-existing test that codified the buggy per-event reset (Grpc_Error_Resets_RetryCount_On_Successful_Event) was replaced.

Communication-009 — `_siteClients` field is mutable and reassignable; cache update is not atomic on failure


Severity	Low
Category	Concurrency & thread safety
Status	Resolved
Location	`src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:53`, `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:240`

Description

_siteClients is a non-readonly Dictionary field. It is only mutated on the actor thread (correct), but the field is needlessly reassignable, and HandleSiteAddressCacheLoaded mutates it in place across several loops. If ActorPath.Parse throws on a malformed address mid-loop (e.g. a site row with a garbage NodeAAddress), the method aborts partway through, having already stopped some ClusterClients and added others — leaving the cache partially updated with no recovery until the next 60s refresh. The other actor mutable collections (_debugSubscriptions, _inProgressDeployments) are correctly readonly.

Recommendation

Mark _siteClients readonly. Validate/parse all addresses up front (or wrap ActorPath.Parse in a try/catch that logs and skips the bad site) so a single malformed site record cannot abort the whole refresh and leave a half-updated cache.

Resolution

Resolved 2026-05-16 (commit pending). Root cause confirmed: _siteClients was a non-readonly field, and HandleSiteAddressCacheLoaded's add/update loop called ActorPath.Parse per site with no guard — a malformed NodeAAddress threw mid-loop and aborted the refresh, leaving the cache half-updated until the next 60s cycle. Fix: _siteClients is now readonly, and the per-site ActorPath.Parse is wrapped in a try/catch that logs the bad address at Warning and continues to the next site, so a single garbage row cannot starve other sites of their ClusterClient. Regression test CentralCommunicationActorTests.MalformedSiteAddress_DoesNotAbortRefresh_OtherSitesStillRegistered (bad site ordered before a good one) fails against the pre-fix code (good site never registered) and passes after.

Communication-010 — `DebugStreamBridgeActor` XML doc incorrectly describes it as a "Persistent actor"


Severity	Low
Category	Documentation & comments
Status	Resolved
Location	`src/ScadaLink.Communication/Actors/DebugStreamBridgeActor.cs:10`

Description

The class summary opens with "Persistent actor (one per active debug session)...". The actor derives from ReceiveActor, not a persistent actor base class, holds no PersistenceId, and writes no journal/snapshot. "Persistent" is misleading — debug sessions are explicitly "session-based and temporary" per the design doc. A reader could assume state survives restart, which it does not.

Recommendation

Reword the summary to "Long-lived (per active debug session) actor on the central side..." or similar, removing the word "Persistent".

Resolution

Resolved 2026-05-16 (commit pending). Root cause confirmed: the class summary opened with "Persistent actor (one per active debug session)..." but the actor derives from ReceiveActor, holds no PersistenceId, and writes no journal/snapshot. Fix (documentation only — no behaviour change, so no regression test): the summary was reworded to "Long-lived (one per active debug session) actor on the central side. Debug sessions are session-based and temporary — this actor holds no persisted state and does not derive from an Akka.Persistence base class; its state does not survive a restart."

Communication-011 — No test coverage for snapshot-timeout cleanup, address-cache failure, or gRPC reconnect leak


Severity	Low
Category	Testing coverage
Status	Resolved
Location	`tests/ScadaLink.Communication.Tests/` (module-wide)

Description

The test suite covers happy-path routing, handler-not-registered failures, heartbeat bumping, cache refresh, and gRPC bridge reconnect/retry. However several critical paths identified in this review have no coverage:

The DebugStreamService.StartStreamAsync snapshot-timeout path (Communication-001) — no test verifies bridge actor / site subscription teardown on timeout, nor the onTerminated-before-snapshot race that throws a non-OperationCanceledException.
CentralCommunicationActor behaviour when LoadSiteAddressesFromDb faults (Communication-006) — RefreshSiteAddresses_UpdatesCache only exercises success.
SiteStreamGrpcClient subscription-map overwrite/removal race (Communication-003) and gRPC reconnect not unsubscribing the old node (Communication-002).
A malformed NodeAAddress aborting HandleSiteAddressCacheLoaded (Communication-009).

Recommendation

Add tests for: snapshot timeout / pre-snapshot termination cleanup; address-load failure logging and empty-cache behaviour; reusing a correlation ID across SubscribeAsync calls; and a malformed site address during cache refresh.

Resolution

Resolved 2026-05-16 (commit pending). This is a meta-coverage finding; every gap it enumerates is now covered by a regression test (each fails against its pre-fix code and passes after):

Snapshot timeout / pre-snapshot termination (Communication-001) — DebugStreamServiceTests.StartStreamAsync_StreamTerminatesBeforeSnapshot_ThrowsMeaningfulException.
gRPC reconnect not unsubscribing the old node (Communication-002) — DebugStreamBridgeActorTests.On_GrpcError_Unsubscribes_Old_Stream_Before_Reconnect.
SiteStreamGrpcClient subscription-map overwrite/removal race (Communication-003) — SiteStreamGrpcClientTests.RegisterSubscription_ReusedCorrelationId_CancelsAndDisposesPriorCts and RemoveSubscription_OnlyRemovesOwnCts_NotAReplacement.
LoadSiteAddressesFromDb fault (Communication-006) — CentralCommunicationActorTests.LoadSiteAddressesFailure_IsLoggedNotSilentlySwallowed.
Malformed NodeAAddress aborting HandleSiteAddressCacheLoaded (Communication-009) — CentralCommunicationActorTests.MalformedSiteAddress_DoesNotAbortRefresh_OtherSitesStillRegistered (added with this finding's resolution). The full module suite (dotnet test tests/ScadaLink.Communication.Tests) is green at 111 passing tests.

Communication-012 — gRPC client factory ignores the endpoint on a cache hit, breaking NodeA→NodeB stream failover


Severity	High
Category	Correctness & logic bugs
Status	Resolved
Location	`src/ScadaLink.Communication/Grpc/SiteStreamGrpcClientFactory.cs:39`, `src/ScadaLink.Communication/Actors/DebugStreamBridgeActor.cs:166`

Description

SiteStreamGrpcClientFactory.GetOrCreate is _clients.GetOrAdd(siteIdentifier, _ => CreateClient(grpcEndpoint)) — it keys the cache by site identifier only and the grpcEndpoint argument is used exclusively for the first-ever creation. Every subsequent call for that site returns the originally-cached SiteStreamGrpcClient, which is permanently bound to the GrpcChannel of whatever endpoint was passed first.

DebugStreamBridgeActor relies on the opposite behaviour. On a gRPC stream error, HandleGrpcError flips _useNodeA and OpenGrpcStream recomputes endpoint = _useNodeA ? _grpcNodeAAddress : _grpcNodeBAddress, then calls _grpcFactory.GetOrCreate(_siteIdentifier, endpoint) expecting a client connected to the other node. Because the factory ignores the new endpoint, the bridge actor reconnects to the same failed NodeA endpoint on every retry. The design doc's core debug-stream failover behaviour ("tries the other site node endpoint", "NodeB if NodeA failed, or vice versa") is therefore inoperative — when a site node goes down, the debug stream cannot move to the surviving node and simply exhausts MaxRetries against the dead endpoint and terminates. The _useNodeA flip, the previousEndpoint computation in HandleGrpcError, and the CleanupGrpc endpoint selection are all dead logic. (Communication-002's Unsubscribe-before-reconnect fix still functions, but it unsubscribes and re-subscribes on the same client/node rather than the intended other node.)

Recommendation

Make the per-site client aware of both endpoints, or key the cache by (siteIdentifier, endpoint), or have GetOrCreate detect an endpoint change and dispose+recreate the cached client. Given the design intent ("Falls back to NodeB if NodeA connection fails"), the cleanest fix is to give SiteStreamGrpcClient (or a per-site holder) both NodeA/NodeB addresses and let it switch channels internally, removing the endpoint argument from GetOrCreate entirely. Add a test that drives a real SiteStreamGrpcClientFactory through a node flip and asserts the second client targets the other endpoint.

Resolution

Resolved 2026-05-17 (commit pending). Root cause confirmed against source: GetOrCreate was _clients.GetOrAdd(siteIdentifier, …) — keyed by site identifier only, so the grpcEndpoint argument was honoured solely on first creation and the NodeA→NodeB flip reconnected to the dead endpoint forever. Fix: SiteStreamGrpcClient now exposes its bound Endpoint, and GetOrCreate compares the cached client's endpoint against the requested one — on a mismatch it atomically installs (via ConcurrentDictionary.AddOrUpdate) a fresh client for the new endpoint and disposes the stale one, so a node flip actually moves to the surviving node. Regression tests SiteStreamGrpcClientFactoryTests.GetOrCreate_EndpointChanged_ReturnsClientBoundToNewEndpoint and GetOrCreate_SameEndpoint_DoesNotDisposeOrRecreate fail against the pre-fix factory and pass after.

Communication-013 — Site gRPC address changes are never applied; `RemoveSiteAsync` has no production caller


Severity	Medium
Category	Design-document adherence
Status	Resolved
Location	`src/ScadaLink.Communication/Grpc/SiteStreamGrpcClientFactory.cs:58`

Description

The design doc states that SiteStreamGrpcClientFactory "Disposes clients on site removal or address change." RemoveSiteAsync implements the disposal mechanism, but a repo-wide search finds no production caller — only tests invoke it. Combined with the cache-by-site-identifier behaviour (Communication-012), the consequence is that once a site's SiteStreamGrpcClient is created, a later edit to that site's GrpcNodeAAddress / GrpcNodeBAddress (via the Central UI or CLI) is never reflected in the cached client — it keeps using the stale channel for the life of the process. CentralCommunicationActor already refreshes the Akka address cache every 60s and recreates ClusterClients on change, but there is no equivalent invalidation path wired into the gRPC client factory. A site whose gRPC endpoints are corrected after an initial misconfiguration will never have working debug streaming until the central node is restarted.

Recommendation

Wire a site-removal / address-change signal into SiteStreamGrpcClientFactory — e.g. have CentralCommunicationActor (which already detects address changes in HandleSiteAddressCacheLoaded) call RemoveSiteAsync for sites whose gRPC addresses changed or were removed, or fold the gRPC endpoints into the same refresh cycle. If the on-the-fly address-change requirement is intentionally dropped, remove RemoveSiteAsync and correct the design doc.

Resolution

Resolved 2026-05-17 (commit pending). The address-change staleness — the primary impact ("never have working debug streaming until the central node is restarted") — is fixed in-module by the Communication-012 change: GetOrCreate is now endpoint-change-aware, so the next time DebugStreamBridgeActor requests a stream with a corrected GrpcNodeAAddress/GrpcNodeBAddress the stale cached client is disposed and replaced — no central restart needed and no external wiring required. RemoveSiteAsync is retained as the disposal path for full site removal (a deleted site record) and its doc comment now states that role explicitly; wiring a delete-site callback belongs to the site-management flow in another module and is out of this module's scope. Regression test SiteStreamGrpcClientFactoryTests.GetOrCreate_EndpointChanged_DisposesPriorClient fails against the pre-fix factory (stale client never disposed/replaced) and passes after.

Communication-014 — Untrusted gRPC `correlation_id` flows directly into an Akka actor name


Severity	Low
Category	Security
Status	Resolved
Location	`src/ScadaLink.Communication/Grpc/SiteStreamGrpcServer.cs:124`

Description

SubscribeInstance is a public gRPC endpoint hosted on each site node. It creates the relay actor with $"stream-relay-{request.CorrelationId}-{actorSeq}" as the actor name, where request.CorrelationId comes straight off the wire. Akka actor names have a restricted character set; a correlation_id containing /, whitespace, or other disallowed characters makes ActorSystem.ActorOf throw InvalidActorNameException. That exception is not caught inside SubscribeInstance, so it escapes as an unhandled RPC fault (and after the _streamSubscriber.Subscribe / _activeStreams entry has already been set up for the duration, though the finally does not run because the throw is before the try). In practice central always supplies a GUID, so impact is low, but the server is trusting client-supplied input to be actor-name-safe.

Recommendation

Validate request.CorrelationId on entry (non-empty, matches an expected GUID/safe pattern) and reject with StatusCode.InvalidArgument otherwise; or derive the actor name solely from the internal _actorCounter and keep the correlation ID only as actor state / dictionary key.

Resolution

Resolved 2026-05-17 (commit pending). Root cause confirmed: SubscribeInstance fed the off-the-wire request.CorrelationId straight into the stream-relay-… actor name, so an id with /, whitespace, or other disallowed characters made ActorOf throw InvalidActorNameException as an unhandled RPC fault. Fix: SubscribeInstance now validates CorrelationId on entry — rejecting null/empty or any value failing ActorPath.IsValidPathElement with StatusCode.InvalidArgument before any actor or subscription state is created. Regression test SiteStreamGrpcServerTests.RejectsCorrelationIdThatIsNotActorNameSafe (theory: /-bearing, whitespace, empty, $-prefixed ids) fails against the pre-fix server and passes after; AcceptsActorNameSafeCorrelationId confirms a normal GUID is still accepted.

Communication-015 — No test exercises the real gRPC client factory across a node flip


Severity	Low
Category	Testing coverage
Status	Resolved
Location	`tests/ScadaLink.Communication.Tests/Grpc/DebugStreamBridgeActorTests.cs:401`, `tests/ScadaLink.Communication.Tests/Grpc/SiteStreamGrpcClientFactoryTests.cs`

Description

DebugStreamBridgeActorTests exercises the reconnect/failover paths through MockSiteStreamGrpcClientFactory, which returns one fixed mock client regardless of the grpcEndpoint argument. This is exactly the behaviour the real SiteStreamGrpcClientFactory exhibits incorrectly (Communication-012), so the mock masks the defect: On_GrpcError_Reconnects_To_Other_Node passes even though the real factory never reaches the other node. SiteStreamGrpcClientFactoryTests only asserts GetOrCreate returns the same client for the same site — it never checks what happens when the same site is requested with a different endpoint.

Recommendation

Add a SiteStreamGrpcClientFactoryTests case that calls GetOrCreate(site, endpointA) then GetOrCreate(site, endpointB) and asserts the second call targets endpointB (it should fail today and pass after Communication-012 is fixed). Have the bridge-actor test's mock factory track the endpoint per call so node-flip coverage is meaningful.

Resolution

Resolved 2026-05-17 (commit pending). SiteStreamGrpcClientFactoryTests gained GetOrCreate_EndpointChanged_ReturnsClientBoundToNewEndpoint and GetOrCreate_EndpointChanged_DisposesPriorClient, which drive the real SiteStreamGrpcClientFactory across a node flip and assert the second call yields a client bound to the new endpoint with the stale one disposed — both fail against the pre-fix factory and pass after (the Communication-012 fix). DebugStreamBridgeActorTests gained On_GrpcError_Reconnects_To_Other_Node_Endpoint, which uses a new EndpointTrackingGrpcClientFactory test double that hands out a distinct mock client per endpoint (instead of one fixed mock regardless of endpoint), so the bridge actor's NodeA→NodeB reconnect is now verified to actually target the NodeB endpoint rather than being masked by an endpoint-agnostic mock.

Communication-016 — `HandleConnectionStateChanged` is dead code — the documented disconnect-cleanup workflow never fires


Severity	High
Category	Design-document adherence
Status	Resolved
Location	`src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:169`, `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:338-375`

Resolution — deleted the dead code path in favour of the keepalive-based detection that is the actual production behaviour: removed the Receive<ConnectionStateChanged> handler, the HandleConnectionStateChanged method, the _debugSubscriptions / _inProgressDeployments tracking dicts

the TrackMessageForCleanup helper that fed them, and the dead message record src/ScadaLink.Commons/Messages/Communication/ConnectionStateChanged.cs. The two dead tests (ConnectionLost_DebugStreamsKilled in CentralCommunicationActorTests, RoundTrip_ConnectionStateChanged_Succeeds in CompatibilityTests) were removed alongside. The design doc docs/requirements/Component-Communication.md "Connection Failure Behavior" section was updated to state explicitly that disconnect is detected at the transport layer (gRPC keepalive PING ~25 s for debug streams; Ask-timeout at the CommunicationService layer for command/control), with no application-level signal. DebugStreamTerminated survives because DebugStreamBridgeActor uses it for an unrelated intra-actor stop signal.

Description

CentralCommunicationActor.HandleConnectionStateChanged is wired to Receive<ConnectionStateChanged> and implements two important workflows on IsConnected == false: (1) kill every active debug stream for the disconnected site (_debugSubscriptions walk → DebugStreamTerminated Tell to each subscriber); (2) mark every in-progress deployment for that site as failed (_inProgressDeployments walk → entry removal). Both are documented in the component design doc's "Connection Failure Behavior" section and in WP-5 of the work plan referenced in the class's own XML doc comment.

A repo-wide search (grep -rn ConnectionStateChanged src/ tests/) shows no production code ever emits ConnectionStateChanged. The only producers are the unit test CentralCommunicationActorTests.ConnectionLost_DebugStreamsKilled (line 137) and the Commons message-roundtrip test. The CentralCommunicationActor therefore never receives one in production, the disconnect-cleanup workflow never fires, and _debugSubscriptions / _inProgressDeployments are never pruned via this path.

Concrete consequences:

A site goes down → its active debug streams do not get a synchronous DebugStreamTerminated notification from central. The bridge actor must detect the disconnect itself via gRPC keepalive timing out (~25s) or TCP RST. Subscribers wait that long for the OnStreamTerminated callback instead of the documented "immediately killed by central" behaviour.
In-progress deployments to a disconnected site continue to occupy the Ask-reply window and only fail when the Ask times out at the CommunicationService.DeployInstanceAsync layer (120s). They are never proactively marked failed.
The unit test gives a strong false impression that the workflow works — it exercises a code path that has no production caller.

The design doc and CLAUDE.md mention "ClusterClient handles failover between NodeA and NodeB internally — there is no application-level NodeA preference / NodeB fallback logic" — so the ClusterClient mechanism is the documented failover transport. But that says nothing about signalling a fully-down remote cluster to central's coordinator actor, which is exactly what ConnectionStateChanged was meant to do.

Recommendation

Pick one of:

Wire a producer for ConnectionStateChanged — e.g. subscribe to ClusterClient's contact-point/cluster events (ClusterClient.ContactPoints Refresh / ContactPointAdded / ContactPointRemoved) or watch the ClusterClient actor for a "no contact points reachable" state — and have it publish ConnectionStateChanged to Self on each transition.
If the documented "synchronously kill streams on disconnect" behaviour is intentionally being dropped in favour of the slower keepalive-based detection, delete the handler, the ConnectionStateChanged record, and the related _debugSubscriptions / _inProgressDeployments tracking, then update the design doc's "Connection Failure Behavior" section accordingly.

Either way, replace CentralCommunicationActorTests.ConnectionLost_DebugStreamsKilled — at present it asserts a behaviour that no production code triggers.

Communication-017 — `_inProgressDeployments` grows unboundedly — successful deployments are never cleaned up


Severity	Medium
Category	Performance & resource management
Status	Open
Location	`src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:73`, `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:501`, `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:357-367`

Description

TrackMessageForCleanup inserts _inProgressDeployments[deploy.DeploymentId] = envelope.SiteId on every DeployInstanceCommand routed to a site (line 501). The only places that remove from _inProgressDeployments are:

HandleConnectionStateChanged on IsConnected == false (line 366) — which per Communication-016 never fires in production.
PostStop (line 553) — only on actor death (central failover).

There is no removal on the normal happy path — neither when the site replies DeploymentStatusResponse (the reply goes to the Ask's temporary reply actor, not back through CentralCommunicationActor), nor on Ask timeout. Every successful or failed deployment leaves its entry behind for the lifetime of the process.

Memory impact is modest (each entry is ~70-100 bytes), but the dictionary grows monotonically. Over months of operation across all sites a central node could accumulate tens of thousands of entries — a real, observable leak. More seriously, the field is also the source-of-truth set the HandleConnectionStateChanged walk uses to fail in-progress deployments, so even if a ConnectionStateChanged were fired today, the walk would "fail" thousands of already-completed deployments and Tell their (now stale) correlation-IDs into the void.

_debugSubscriptions (line 67) shares the same shape — but a normal debug session ends with an UnsubscribeDebugViewRequest that does drive cleanup (line 497), so leaks are only realised when a consumer crashes without unsubscribing.

Recommendation

Either remove _inProgressDeployments entirely (it has no other consumer once Communication-016 is fixed by deletion) or, if the disconnect-cleanup workflow is retained, add a removal hook on the reply path. The simplest fix is to subscribe CentralCommunicationActor to the Ask reply: route DeployInstanceCommand through the actor with the actor as the Ask sender, forward the reply to the original caller, and _inProgressDeployments.Remove in the same handler. (Today the Ask is taken on the actor itself by the caller, so the reply skips the coordinator.)

Communication-018 — Site heartbeats hard-code `IsActive: true` regardless of node role


Severity	Low
Category	Design-document adherence
Status	Resolved
Location	`src/ScadaLink.Communication/Actors/SiteCommunicationActor.cs:376-465`

Description

SiteCommunicationActor.SendHeartbeatToCentral builds new HeartbeatMessage(_siteId, hostname, IsActive: true, DateTimeOffset.UtcNow) on every periodic tick (line 366), with no inspection of whether this node is actually the active site node or a standby. The HeartbeatMessage.IsActive field thus carries the literal value true on every heartbeat from every node, and the field is effectively dead — central's HandleHeartbeat doesn't consume it either (line 297 only passes SiteId and Timestamp to MarkHeartbeat).

Per CLAUDE.md's Cluster & Failover section the active/standby distinction is real ("Both nodes are seed nodes", "keep-oldest split-brain resolver", "automatic dual-node recovery"), so a heartbeat that could carry node-role information would be useful for the central health dashboard distinguishing "active node down, standby up" from "site fully offline". As shipped, the field is contract noise and a future implementer might mistakenly assume it already carries meaningful state.

Recommendation

Either (a) resolve the current cluster role at heartbeat-send time and pass it through — e.g. Cluster.Get(Context.System).SelfRoles.Contains("active") or the project's existing role mechanism — and have the central aggregator consume IsActive; or (b) drop the IsActive field from HeartbeatMessage (additive-only-evolution: deprecate the field, default to true, plan removal in a major message contract revision).

Resolution (2026-05-28): Took option (a). SiteCommunicationActor now accepts an optional Func<bool>? isActiveCheck (default = real Cluster.Get leader check mirroring ActiveNodeGate / ActiveNodeHealthCheck) and SendHeartbeatToCentral stamps HeartbeatMessage.IsActive with the result. A try/catch keeps the heartbeat tick alive when the cluster state is unreadable (warm-up / TestKit without cluster plugin) — falls back to IsActive: false, the safe non-claiming value. Added parameterised test Heartbeat_StampsIsActive_FromInjectedCheck. Tests green (199/199 in Communication.Tests).

Communication-019 — `LoadSiteAddressesFromDb` does not pass a `CancellationToken` to the repository


Severity	Low
Category	Error handling & resilience
Status	Resolved
Location	`src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:397-431`

Description

LoadSiteAddressesFromDb runs await repo.GetAllSitesAsync() inside Task.Run(async () => ...).PipeTo(self) with no cancellation token (line 404). The repository signature accepts CancellationToken (the test mock declares GetAllSitesAsync(Arg.Any<CancellationToken>())), but the actor calls the no-arg overload — so a hung MS SQL connection has no upper bound. The 60-second-periodic refresh keeps firing; each tick spawns a fresh Task.Run that piles up if the database is consistently slow. The actor itself is unaffected (it's not blocked), but pending tasks and DB connection-pool resources accumulate, and the Status.Failure handler (Communication-006) never fires because the task never faults — it just sits.

Recommendation

Maintain a per-load CancellationTokenSource with a deadline (e.g. the same 60s the refresh runs on, or a configurable timeout in CommunicationOptions). Pass its Token to GetAllSitesAsync. Cancel the prior token before spinning a new load to avoid task accumulation.

Resolution (2026-05-28): Added a per-actor lifecycle CancellationTokenSource on CentralCommunicationActor, cancelled+disposed in PostStop. Its Token is now passed into repo.GetAllSitesAsync(ct) so a hung MS SQL query is bounded by actor shutdown rather than holding piped tasks open. The existing 60-second refresh cadence and Status.Failure handler (Comm-006) are unchanged — a deadline-per-load was scoped out as a future enhancement; this fix addresses the immediate "no upper bound on actor stop" concern called out in the finding.

Communication-020 — `SiteAddressCacheLoaded` carries mutable `Dictionary`/`List` types


Severity	Low
Category	Akka.NET conventions
Status	Open
Location	`src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:567`

Description

The Akka.NET convention is that messages crossing actor boundaries (even internal Self-messages over an async task boundary) are immutable. SiteAddressCacheLoaded(Dictionary<string, List<string>> SiteContacts) is a record but its SiteContacts payload is a mutable Dictionary whose values are mutable List<string>. Constructed inside Task.Run and handed off to the actor, the cache could in principle be mutated by either side; in practice nothing does, but the type is a stale-evidence guarantee that CLAUDE.md's "message immutability" rule is being followed only by convention.

Recommendation

Change the record signature to use IReadOnlyDictionary<string, IReadOnlyList<string>> (or ImmutableDictionary / ImmutableArray<string>) and freeze the data before piping. The cost is negligible — the payload is built and consumed once per refresh tick.

Communication-021 — `SiteStreamGrpcServer.SubscribeInstance` leaks the `StreamRelayActor` if `Subscribe` throws pre-try


Severity	Low
Category	Error handling & resilience
Status	Resolved
Location	`src/ScadaLink.Communication/Grpc/SiteStreamGrpcServer.cs:188-200`

Description

SubscribeInstance performs these statements in order (lines 189-194), all before the try block at line 200:

Interlocked.Increment(ref _actorCounter)
_actorSystem!.ActorOf(Props.Create(typeof(StreamRelayActor), ...))
_streamSubscriber.Subscribe(request.InstanceUniqueName, relayActor)

If step 3 throws (the subscriber is wired but its Subscribe faults — a stale instance name, a temporary index lookup failure, etc.), the exception escapes the method as an unhandled RpcException and leaks the freshly-created relayActor. The finally block at line 211 is unreachable because the throw happens before the try. The actor's _activeStreams entry, the StreamEntry.Cts, and the Channel<SiteStreamEvent> are also leaked.

In normal operation _streamSubscriber.Subscribe does not throw, so the bug is latent — but a misbehaving site runtime (e.g. SiteStreamManager faulted because the actor system is shutting down) would surface it.

Recommendation

Restructure to either (a) wrap the Subscribe call in a try whose catch stops the relay actor and disposes the CTS, or (b) move the actor + subscriber creation inside the existing try block (the finally will then handle cleanup uniformly). Option (b) is the simplest — just move lines 189-194 down past the try { brace.

Resolution (2026-05-28): Took option (a). _streamSubscriber.Subscribe(...) is now wrapped in its own try/catch — on throw, the freshly-created relay actor is stopped via _actorSystem.Stop, the bounded channel is completed via channel.Writer.TryComplete(), and the _activeStreams entry is removed via the ownership-preserving TryRemove(KeyValuePair) overload before the exception is re-thrown to the caller. Added regression test SiteStreamGrpcServerTests.Comm021_SubscribeThrows_StopsRelayActorAndRemovesActiveStreamEntry using an NSubstitute ISiteStreamSubscriber that throws on Subscribe; asserts ActiveStreamCount == 0 and that RemoveSubscriber was NOT called (confirming the catch path, not the finally path).

Communication-022 — `_debugSubscriptions` keyed by caller-supplied correlation ID; reuse silently orphans the prior subscriber


Severity	Low
Category	Correctness & logic bugs
Status	Resolved
Location	`src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:67`, `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:493`

Description

TrackMessageForCleanup on SubscribeDebugViewRequest does _debugSubscriptions[sub.CorrelationId] = (envelope.SiteId, Sender) (line 493). The dictionary indexer silently overwrites any prior entry for the same CorrelationId. If two debug sessions ever reuse the same correlation ID (e.g. two Blazor users start a stream at the same moment with a non-GUID id, or a caller bug, or a malicious caller as flagged in the cousin Communication-014), the first subscriber's entry is overwritten and lost — on a later ConnectionStateChanged(false) (per Communication-016 it never actually fires today, but the design intent stands), only the second subscriber would be notified of the disconnect.

DebugStreamService.StartStreamAsync uses Guid.NewGuid().ToString("N") as the session id (DebugStreamService.cs:97), so a real collision is astronomically unlikely in normal operation. But the central side is not defending itself: a CLI consumer or a future caller is implicitly trusted to generate globally-unique ids.

Recommendation

When the slot is already occupied, log a Warning and either reject the new subscription with an error response or evict the prior subscriber via DebugStreamTerminated before installing the new one. Mirrors the SiteStreamGrpcServer defensive behaviour where a duplicate correlation_id cancels the existing stream (line 167).

Resolution (2026-05-28): Closed by Comm-016 — field removed in commit ac96b83. The _debugSubscriptions dictionary, TrackMessageForCleanup helper, and the HandleConnectionStateChanged handler that consumed them were all deleted as part of Comm-016's dead-code purge. There is no longer any caller-supplied correlation-id keyed map to overwrite — the orphan-on-reuse hazard is gone.

60 KiB Raw Blame History

Code Review — Communication

Summary

Re-review 2026-05-17 (commit 39d737e)

Re-review 2026-05-28 (commit 1eb6e97)

Checklist coverage 2026-05-28

Checklist coverage

Findings

Communication-001 — Early stream termination escapes StartStreamAsync's narrow exception handling

Communication-002 — gRPC reconnect does not unsubscribe the previous stream, leaking site-side relay actors

Communication-003 — SiteStreamGrpcClient subscription map overwritten without disposal; reconnect can cancel the wrong stream

Communication-004 — Coordinator actors declare no SupervisorStrategy (design requires Resume)

Communication-005 — gRPC keepalive and max-stream-lifetime options are defined but never applied

Communication-006 — Site address load failures are silently swallowed, leaving a stale cache

Communication-007 — SiteStreamGrpcClientFactory.Dispose blocks on async work (sync-over-async)

Communication-008 — Reconnect retry-count reset can mask a flapping stream indefinitely

Communication-009 — _siteClients field is mutable and reassignable; cache update is not atomic on failure

Communication-010 — DebugStreamBridgeActor XML doc incorrectly describes it as a "Persistent actor"

Communication-011 — No test coverage for snapshot-timeout cleanup, address-cache failure, or gRPC reconnect leak

Communication-012 — gRPC client factory ignores the endpoint on a cache hit, breaking NodeA→NodeB stream failover

Communication-013 — Site gRPC address changes are never applied; RemoveSiteAsync has no production caller

Communication-014 — Untrusted gRPC correlation_id flows directly into an Akka actor name

Communication-015 — No test exercises the real gRPC client factory across a node flip

Communication-016 — HandleConnectionStateChanged is dead code — the documented disconnect-cleanup workflow never fires

Communication-017 — _inProgressDeployments grows unboundedly — successful deployments are never cleaned up

Communication-018 — Site heartbeats hard-code IsActive: true regardless of node role

Communication-019 — LoadSiteAddressesFromDb does not pass a CancellationToken to the repository

Communication-020 — SiteAddressCacheLoaded carries mutable Dictionary/List types

Communication-021 — SiteStreamGrpcServer.SubscribeInstance leaks the StreamRelayActor if Subscribe throws pre-try

Communication-022 — _debugSubscriptions keyed by caller-supplied correlation ID; reuse silently orphans the prior subscriber

60 KiB

Raw Blame History

Re-review 2026-05-17 (commit `39d737e`)

Re-review 2026-05-28 (commit `1eb6e97`)

Communication-007 — `SiteStreamGrpcClientFactory.Dispose` blocks on async work (sync-over-async)

Communication-009 — `_siteClients` field is mutable and reassignable; cache update is not atomic on failure

Communication-010 — `DebugStreamBridgeActor` XML doc incorrectly describes it as a "Persistent actor"

Communication-013 — Site gRPC address changes are never applied; `RemoveSiteAsync` has no production caller

Communication-014 — Untrusted gRPC `correlation_id` flows directly into an Akka actor name

Communication-016 — `HandleConnectionStateChanged` is dead code — the documented disconnect-cleanup workflow never fires

Communication-017 — `_inProgressDeployments` grows unboundedly — successful deployments are never cleaned up

Communication-018 — Site heartbeats hard-code `IsActive: true` regardless of node role

Communication-019 — `LoadSiteAddressesFromDb` does not pass a `CancellationToken` to the repository

Communication-020 — `SiteAddressCacheLoaded` carries mutable `Dictionary`/`List` types

Communication-021 — `SiteStreamGrpcServer.SubscribeInstance` leaks the `StreamRelayActor` if `Subscribe` throws pre-try

Communication-022 — `_debugSubscriptions` keyed by caller-supplied correlation ID; reuse silently orphans the prior subscriber