Async cancellation hygiene, fire-and-forget observability, retry/shutdown semantics, and audit-row coverage across 9 modules. Highlights: Cancellation & lifecycle: - AuditLog-006: SqliteAuditWriter.Dispose hops to thread pool, escaping the captured SyncContext that risked sync-over-async deadlock. - AuditLog-010: SiteAuditTelemetryActor owns a private lifecycle CTS, threaded through drain paths instead of CancellationToken.None. - Comm-019: CentralCommunicationActor adds lifecycle CTS for repo calls. - Host-019: Migration StartupRetry forwards ApplicationStopping so SIGTERM during the bounded-retry window aborts cleanly. Cursor / retry / counter correctness: - AuditLog-004: SiteAuditReconciliationActor's cursor now holds at `since` when any row's idempotent insert is still being retried (per-EventId retry counter, MaxPermanentInsertAttempts=5 escape valve with LogCritical abandon). No more silent abandonment of permanently-failing rows. - ConfigDB-019: Dropped the catch-and-continue on EnsureLookaheadAsync's SPLIT loop — by class-doc construction the catch could only mask real failures and let the next iteration create permanent partition holes. - HM-017/018: HealthReportSender + CentralHealthReportLoop snapshot per-interval counters before sending, restore via new ISiteHealthCollector.AddIntervalCounters on transport failure so counts aren't silently lost. Fire-and-forget / shutdown waits: - InboundAPI-018: AuditWriteMiddleware observes faulted audit-write tasks via OnlyOnFaulted continuation (Warning log; response unchanged). - SnF-024: StoreAndForwardService.StopAsync awaits in-flight retry sweep with a bounded SweepShutdownWaitTimeout (10s). Leak / refactor: - Comm-021: SiteStreamGrpcServer.SubscribeInstance wraps Subscribe in its own try/catch so a throw doesn't leak the relay actor or _activeStreams entry. - Comm-022: VERIFIED already-closed by Comm-016's dead-code purge. - CLI-017: BundleCommands' three subcommands delegate to ExecuteCommandAsync (auth-failure exit-code contract unified). Defensive / validation: - CLI-021: CliConfig.Load wraps file-read/JSON parse so malformed config prints a warning and returns defaults instead of crashing the CLI. - Host-022: ParseLevel emits stderr one-shot warning for unrecognised MinimumLevel instead of silently coercing to Information. - ESG-019: ExternalSystemClient sets HttpClient.Timeout=Infinite so the per-call CTS is the sole timeout source (was clipped to 100s by .NET). - Security-020: New SecurityOptionsValidator (IValidateOptions) rejects empty LdapServer/LdapSearchBase with ValidateOnStart. - DM-019: Lifecycle command timeouts now emit DisableTimedOut/EnableTimedOut/ DeleteTimedOut audit entries (mirrors DeployFailed pattern). Plus reconciled stale per-module Open-findings counters that had drifted from prior sessions. 20+ new regression tests across 11 test projects; build clean; affected suites all green. README regenerated: 75 open (was 93).
60 KiB
Code Review — Communication
| Field | Value |
|---|---|
| Module | src/ScadaLink.Communication |
| Design doc | docs/requirements/Component-Communication.md |
| Status | Reviewed |
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | 1eb6e97 |
| Open findings | 4 |
Summary
The Communication module is generally well-structured and matches the design doc's
two-transport model (ClusterClient for command/control, gRPC server-streaming for
real-time data). The actors keep mutable state on the actor thread, use PipeTo for
async work, and the gRPC server/client lifecycle is mostly disciplined. However the
review found several High and Medium issues clustered around two themes:
(a) gRPC subscription bookkeeping races — SiteStreamGrpcClient overwrites and
removes subscription entries by correlation ID without disposal or ownership checks,
so reconnect cycles leak CancellationTokenSourcees and can cancel the wrong stream;
and (b) missing supervision strategy on the coordinator actors, contrary to the
CLAUDE.md "Resume for coordinator actors" decision. Design-doc adherence is otherwise
good. Test coverage is broad for happy paths but has gaps around failover, cache
mutation races, and the snapshot-timeout cleanup path.
Re-review 2026-05-17 (commit 39d737e)
All prior findings (Communication-001..011) are confirmed Resolved in this commit
and the fixes hold up against the source. The re-review walked all 10 checklist
categories again and uncovered a previously-missed defect at the centre of the gRPC
node-failover path: SiteStreamGrpcClientFactory.GetOrCreate caches one client per
site identifier and silently ignores the grpcEndpoint argument on a cache hit. The
DebugStreamBridgeActor reconnect logic flips _useNodeA and passes the other
node's endpoint, but the factory hands back the original NodeA-bound client every
time — so the documented "try the other site node endpoint" failover never actually
moves to NodeB (Communication-012). The same caching defect means a site address
change is never picked up because RemoveSiteAsync has no production caller
(Communication-013). Two Low findings round out the re-review: an untrusted
gRPC-supplied correlation_id flows straight into an Akka actor name
(Communication-014), and the factory's endpoint-reuse defect is masked by the test
mock (Communication-015). Four new findings, all Open: one High, one Medium, two Low.
Re-review 2026-05-28 (commit 1eb6e97)
All prior findings (Communication-001..015) remain Resolved in this commit. The
re-review walked all 10 checklist categories again on the surface that has not
been re-examined before — the central↔site command/control routing surface
(CentralCommunicationActor, SiteCommunicationActor) rather than the
previously-mined gRPC streaming surface — and uncovered a cluster of defects
around the connection-state-change workflow. The single material finding is
HandleConnectionStateChanged is dead code: no production code path emits
ConnectionStateChanged, so the documented "kill active debug streams for the
disconnected site" + "mark in-progress deployments as failed" workflow never
fires at runtime (Communication-016). The downstream consequence is
_inProgressDeployments grows unboundedly — entries are inserted on every
deployment but only cleaned via that dead path (Communication-017). Three
smaller items round out the re-review: site heartbeats hard-code
IsActive: true regardless of node role (Communication-018), the
60-second-periodic LoadSiteAddressesFromDb task has no CancellationToken so a
hung DB query has no upper bound (Communication-019), the
SiteAddressCacheLoaded internal message carries a mutable
Dictionary/List (Communication-020), SiteStreamGrpcServer.SubscribeInstance
leaks the StreamRelayActor if _streamSubscriber.Subscribe throws between
ActorOf and the try block (Communication-021), and _debugSubscriptions
keyed by caller-supplied CorrelationId could orphan a subscriber on ID reuse
(Communication-022). Seven new findings, all Open: one High, one Medium, five
Low.
Checklist coverage 2026-05-28
| # | Category | Examined | Notes |
|---|---|---|---|
| 1 | Correctness & logic bugs | ✓ | HandleConnectionStateChanged and its _inProgressDeployments / _debugSubscriptions cleanup never fire — the connection-state workflow is dead (Communication-016, Communication-017). _debugSubscriptions correlation-ID overwrite risk (Communication-022). |
| 2 | Akka.NET conventions | ✓ | SiteAddressCacheLoaded carries mutable Dictionary<string, List<string>> — violates message-immutability convention (Communication-020). Forward/PipeTo/Sender-capture all clean. |
| 3 | Concurrency & thread safety | ✓ | All mutable state mutated on the actor thread. _subscriptions ConcurrentDictionary use disciplined. No new issues. |
| 4 | Error handling & resilience | ✓ | LoadSiteAddressesFromDb lacks a CancellationToken propagation point (Communication-019). SubscribeInstance leaks the relay actor if Subscribe throws pre-try (Communication-021). |
| 5 | Security | ✓ | Correlation-id validation in place (Communication-014). No new issues. |
| 6 | Performance & resource management | ✓ | _inProgressDeployments grows unboundedly (Communication-017). gRPC client/server lifecycles otherwise clean. |
| 7 | Design-document adherence | ✓ | ConnectionStateChanged handler is dead code — the doc-stated "kill streams on disconnect, fail in-progress deployments" workflow does not actually run (Communication-016). Site heartbeats always report IsActive: true regardless of role (Communication-018). |
| 8 | Code organization & conventions | ✓ | Options pattern correct; mapper placement and proto evolution are additive-only. No new issues. |
| 9 | Testing coverage | ✓ | CentralCommunicationActorTests.ConnectionLost_DebugStreamsKilled exercises a code path that no production caller ever drives — gives false confidence (related to Communication-016). |
| 10 | Documentation & comments | ✓ | Detailed XML docs added in this commit. No new issues. |
Checklist coverage
| # | Category | Examined | Notes |
|---|---|---|---|
| 1 | Correctness & logic bugs | ✓ | Re-review: factory ignores endpoint on cache hit, defeating NodeA→NodeB stream failover (Communication-012). Prior items resolved. |
| 2 | Akka.NET conventions | ✓ | Coordinator Resume strategies now present and verified. No new issues. |
| 3 | Concurrency & thread safety | ✓ | Subscription-map register/remove now ownership-checked. _siteClients readonly. No new issues. |
| 4 | Error handling & resilience | ✓ | Status.Failure handler added; reconnect unsubscribes prior stream. No new issues. |
| 5 | Security | ✓ | Re-review: public gRPC correlation_id flows unvalidated into an Akka actor name (Communication-014). |
| 6 | Performance & resource management | ✓ | Synchronous Dispose paths fixed; CTS leaks resolved. No new issues. |
| 7 | Design-document adherence | ✓ | Re-review: site gRPC address-change disposal not wired — RemoveSiteAsync is dead code (Communication-013). gRPC options now applied. |
| 8 | Code organization & conventions | ✓ | Options pattern correct; public records still declared in actor files (acceptable). No structural issues. |
| 9 | Testing coverage | ✓ | Re-review: prior gaps closed, but the factory mock masks the endpoint-reuse defect — no real node-flip coverage (Communication-015). |
| 10 | Documentation & comments | ✓ | DebugStreamBridgeActor summary corrected. No new issues. |
Findings
Communication-001 — Early stream termination escapes StartStreamAsync's narrow exception handling
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | src/ScadaLink.Communication/DebugStreamService.cs:130-143 |
Re-triaged 2026-05-16: originally filed Critical, claiming an orphaned bridge actor
and a multi-minute site-side resource leak on every snapshot timeout. On verification
that impact does not occur: DebugStreamBridgeActor calls CleanupGrpc() and
Context.Stop(Self) on every path that invokes onTerminated (site disconnect, gRPC
max-retries, ReceiveTimeout), so it always self-terminates and releases its gRPC
subscription; and the pure-timeout path does reach StopStream, which also stops it.
The genuine defect described below is an error-handling gap, not a leak — severity
corrected to Medium.
Description
StartStreamAsync awaits the initial snapshot inside a try whose only handler is
catch (OperationCanceledException). When the stream terminates before the snapshot
arrives, onTerminatedWrapper completes the await via
snapshotTcs.TrySetException(new InvalidOperationException(...)). That
InvalidOperationException is not an OperationCanceledException, so it escapes the
catch entirely: the caller (Blazor debug view / SignalR hub) receives a raw,
untranslated exception, and StartStreamAsync performs no teardown of its own on that
path — it relies implicitly on the bridge actor self-terminating. Cleanup from the
service side is therefore not deterministic, and the failure surfaced to the caller is
not a meaningful, documented result.
Recommendation
In StartStreamAsync, catch any exception from the snapshot await, deterministically
tear down the bridge actor (Tell(StopDebugStream) via the local actor reference, since
a racing onTerminatedWrapper may already have removed the session entry), and translate
the failure into a meaningful exception for the caller.
Resolution
Resolved 2026-05-16. The catch (OperationCanceledException)-only block in
StartStreamAsync was replaced with catch (Exception): it removes the session entry,
sends StopDebugStream to the bridge actor via the local reference (idempotent — the
actor may already be stopping itself), and throws a descriptive exception —
TimeoutException for the 30s timeout, otherwise an InvalidOperationException that
names the instance/site and wraps the underlying cause. Regression test
DebugStreamServiceTests.StartStreamAsync_StreamTerminatesBeforeSnapshot_ThrowsMeaningfulException
fails against the pre-fix code and passes after. Fixed by the commit whose message
references Communication-001.
Communication-002 — gRPC reconnect does not unsubscribe the previous stream, leaking site-side relay actors
| Severity | High |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | src/ScadaLink.Communication/Actors/DebugStreamBridgeActor.cs:170, src/ScadaLink.Communication/Actors/DebugStreamBridgeActor.cs:143 |
Description
On a gRPC stream error, HandleGrpcError increments the retry count, flips
_useNodeA, and schedules OpenGrpcStream. OpenGrpcStream cancels and disposes
_grpcCts and starts a fresh SubscribeInstance call — but it never calls
client.Unsubscribe(_correlationId) on the old node's client, and the site-side
SiteStreamGrpcServer keys active streams by correlation_id only. Because the new
subscription goes to the other node (_useNodeA flipped), the old node's
SiteStreamGrpcServer still has an active stream + StreamRelayActor +
SiteStreamManager subscription for that correlation ID. The old node only learns the
client is gone via TCP RST or keepalive — exactly the failure mode that triggered the
reconnect (network partition / silent node), so detection may take ~25s or never. Each
reconnect can therefore leave a zombie relay actor on the failed node. CleanupGrpc
(which does call Unsubscribe) is only invoked on terminal paths, not between
reconnect attempts.
Recommendation
Before reconnecting in HandleGrpcError / at the top of OpenGrpcStream, call
Unsubscribe(_correlationId) on the client for the previous endpoint (the one that
just failed) so the local CTS is cancelled and — where the channel is still alive —
the gRPC cancellation reaches the site and stops the relay actor.
Resolution
Resolved 2026-05-16 (commit <pending>). Root cause confirmed against source:
HandleGrpcError flipped _useNodeA and scheduled OpenGrpcStream without ever
unsubscribing the failed stream, leaving the old node's StreamRelayActor zombie until
TCP/keepalive timeout. Fix: HandleGrpcError now resolves the client for the
previous endpoint (before flipping _useNodeA) and calls Unsubscribe(_correlationId)
on it, so the local CTS is cancelled and gRPC cancellation reaches the still-alive site.
Regression test DebugStreamBridgeActorTests.On_GrpcError_Unsubscribes_Old_Stream_Before_Reconnect
fails against the pre-fix code and passes after.
Communication-003 — SiteStreamGrpcClient subscription map overwritten without disposal; reconnect can cancel the wrong stream
| Severity | High |
| Category | Concurrency & thread safety |
| Status | Resolved |
| Location | src/ScadaLink.Communication/Grpc/SiteStreamGrpcClient.cs:77, src/ScadaLink.Communication/Grpc/SiteStreamGrpcClient.cs:106 |
Description
SubscribeAsync does _subscriptions[correlationId] = cts; (line 77),
unconditionally overwriting any existing entry for that correlation ID without
cancelling or disposing the previous CancellationTokenSource. The finally block
then does _subscriptions.TryRemove(correlationId, out _) (line 106) which removes
the entry by key only, regardless of which CTS is stored. Because
DebugStreamBridgeActor reuses the same _correlationId across reconnect attempts
(and SiteStreamGrpcClientFactory returns the same SiteStreamGrpcClient for a site
even after a node flip), two SubscribeAsync calls can briefly share a correlation
ID. The first call's finally then removes the second call's CTS entry, so a later
Unsubscribe(correlationId) finds nothing and the live stream is never cancelled — an
orphan. Conversely the overwritten CTS is leaked (never disposed).
Recommendation
When inserting, cancel+dispose any prior CTS for that correlation ID. In the finally,
remove only if the stored CTS is the one this call created (use the
TryRemove(KeyValuePair) overload, mirroring what SiteStreamGrpcServer already does
with StreamEntry). Consider keying subscriptions by a per-call GUID rather than the
caller-supplied correlation ID.
Resolution
Resolved 2026-05-16 (commit <pending>). Root cause confirmed against source: the
inline _subscriptions[correlationId] = cts overwrote a prior CTS without
cancel/dispose (leak), and the finally's TryRemove(correlationId, out _) removed by
key only — a racing reconnect's live CTS could be removed by the prior call's finally,
orphaning the live stream. Fix: extracted two internal helpers used by SubscribeAsync
— RegisterSubscription cancels+disposes any existing CTS for the correlation ID before
inserting, and RemoveSubscription uses the ConcurrentDictionary.TryRemove(KeyValuePair)
overload so it removes only the CTS that call created (mirroring SiteStreamGrpcServer's
StreamEntry pattern). Regression tests
SiteStreamGrpcClientTests.RegisterSubscription_ReusedCorrelationId_CancelsAndDisposesPriorCts
and SiteStreamGrpcClientTests.RemoveSubscription_OnlyRemovesOwnCts_NotAReplacement
fail against the pre-fix logic and pass after.
Communication-004 — Coordinator actors declare no SupervisorStrategy (design requires Resume)
| Severity | Medium |
| Category | Akka.NET conventions |
| Status | Resolved |
| Location | src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:42, src/ScadaLink.Communication/Actors/SiteCommunicationActor.cs:22 |
Description
CLAUDE.md ("Explicit supervision strategies: Resume for coordinator actors, Stop for
short-lived execution actors") requires coordinator actors to use an explicit Resume
supervision strategy. CentralCommunicationActor and SiteCommunicationActor are
long-lived coordinators (they own the per-site ClusterClient map, debug
subscriptions, in-progress deployments) but neither overrides SupervisorStrategy.
They fall back to the Akka default (OneForOneStrategy with Restart). A child fault
— e.g. a ClusterClient child of CentralCommunicationActor created by
DefaultSiteClientFactory — would Restart under the default strategy, and any
exception in the coordinator itself would restart it, wiping _siteClients,
_debugSubscriptions, and _inProgressDeployments silently. The design intent is
Resume so transient child faults do not discard coordinator state.
Recommendation
Override SupervisorStrategy on both actors to return an explicit
OneForOneStrategy with Directive.Resume (or the project's standard coordinator
strategy), matching the documented decision and other coordinator actors.
Resolution
Resolved 2026-05-16 (commit pending). Root cause confirmed: neither
CentralCommunicationActor nor SiteCommunicationActor overrode SupervisorStrategy,
so child faults fell back to the Akka default (Restart). Note that an actor's own
SupervisorStrategy governs its children — a transient child fault would Restart
the child and discard its in-memory state, contrary to the CLAUDE.md "Resume for
coordinator actors" decision. Fix: both actors now override SupervisorStrategy() to
return a OneForOneStrategy with an unbounded Decider resolving to Directive.Resume
(mirroring DataConnectionManagerActor). Regression tests
CoordinatorSupervisionTests.CentralCommunicationActor_SupervisorStrategy_IsResume and
CoordinatorSupervisionTests.SiteCommunicationActor_SupervisorStrategy_IsResume fail
against the pre-fix code (decider yields Restart) and pass after.
Communication-005 — gRPC keepalive and max-stream-lifetime options are defined but never applied
| Severity | Medium |
| Category | Design-document adherence |
| Status | Resolved |
| Location | src/ScadaLink.Communication/Grpc/SiteStreamGrpcClient.cs:25, src/ScadaLink.Communication/CommunicationOptions.cs:36 |
Description
CommunicationOptions exposes GrpcKeepAlivePingDelay, GrpcKeepAlivePingTimeout,
GrpcMaxStreamLifetime, and GrpcMaxConcurrentStreams, and the design doc's
"gRPC Connection Keepalive" section explicitly states these are configurable. However
SiteStreamGrpcClient's constructor hard-codes KeepAlivePingDelay = TimeSpan.FromSeconds(15) and KeepAlivePingTimeout = TimeSpan.FromSeconds(10)
instead of reading the options. GrpcMaxStreamLifetime (the documented "Session
timeout — 4 hours" third layer of dead-client detection) is not referenced anywhere
— SiteStreamGrpcServer.SubscribeInstance creates a linked CTS from the call
cancellation token only, with no CancelAfter. The 4-hour zombie-stream safety net
described in the design doc does not exist in code. GrpcMaxConcurrentStreams is also
not wired to the server (SiteStreamGrpcServer takes a maxConcurrentStreams
constructor parameter defaulting to 100, but nothing binds the option to it).
Recommendation
Flow CommunicationOptions into SiteStreamGrpcClient and SiteStreamGrpcServer
(via the factory / DI). Apply GrpcKeepAlivePingDelay / GrpcKeepAlivePingTimeout to
the SocketsHttpHandler, bind GrpcMaxConcurrentStreams to the server's limit, and
implement the GrpcMaxStreamLifetime session timeout with CancelAfter on the
server-side stream CTS — or, if the 4-hour cap is intentionally dropped, remove the
option and update the design doc.
Resolution
Resolved 2026-05-16 (commit pending). Root cause confirmed: SiteStreamGrpcClient
hard-coded the keepalive values, GrpcMaxStreamLifetime was referenced nowhere, and
GrpcMaxConcurrentStreams was never bound to the server. Fix (scoped to
src/ScadaLink.Communication): SiteStreamGrpcClient gained a constructor taking
CommunicationOptions and now applies GrpcKeepAlivePingDelay/GrpcKeepAlivePingTimeout
to its SocketsHttpHandler; SiteStreamGrpcClientFactory gained an
IOptions<CommunicationOptions> DI constructor and flows the options into every client
it creates; SiteStreamGrpcServer gained an IOptions<CommunicationOptions> DI
constructor that binds GrpcMaxConcurrentStreams and implements the documented 4-hour
session timeout via CancellationTokenSource.CancelAfter(GrpcMaxStreamLifetime) on the
per-stream CTS. The Host's existing AddSingleton<SiteStreamGrpcServer>() registration
resolves the new DI constructor via greedy resolution — no Host change required.
Regression tests GrpcOptionsWiringTests.SiteStreamGrpcClient_AppliesKeepAliveFromOptions,
GrpcOptionsWiringTests.SiteStreamGrpcClientFactory_FlowsOptionsToCreatedClients, and
GrpcOptionsWiringTests.SiteStreamGrpcServer_BindsMaxConcurrentStreamsAndLifetimeFromOptions
exercise the wiring (they require the new members to even compile).
Communication-006 — Site address load failures are silently swallowed, leaving a stale cache
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:204 |
Description
LoadSiteAddressesFromDb runs the repository query inside Task.Run(...).PipeTo(self).
If GetAllSitesAsync throws (database unavailable, transient connection error), the
faulted task is piped to Self as a Status.Failure. CentralCommunicationActor has
no Receive<Status.Failure> handler, so the failure becomes an unhandled message
(logged at debug, not surfaced) and the periodic refresh silently fails. If the
first startup load fails the actor runs with an empty _siteClients map — every
SiteEnvelope is dropped (line 187) and every Ask times out with no indication of the
root cause.
Recommendation
Add a Receive<Status.Failure> handler that logs the load failure at Warning/Error
level so operators can distinguish "site has no addresses configured" from "database
is down". Optionally surface a health metric for repeated load failures.
Resolution
Resolved 2026-05-16 (commit pending). Root cause confirmed: a faulted
LoadSiteAddressesFromDb task is piped to Self as a Status.Failure, but the actor
had no handler for it — the failure became an unhandled message (debug-level only) and
the periodic refresh failed silently. Fix: added a Receive<Status.Failure> handler
that logs the load failure at Warning with the underlying exception as the cause, so
operators can distinguish a missing-addresses configuration from a database outage.
Regression test
CentralCommunicationActorTests.LoadSiteAddressesFailure_IsLoggedNotSilentlySwallowed
(repository query throws) asserts the Warning is emitted — it produces no warning
against the pre-fix code and passes after.
Communication-007 — SiteStreamGrpcClientFactory.Dispose blocks on async work (sync-over-async)
| Severity | Medium |
| Category | Performance & resource management |
| Status | Resolved |
| Location | src/ScadaLink.Communication/Grpc/SiteStreamGrpcClientFactory.cs:53 |
Description
Dispose() calls DisposeAsync().AsTask().GetAwaiter().GetResult(). This is the
classic sync-over-async pattern: it blocks the calling thread until all per-site
SiteStreamGrpcClient.DisposeAsync calls complete. If Dispose is invoked from a
context with a single-threaded synchronization context or from DI container shutdown
on a constrained thread pool, this can deadlock or stall host shutdown. The class
already implements IAsyncDisposable.
Recommendation
Prefer registering and disposing the factory through IAsyncDisposable only (modern
.NET DI honours it for singletons). If a synchronous Dispose must remain, dispose
the underlying GrpcChannels directly (synchronous) rather than blocking on the async
path, or document why blocking is safe here.
Resolution
Resolved 2026-05-16 (commit pending). Root cause confirmed: Dispose() called
DisposeAsync().AsTask().GetAwaiter().GetResult(), the classic sync-over-async pattern.
Fix: SiteStreamGrpcClient now also implements IDisposable with a synchronous
Dispose() that releases its CancellationTokenSources and underlying GrpcChannel
directly (all of that teardown is inherently synchronous); SiteStreamGrpcClientFactory.Dispose()
now disposes each cached client via that synchronous path with no blocking on the async
path. A CreateClient seam was extracted so the test can substitute a tracking client
while still exercising the factory's real caching/disposal machinery. Regression test
SiteStreamGrpcClientFactoryDisposeTests.Dispose_DisposesClientsSynchronously_NotViaAsyncPath
fails against the pre-fix code (clients disposed via DisposeAsync) and passes after;
Dispose_DoesNotDeadlock_UnderSingleThreadedSynchronizationContext guards the stall path.
Communication-008 — Reconnect retry-count reset can mask a flapping stream indefinitely
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | src/ScadaLink.Communication/Actors/DebugStreamBridgeActor.cs:71, src/ScadaLink.Communication/Actors/DebugStreamBridgeActor.cs:174 |
Description
_retryCount is reset to 0 every time a single AttributeValueChanged or
AlarmStateChanged event is received (lines 72, 77). Combined with MaxRetries = 3,
a stream that connects, delivers exactly one event, then fails — repeatedly — will
reconnect forever. The design doc states "max 3 retries, terminate the session if all
retries fail"; the current logic only terminates after 3 consecutive failures with
zero intervening events, so a flapping site never trips the limit and the debug
session (and its site-side relay) lives on indefinitely. The ReceiveTimeout orphan
net is also reset by every received message, so it does not bound this case either.
Recommendation
Either reset _retryCount only after the stream has been stably connected for some
minimum duration (e.g. a timer armed on stream open, cancelled on the next error), or
keep a separate cumulative reconnect counter / time window that bounds total
reconnects regardless of intervening events.
Resolution
Resolved 2026-05-16 (commit pending). Root cause confirmed: _retryCount was reset to
0 on every received AttributeValueChanged/AlarmStateChanged, so a stream that
connected, delivered one event, then failed — repeatedly — never tripped MaxRetries.
Fix (recommendation option a): the per-event reset was removed; instead OpenGrpcStream
arms a single StabilityWindow timer (60s default, internal-settable for tests), and
only when it fires (GrpcStreamStable) — i.e. the stream stayed up long enough to be
considered recovered — is _retryCount reset. HandleGrpcError cancels that timer, so
a stream that fails before the window elapses does not recover its retry budget. A
flapping stream therefore terminates after MaxRetries regardless of intervening
events. Regression test
DebugStreamBridgeActorTests.FlappingStream_DeliveringEventsBetweenFailures_StillTerminatesAfterMaxRetries
fails against the pre-fix code (actor never terminates) and passes after;
RetryCount_RecoveredOnlyAfterStreamStaysStableForStabilityWindow verifies the budget
is recovered after a stable interval. The pre-existing test that codified the buggy
per-event reset (Grpc_Error_Resets_RetryCount_On_Successful_Event) was replaced.
Communication-009 — _siteClients field is mutable and reassignable; cache update is not atomic on failure
| Severity | Low |
| Category | Concurrency & thread safety |
| Status | Resolved |
| Location | src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:53, src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:240 |
Description
_siteClients is a non-readonly Dictionary field. It is only mutated on the actor
thread (correct), but the field is needlessly reassignable, and
HandleSiteAddressCacheLoaded mutates it in place across several loops. If
ActorPath.Parse throws on a malformed address mid-loop (e.g. a site row with a
garbage NodeAAddress), the method aborts partway through, having already stopped
some ClusterClients and added others — leaving the cache partially updated with no
recovery until the next 60s refresh. The other actor mutable collections
(_debugSubscriptions, _inProgressDeployments) are correctly readonly.
Recommendation
Mark _siteClients readonly. Validate/parse all addresses up front (or wrap
ActorPath.Parse in a try/catch that logs and skips the bad site) so a single
malformed site record cannot abort the whole refresh and leave a half-updated cache.
Resolution
Resolved 2026-05-16 (commit pending). Root cause confirmed: _siteClients was a
non-readonly field, and HandleSiteAddressCacheLoaded's add/update loop called
ActorPath.Parse per site with no guard — a malformed NodeAAddress threw mid-loop and
aborted the refresh, leaving the cache half-updated until the next 60s cycle. Fix:
_siteClients is now readonly, and the per-site ActorPath.Parse is wrapped in a
try/catch that logs the bad address at Warning and continues to the next site, so a
single garbage row cannot starve other sites of their ClusterClient. Regression test
CentralCommunicationActorTests.MalformedSiteAddress_DoesNotAbortRefresh_OtherSitesStillRegistered
(bad site ordered before a good one) fails against the pre-fix code (good site never
registered) and passes after.
Communication-010 — DebugStreamBridgeActor XML doc incorrectly describes it as a "Persistent actor"
| Severity | Low |
| Category | Documentation & comments |
| Status | Resolved |
| Location | src/ScadaLink.Communication/Actors/DebugStreamBridgeActor.cs:10 |
Description
The class summary opens with "Persistent actor (one per active debug session)...".
The actor derives from ReceiveActor, not a persistent actor base class, holds no
PersistenceId, and writes no journal/snapshot. "Persistent" is misleading — debug
sessions are explicitly "session-based and temporary" per the design doc. A reader
could assume state survives restart, which it does not.
Recommendation
Reword the summary to "Long-lived (per active debug session) actor on the central side..." or similar, removing the word "Persistent".
Resolution
Resolved 2026-05-16 (commit pending). Root cause confirmed: the class summary opened
with "Persistent actor (one per active debug session)..." but the actor derives from
ReceiveActor, holds no PersistenceId, and writes no journal/snapshot. Fix
(documentation only — no behaviour change, so no regression test): the summary was
reworded to "Long-lived (one per active debug session) actor on the central side. Debug
sessions are session-based and temporary — this actor holds no persisted state and does
not derive from an Akka.Persistence base class; its state does not survive a restart."
Communication-011 — No test coverage for snapshot-timeout cleanup, address-cache failure, or gRPC reconnect leak
| Severity | Low |
| Category | Testing coverage |
| Status | Resolved |
| Location | tests/ScadaLink.Communication.Tests/ (module-wide) |
Description
The test suite covers happy-path routing, handler-not-registered failures, heartbeat bumping, cache refresh, and gRPC bridge reconnect/retry. However several critical paths identified in this review have no coverage:
- The
DebugStreamService.StartStreamAsyncsnapshot-timeout path (Communication-001) — no test verifies bridge actor / site subscription teardown on timeout, nor theonTerminated-before-snapshot race that throws a non-OperationCanceledException. CentralCommunicationActorbehaviour whenLoadSiteAddressesFromDbfaults (Communication-006) —RefreshSiteAddresses_UpdatesCacheonly exercises success.SiteStreamGrpcClientsubscription-map overwrite/removal race (Communication-003) and gRPC reconnect not unsubscribing the old node (Communication-002).- A malformed
NodeAAddressabortingHandleSiteAddressCacheLoaded(Communication-009).
Recommendation
Add tests for: snapshot timeout / pre-snapshot termination cleanup; address-load
failure logging and empty-cache behaviour; reusing a correlation ID across
SubscribeAsync calls; and a malformed site address during cache refresh.
Resolution
Resolved 2026-05-16 (commit pending). This is a meta-coverage finding; every gap it enumerates is now covered by a regression test (each fails against its pre-fix code and passes after):
- Snapshot timeout / pre-snapshot termination (Communication-001) —
DebugStreamServiceTests.StartStreamAsync_StreamTerminatesBeforeSnapshot_ThrowsMeaningfulException. - gRPC reconnect not unsubscribing the old node (Communication-002) —
DebugStreamBridgeActorTests.On_GrpcError_Unsubscribes_Old_Stream_Before_Reconnect. SiteStreamGrpcClientsubscription-map overwrite/removal race (Communication-003) —SiteStreamGrpcClientTests.RegisterSubscription_ReusedCorrelationId_CancelsAndDisposesPriorCtsandRemoveSubscription_OnlyRemovesOwnCts_NotAReplacement.LoadSiteAddressesFromDbfault (Communication-006) —CentralCommunicationActorTests.LoadSiteAddressesFailure_IsLoggedNotSilentlySwallowed.- Malformed
NodeAAddressabortingHandleSiteAddressCacheLoaded(Communication-009) —CentralCommunicationActorTests.MalformedSiteAddress_DoesNotAbortRefresh_OtherSitesStillRegistered(added with this finding's resolution). The full module suite (dotnet test tests/ScadaLink.Communication.Tests) is green at 111 passing tests.
Communication-012 — gRPC client factory ignores the endpoint on a cache hit, breaking NodeA→NodeB stream failover
| Severity | High |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | src/ScadaLink.Communication/Grpc/SiteStreamGrpcClientFactory.cs:39, src/ScadaLink.Communication/Actors/DebugStreamBridgeActor.cs:166 |
Description
SiteStreamGrpcClientFactory.GetOrCreate is _clients.GetOrAdd(siteIdentifier, _ => CreateClient(grpcEndpoint)) — it keys the cache by site identifier only and the
grpcEndpoint argument is used exclusively for the first-ever creation. Every
subsequent call for that site returns the originally-cached SiteStreamGrpcClient,
which is permanently bound to the GrpcChannel of whatever endpoint was passed first.
DebugStreamBridgeActor relies on the opposite behaviour. On a gRPC stream error,
HandleGrpcError flips _useNodeA and OpenGrpcStream recomputes
endpoint = _useNodeA ? _grpcNodeAAddress : _grpcNodeBAddress, then calls
_grpcFactory.GetOrCreate(_siteIdentifier, endpoint) expecting a client connected to
the other node. Because the factory ignores the new endpoint, the bridge actor
reconnects to the same failed NodeA endpoint on every retry. The design doc's
core debug-stream failover behaviour ("tries the other site node endpoint", "NodeB if
NodeA failed, or vice versa") is therefore inoperative — when a site node goes down,
the debug stream cannot move to the surviving node and simply exhausts MaxRetries
against the dead endpoint and terminates. The _useNodeA flip, the previousEndpoint
computation in HandleGrpcError, and the CleanupGrpc endpoint selection are all
dead logic. (Communication-002's Unsubscribe-before-reconnect fix still functions,
but it unsubscribes and re-subscribes on the same client/node rather than the
intended other node.)
Recommendation
Make the per-site client aware of both endpoints, or key the cache by
(siteIdentifier, endpoint), or have GetOrCreate detect an endpoint change and
dispose+recreate the cached client. Given the design intent ("Falls back to NodeB if
NodeA connection fails"), the cleanest fix is to give SiteStreamGrpcClient (or a
per-site holder) both NodeA/NodeB addresses and let it switch channels internally,
removing the endpoint argument from GetOrCreate entirely. Add a test that drives a
real SiteStreamGrpcClientFactory through a node flip and asserts the second client
targets the other endpoint.
Resolution
Resolved 2026-05-17 (commit pending). Root cause confirmed against source:
GetOrCreate was _clients.GetOrAdd(siteIdentifier, …) — keyed by site identifier
only, so the grpcEndpoint argument was honoured solely on first creation and the
NodeA→NodeB flip reconnected to the dead endpoint forever. Fix:
SiteStreamGrpcClient now exposes its bound Endpoint, and GetOrCreate compares
the cached client's endpoint against the requested one — on a mismatch it atomically
installs (via ConcurrentDictionary.AddOrUpdate) a fresh client for the new endpoint
and disposes the stale one, so a node flip actually moves to the surviving node.
Regression tests SiteStreamGrpcClientFactoryTests.GetOrCreate_EndpointChanged_ReturnsClientBoundToNewEndpoint
and GetOrCreate_SameEndpoint_DoesNotDisposeOrRecreate fail against the pre-fix
factory and pass after.
Communication-013 — Site gRPC address changes are never applied; RemoveSiteAsync has no production caller
| Severity | Medium |
| Category | Design-document adherence |
| Status | Resolved |
| Location | src/ScadaLink.Communication/Grpc/SiteStreamGrpcClientFactory.cs:58 |
Description
The design doc states that SiteStreamGrpcClientFactory "Disposes clients on site
removal or address change." RemoveSiteAsync implements the disposal mechanism, but
a repo-wide search finds no production caller — only tests invoke it. Combined
with the cache-by-site-identifier behaviour (Communication-012), the consequence is
that once a site's SiteStreamGrpcClient is created, a later edit to that site's
GrpcNodeAAddress / GrpcNodeBAddress (via the Central UI or CLI) is never reflected
in the cached client — it keeps using the stale channel for the life of the process.
CentralCommunicationActor already refreshes the Akka address cache every 60s and
recreates ClusterClients on change, but there is no equivalent invalidation path
wired into the gRPC client factory. A site whose gRPC endpoints are corrected after
an initial misconfiguration will never have working debug streaming until the central
node is restarted.
Recommendation
Wire a site-removal / address-change signal into SiteStreamGrpcClientFactory —
e.g. have CentralCommunicationActor (which already detects address changes in
HandleSiteAddressCacheLoaded) call RemoveSiteAsync for sites whose gRPC addresses
changed or were removed, or fold the gRPC endpoints into the same refresh cycle. If
the on-the-fly address-change requirement is intentionally dropped, remove
RemoveSiteAsync and correct the design doc.
Resolution
Resolved 2026-05-17 (commit pending). The address-change staleness — the primary
impact ("never have working debug streaming until the central node is restarted") —
is fixed in-module by the Communication-012 change: GetOrCreate is now
endpoint-change-aware, so the next time DebugStreamBridgeActor requests a stream
with a corrected GrpcNodeAAddress/GrpcNodeBAddress the stale cached client is
disposed and replaced — no central restart needed and no external wiring required.
RemoveSiteAsync is retained as the disposal path for full site removal (a deleted
site record) and its doc comment now states that role explicitly; wiring a
delete-site callback belongs to the site-management flow in another module and is out
of this module's scope. Regression test
SiteStreamGrpcClientFactoryTests.GetOrCreate_EndpointChanged_DisposesPriorClient
fails against the pre-fix factory (stale client never disposed/replaced) and passes
after.
Communication-014 — Untrusted gRPC correlation_id flows directly into an Akka actor name
| Severity | Low |
| Category | Security |
| Status | Resolved |
| Location | src/ScadaLink.Communication/Grpc/SiteStreamGrpcServer.cs:124 |
Description
SubscribeInstance is a public gRPC endpoint hosted on each site node. It creates the
relay actor with $"stream-relay-{request.CorrelationId}-{actorSeq}" as the actor
name, where request.CorrelationId comes straight off the wire. Akka actor names have
a restricted character set; a correlation_id containing /, whitespace, or other
disallowed characters makes ActorSystem.ActorOf throw InvalidActorNameException.
That exception is not caught inside SubscribeInstance, so it escapes as an unhandled
RPC fault (and after the _streamSubscriber.Subscribe / _activeStreams entry has
already been set up for the duration, though the finally does not run because the
throw is before the try). In practice central always supplies a GUID, so impact is
low, but the server is trusting client-supplied input to be actor-name-safe.
Recommendation
Validate request.CorrelationId on entry (non-empty, matches an expected GUID/safe
pattern) and reject with StatusCode.InvalidArgument otherwise; or derive the actor
name solely from the internal _actorCounter and keep the correlation ID only as
actor state / dictionary key.
Resolution
Resolved 2026-05-17 (commit pending). Root cause confirmed: SubscribeInstance fed
the off-the-wire request.CorrelationId straight into the stream-relay-… actor
name, so an id with /, whitespace, or other disallowed characters made ActorOf
throw InvalidActorNameException as an unhandled RPC fault. Fix: SubscribeInstance
now validates CorrelationId on entry — rejecting null/empty or any value failing
ActorPath.IsValidPathElement with StatusCode.InvalidArgument before any actor or
subscription state is created. Regression test
SiteStreamGrpcServerTests.RejectsCorrelationIdThatIsNotActorNameSafe (theory:
/-bearing, whitespace, empty, $-prefixed ids) fails against the pre-fix server
and passes after; AcceptsActorNameSafeCorrelationId confirms a normal GUID is still
accepted.
Communication-015 — No test exercises the real gRPC client factory across a node flip
| Severity | Low |
| Category | Testing coverage |
| Status | Resolved |
| Location | tests/ScadaLink.Communication.Tests/Grpc/DebugStreamBridgeActorTests.cs:401, tests/ScadaLink.Communication.Tests/Grpc/SiteStreamGrpcClientFactoryTests.cs |
Description
DebugStreamBridgeActorTests exercises the reconnect/failover paths through
MockSiteStreamGrpcClientFactory, which returns one fixed mock client regardless of
the grpcEndpoint argument. This is exactly the behaviour the real
SiteStreamGrpcClientFactory exhibits incorrectly (Communication-012), so the mock
masks the defect: On_GrpcError_Reconnects_To_Other_Node passes even though the real
factory never reaches the other node. SiteStreamGrpcClientFactoryTests only asserts
GetOrCreate returns the same client for the same site — it never checks what happens
when the same site is requested with a different endpoint.
Recommendation
Add a SiteStreamGrpcClientFactoryTests case that calls GetOrCreate(site, endpointA)
then GetOrCreate(site, endpointB) and asserts the second call targets endpointB
(it should fail today and pass after Communication-012 is fixed). Have the bridge-actor
test's mock factory track the endpoint per call so node-flip coverage is meaningful.
Resolution
Resolved 2026-05-17 (commit pending). SiteStreamGrpcClientFactoryTests gained
GetOrCreate_EndpointChanged_ReturnsClientBoundToNewEndpoint and
GetOrCreate_EndpointChanged_DisposesPriorClient, which drive the real
SiteStreamGrpcClientFactory across a node flip and assert the second call yields a
client bound to the new endpoint with the stale one disposed — both fail against the
pre-fix factory and pass after (the Communication-012 fix). DebugStreamBridgeActorTests
gained On_GrpcError_Reconnects_To_Other_Node_Endpoint, which uses a new
EndpointTrackingGrpcClientFactory test double that hands out a distinct mock client
per endpoint (instead of one fixed mock regardless of endpoint), so the bridge actor's
NodeA→NodeB reconnect is now verified to actually target the NodeB endpoint rather
than being masked by an endpoint-agnostic mock.
Communication-016 — HandleConnectionStateChanged is dead code — the documented disconnect-cleanup workflow never fires
| Severity | High |
| Category | Design-document adherence |
| Status | Resolved |
| Location | src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:169, src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:338-375 |
Resolution — deleted the dead code path in favour of the keepalive-based
detection that is the actual production behaviour: removed the
Receive<ConnectionStateChanged> handler, the HandleConnectionStateChanged
method, the _debugSubscriptions / _inProgressDeployments tracking dicts
- the
TrackMessageForCleanuphelper that fed them, and the dead message recordsrc/ScadaLink.Commons/Messages/Communication/ConnectionStateChanged.cs. The two dead tests (ConnectionLost_DebugStreamsKilledin CentralCommunicationActorTests,RoundTrip_ConnectionStateChanged_Succeedsin CompatibilityTests) were removed alongside. The design docdocs/requirements/Component-Communication.md"Connection Failure Behavior" section was updated to state explicitly that disconnect is detected at the transport layer (gRPC keepalive PING ~25 s for debug streams; Ask-timeout at the CommunicationService layer for command/control), with no application-level signal.DebugStreamTerminatedsurvives becauseDebugStreamBridgeActoruses it for an unrelated intra-actor stop signal.
Description
CentralCommunicationActor.HandleConnectionStateChanged is wired to
Receive<ConnectionStateChanged> and implements two important workflows on
IsConnected == false: (1) kill every active debug stream for the disconnected
site (_debugSubscriptions walk → DebugStreamTerminated Tell to each
subscriber); (2) mark every in-progress deployment for that site as failed
(_inProgressDeployments walk → entry removal). Both are documented in the
component design doc's "Connection Failure Behavior" section and in WP-5 of the
work plan referenced in the class's own XML doc comment.
A repo-wide search (grep -rn ConnectionStateChanged src/ tests/) shows no
production code ever emits ConnectionStateChanged. The only producers are
the unit test CentralCommunicationActorTests.ConnectionLost_DebugStreamsKilled
(line 137) and the Commons message-roundtrip test. The
CentralCommunicationActor therefore never receives one in production, the
disconnect-cleanup workflow never fires, and _debugSubscriptions /
_inProgressDeployments are never pruned via this path.
Concrete consequences:
- A site goes down → its active debug streams do not get a synchronous
DebugStreamTerminatednotification from central. The bridge actor must detect the disconnect itself via gRPC keepalive timing out (~25s) or TCP RST. Subscribers wait that long for theOnStreamTerminatedcallback instead of the documented "immediately killed by central" behaviour. - In-progress deployments to a disconnected site continue to occupy the
Ask-reply window and only fail when the Ask times out at the
CommunicationService.DeployInstanceAsynclayer (120s). They are never proactively marked failed. - The unit test gives a strong false impression that the workflow works — it exercises a code path that has no production caller.
The design doc and CLAUDE.md mention "ClusterClient handles failover between
NodeA and NodeB internally — there is no application-level NodeA preference /
NodeB fallback logic" — so the ClusterClient mechanism is the documented
failover transport. But that says nothing about signalling a fully-down
remote cluster to central's coordinator actor, which is exactly what
ConnectionStateChanged was meant to do.
Recommendation
Pick one of:
- Wire a producer for
ConnectionStateChanged— e.g. subscribe toClusterClient's contact-point/cluster events (ClusterClient.ContactPointsRefresh /ContactPointAdded/ContactPointRemoved) or watch the ClusterClient actor for a "no contact points reachable" state — and have it publishConnectionStateChangedtoSelfon each transition. - If the documented "synchronously kill streams on disconnect" behaviour is
intentionally being dropped in favour of the slower keepalive-based
detection, delete the handler, the
ConnectionStateChangedrecord, and the related_debugSubscriptions/_inProgressDeploymentstracking, then update the design doc's "Connection Failure Behavior" section accordingly.
Either way, replace CentralCommunicationActorTests.ConnectionLost_DebugStreamsKilled
— at present it asserts a behaviour that no production code triggers.
Communication-017 — _inProgressDeployments grows unboundedly — successful deployments are never cleaned up
| Severity | Medium |
| Category | Performance & resource management |
| Status | Open |
| Location | src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:73, src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:501, src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:357-367 |
Description
TrackMessageForCleanup inserts _inProgressDeployments[deploy.DeploymentId] = envelope.SiteId on every DeployInstanceCommand routed to a site (line 501).
The only places that remove from _inProgressDeployments are:
HandleConnectionStateChangedonIsConnected == false(line 366) — which per Communication-016 never fires in production.PostStop(line 553) — only on actor death (central failover).
There is no removal on the normal happy path — neither when the site replies
DeploymentStatusResponse (the reply goes to the Ask's temporary reply actor,
not back through CentralCommunicationActor), nor on Ask timeout. Every
successful or failed deployment leaves its entry behind for the lifetime of the
process.
Memory impact is modest (each entry is ~70-100 bytes), but the dictionary grows
monotonically. Over months of operation across all sites a central node could
accumulate tens of thousands of entries — a real, observable leak. More
seriously, the field is also the source-of-truth set the
HandleConnectionStateChanged walk uses to fail in-progress deployments, so
even if a ConnectionStateChanged were fired today, the walk would
"fail" thousands of already-completed deployments and Tell their (now stale)
correlation-IDs into the void.
_debugSubscriptions (line 67) shares the same shape — but a normal debug
session ends with an UnsubscribeDebugViewRequest that does drive cleanup
(line 497), so leaks are only realised when a consumer crashes without
unsubscribing.
Recommendation
Either remove _inProgressDeployments entirely (it has no other consumer once
Communication-016 is fixed by deletion) or, if the disconnect-cleanup workflow
is retained, add a removal hook on the reply path. The simplest fix is to
subscribe CentralCommunicationActor to the Ask reply: route
DeployInstanceCommand through the actor with the actor as the Ask sender,
forward the reply to the original caller, and _inProgressDeployments.Remove
in the same handler. (Today the Ask is taken on the actor itself by the
caller, so the reply skips the coordinator.)
Communication-018 — Site heartbeats hard-code IsActive: true regardless of node role
| Severity | Low |
| Category | Design-document adherence |
| Status | Resolved |
| Location | src/ScadaLink.Communication/Actors/SiteCommunicationActor.cs:376-465 |
Description
SiteCommunicationActor.SendHeartbeatToCentral builds
new HeartbeatMessage(_siteId, hostname, IsActive: true, DateTimeOffset.UtcNow)
on every periodic tick (line 366), with no inspection of whether this node is
actually the active site node or a standby. The HeartbeatMessage.IsActive
field thus carries the literal value true on every heartbeat from every
node, and the field is effectively dead — central's HandleHeartbeat doesn't
consume it either (line 297 only passes SiteId and Timestamp to
MarkHeartbeat).
Per CLAUDE.md's Cluster & Failover section the active/standby distinction is real ("Both nodes are seed nodes", "keep-oldest split-brain resolver", "automatic dual-node recovery"), so a heartbeat that could carry node-role information would be useful for the central health dashboard distinguishing "active node down, standby up" from "site fully offline". As shipped, the field is contract noise and a future implementer might mistakenly assume it already carries meaningful state.
Recommendation
Either (a) resolve the current cluster role at heartbeat-send time and pass it
through — e.g. Cluster.Get(Context.System).SelfRoles.Contains("active") or
the project's existing role mechanism — and have the central aggregator
consume IsActive; or (b) drop the IsActive field from HeartbeatMessage
(additive-only-evolution: deprecate the field, default to true, plan
removal in a major message contract revision).
Resolution (2026-05-28): Took option (a). SiteCommunicationActor now
accepts an optional Func<bool>? isActiveCheck (default = real Cluster.Get
leader check mirroring ActiveNodeGate / ActiveNodeHealthCheck) and
SendHeartbeatToCentral stamps HeartbeatMessage.IsActive with the result.
A try/catch keeps the heartbeat tick alive when the cluster state is
unreadable (warm-up / TestKit without cluster plugin) — falls back to
IsActive: false, the safe non-claiming value. Added parameterised test
Heartbeat_StampsIsActive_FromInjectedCheck. Tests green (199/199 in
Communication.Tests).
Communication-019 — LoadSiteAddressesFromDb does not pass a CancellationToken to the repository
| Severity | Low |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:397-431 |
Description
LoadSiteAddressesFromDb runs await repo.GetAllSitesAsync() inside
Task.Run(async () => ...).PipeTo(self) with no cancellation token (line 404).
The repository signature accepts CancellationToken (the test mock declares
GetAllSitesAsync(Arg.Any<CancellationToken>())), but the actor calls the
no-arg overload — so a hung MS SQL connection has no upper bound. The
60-second-periodic refresh keeps firing; each tick spawns a fresh Task.Run
that piles up if the database is consistently slow. The actor itself is
unaffected (it's not blocked), but pending tasks and DB connection-pool
resources accumulate, and the Status.Failure handler (Communication-006)
never fires because the task never faults — it just sits.
Recommendation
Maintain a per-load CancellationTokenSource with a deadline (e.g. the same
60s the refresh runs on, or a configurable timeout in CommunicationOptions).
Pass its Token to GetAllSitesAsync. Cancel the prior token before spinning
a new load to avoid task accumulation.
Resolution (2026-05-28): Added a per-actor lifecycle CancellationTokenSource
on CentralCommunicationActor, cancelled+disposed in PostStop. Its Token
is now passed into repo.GetAllSitesAsync(ct) so a hung MS SQL query is
bounded by actor shutdown rather than holding piped tasks open. The existing
60-second refresh cadence and Status.Failure handler (Comm-006) are unchanged
— a deadline-per-load was scoped out as a future enhancement; this fix
addresses the immediate "no upper bound on actor stop" concern called out in
the finding.
Communication-020 — SiteAddressCacheLoaded carries mutable Dictionary/List types
| Severity | Low |
| Category | Akka.NET conventions |
| Status | Open |
| Location | src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:567 |
Description
The Akka.NET convention is that messages crossing actor boundaries (even
internal Self-messages over an async task boundary) are immutable.
SiteAddressCacheLoaded(Dictionary<string, List<string>> SiteContacts) is a
record but its SiteContacts payload is a mutable Dictionary whose values
are mutable List<string>. Constructed inside Task.Run and handed off to
the actor, the cache could in principle be mutated by either side; in
practice nothing does, but the type is a stale-evidence guarantee that
CLAUDE.md's "message immutability" rule is being followed only by convention.
Recommendation
Change the record signature to use IReadOnlyDictionary<string, IReadOnlyList<string>>
(or ImmutableDictionary / ImmutableArray<string>) and freeze the data
before piping. The cost is negligible — the payload is built and consumed
once per refresh tick.
Communication-021 — SiteStreamGrpcServer.SubscribeInstance leaks the StreamRelayActor if Subscribe throws pre-try
| Severity | Low |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | src/ScadaLink.Communication/Grpc/SiteStreamGrpcServer.cs:188-200 |
Description
SubscribeInstance performs these statements in order (lines 189-194), all
before the try block at line 200:
Interlocked.Increment(ref _actorCounter)_actorSystem!.ActorOf(Props.Create(typeof(StreamRelayActor), ...))_streamSubscriber.Subscribe(request.InstanceUniqueName, relayActor)
If step 3 throws (the subscriber is wired but its Subscribe faults — a stale
instance name, a temporary index lookup failure, etc.), the exception escapes
the method as an unhandled RpcException and leaks the freshly-created
relayActor. The finally block at line 211 is unreachable because the
throw happens before the try. The actor's _activeStreams entry, the
StreamEntry.Cts, and the Channel<SiteStreamEvent> are also leaked.
In normal operation _streamSubscriber.Subscribe does not throw, so the bug is
latent — but a misbehaving site runtime (e.g. SiteStreamManager faulted
because the actor system is shutting down) would surface it.
Recommendation
Restructure to either (a) wrap the Subscribe call in a try whose catch
stops the relay actor and disposes the CTS, or (b) move the actor + subscriber
creation inside the existing try block (the finally will then handle
cleanup uniformly). Option (b) is the simplest — just move lines 189-194 down
past the try { brace.
Resolution (2026-05-28): Took option (a). _streamSubscriber.Subscribe(...)
is now wrapped in its own try/catch — on throw, the freshly-created relay actor
is stopped via _actorSystem.Stop, the bounded channel is completed via
channel.Writer.TryComplete(), and the _activeStreams entry is removed via
the ownership-preserving TryRemove(KeyValuePair) overload before the
exception is re-thrown to the caller. Added regression test
SiteStreamGrpcServerTests.Comm021_SubscribeThrows_StopsRelayActorAndRemovesActiveStreamEntry
using an NSubstitute ISiteStreamSubscriber that throws on Subscribe;
asserts ActiveStreamCount == 0 and that RemoveSubscriber was NOT called
(confirming the catch path, not the finally path).
Communication-022 — _debugSubscriptions keyed by caller-supplied correlation ID; reuse silently orphans the prior subscriber
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:67, src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:493 |
Description
TrackMessageForCleanup on SubscribeDebugViewRequest does
_debugSubscriptions[sub.CorrelationId] = (envelope.SiteId, Sender) (line 493).
The dictionary indexer silently overwrites any prior entry for the same
CorrelationId. If two debug sessions ever reuse the same correlation ID (e.g.
two Blazor users start a stream at the same moment with a non-GUID id, or a
caller bug, or a malicious caller as flagged in the cousin
Communication-014), the first subscriber's entry is overwritten and lost —
on a later ConnectionStateChanged(false) (per Communication-016 it never
actually fires today, but the design intent stands), only the second
subscriber would be notified of the disconnect.
DebugStreamService.StartStreamAsync uses Guid.NewGuid().ToString("N") as
the session id (DebugStreamService.cs:97), so a real collision is
astronomically unlikely in normal operation. But the central side is not
defending itself: a CLI consumer or a future caller is implicitly trusted to
generate globally-unique ids.
Recommendation
When the slot is already occupied, log a Warning and either reject the new
subscription with an error response or evict the prior subscriber via
DebugStreamTerminated before installing the new one. Mirrors the
SiteStreamGrpcServer defensive behaviour where a duplicate correlation_id
cancels the existing stream (line 167).
Resolution (2026-05-28): Closed by Comm-016 — field removed in commit ac96b83.
The _debugSubscriptions dictionary, TrackMessageForCleanup helper, and the
HandleConnectionStateChanged handler that consumed them were all deleted as
part of Comm-016's dead-code purge. There is no longer any caller-supplied
correlation-id keyed map to overwrite — the orphan-on-reuse hazard is gone.