fix(communication): resolve Communication-004..008 — Resume supervision, gRPC option wiring, address-load logging, sync dispose, flap detection

2026-05-16 20:58:03 -04:00
parent 3e7a3d7e31
commit 31a6995d24
12 changed files with 656 additions and 51 deletions
--- a/code-reviews/Communication/findings.md
+++ b/code-reviews/Communication/findings.md
@@ -8,7 +8,7 @@
 | Last reviewed | 2026-05-16 |
 | Reviewer | claude-agent |
 | Commit reviewed | `9c60592` |
-| Open findings | 8 |
+| Open findings | 3 |

 ## Summary

@@ -187,7 +187,7 @@ fail against the pre-fix logic and pass after.
 |--|--|
 | Severity | Medium |
 | Category | Akka.NET conventions |
-| Status | Open |
+| Status | Resolved |
 | Location | `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:42`, `src/ScadaLink.Communication/Actors/SiteCommunicationActor.cs:22` |

 **Description**
@@ -212,7 +212,17 @@ strategy), matching the documented decision and other coordinator actors.

 **Resolution**

-_Unresolved._
+Resolved 2026-05-16 (commit pending). Root cause confirmed: neither
+`CentralCommunicationActor` nor `SiteCommunicationActor` overrode `SupervisorStrategy`,
+so child faults fell back to the Akka default (`Restart`). Note that an actor's own
+`SupervisorStrategy` governs its *children* — a transient child fault would `Restart`
+the child and discard its in-memory state, contrary to the CLAUDE.md "Resume for
+coordinator actors" decision. Fix: both actors now override `SupervisorStrategy()` to
+return a `OneForOneStrategy` with an unbounded `Decider` resolving to `Directive.Resume`
+(mirroring `DataConnectionManagerActor`). Regression tests
+`CoordinatorSupervisionTests.CentralCommunicationActor_SupervisorStrategy_IsResume` and
+`CoordinatorSupervisionTests.SiteCommunicationActor_SupervisorStrategy_IsResume` fail
+against the pre-fix code (decider yields `Restart`) and pass after.

 ### Communication-005 — gRPC keepalive and max-stream-lifetime options are defined but never applied

@@ -220,7 +230,7 @@ _Unresolved._
 |--|--|
 | Severity | Medium |
 | Category | Design-document adherence |
-| Status | Open |
+| Status | Resolved |
 | Location | `src/ScadaLink.Communication/Grpc/SiteStreamGrpcClient.cs:25`, `src/ScadaLink.Communication/CommunicationOptions.cs:36` |

 **Description**
@@ -249,7 +259,22 @@ option and update the design doc.

 **Resolution**

-_Unresolved._
+Resolved 2026-05-16 (commit pending). Root cause confirmed: `SiteStreamGrpcClient`
+hard-coded the keepalive values, `GrpcMaxStreamLifetime` was referenced nowhere, and
+`GrpcMaxConcurrentStreams` was never bound to the server. Fix (scoped to
+`src/ScadaLink.Communication`): `SiteStreamGrpcClient` gained a constructor taking
+`CommunicationOptions` and now applies `GrpcKeepAlivePingDelay`/`GrpcKeepAlivePingTimeout`
+to its `SocketsHttpHandler`; `SiteStreamGrpcClientFactory` gained an
+`IOptions<CommunicationOptions>` DI constructor and flows the options into every client
+it creates; `SiteStreamGrpcServer` gained an `IOptions<CommunicationOptions>` DI
+constructor that binds `GrpcMaxConcurrentStreams` and implements the documented 4-hour
+session timeout via `CancellationTokenSource.CancelAfter(GrpcMaxStreamLifetime)` on the
+per-stream CTS. The Host's existing `AddSingleton<SiteStreamGrpcServer>()` registration
+resolves the new DI constructor via greedy resolution — no Host change required.
+Regression tests `GrpcOptionsWiringTests.SiteStreamGrpcClient_AppliesKeepAliveFromOptions`,
+`GrpcOptionsWiringTests.SiteStreamGrpcClientFactory_FlowsOptionsToCreatedClients`, and
+`GrpcOptionsWiringTests.SiteStreamGrpcServer_BindsMaxConcurrentStreamsAndLifetimeFromOptions`
+exercise the wiring (they require the new members to even compile).

 ### Communication-006 — Site address load failures are silently swallowed, leaving a stale cache

@@ -257,7 +282,7 @@ _Unresolved._
 |--|--|
 | Severity | Medium |
 | Category | Error handling & resilience |
-| Status | Open |
+| Status | Resolved |
 | Location | `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:204` |

 **Description**
@@ -279,7 +304,16 @@ is down". Optionally surface a health metric for repeated load failures.

 **Resolution**

-_Unresolved._
+Resolved 2026-05-16 (commit pending). Root cause confirmed: a faulted
+`LoadSiteAddressesFromDb` task is piped to `Self` as a `Status.Failure`, but the actor
+had no handler for it — the failure became an unhandled message (debug-level only) and
+the periodic refresh failed silently. Fix: added a `Receive<Status.Failure>` handler
+that logs the load failure at `Warning` with the underlying exception as the cause, so
+operators can distinguish a missing-addresses configuration from a database outage.
+Regression test
+`CentralCommunicationActorTests.LoadSiteAddressesFailure_IsLoggedNotSilentlySwallowed`
+(repository query throws) asserts the Warning is emitted — it produces no warning
+against the pre-fix code and passes after.

 ### Communication-007 — `SiteStreamGrpcClientFactory.Dispose` blocks on async work (sync-over-async)

@@ -287,7 +321,7 @@ _Unresolved._
 |--|--|
 | Severity | Medium |
 | Category | Performance & resource management |
-| Status | Open |
+| Status | Resolved |
 | Location | `src/ScadaLink.Communication/Grpc/SiteStreamGrpcClientFactory.cs:53` |

 **Description**
@@ -308,7 +342,17 @@ path, or document why blocking is safe here.

 **Resolution**

-_Unresolved._
+Resolved 2026-05-16 (commit pending). Root cause confirmed: `Dispose()` called
+`DisposeAsync().AsTask().GetAwaiter().GetResult()`, the classic sync-over-async pattern.
+Fix: `SiteStreamGrpcClient` now also implements `IDisposable` with a synchronous
+`Dispose()` that releases its CancellationTokenSources and underlying `GrpcChannel`
+directly (all of that teardown is inherently synchronous); `SiteStreamGrpcClientFactory.Dispose()`
+now disposes each cached client via that synchronous path with no blocking on the async
+path. A `CreateClient` seam was extracted so the test can substitute a tracking client
+while still exercising the factory's real caching/disposal machinery. Regression test
+`SiteStreamGrpcClientFactoryDisposeTests.Dispose_DisposesClientsSynchronously_NotViaAsyncPath`
+fails against the pre-fix code (clients disposed via `DisposeAsync`) and passes after;
+`Dispose_DoesNotDeadlock_UnderSingleThreadedSynchronizationContext` guards the stall path.

 ### Communication-008 — Reconnect retry-count reset can mask a flapping stream indefinitely

@@ -316,7 +360,7 @@ _Unresolved._
 |--|--|
 | Severity | Medium |
 | Category | Correctness & logic bugs |
-| Status | Open |
+| Status | Resolved |
 | Location | `src/ScadaLink.Communication/Actors/DebugStreamBridgeActor.cs:71`, `src/ScadaLink.Communication/Actors/DebugStreamBridgeActor.cs:174` |

 **Description**
@@ -339,7 +383,21 @@ reconnects regardless of intervening events.

 **Resolution**

-_Unresolved._
+Resolved 2026-05-16 (commit pending). Root cause confirmed: `_retryCount` was reset to
+0 on every received `AttributeValueChanged`/`AlarmStateChanged`, so a stream that
+connected, delivered one event, then failed — repeatedly — never tripped `MaxRetries`.
+Fix (recommendation option a): the per-event reset was removed; instead `OpenGrpcStream`
+arms a single `StabilityWindow` timer (60s default, internal-settable for tests), and
+only when it fires (`GrpcStreamStable`) — i.e. the stream stayed up long enough to be
+considered recovered — is `_retryCount` reset. `HandleGrpcError` cancels that timer, so
+a stream that fails before the window elapses does not recover its retry budget. A
+flapping stream therefore terminates after `MaxRetries` regardless of intervening
+events. Regression test
+`DebugStreamBridgeActorTests.FlappingStream_DeliveringEventsBetweenFailures_StillTerminatesAfterMaxRetries`
+fails against the pre-fix code (actor never terminates) and passes after;
+`RetryCount_RecoveredOnlyAfterStreamStaysStableForStabilityWindow` verifies the budget
+is recovered after a stable interval. The pre-existing test that codified the buggy
+per-event reset (`Grpc_Error_Resets_RetryCount_On_Successful_Event`) was replaced.

 ### Communication-009 — `_siteClients` field is mutable and reassignable; cache update is not atomic on failure