fix(communication): resolve Communication-004..008 — Resume supervision, gRPC option wiring, address-load logging, sync dispose, flap detection
This commit is contained in:
@@ -8,7 +8,7 @@
|
||||
| Last reviewed | 2026-05-16 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `9c60592` |
|
||||
| Open findings | 8 |
|
||||
| Open findings | 3 |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -187,7 +187,7 @@ fail against the pre-fix logic and pass after.
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Akka.NET conventions |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:42`, `src/ScadaLink.Communication/Actors/SiteCommunicationActor.cs:22` |
|
||||
|
||||
**Description**
|
||||
@@ -212,7 +212,17 @@ strategy), matching the documented decision and other coordinator actors.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
Resolved 2026-05-16 (commit pending). Root cause confirmed: neither
|
||||
`CentralCommunicationActor` nor `SiteCommunicationActor` overrode `SupervisorStrategy`,
|
||||
so child faults fell back to the Akka default (`Restart`). Note that an actor's own
|
||||
`SupervisorStrategy` governs its *children* — a transient child fault would `Restart`
|
||||
the child and discard its in-memory state, contrary to the CLAUDE.md "Resume for
|
||||
coordinator actors" decision. Fix: both actors now override `SupervisorStrategy()` to
|
||||
return a `OneForOneStrategy` with an unbounded `Decider` resolving to `Directive.Resume`
|
||||
(mirroring `DataConnectionManagerActor`). Regression tests
|
||||
`CoordinatorSupervisionTests.CentralCommunicationActor_SupervisorStrategy_IsResume` and
|
||||
`CoordinatorSupervisionTests.SiteCommunicationActor_SupervisorStrategy_IsResume` fail
|
||||
against the pre-fix code (decider yields `Restart`) and pass after.
|
||||
|
||||
### Communication-005 — gRPC keepalive and max-stream-lifetime options are defined but never applied
|
||||
|
||||
@@ -220,7 +230,7 @@ _Unresolved._
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Design-document adherence |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ScadaLink.Communication/Grpc/SiteStreamGrpcClient.cs:25`, `src/ScadaLink.Communication/CommunicationOptions.cs:36` |
|
||||
|
||||
**Description**
|
||||
@@ -249,7 +259,22 @@ option and update the design doc.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
Resolved 2026-05-16 (commit pending). Root cause confirmed: `SiteStreamGrpcClient`
|
||||
hard-coded the keepalive values, `GrpcMaxStreamLifetime` was referenced nowhere, and
|
||||
`GrpcMaxConcurrentStreams` was never bound to the server. Fix (scoped to
|
||||
`src/ScadaLink.Communication`): `SiteStreamGrpcClient` gained a constructor taking
|
||||
`CommunicationOptions` and now applies `GrpcKeepAlivePingDelay`/`GrpcKeepAlivePingTimeout`
|
||||
to its `SocketsHttpHandler`; `SiteStreamGrpcClientFactory` gained an
|
||||
`IOptions<CommunicationOptions>` DI constructor and flows the options into every client
|
||||
it creates; `SiteStreamGrpcServer` gained an `IOptions<CommunicationOptions>` DI
|
||||
constructor that binds `GrpcMaxConcurrentStreams` and implements the documented 4-hour
|
||||
session timeout via `CancellationTokenSource.CancelAfter(GrpcMaxStreamLifetime)` on the
|
||||
per-stream CTS. The Host's existing `AddSingleton<SiteStreamGrpcServer>()` registration
|
||||
resolves the new DI constructor via greedy resolution — no Host change required.
|
||||
Regression tests `GrpcOptionsWiringTests.SiteStreamGrpcClient_AppliesKeepAliveFromOptions`,
|
||||
`GrpcOptionsWiringTests.SiteStreamGrpcClientFactory_FlowsOptionsToCreatedClients`, and
|
||||
`GrpcOptionsWiringTests.SiteStreamGrpcServer_BindsMaxConcurrentStreamsAndLifetimeFromOptions`
|
||||
exercise the wiring (they require the new members to even compile).
|
||||
|
||||
### Communication-006 — Site address load failures are silently swallowed, leaving a stale cache
|
||||
|
||||
@@ -257,7 +282,7 @@ _Unresolved._
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Error handling & resilience |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:204` |
|
||||
|
||||
**Description**
|
||||
@@ -279,7 +304,16 @@ is down". Optionally surface a health metric for repeated load failures.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
Resolved 2026-05-16 (commit pending). Root cause confirmed: a faulted
|
||||
`LoadSiteAddressesFromDb` task is piped to `Self` as a `Status.Failure`, but the actor
|
||||
had no handler for it — the failure became an unhandled message (debug-level only) and
|
||||
the periodic refresh failed silently. Fix: added a `Receive<Status.Failure>` handler
|
||||
that logs the load failure at `Warning` with the underlying exception as the cause, so
|
||||
operators can distinguish a missing-addresses configuration from a database outage.
|
||||
Regression test
|
||||
`CentralCommunicationActorTests.LoadSiteAddressesFailure_IsLoggedNotSilentlySwallowed`
|
||||
(repository query throws) asserts the Warning is emitted — it produces no warning
|
||||
against the pre-fix code and passes after.
|
||||
|
||||
### Communication-007 — `SiteStreamGrpcClientFactory.Dispose` blocks on async work (sync-over-async)
|
||||
|
||||
@@ -287,7 +321,7 @@ _Unresolved._
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Performance & resource management |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ScadaLink.Communication/Grpc/SiteStreamGrpcClientFactory.cs:53` |
|
||||
|
||||
**Description**
|
||||
@@ -308,7 +342,17 @@ path, or document why blocking is safe here.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
Resolved 2026-05-16 (commit pending). Root cause confirmed: `Dispose()` called
|
||||
`DisposeAsync().AsTask().GetAwaiter().GetResult()`, the classic sync-over-async pattern.
|
||||
Fix: `SiteStreamGrpcClient` now also implements `IDisposable` with a synchronous
|
||||
`Dispose()` that releases its CancellationTokenSources and underlying `GrpcChannel`
|
||||
directly (all of that teardown is inherently synchronous); `SiteStreamGrpcClientFactory.Dispose()`
|
||||
now disposes each cached client via that synchronous path with no blocking on the async
|
||||
path. A `CreateClient` seam was extracted so the test can substitute a tracking client
|
||||
while still exercising the factory's real caching/disposal machinery. Regression test
|
||||
`SiteStreamGrpcClientFactoryDisposeTests.Dispose_DisposesClientsSynchronously_NotViaAsyncPath`
|
||||
fails against the pre-fix code (clients disposed via `DisposeAsync`) and passes after;
|
||||
`Dispose_DoesNotDeadlock_UnderSingleThreadedSynchronizationContext` guards the stall path.
|
||||
|
||||
### Communication-008 — Reconnect retry-count reset can mask a flapping stream indefinitely
|
||||
|
||||
@@ -316,7 +360,7 @@ _Unresolved._
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ScadaLink.Communication/Actors/DebugStreamBridgeActor.cs:71`, `src/ScadaLink.Communication/Actors/DebugStreamBridgeActor.cs:174` |
|
||||
|
||||
**Description**
|
||||
@@ -339,7 +383,21 @@ reconnects regardless of intervening events.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
Resolved 2026-05-16 (commit pending). Root cause confirmed: `_retryCount` was reset to
|
||||
0 on every received `AttributeValueChanged`/`AlarmStateChanged`, so a stream that
|
||||
connected, delivered one event, then failed — repeatedly — never tripped `MaxRetries`.
|
||||
Fix (recommendation option a): the per-event reset was removed; instead `OpenGrpcStream`
|
||||
arms a single `StabilityWindow` timer (60s default, internal-settable for tests), and
|
||||
only when it fires (`GrpcStreamStable`) — i.e. the stream stayed up long enough to be
|
||||
considered recovered — is `_retryCount` reset. `HandleGrpcError` cancels that timer, so
|
||||
a stream that fails before the window elapses does not recover its retry budget. A
|
||||
flapping stream therefore terminates after `MaxRetries` regardless of intervening
|
||||
events. Regression test
|
||||
`DebugStreamBridgeActorTests.FlappingStream_DeliveringEventsBetweenFailures_StillTerminatesAfterMaxRetries`
|
||||
fails against the pre-fix code (actor never terminates) and passes after;
|
||||
`RetryCount_RecoveredOnlyAfterStreamStaysStableForStabilityWindow` verifies the budget
|
||||
is recovered after a stable interval. The pre-existing test that codified the buggy
|
||||
per-event reset (`Grpc_Error_Resets_RetryCount_On_Successful_Event`) was replaced.
|
||||
|
||||
### Communication-009 — `_siteClients` field is mutable and reassignable; cache update is not atomic on failure
|
||||
|
||||
|
||||
Reference in New Issue
Block a user