fix(communication): resolve Communication-002/003 — gRPC reconnect stream cleanup and subscription map safety

This commit is contained in:
Joseph Doherty
2026-05-16 19:33:09 -04:00
parent 87f14c190a
commit 301e7fb854
5 changed files with 134 additions and 7 deletions

View File

@@ -8,7 +8,7 @@
| Last reviewed | 2026-05-16 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 10 |
| Open findings | 8 |
## Summary
@@ -98,7 +98,7 @@ references `Communication-001`.
|--|--|
| Severity | High |
| Category | Error handling & resilience |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.Communication/Actors/DebugStreamBridgeActor.cs:170`, `src/ScadaLink.Communication/Actors/DebugStreamBridgeActor.cs:143` |
**Description**
@@ -126,7 +126,14 @@ the gRPC cancellation reaches the site and stops the relay actor.
**Resolution**
_Unresolved._
Resolved 2026-05-16 (commit `<pending>`). Root cause confirmed against source:
`HandleGrpcError` flipped `_useNodeA` and scheduled `OpenGrpcStream` without ever
unsubscribing the failed stream, leaving the old node's `StreamRelayActor` zombie until
TCP/keepalive timeout. Fix: `HandleGrpcError` now resolves the client for the
*previous* endpoint (before flipping `_useNodeA`) and calls `Unsubscribe(_correlationId)`
on it, so the local CTS is cancelled and gRPC cancellation reaches the still-alive site.
Regression test `DebugStreamBridgeActorTests.On_GrpcError_Unsubscribes_Old_Stream_Before_Reconnect`
fails against the pre-fix code and passes after.
### Communication-003 — SiteStreamGrpcClient subscription map overwritten without disposal; reconnect can cancel the wrong stream
@@ -134,7 +141,7 @@ _Unresolved._
|--|--|
| Severity | High |
| Category | Concurrency & thread safety |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.Communication/Grpc/SiteStreamGrpcClient.cs:77`, `src/ScadaLink.Communication/Grpc/SiteStreamGrpcClient.cs:106` |
**Description**
@@ -161,7 +168,18 @@ caller-supplied correlation ID.
**Resolution**
_Unresolved._
Resolved 2026-05-16 (commit `<pending>`). Root cause confirmed against source: the
inline `_subscriptions[correlationId] = cts` overwrote a prior CTS without
cancel/dispose (leak), and the `finally`'s `TryRemove(correlationId, out _)` removed by
key only — a racing reconnect's live CTS could be removed by the prior call's `finally`,
orphaning the live stream. Fix: extracted two internal helpers used by `SubscribeAsync`
`RegisterSubscription` cancels+disposes any existing CTS for the correlation ID before
inserting, and `RemoveSubscription` uses the `ConcurrentDictionary.TryRemove(KeyValuePair)`
overload so it removes only the CTS that call created (mirroring `SiteStreamGrpcServer`'s
`StreamEntry` pattern). Regression tests
`SiteStreamGrpcClientTests.RegisterSubscription_ReusedCorrelationId_CancelsAndDisposesPriorCts`
and `SiteStreamGrpcClientTests.RemoveSubscription_OnlyRemovesOwnCts_NotAReplacement`
fail against the pre-fix logic and pass after.
### Communication-004 — Coordinator actors declare no SupervisorStrategy (design requires Resume)