fix(communication): resolve Communication-002/003 — gRPC reconnect stream cleanup and subscription map safety
This commit is contained in:
@@ -8,7 +8,7 @@
|
||||
| Last reviewed | 2026-05-16 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `9c60592` |
|
||||
| Open findings | 10 |
|
||||
| Open findings | 8 |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -98,7 +98,7 @@ references `Communication-001`.
|
||||
|--|--|
|
||||
| Severity | High |
|
||||
| Category | Error handling & resilience |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ScadaLink.Communication/Actors/DebugStreamBridgeActor.cs:170`, `src/ScadaLink.Communication/Actors/DebugStreamBridgeActor.cs:143` |
|
||||
|
||||
**Description**
|
||||
@@ -126,7 +126,14 @@ the gRPC cancellation reaches the site and stops the relay actor.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
Resolved 2026-05-16 (commit `<pending>`). Root cause confirmed against source:
|
||||
`HandleGrpcError` flipped `_useNodeA` and scheduled `OpenGrpcStream` without ever
|
||||
unsubscribing the failed stream, leaving the old node's `StreamRelayActor` zombie until
|
||||
TCP/keepalive timeout. Fix: `HandleGrpcError` now resolves the client for the
|
||||
*previous* endpoint (before flipping `_useNodeA`) and calls `Unsubscribe(_correlationId)`
|
||||
on it, so the local CTS is cancelled and gRPC cancellation reaches the still-alive site.
|
||||
Regression test `DebugStreamBridgeActorTests.On_GrpcError_Unsubscribes_Old_Stream_Before_Reconnect`
|
||||
fails against the pre-fix code and passes after.
|
||||
|
||||
### Communication-003 — SiteStreamGrpcClient subscription map overwritten without disposal; reconnect can cancel the wrong stream
|
||||
|
||||
@@ -134,7 +141,7 @@ _Unresolved._
|
||||
|--|--|
|
||||
| Severity | High |
|
||||
| Category | Concurrency & thread safety |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ScadaLink.Communication/Grpc/SiteStreamGrpcClient.cs:77`, `src/ScadaLink.Communication/Grpc/SiteStreamGrpcClient.cs:106` |
|
||||
|
||||
**Description**
|
||||
@@ -161,7 +168,18 @@ caller-supplied correlation ID.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
Resolved 2026-05-16 (commit `<pending>`). Root cause confirmed against source: the
|
||||
inline `_subscriptions[correlationId] = cts` overwrote a prior CTS without
|
||||
cancel/dispose (leak), and the `finally`'s `TryRemove(correlationId, out _)` removed by
|
||||
key only — a racing reconnect's live CTS could be removed by the prior call's `finally`,
|
||||
orphaning the live stream. Fix: extracted two internal helpers used by `SubscribeAsync`
|
||||
— `RegisterSubscription` cancels+disposes any existing CTS for the correlation ID before
|
||||
inserting, and `RemoveSubscription` uses the `ConcurrentDictionary.TryRemove(KeyValuePair)`
|
||||
overload so it removes only the CTS that call created (mirroring `SiteStreamGrpcServer`'s
|
||||
`StreamEntry` pattern). Regression tests
|
||||
`SiteStreamGrpcClientTests.RegisterSubscription_ReusedCorrelationId_CancelsAndDisposesPriorCts`
|
||||
and `SiteStreamGrpcClientTests.RemoveSubscription_OnlyRemovesOwnCts_NotAReplacement`
|
||||
fail against the pre-fix logic and pass after.
|
||||
|
||||
### Communication-004 — Coordinator actors declare no SupervisorStrategy (design requires Resume)
|
||||
|
||||
|
||||
Reference in New Issue
Block a user