fix(communication): resolve Communication-012..015 — endpoint-aware gRPC client cache, address-change recreation, correlation-id validation, node-flip tests
This commit is contained in:
@@ -8,7 +8,7 @@
|
||||
| Last reviewed | 2026-05-17 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `39d737e` |
|
||||
| Open findings | 4 |
|
||||
| Open findings | 0 |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -543,7 +543,7 @@ The full module suite (`dotnet test tests/ScadaLink.Communication.Tests`) is gre
|
||||
|--|--|
|
||||
| Severity | High |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ScadaLink.Communication/Grpc/SiteStreamGrpcClientFactory.cs:39`, `src/ScadaLink.Communication/Actors/DebugStreamBridgeActor.cs:166` |
|
||||
|
||||
**Description**
|
||||
@@ -582,7 +582,17 @@ targets the other endpoint.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
Resolved 2026-05-17 (commit pending). Root cause confirmed against source:
|
||||
`GetOrCreate` was `_clients.GetOrAdd(siteIdentifier, …)` — keyed by site identifier
|
||||
only, so the `grpcEndpoint` argument was honoured solely on first creation and the
|
||||
NodeA→NodeB flip reconnected to the dead endpoint forever. Fix:
|
||||
`SiteStreamGrpcClient` now exposes its bound `Endpoint`, and `GetOrCreate` compares
|
||||
the cached client's endpoint against the requested one — on a mismatch it atomically
|
||||
installs (via `ConcurrentDictionary.AddOrUpdate`) a fresh client for the new endpoint
|
||||
and disposes the stale one, so a node flip actually moves to the surviving node.
|
||||
Regression tests `SiteStreamGrpcClientFactoryTests.GetOrCreate_EndpointChanged_ReturnsClientBoundToNewEndpoint`
|
||||
and `GetOrCreate_SameEndpoint_DoesNotDisposeOrRecreate` fail against the pre-fix
|
||||
factory and pass after.
|
||||
|
||||
### Communication-013 — Site gRPC address changes are never applied; `RemoveSiteAsync` has no production caller
|
||||
|
||||
@@ -590,7 +600,7 @@ _Unresolved._
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Design-document adherence |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ScadaLink.Communication/Grpc/SiteStreamGrpcClientFactory.cs:58` |
|
||||
|
||||
**Description**
|
||||
@@ -619,7 +629,19 @@ the on-the-fly address-change requirement is intentionally dropped, remove
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
Resolved 2026-05-17 (commit pending). The address-*change* staleness — the primary
|
||||
impact ("never have working debug streaming until the central node is restarted") —
|
||||
is fixed in-module by the Communication-012 change: `GetOrCreate` is now
|
||||
endpoint-change-aware, so the next time `DebugStreamBridgeActor` requests a stream
|
||||
with a corrected `GrpcNodeAAddress`/`GrpcNodeBAddress` the stale cached client is
|
||||
disposed and replaced — no central restart needed and no external wiring required.
|
||||
`RemoveSiteAsync` is retained as the disposal path for full site *removal* (a deleted
|
||||
site record) and its doc comment now states that role explicitly; wiring a
|
||||
delete-site callback belongs to the site-management flow in another module and is out
|
||||
of this module's scope. Regression test
|
||||
`SiteStreamGrpcClientFactoryTests.GetOrCreate_EndpointChanged_DisposesPriorClient`
|
||||
fails against the pre-fix factory (stale client never disposed/replaced) and passes
|
||||
after.
|
||||
|
||||
### Communication-014 — Untrusted gRPC `correlation_id` flows directly into an Akka actor name
|
||||
|
||||
@@ -627,7 +649,7 @@ _Unresolved._
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Security |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ScadaLink.Communication/Grpc/SiteStreamGrpcServer.cs:124` |
|
||||
|
||||
**Description**
|
||||
@@ -652,7 +674,17 @@ actor state / dictionary key.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
Resolved 2026-05-17 (commit pending). Root cause confirmed: `SubscribeInstance` fed
|
||||
the off-the-wire `request.CorrelationId` straight into the `stream-relay-…` actor
|
||||
name, so an id with `/`, whitespace, or other disallowed characters made `ActorOf`
|
||||
throw `InvalidActorNameException` as an unhandled RPC fault. Fix: `SubscribeInstance`
|
||||
now validates `CorrelationId` on entry — rejecting null/empty or any value failing
|
||||
`ActorPath.IsValidPathElement` with `StatusCode.InvalidArgument` before any actor or
|
||||
subscription state is created. Regression test
|
||||
`SiteStreamGrpcServerTests.RejectsCorrelationIdThatIsNotActorNameSafe` (theory:
|
||||
`/`-bearing, whitespace, empty, `$`-prefixed ids) fails against the pre-fix server
|
||||
and passes after; `AcceptsActorNameSafeCorrelationId` confirms a normal GUID is still
|
||||
accepted.
|
||||
|
||||
### Communication-015 — No test exercises the real gRPC client factory across a node flip
|
||||
|
||||
@@ -660,7 +692,7 @@ _Unresolved._
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Testing coverage |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
| Location | `tests/ScadaLink.Communication.Tests/Grpc/DebugStreamBridgeActorTests.cs:401`, `tests/ScadaLink.Communication.Tests/Grpc/SiteStreamGrpcClientFactoryTests.cs` |
|
||||
|
||||
**Description**
|
||||
@@ -683,4 +715,14 @@ test's mock factory track the endpoint per call so node-flip coverage is meaning
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
Resolved 2026-05-17 (commit pending). `SiteStreamGrpcClientFactoryTests` gained
|
||||
`GetOrCreate_EndpointChanged_ReturnsClientBoundToNewEndpoint` and
|
||||
`GetOrCreate_EndpointChanged_DisposesPriorClient`, which drive the *real*
|
||||
`SiteStreamGrpcClientFactory` across a node flip and assert the second call yields a
|
||||
client bound to the new endpoint with the stale one disposed — both fail against the
|
||||
pre-fix factory and pass after (the Communication-012 fix). `DebugStreamBridgeActorTests`
|
||||
gained `On_GrpcError_Reconnects_To_Other_Node_Endpoint`, which uses a new
|
||||
`EndpointTrackingGrpcClientFactory` test double that hands out a distinct mock client
|
||||
per endpoint (instead of one fixed mock regardless of endpoint), so the bridge actor's
|
||||
NodeA→NodeB reconnect is now verified to actually target the NodeB endpoint rather
|
||||
than being masked by an endpoint-agnostic mock.
|
||||
|
||||
Reference in New Issue
Block a user