docs(code-reviews): re-review batch 1 at 39d737e — CentralUI, CLI, ClusterInfrastructure, Commons, Communication
17 new findings: CentralUI-020..025, CLI-014..016, ClusterInfrastructure-009..010, Commons-013..014, Communication-012..015.
This commit is contained in:
@@ -5,10 +5,10 @@
|
||||
| Module | `src/ScadaLink.Communication` |
|
||||
| Design doc | `docs/requirements/Component-Communication.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-16 |
|
||||
| Last reviewed | 2026-05-17 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `9c60592` |
|
||||
| Open findings | 0 |
|
||||
| Commit reviewed | `39d737e` |
|
||||
| Open findings | 4 |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -25,20 +25,37 @@ CLAUDE.md "Resume for coordinator actors" decision. Design-doc adherence is othe
|
||||
good. Test coverage is broad for happy paths but has gaps around failover, cache
|
||||
mutation races, and the snapshot-timeout cleanup path.
|
||||
|
||||
#### Re-review 2026-05-17 (commit `39d737e`)
|
||||
|
||||
All prior findings (Communication-001..011) are confirmed `Resolved` in this commit
|
||||
and the fixes hold up against the source. The re-review walked all 10 checklist
|
||||
categories again and uncovered a previously-missed defect at the centre of the gRPC
|
||||
node-failover path: **`SiteStreamGrpcClientFactory.GetOrCreate` caches one client per
|
||||
site identifier and silently ignores the `grpcEndpoint` argument on a cache hit**. The
|
||||
`DebugStreamBridgeActor` reconnect logic flips `_useNodeA` and passes the *other*
|
||||
node's endpoint, but the factory hands back the original NodeA-bound client every
|
||||
time — so the documented "try the other site node endpoint" failover never actually
|
||||
moves to NodeB (Communication-012). The same caching defect means a site address
|
||||
change is never picked up because `RemoveSiteAsync` has no production caller
|
||||
(Communication-013). Two Low findings round out the re-review: an untrusted
|
||||
gRPC-supplied `correlation_id` flows straight into an Akka actor name
|
||||
(Communication-014), and the factory's endpoint-reuse defect is masked by the test
|
||||
mock (Communication-015). Four new findings, all Open: one High, one Medium, two Low.
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
|---|----------|----------|-------|
|
||||
| 1 | Correctness & logic bugs | ✓ | Snapshot-timeout orphan, reconnect not calling `CleanupGrpc`, subscription-map races. |
|
||||
| 2 | Akka.NET conventions | ✓ | No supervision strategy on coordinators; `Sender` captured in async-launched closure path. |
|
||||
| 3 | Concurrency & thread safety | ✓ | `SiteStreamGrpcClient._subscriptions` overwrite/remove race; `_siteClients` field reassignment unused but non-readonly. |
|
||||
| 4 | Error handling & resilience | ✓ | gRPC reconnect leaks server-side relay; `LoadSiteAddressesFromDb` swallows DB failures silently. |
|
||||
| 5 | Security | ✓ | No findings in module code. DebugStreamHub auth lives outside this module (Central UI). |
|
||||
| 6 | Performance & resource management | ✓ | Orphaned subscriptions/CTS leaks; `SiteStreamGrpcClientFactory.Dispose` blocks on async. |
|
||||
| 7 | Design-document adherence | ✓ | `GrpcMaxStreamLifetime` / keepalive options defined but never applied; hard-coded values used instead. |
|
||||
| 8 | Code organization & conventions | ✓ | Options pattern correct; minor: public records declared in actor files. No structural issues. |
|
||||
| 9 | Testing coverage | ✓ | No tests for snapshot-timeout cleanup, address-cache refresh races, or gRPC server reconnect-leak. |
|
||||
| 10 | Documentation & comments | ✓ | XML comment on `DebugStreamBridgeActor` says "Persistent actor" — it is not an Akka.Persistence actor. |
|
||||
| 1 | Correctness & logic bugs | ✓ | Re-review: factory ignores endpoint on cache hit, defeating NodeA→NodeB stream failover (Communication-012). Prior items resolved. |
|
||||
| 2 | Akka.NET conventions | ✓ | Coordinator `Resume` strategies now present and verified. No new issues. |
|
||||
| 3 | Concurrency & thread safety | ✓ | Subscription-map register/remove now ownership-checked. `_siteClients` readonly. No new issues. |
|
||||
| 4 | Error handling & resilience | ✓ | `Status.Failure` handler added; reconnect unsubscribes prior stream. No new issues. |
|
||||
| 5 | Security | ✓ | Re-review: public gRPC `correlation_id` flows unvalidated into an Akka actor name (Communication-014). |
|
||||
| 6 | Performance & resource management | ✓ | Synchronous `Dispose` paths fixed; CTS leaks resolved. No new issues. |
|
||||
| 7 | Design-document adherence | ✓ | Re-review: site gRPC address-change disposal not wired — `RemoveSiteAsync` is dead code (Communication-013). gRPC options now applied. |
|
||||
| 8 | Code organization & conventions | ✓ | Options pattern correct; public records still declared in actor files (acceptable). No structural issues. |
|
||||
| 9 | Testing coverage | ✓ | Re-review: prior gaps closed, but the factory mock masks the endpoint-reuse defect — no real node-flip coverage (Communication-015). |
|
||||
| 10 | Documentation & comments | ✓ | `DebugStreamBridgeActor` summary corrected. No new issues. |
|
||||
|
||||
## Findings
|
||||
|
||||
@@ -519,3 +536,151 @@ passes after):
|
||||
(added with this finding's resolution).
|
||||
The full module suite (`dotnet test tests/ScadaLink.Communication.Tests`) is green at
|
||||
111 passing tests.
|
||||
|
||||
### Communication-012 — gRPC client factory ignores the endpoint on a cache hit, breaking NodeA→NodeB stream failover
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | High |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Communication/Grpc/SiteStreamGrpcClientFactory.cs:39`, `src/ScadaLink.Communication/Actors/DebugStreamBridgeActor.cs:166` |
|
||||
|
||||
**Description**
|
||||
|
||||
`SiteStreamGrpcClientFactory.GetOrCreate` is `_clients.GetOrAdd(siteIdentifier, _ =>
|
||||
CreateClient(grpcEndpoint))` — it keys the cache by **site identifier only** and the
|
||||
`grpcEndpoint` argument is used *exclusively* for the first-ever creation. Every
|
||||
subsequent call for that site returns the originally-cached `SiteStreamGrpcClient`,
|
||||
which is permanently bound to the `GrpcChannel` of whatever endpoint was passed first.
|
||||
|
||||
`DebugStreamBridgeActor` relies on the opposite behaviour. On a gRPC stream error,
|
||||
`HandleGrpcError` flips `_useNodeA` and `OpenGrpcStream` recomputes
|
||||
`endpoint = _useNodeA ? _grpcNodeAAddress : _grpcNodeBAddress`, then calls
|
||||
`_grpcFactory.GetOrCreate(_siteIdentifier, endpoint)` expecting a client connected to
|
||||
the *other* node. Because the factory ignores the new endpoint, the bridge actor
|
||||
reconnects to the **same failed NodeA endpoint** on every retry. The design doc's
|
||||
core debug-stream failover behaviour ("tries the other site node endpoint", "NodeB if
|
||||
NodeA failed, or vice versa") is therefore inoperative — when a site node goes down,
|
||||
the debug stream cannot move to the surviving node and simply exhausts `MaxRetries`
|
||||
against the dead endpoint and terminates. The `_useNodeA` flip, the `previousEndpoint`
|
||||
computation in `HandleGrpcError`, and the `CleanupGrpc` endpoint selection are all
|
||||
dead logic. (Communication-002's `Unsubscribe`-before-reconnect fix still functions,
|
||||
but it unsubscribes and re-subscribes on the *same* client/node rather than the
|
||||
intended other node.)
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Make the per-site client aware of both endpoints, or key the cache by
|
||||
`(siteIdentifier, endpoint)`, or have `GetOrCreate` detect an endpoint change and
|
||||
dispose+recreate the cached client. Given the design intent ("Falls back to NodeB if
|
||||
NodeA connection fails"), the cleanest fix is to give `SiteStreamGrpcClient` (or a
|
||||
per-site holder) both NodeA/NodeB addresses and let it switch channels internally,
|
||||
removing the endpoint argument from `GetOrCreate` entirely. Add a test that drives a
|
||||
real `SiteStreamGrpcClientFactory` through a node flip and asserts the second client
|
||||
targets the other endpoint.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### Communication-013 — Site gRPC address changes are never applied; `RemoveSiteAsync` has no production caller
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Design-document adherence |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Communication/Grpc/SiteStreamGrpcClientFactory.cs:58` |
|
||||
|
||||
**Description**
|
||||
|
||||
The design doc states that `SiteStreamGrpcClientFactory` "Disposes clients on site
|
||||
removal or address change." `RemoveSiteAsync` implements the disposal mechanism, but
|
||||
a repo-wide search finds **no production caller** — only tests invoke it. Combined
|
||||
with the cache-by-site-identifier behaviour (Communication-012), the consequence is
|
||||
that once a site's `SiteStreamGrpcClient` is created, a later edit to that site's
|
||||
`GrpcNodeAAddress` / `GrpcNodeBAddress` (via the Central UI or CLI) is never reflected
|
||||
in the cached client — it keeps using the stale channel for the life of the process.
|
||||
`CentralCommunicationActor` already refreshes the *Akka* address cache every 60s and
|
||||
recreates ClusterClients on change, but there is no equivalent invalidation path
|
||||
wired into the gRPC client factory. A site whose gRPC endpoints are corrected after
|
||||
an initial misconfiguration will never have working debug streaming until the central
|
||||
node is restarted.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Wire a site-removal / address-change signal into `SiteStreamGrpcClientFactory` —
|
||||
e.g. have `CentralCommunicationActor` (which already detects address changes in
|
||||
`HandleSiteAddressCacheLoaded`) call `RemoveSiteAsync` for sites whose gRPC addresses
|
||||
changed or were removed, or fold the gRPC endpoints into the same refresh cycle. If
|
||||
the on-the-fly address-change requirement is intentionally dropped, remove
|
||||
`RemoveSiteAsync` and correct the design doc.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### Communication-014 — Untrusted gRPC `correlation_id` flows directly into an Akka actor name
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Security |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Communication/Grpc/SiteStreamGrpcServer.cs:124` |
|
||||
|
||||
**Description**
|
||||
|
||||
`SubscribeInstance` is a public gRPC endpoint hosted on each site node. It creates the
|
||||
relay actor with `$"stream-relay-{request.CorrelationId}-{actorSeq}"` as the actor
|
||||
name, where `request.CorrelationId` comes straight off the wire. Akka actor names have
|
||||
a restricted character set; a `correlation_id` containing `/`, whitespace, or other
|
||||
disallowed characters makes `ActorSystem.ActorOf` throw `InvalidActorNameException`.
|
||||
That exception is not caught inside `SubscribeInstance`, so it escapes as an unhandled
|
||||
RPC fault (and after the `_streamSubscriber.Subscribe` / `_activeStreams` entry has
|
||||
already been set up for the duration, though the `finally` does not run because the
|
||||
throw is before the `try`). In practice central always supplies a GUID, so impact is
|
||||
low, but the server is trusting client-supplied input to be actor-name-safe.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Validate `request.CorrelationId` on entry (non-empty, matches an expected GUID/safe
|
||||
pattern) and reject with `StatusCode.InvalidArgument` otherwise; or derive the actor
|
||||
name solely from the internal `_actorCounter` and keep the correlation ID only as
|
||||
actor state / dictionary key.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### Communication-015 — No test exercises the real gRPC client factory across a node flip
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Testing coverage |
|
||||
| Status | Open |
|
||||
| Location | `tests/ScadaLink.Communication.Tests/Grpc/DebugStreamBridgeActorTests.cs:401`, `tests/ScadaLink.Communication.Tests/Grpc/SiteStreamGrpcClientFactoryTests.cs` |
|
||||
|
||||
**Description**
|
||||
|
||||
`DebugStreamBridgeActorTests` exercises the reconnect/failover paths through
|
||||
`MockSiteStreamGrpcClientFactory`, which returns one fixed mock client regardless of
|
||||
the `grpcEndpoint` argument. This is exactly the behaviour the *real*
|
||||
`SiteStreamGrpcClientFactory` exhibits incorrectly (Communication-012), so the mock
|
||||
masks the defect: `On_GrpcError_Reconnects_To_Other_Node` passes even though the real
|
||||
factory never reaches the other node. `SiteStreamGrpcClientFactoryTests` only asserts
|
||||
`GetOrCreate` returns the same client for the same site — it never checks what happens
|
||||
when the same site is requested with a *different* endpoint.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Add a `SiteStreamGrpcClientFactoryTests` case that calls `GetOrCreate(site, endpointA)`
|
||||
then `GetOrCreate(site, endpointB)` and asserts the second call targets `endpointB`
|
||||
(it should fail today and pass after Communication-012 is fixed). Have the bridge-actor
|
||||
test's mock factory track the endpoint per call so node-flip coverage is meaningful.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
Reference in New Issue
Block a user