7b0b9c7365
Solution + 23 src projects + 26 test projects renamed; folders, csproj, namespaces, and ScadaLinkDbContext/ScadaBridgeDbContext class updated. ActorSystem "scadalink" → "scadabridge", Akka seed-node URLs migrated. SQL roles/logins, LDAP domains, CLI command name, and CLI config dir (~/.scadalink → ~/.scadabridge) also renamed. Build green; 5 Host.Tests fail awaiting SQL login rename in next commit. Pre-existing StaleTagMonitor timing flakes unchanged. Rename script committed at tools/rename-to-scadabridge.sh.
1123 lines
62 KiB
Markdown
1123 lines
62 KiB
Markdown
# Code Review — Communication
|
|
|
|
| Field | Value |
|
|
|-------|-------|
|
|
| Module | `src/ZB.MOM.WW.ScadaBridge.Communication` |
|
|
| Design doc | `docs/requirements/Component-Communication.md` |
|
|
| Status | Reviewed |
|
|
| Last reviewed | 2026-05-28 |
|
|
| Reviewer | claude-agent |
|
|
| Commit reviewed | `1eb6e97` |
|
|
| Open findings | 2 |
|
|
|
|
## Summary
|
|
|
|
The Communication module is generally well-structured and matches the design doc's
|
|
two-transport model (ClusterClient for command/control, gRPC server-streaming for
|
|
real-time data). The actors keep mutable state on the actor thread, use `PipeTo` for
|
|
async work, and the gRPC server/client lifecycle is mostly disciplined. However the
|
|
review found several High and Medium issues clustered around two themes:
|
|
**(a) gRPC subscription bookkeeping races** — `SiteStreamGrpcClient` overwrites and
|
|
removes subscription entries by correlation ID without disposal or ownership checks,
|
|
so reconnect cycles leak `CancellationTokenSource`es and can cancel the wrong stream;
|
|
and **(b) missing supervision strategy** on the coordinator actors, contrary to the
|
|
CLAUDE.md "Resume for coordinator actors" decision. Design-doc adherence is otherwise
|
|
good. Test coverage is broad for happy paths but has gaps around failover, cache
|
|
mutation races, and the snapshot-timeout cleanup path.
|
|
|
|
#### Re-review 2026-05-17 (commit `39d737e`)
|
|
|
|
All prior findings (Communication-001..011) are confirmed `Resolved` in this commit
|
|
and the fixes hold up against the source. The re-review walked all 10 checklist
|
|
categories again and uncovered a previously-missed defect at the centre of the gRPC
|
|
node-failover path: **`SiteStreamGrpcClientFactory.GetOrCreate` caches one client per
|
|
site identifier and silently ignores the `grpcEndpoint` argument on a cache hit**. The
|
|
`DebugStreamBridgeActor` reconnect logic flips `_useNodeA` and passes the *other*
|
|
node's endpoint, but the factory hands back the original NodeA-bound client every
|
|
time — so the documented "try the other site node endpoint" failover never actually
|
|
moves to NodeB (Communication-012). The same caching defect means a site address
|
|
change is never picked up because `RemoveSiteAsync` has no production caller
|
|
(Communication-013). Two Low findings round out the re-review: an untrusted
|
|
gRPC-supplied `correlation_id` flows straight into an Akka actor name
|
|
(Communication-014), and the factory's endpoint-reuse defect is masked by the test
|
|
mock (Communication-015). Four new findings, all Open: one High, one Medium, two Low.
|
|
|
|
#### Re-review 2026-05-28 (commit `1eb6e97`)
|
|
|
|
All prior findings (Communication-001..015) remain `Resolved` in this commit. The
|
|
re-review walked all 10 checklist categories again on the surface that has not
|
|
been re-examined before — the central↔site command/control routing surface
|
|
(`CentralCommunicationActor`, `SiteCommunicationActor`) rather than the
|
|
previously-mined gRPC streaming surface — and uncovered a cluster of defects
|
|
around the connection-state-change workflow. The single material finding is
|
|
**`HandleConnectionStateChanged` is dead code**: no production code path emits
|
|
`ConnectionStateChanged`, so the documented "kill active debug streams for the
|
|
disconnected site" + "mark in-progress deployments as failed" workflow never
|
|
fires at runtime (Communication-016). The downstream consequence is
|
|
**`_inProgressDeployments` grows unboundedly** — entries are inserted on every
|
|
deployment but only cleaned via that dead path (Communication-017). Three
|
|
smaller items round out the re-review: site heartbeats hard-code
|
|
`IsActive: true` regardless of node role (Communication-018), the
|
|
60-second-periodic `LoadSiteAddressesFromDb` task has no CancellationToken so a
|
|
hung DB query has no upper bound (Communication-019), the
|
|
`SiteAddressCacheLoaded` internal message carries a mutable
|
|
`Dictionary`/`List` (Communication-020), `SiteStreamGrpcServer.SubscribeInstance`
|
|
leaks the StreamRelayActor if `_streamSubscriber.Subscribe` throws between
|
|
`ActorOf` and the `try` block (Communication-021), and `_debugSubscriptions`
|
|
keyed by caller-supplied `CorrelationId` could orphan a subscriber on ID reuse
|
|
(Communication-022). Seven new findings, all Open: one High, one Medium, five
|
|
Low.
|
|
|
|
## Checklist coverage 2026-05-28
|
|
|
|
| # | Category | Examined | Notes |
|
|
|---|----------|----------|-------|
|
|
| 1 | Correctness & logic bugs | ✓ | `HandleConnectionStateChanged` and its `_inProgressDeployments` / `_debugSubscriptions` cleanup never fire — the connection-state workflow is dead (Communication-016, Communication-017). `_debugSubscriptions` correlation-ID overwrite risk (Communication-022). |
|
|
| 2 | Akka.NET conventions | ✓ | `SiteAddressCacheLoaded` carries mutable `Dictionary<string, List<string>>` — violates message-immutability convention (Communication-020). `Forward`/`PipeTo`/Sender-capture all clean. |
|
|
| 3 | Concurrency & thread safety | ✓ | All mutable state mutated on the actor thread. `_subscriptions` ConcurrentDictionary use disciplined. No new issues. |
|
|
| 4 | Error handling & resilience | ✓ | `LoadSiteAddressesFromDb` lacks a `CancellationToken` propagation point (Communication-019). `SubscribeInstance` leaks the relay actor if `Subscribe` throws pre-try (Communication-021). |
|
|
| 5 | Security | ✓ | Correlation-id validation in place (Communication-014). No new issues. |
|
|
| 6 | Performance & resource management | ✓ | `_inProgressDeployments` grows unboundedly (Communication-017). gRPC client/server lifecycles otherwise clean. |
|
|
| 7 | Design-document adherence | ✓ | `ConnectionStateChanged` handler is dead code — the doc-stated "kill streams on disconnect, fail in-progress deployments" workflow does not actually run (Communication-016). Site heartbeats always report `IsActive: true` regardless of role (Communication-018). |
|
|
| 8 | Code organization & conventions | ✓ | Options pattern correct; mapper placement and proto evolution are additive-only. No new issues. |
|
|
| 9 | Testing coverage | ✓ | `CentralCommunicationActorTests.ConnectionLost_DebugStreamsKilled` exercises a code path that no production caller ever drives — gives false confidence (related to Communication-016). |
|
|
| 10 | Documentation & comments | ✓ | Detailed XML docs added in this commit. No new issues. |
|
|
|
|
## Checklist coverage
|
|
|
|
| # | Category | Examined | Notes |
|
|
|---|----------|----------|-------|
|
|
| 1 | Correctness & logic bugs | ✓ | Re-review: factory ignores endpoint on cache hit, defeating NodeA→NodeB stream failover (Communication-012). Prior items resolved. |
|
|
| 2 | Akka.NET conventions | ✓ | Coordinator `Resume` strategies now present and verified. No new issues. |
|
|
| 3 | Concurrency & thread safety | ✓ | Subscription-map register/remove now ownership-checked. `_siteClients` readonly. No new issues. |
|
|
| 4 | Error handling & resilience | ✓ | `Status.Failure` handler added; reconnect unsubscribes prior stream. No new issues. |
|
|
| 5 | Security | ✓ | Re-review: public gRPC `correlation_id` flows unvalidated into an Akka actor name (Communication-014). |
|
|
| 6 | Performance & resource management | ✓ | Synchronous `Dispose` paths fixed; CTS leaks resolved. No new issues. |
|
|
| 7 | Design-document adherence | ✓ | Re-review: site gRPC address-change disposal not wired — `RemoveSiteAsync` is dead code (Communication-013). gRPC options now applied. |
|
|
| 8 | Code organization & conventions | ✓ | Options pattern correct; public records still declared in actor files (acceptable). No structural issues. |
|
|
| 9 | Testing coverage | ✓ | Re-review: prior gaps closed, but the factory mock masks the endpoint-reuse defect — no real node-flip coverage (Communication-015). |
|
|
| 10 | Documentation & comments | ✓ | `DebugStreamBridgeActor` summary corrected. No new issues. |
|
|
|
|
## Findings
|
|
|
|
### Communication-001 — Early stream termination escapes StartStreamAsync's narrow exception handling
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Medium |
|
|
| Category | Error handling & resilience |
|
|
| Status | Resolved |
|
|
| Location | `src/ZB.MOM.WW.ScadaBridge.Communication/DebugStreamService.cs:130-143` |
|
|
|
|
**Re-triaged 2026-05-16:** originally filed Critical, claiming an orphaned bridge actor
|
|
and a multi-minute site-side resource leak on every snapshot timeout. On verification
|
|
that impact does **not** occur: `DebugStreamBridgeActor` calls `CleanupGrpc()` and
|
|
`Context.Stop(Self)` on every path that invokes `onTerminated` (site disconnect, gRPC
|
|
max-retries, `ReceiveTimeout`), so it always self-terminates and releases its gRPC
|
|
subscription; and the pure-timeout path does reach `StopStream`, which also stops it.
|
|
The genuine defect described below is an error-handling gap, not a leak — severity
|
|
corrected to Medium.
|
|
|
|
**Description**
|
|
|
|
`StartStreamAsync` awaits the initial snapshot inside a `try` whose only handler is
|
|
`catch (OperationCanceledException)`. When the stream terminates before the snapshot
|
|
arrives, `onTerminatedWrapper` completes the await via
|
|
`snapshotTcs.TrySetException(new InvalidOperationException(...))`. That
|
|
`InvalidOperationException` is not an `OperationCanceledException`, so it escapes the
|
|
catch entirely: the caller (Blazor debug view / SignalR hub) receives a raw,
|
|
untranslated exception, and `StartStreamAsync` performs no teardown of its own on that
|
|
path — it relies implicitly on the bridge actor self-terminating. Cleanup from the
|
|
service side is therefore not deterministic, and the failure surfaced to the caller is
|
|
not a meaningful, documented result.
|
|
|
|
**Recommendation**
|
|
|
|
In `StartStreamAsync`, catch any exception from the snapshot await, deterministically
|
|
tear down the bridge actor (`Tell(StopDebugStream)` via the local actor reference, since
|
|
a racing `onTerminatedWrapper` may already have removed the session entry), and translate
|
|
the failure into a meaningful exception for the caller.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16. The `catch (OperationCanceledException)`-only block in
|
|
`StartStreamAsync` was replaced with `catch (Exception)`: it removes the session entry,
|
|
sends `StopDebugStream` to the bridge actor via the local reference (idempotent — the
|
|
actor may already be stopping itself), and throws a descriptive exception —
|
|
`TimeoutException` for the 30s timeout, otherwise an `InvalidOperationException` that
|
|
names the instance/site and wraps the underlying cause. Regression test
|
|
`DebugStreamServiceTests.StartStreamAsync_StreamTerminatesBeforeSnapshot_ThrowsMeaningfulException`
|
|
fails against the pre-fix code and passes after. Fixed by the commit whose message
|
|
references `Communication-001`.
|
|
|
|
### Communication-002 — gRPC reconnect does not unsubscribe the previous stream, leaking site-side relay actors
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | High |
|
|
| Category | Error handling & resilience |
|
|
| Status | Resolved |
|
|
| Location | `src/ZB.MOM.WW.ScadaBridge.Communication/Actors/DebugStreamBridgeActor.cs:170`, `src/ZB.MOM.WW.ScadaBridge.Communication/Actors/DebugStreamBridgeActor.cs:143` |
|
|
|
|
**Description**
|
|
|
|
On a gRPC stream error, `HandleGrpcError` increments the retry count, flips
|
|
`_useNodeA`, and schedules `OpenGrpcStream`. `OpenGrpcStream` cancels and disposes
|
|
`_grpcCts` and starts a fresh `SubscribeInstance` call — but it never calls
|
|
`client.Unsubscribe(_correlationId)` on the *old* node's client, and the site-side
|
|
`SiteStreamGrpcServer` keys active streams by `correlation_id` only. Because the new
|
|
subscription goes to the *other* node (`_useNodeA` flipped), the old node's
|
|
`SiteStreamGrpcServer` still has an active stream + `StreamRelayActor` +
|
|
`SiteStreamManager` subscription for that correlation ID. The old node only learns the
|
|
client is gone via TCP RST or keepalive — exactly the failure mode that triggered the
|
|
reconnect (network partition / silent node), so detection may take ~25s or never. Each
|
|
reconnect can therefore leave a zombie relay actor on the failed node. `CleanupGrpc`
|
|
(which *does* call `Unsubscribe`) is only invoked on terminal paths, not between
|
|
reconnect attempts.
|
|
|
|
**Recommendation**
|
|
|
|
Before reconnecting in `HandleGrpcError` / at the top of `OpenGrpcStream`, call
|
|
`Unsubscribe(_correlationId)` on the client for the *previous* endpoint (the one that
|
|
just failed) so the local CTS is cancelled and — where the channel is still alive —
|
|
the gRPC cancellation reaches the site and stops the relay actor.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16 (commit `<pending>`). Root cause confirmed against source:
|
|
`HandleGrpcError` flipped `_useNodeA` and scheduled `OpenGrpcStream` without ever
|
|
unsubscribing the failed stream, leaving the old node's `StreamRelayActor` zombie until
|
|
TCP/keepalive timeout. Fix: `HandleGrpcError` now resolves the client for the
|
|
*previous* endpoint (before flipping `_useNodeA`) and calls `Unsubscribe(_correlationId)`
|
|
on it, so the local CTS is cancelled and gRPC cancellation reaches the still-alive site.
|
|
Regression test `DebugStreamBridgeActorTests.On_GrpcError_Unsubscribes_Old_Stream_Before_Reconnect`
|
|
fails against the pre-fix code and passes after.
|
|
|
|
### Communication-003 — SiteStreamGrpcClient subscription map overwritten without disposal; reconnect can cancel the wrong stream
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | High |
|
|
| Category | Concurrency & thread safety |
|
|
| Status | Resolved |
|
|
| Location | `src/ZB.MOM.WW.ScadaBridge.Communication/Grpc/SiteStreamGrpcClient.cs:77`, `src/ZB.MOM.WW.ScadaBridge.Communication/Grpc/SiteStreamGrpcClient.cs:106` |
|
|
|
|
**Description**
|
|
|
|
`SubscribeAsync` does `_subscriptions[correlationId] = cts;` (line 77),
|
|
unconditionally overwriting any existing entry for that correlation ID without
|
|
cancelling or disposing the previous `CancellationTokenSource`. The `finally` block
|
|
then does `_subscriptions.TryRemove(correlationId, out _)` (line 106) which removes
|
|
the entry **by key only, regardless of which CTS is stored**. Because
|
|
`DebugStreamBridgeActor` reuses the same `_correlationId` across reconnect attempts
|
|
(and `SiteStreamGrpcClientFactory` returns the same `SiteStreamGrpcClient` for a site
|
|
even after a node flip), two `SubscribeAsync` calls can briefly share a correlation
|
|
ID. The first call's `finally` then removes the *second* call's CTS entry, so a later
|
|
`Unsubscribe(correlationId)` finds nothing and the live stream is never cancelled — an
|
|
orphan. Conversely the overwritten CTS is leaked (never disposed).
|
|
|
|
**Recommendation**
|
|
|
|
When inserting, cancel+dispose any prior CTS for that correlation ID. In the `finally`,
|
|
remove only if the stored CTS is the one this call created (use the
|
|
`TryRemove(KeyValuePair)` overload, mirroring what `SiteStreamGrpcServer` already does
|
|
with `StreamEntry`). Consider keying subscriptions by a per-call GUID rather than the
|
|
caller-supplied correlation ID.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16 (commit `<pending>`). Root cause confirmed against source: the
|
|
inline `_subscriptions[correlationId] = cts` overwrote a prior CTS without
|
|
cancel/dispose (leak), and the `finally`'s `TryRemove(correlationId, out _)` removed by
|
|
key only — a racing reconnect's live CTS could be removed by the prior call's `finally`,
|
|
orphaning the live stream. Fix: extracted two internal helpers used by `SubscribeAsync`
|
|
— `RegisterSubscription` cancels+disposes any existing CTS for the correlation ID before
|
|
inserting, and `RemoveSubscription` uses the `ConcurrentDictionary.TryRemove(KeyValuePair)`
|
|
overload so it removes only the CTS that call created (mirroring `SiteStreamGrpcServer`'s
|
|
`StreamEntry` pattern). Regression tests
|
|
`SiteStreamGrpcClientTests.RegisterSubscription_ReusedCorrelationId_CancelsAndDisposesPriorCts`
|
|
and `SiteStreamGrpcClientTests.RemoveSubscription_OnlyRemovesOwnCts_NotAReplacement`
|
|
fail against the pre-fix logic and pass after.
|
|
|
|
### Communication-004 — Coordinator actors declare no SupervisorStrategy (design requires Resume)
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Medium |
|
|
| Category | Akka.NET conventions |
|
|
| Status | Resolved |
|
|
| Location | `src/ZB.MOM.WW.ScadaBridge.Communication/Actors/CentralCommunicationActor.cs:42`, `src/ZB.MOM.WW.ScadaBridge.Communication/Actors/SiteCommunicationActor.cs:22` |
|
|
|
|
**Description**
|
|
|
|
CLAUDE.md ("Explicit supervision strategies: Resume for coordinator actors, Stop for
|
|
short-lived execution actors") requires coordinator actors to use an explicit `Resume`
|
|
supervision strategy. `CentralCommunicationActor` and `SiteCommunicationActor` are
|
|
long-lived coordinators (they own the per-site ClusterClient map, debug
|
|
subscriptions, in-progress deployments) but neither overrides `SupervisorStrategy`.
|
|
They fall back to the Akka default (`OneForOneStrategy` with `Restart`). A child fault
|
|
— e.g. a `ClusterClient` child of `CentralCommunicationActor` created by
|
|
`DefaultSiteClientFactory` — would `Restart` under the default strategy, and any
|
|
exception in the coordinator itself would restart it, wiping `_siteClients`,
|
|
`_debugSubscriptions`, and `_inProgressDeployments` silently. The design intent is
|
|
`Resume` so transient child faults do not discard coordinator state.
|
|
|
|
**Recommendation**
|
|
|
|
Override `SupervisorStrategy` on both actors to return an explicit
|
|
`OneForOneStrategy` with `Directive.Resume` (or the project's standard coordinator
|
|
strategy), matching the documented decision and other coordinator actors.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16 (commit pending). Root cause confirmed: neither
|
|
`CentralCommunicationActor` nor `SiteCommunicationActor` overrode `SupervisorStrategy`,
|
|
so child faults fell back to the Akka default (`Restart`). Note that an actor's own
|
|
`SupervisorStrategy` governs its *children* — a transient child fault would `Restart`
|
|
the child and discard its in-memory state, contrary to the CLAUDE.md "Resume for
|
|
coordinator actors" decision. Fix: both actors now override `SupervisorStrategy()` to
|
|
return a `OneForOneStrategy` with an unbounded `Decider` resolving to `Directive.Resume`
|
|
(mirroring `DataConnectionManagerActor`). Regression tests
|
|
`CoordinatorSupervisionTests.CentralCommunicationActor_SupervisorStrategy_IsResume` and
|
|
`CoordinatorSupervisionTests.SiteCommunicationActor_SupervisorStrategy_IsResume` fail
|
|
against the pre-fix code (decider yields `Restart`) and pass after.
|
|
|
|
### Communication-005 — gRPC keepalive and max-stream-lifetime options are defined but never applied
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Medium |
|
|
| Category | Design-document adherence |
|
|
| Status | Resolved |
|
|
| Location | `src/ZB.MOM.WW.ScadaBridge.Communication/Grpc/SiteStreamGrpcClient.cs:25`, `src/ZB.MOM.WW.ScadaBridge.Communication/CommunicationOptions.cs:36` |
|
|
|
|
**Description**
|
|
|
|
`CommunicationOptions` exposes `GrpcKeepAlivePingDelay`, `GrpcKeepAlivePingTimeout`,
|
|
`GrpcMaxStreamLifetime`, and `GrpcMaxConcurrentStreams`, and the design doc's
|
|
"gRPC Connection Keepalive" section explicitly states these are configurable. However
|
|
`SiteStreamGrpcClient`'s constructor hard-codes `KeepAlivePingDelay =
|
|
TimeSpan.FromSeconds(15)` and `KeepAlivePingTimeout = TimeSpan.FromSeconds(10)`
|
|
instead of reading the options. `GrpcMaxStreamLifetime` (the documented "Session
|
|
timeout — 4 hours" third layer of dead-client detection) is not referenced anywhere
|
|
— `SiteStreamGrpcServer.SubscribeInstance` creates a linked CTS from the call
|
|
cancellation token only, with no `CancelAfter`. The 4-hour zombie-stream safety net
|
|
described in the design doc does not exist in code. `GrpcMaxConcurrentStreams` is also
|
|
not wired to the server (`SiteStreamGrpcServer` takes a `maxConcurrentStreams`
|
|
constructor parameter defaulting to 100, but nothing binds the option to it).
|
|
|
|
**Recommendation**
|
|
|
|
Flow `CommunicationOptions` into `SiteStreamGrpcClient` and `SiteStreamGrpcServer`
|
|
(via the factory / DI). Apply `GrpcKeepAlivePingDelay` / `GrpcKeepAlivePingTimeout` to
|
|
the `SocketsHttpHandler`, bind `GrpcMaxConcurrentStreams` to the server's limit, and
|
|
implement the `GrpcMaxStreamLifetime` session timeout with `CancelAfter` on the
|
|
server-side stream CTS — or, if the 4-hour cap is intentionally dropped, remove the
|
|
option and update the design doc.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16 (commit pending). Root cause confirmed: `SiteStreamGrpcClient`
|
|
hard-coded the keepalive values, `GrpcMaxStreamLifetime` was referenced nowhere, and
|
|
`GrpcMaxConcurrentStreams` was never bound to the server. Fix (scoped to
|
|
`src/ZB.MOM.WW.ScadaBridge.Communication`): `SiteStreamGrpcClient` gained a constructor taking
|
|
`CommunicationOptions` and now applies `GrpcKeepAlivePingDelay`/`GrpcKeepAlivePingTimeout`
|
|
to its `SocketsHttpHandler`; `SiteStreamGrpcClientFactory` gained an
|
|
`IOptions<CommunicationOptions>` DI constructor and flows the options into every client
|
|
it creates; `SiteStreamGrpcServer` gained an `IOptions<CommunicationOptions>` DI
|
|
constructor that binds `GrpcMaxConcurrentStreams` and implements the documented 4-hour
|
|
session timeout via `CancellationTokenSource.CancelAfter(GrpcMaxStreamLifetime)` on the
|
|
per-stream CTS. The Host's existing `AddSingleton<SiteStreamGrpcServer>()` registration
|
|
resolves the new DI constructor via greedy resolution — no Host change required.
|
|
Regression tests `GrpcOptionsWiringTests.SiteStreamGrpcClient_AppliesKeepAliveFromOptions`,
|
|
`GrpcOptionsWiringTests.SiteStreamGrpcClientFactory_FlowsOptionsToCreatedClients`, and
|
|
`GrpcOptionsWiringTests.SiteStreamGrpcServer_BindsMaxConcurrentStreamsAndLifetimeFromOptions`
|
|
exercise the wiring (they require the new members to even compile).
|
|
|
|
### Communication-006 — Site address load failures are silently swallowed, leaving a stale cache
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Medium |
|
|
| Category | Error handling & resilience |
|
|
| Status | Resolved |
|
|
| Location | `src/ZB.MOM.WW.ScadaBridge.Communication/Actors/CentralCommunicationActor.cs:204` |
|
|
|
|
**Description**
|
|
|
|
`LoadSiteAddressesFromDb` runs the repository query inside `Task.Run(...).PipeTo(self)`.
|
|
If `GetAllSitesAsync` throws (database unavailable, transient connection error), the
|
|
faulted task is piped to `Self` as a `Status.Failure`. `CentralCommunicationActor` has
|
|
no `Receive<Status.Failure>` handler, so the failure becomes an unhandled message
|
|
(logged at debug, not surfaced) and the periodic refresh silently fails. If the
|
|
*first* startup load fails the actor runs with an empty `_siteClients` map — every
|
|
`SiteEnvelope` is dropped (line 187) and every Ask times out with no indication of the
|
|
root cause.
|
|
|
|
**Recommendation**
|
|
|
|
Add a `Receive<Status.Failure>` handler that logs the load failure at Warning/Error
|
|
level so operators can distinguish "site has no addresses configured" from "database
|
|
is down". Optionally surface a health metric for repeated load failures.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16 (commit pending). Root cause confirmed: a faulted
|
|
`LoadSiteAddressesFromDb` task is piped to `Self` as a `Status.Failure`, but the actor
|
|
had no handler for it — the failure became an unhandled message (debug-level only) and
|
|
the periodic refresh failed silently. Fix: added a `Receive<Status.Failure>` handler
|
|
that logs the load failure at `Warning` with the underlying exception as the cause, so
|
|
operators can distinguish a missing-addresses configuration from a database outage.
|
|
Regression test
|
|
`CentralCommunicationActorTests.LoadSiteAddressesFailure_IsLoggedNotSilentlySwallowed`
|
|
(repository query throws) asserts the Warning is emitted — it produces no warning
|
|
against the pre-fix code and passes after.
|
|
|
|
### Communication-007 — `SiteStreamGrpcClientFactory.Dispose` blocks on async work (sync-over-async)
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Medium |
|
|
| Category | Performance & resource management |
|
|
| Status | Resolved |
|
|
| Location | `src/ZB.MOM.WW.ScadaBridge.Communication/Grpc/SiteStreamGrpcClientFactory.cs:53` |
|
|
|
|
**Description**
|
|
|
|
`Dispose()` calls `DisposeAsync().AsTask().GetAwaiter().GetResult()`. This is the
|
|
classic sync-over-async pattern: it blocks the calling thread until all per-site
|
|
`SiteStreamGrpcClient.DisposeAsync` calls complete. If `Dispose` is invoked from a
|
|
context with a single-threaded synchronization context or from DI container shutdown
|
|
on a constrained thread pool, this can deadlock or stall host shutdown. The class
|
|
already implements `IAsyncDisposable`.
|
|
|
|
**Recommendation**
|
|
|
|
Prefer registering and disposing the factory through `IAsyncDisposable` only (modern
|
|
.NET DI honours it for singletons). If a synchronous `Dispose` must remain, dispose
|
|
the underlying `GrpcChannel`s directly (synchronous) rather than blocking on the async
|
|
path, or document why blocking is safe here.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16 (commit pending). Root cause confirmed: `Dispose()` called
|
|
`DisposeAsync().AsTask().GetAwaiter().GetResult()`, the classic sync-over-async pattern.
|
|
Fix: `SiteStreamGrpcClient` now also implements `IDisposable` with a synchronous
|
|
`Dispose()` that releases its CancellationTokenSources and underlying `GrpcChannel`
|
|
directly (all of that teardown is inherently synchronous); `SiteStreamGrpcClientFactory.Dispose()`
|
|
now disposes each cached client via that synchronous path with no blocking on the async
|
|
path. A `CreateClient` seam was extracted so the test can substitute a tracking client
|
|
while still exercising the factory's real caching/disposal machinery. Regression test
|
|
`SiteStreamGrpcClientFactoryDisposeTests.Dispose_DisposesClientsSynchronously_NotViaAsyncPath`
|
|
fails against the pre-fix code (clients disposed via `DisposeAsync`) and passes after;
|
|
`Dispose_DoesNotDeadlock_UnderSingleThreadedSynchronizationContext` guards the stall path.
|
|
|
|
### Communication-008 — Reconnect retry-count reset can mask a flapping stream indefinitely
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Medium |
|
|
| Category | Correctness & logic bugs |
|
|
| Status | Resolved |
|
|
| Location | `src/ZB.MOM.WW.ScadaBridge.Communication/Actors/DebugStreamBridgeActor.cs:71`, `src/ZB.MOM.WW.ScadaBridge.Communication/Actors/DebugStreamBridgeActor.cs:174` |
|
|
|
|
**Description**
|
|
|
|
`_retryCount` is reset to 0 every time a single `AttributeValueChanged` or
|
|
`AlarmStateChanged` event is received (lines 72, 77). Combined with `MaxRetries = 3`,
|
|
a stream that connects, delivers exactly one event, then fails — repeatedly — will
|
|
reconnect forever. The design doc states "max 3 retries, terminate the session if all
|
|
retries fail"; the current logic only terminates after 3 *consecutive* failures with
|
|
zero intervening events, so a flapping site never trips the limit and the debug
|
|
session (and its site-side relay) lives on indefinitely. The `ReceiveTimeout` orphan
|
|
net is also reset by every received message, so it does not bound this case either.
|
|
|
|
**Recommendation**
|
|
|
|
Either reset `_retryCount` only after the stream has been stably connected for some
|
|
minimum duration (e.g. a timer armed on stream open, cancelled on the next error), or
|
|
keep a separate cumulative reconnect counter / time window that bounds total
|
|
reconnects regardless of intervening events.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16 (commit pending). Root cause confirmed: `_retryCount` was reset to
|
|
0 on every received `AttributeValueChanged`/`AlarmStateChanged`, so a stream that
|
|
connected, delivered one event, then failed — repeatedly — never tripped `MaxRetries`.
|
|
Fix (recommendation option a): the per-event reset was removed; instead `OpenGrpcStream`
|
|
arms a single `StabilityWindow` timer (60s default, internal-settable for tests), and
|
|
only when it fires (`GrpcStreamStable`) — i.e. the stream stayed up long enough to be
|
|
considered recovered — is `_retryCount` reset. `HandleGrpcError` cancels that timer, so
|
|
a stream that fails before the window elapses does not recover its retry budget. A
|
|
flapping stream therefore terminates after `MaxRetries` regardless of intervening
|
|
events. Regression test
|
|
`DebugStreamBridgeActorTests.FlappingStream_DeliveringEventsBetweenFailures_StillTerminatesAfterMaxRetries`
|
|
fails against the pre-fix code (actor never terminates) and passes after;
|
|
`RetryCount_RecoveredOnlyAfterStreamStaysStableForStabilityWindow` verifies the budget
|
|
is recovered after a stable interval. The pre-existing test that codified the buggy
|
|
per-event reset (`Grpc_Error_Resets_RetryCount_On_Successful_Event`) was replaced.
|
|
|
|
### Communication-009 — `_siteClients` field is mutable and reassignable; cache update is not atomic on failure
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Concurrency & thread safety |
|
|
| Status | Resolved |
|
|
| Location | `src/ZB.MOM.WW.ScadaBridge.Communication/Actors/CentralCommunicationActor.cs:53`, `src/ZB.MOM.WW.ScadaBridge.Communication/Actors/CentralCommunicationActor.cs:240` |
|
|
|
|
**Description**
|
|
|
|
`_siteClients` is a non-`readonly` `Dictionary` field. It is only mutated on the actor
|
|
thread (correct), but the field is needlessly reassignable, and
|
|
`HandleSiteAddressCacheLoaded` mutates it in place across several loops. If
|
|
`ActorPath.Parse` throws on a malformed address mid-loop (e.g. a site row with a
|
|
garbage `NodeAAddress`), the method aborts partway through, having already stopped
|
|
some ClusterClients and added others — leaving the cache partially updated with no
|
|
recovery until the next 60s refresh. The other actor mutable collections
|
|
(`_debugSubscriptions`, `_inProgressDeployments`) are correctly `readonly`.
|
|
|
|
**Recommendation**
|
|
|
|
Mark `_siteClients` `readonly`. Validate/parse all addresses up front (or wrap
|
|
`ActorPath.Parse` in a try/catch that logs and skips the bad site) so a single
|
|
malformed site record cannot abort the whole refresh and leave a half-updated cache.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16 (commit pending). Root cause confirmed: `_siteClients` was a
|
|
non-`readonly` field, and `HandleSiteAddressCacheLoaded`'s add/update loop called
|
|
`ActorPath.Parse` per site with no guard — a malformed `NodeAAddress` threw mid-loop and
|
|
aborted the refresh, leaving the cache half-updated until the next 60s cycle. Fix:
|
|
`_siteClients` is now `readonly`, and the per-site `ActorPath.Parse` is wrapped in a
|
|
try/catch that logs the bad address at Warning and `continue`s to the next site, so a
|
|
single garbage row cannot starve other sites of their ClusterClient. Regression test
|
|
`CentralCommunicationActorTests.MalformedSiteAddress_DoesNotAbortRefresh_OtherSitesStillRegistered`
|
|
(bad site ordered before a good one) fails against the pre-fix code (good site never
|
|
registered) and passes after.
|
|
|
|
### Communication-010 — `DebugStreamBridgeActor` XML doc incorrectly describes it as a "Persistent actor"
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Documentation & comments |
|
|
| Status | Resolved |
|
|
| Location | `src/ZB.MOM.WW.ScadaBridge.Communication/Actors/DebugStreamBridgeActor.cs:10` |
|
|
|
|
**Description**
|
|
|
|
The class summary opens with "Persistent actor (one per active debug session)...".
|
|
The actor derives from `ReceiveActor`, not a persistent actor base class, holds no
|
|
`PersistenceId`, and writes no journal/snapshot. "Persistent" is misleading — debug
|
|
sessions are explicitly "session-based and temporary" per the design doc. A reader
|
|
could assume state survives restart, which it does not.
|
|
|
|
**Recommendation**
|
|
|
|
Reword the summary to "Long-lived (per active debug session) actor on the central
|
|
side..." or similar, removing the word "Persistent".
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16 (commit pending). Root cause confirmed: the class summary opened
|
|
with "Persistent actor (one per active debug session)..." but the actor derives from
|
|
`ReceiveActor`, holds no `PersistenceId`, and writes no journal/snapshot. Fix
|
|
(documentation only — no behaviour change, so no regression test): the summary was
|
|
reworded to "Long-lived (one per active debug session) actor on the central side. Debug
|
|
sessions are session-based and temporary — this actor holds no persisted state and does
|
|
not derive from an Akka.Persistence base class; its state does not survive a restart."
|
|
|
|
### Communication-011 — No test coverage for snapshot-timeout cleanup, address-cache failure, or gRPC reconnect leak
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Testing coverage |
|
|
| Status | Resolved |
|
|
| Location | `tests/ZB.MOM.WW.ScadaBridge.Communication.Tests/` (module-wide) |
|
|
|
|
**Description**
|
|
|
|
The test suite covers happy-path routing, handler-not-registered failures, heartbeat
|
|
bumping, cache refresh, and gRPC bridge reconnect/retry. However several critical
|
|
paths identified in this review have no coverage:
|
|
|
|
- The `DebugStreamService.StartStreamAsync` snapshot-timeout path (Communication-001)
|
|
— no test verifies bridge actor / site subscription teardown on timeout, nor the
|
|
`onTerminated`-before-snapshot race that throws a non-`OperationCanceledException`.
|
|
- `CentralCommunicationActor` behaviour when `LoadSiteAddressesFromDb` faults
|
|
(Communication-006) — `RefreshSiteAddresses_UpdatesCache` only exercises success.
|
|
- `SiteStreamGrpcClient` subscription-map overwrite/removal race (Communication-003)
|
|
and gRPC reconnect not unsubscribing the old node (Communication-002).
|
|
- A malformed `NodeAAddress` aborting `HandleSiteAddressCacheLoaded` (Communication-009).
|
|
|
|
**Recommendation**
|
|
|
|
Add tests for: snapshot timeout / pre-snapshot termination cleanup; address-load
|
|
failure logging and empty-cache behaviour; reusing a correlation ID across
|
|
`SubscribeAsync` calls; and a malformed site address during cache refresh.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16 (commit pending). This is a meta-coverage finding; every gap it
|
|
enumerates is now covered by a regression test (each fails against its pre-fix code and
|
|
passes after):
|
|
- Snapshot timeout / pre-snapshot termination (Communication-001) —
|
|
`DebugStreamServiceTests.StartStreamAsync_StreamTerminatesBeforeSnapshot_ThrowsMeaningfulException`.
|
|
- gRPC reconnect not unsubscribing the old node (Communication-002) —
|
|
`DebugStreamBridgeActorTests.On_GrpcError_Unsubscribes_Old_Stream_Before_Reconnect`.
|
|
- `SiteStreamGrpcClient` subscription-map overwrite/removal race (Communication-003) —
|
|
`SiteStreamGrpcClientTests.RegisterSubscription_ReusedCorrelationId_CancelsAndDisposesPriorCts`
|
|
and `RemoveSubscription_OnlyRemovesOwnCts_NotAReplacement`.
|
|
- `LoadSiteAddressesFromDb` fault (Communication-006) —
|
|
`CentralCommunicationActorTests.LoadSiteAddressesFailure_IsLoggedNotSilentlySwallowed`.
|
|
- Malformed `NodeAAddress` aborting `HandleSiteAddressCacheLoaded` (Communication-009) —
|
|
`CentralCommunicationActorTests.MalformedSiteAddress_DoesNotAbortRefresh_OtherSitesStillRegistered`
|
|
(added with this finding's resolution).
|
|
The full module suite (`dotnet test tests/ZB.MOM.WW.ScadaBridge.Communication.Tests`) is green at
|
|
111 passing tests.
|
|
|
|
### Communication-012 — gRPC client factory ignores the endpoint on a cache hit, breaking NodeA→NodeB stream failover
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | High |
|
|
| Category | Correctness & logic bugs |
|
|
| Status | Resolved |
|
|
| Location | `src/ZB.MOM.WW.ScadaBridge.Communication/Grpc/SiteStreamGrpcClientFactory.cs:39`, `src/ZB.MOM.WW.ScadaBridge.Communication/Actors/DebugStreamBridgeActor.cs:166` |
|
|
|
|
**Description**
|
|
|
|
`SiteStreamGrpcClientFactory.GetOrCreate` is `_clients.GetOrAdd(siteIdentifier, _ =>
|
|
CreateClient(grpcEndpoint))` — it keys the cache by **site identifier only** and the
|
|
`grpcEndpoint` argument is used *exclusively* for the first-ever creation. Every
|
|
subsequent call for that site returns the originally-cached `SiteStreamGrpcClient`,
|
|
which is permanently bound to the `GrpcChannel` of whatever endpoint was passed first.
|
|
|
|
`DebugStreamBridgeActor` relies on the opposite behaviour. On a gRPC stream error,
|
|
`HandleGrpcError` flips `_useNodeA` and `OpenGrpcStream` recomputes
|
|
`endpoint = _useNodeA ? _grpcNodeAAddress : _grpcNodeBAddress`, then calls
|
|
`_grpcFactory.GetOrCreate(_siteIdentifier, endpoint)` expecting a client connected to
|
|
the *other* node. Because the factory ignores the new endpoint, the bridge actor
|
|
reconnects to the **same failed NodeA endpoint** on every retry. The design doc's
|
|
core debug-stream failover behaviour ("tries the other site node endpoint", "NodeB if
|
|
NodeA failed, or vice versa") is therefore inoperative — when a site node goes down,
|
|
the debug stream cannot move to the surviving node and simply exhausts `MaxRetries`
|
|
against the dead endpoint and terminates. The `_useNodeA` flip, the `previousEndpoint`
|
|
computation in `HandleGrpcError`, and the `CleanupGrpc` endpoint selection are all
|
|
dead logic. (Communication-002's `Unsubscribe`-before-reconnect fix still functions,
|
|
but it unsubscribes and re-subscribes on the *same* client/node rather than the
|
|
intended other node.)
|
|
|
|
**Recommendation**
|
|
|
|
Make the per-site client aware of both endpoints, or key the cache by
|
|
`(siteIdentifier, endpoint)`, or have `GetOrCreate` detect an endpoint change and
|
|
dispose+recreate the cached client. Given the design intent ("Falls back to NodeB if
|
|
NodeA connection fails"), the cleanest fix is to give `SiteStreamGrpcClient` (or a
|
|
per-site holder) both NodeA/NodeB addresses and let it switch channels internally,
|
|
removing the endpoint argument from `GetOrCreate` entirely. Add a test that drives a
|
|
real `SiteStreamGrpcClientFactory` through a node flip and asserts the second client
|
|
targets the other endpoint.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-17 (commit pending). Root cause confirmed against source:
|
|
`GetOrCreate` was `_clients.GetOrAdd(siteIdentifier, …)` — keyed by site identifier
|
|
only, so the `grpcEndpoint` argument was honoured solely on first creation and the
|
|
NodeA→NodeB flip reconnected to the dead endpoint forever. Fix:
|
|
`SiteStreamGrpcClient` now exposes its bound `Endpoint`, and `GetOrCreate` compares
|
|
the cached client's endpoint against the requested one — on a mismatch it atomically
|
|
installs (via `ConcurrentDictionary.AddOrUpdate`) a fresh client for the new endpoint
|
|
and disposes the stale one, so a node flip actually moves to the surviving node.
|
|
Regression tests `SiteStreamGrpcClientFactoryTests.GetOrCreate_EndpointChanged_ReturnsClientBoundToNewEndpoint`
|
|
and `GetOrCreate_SameEndpoint_DoesNotDisposeOrRecreate` fail against the pre-fix
|
|
factory and pass after.
|
|
|
|
### Communication-013 — Site gRPC address changes are never applied; `RemoveSiteAsync` has no production caller
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Medium |
|
|
| Category | Design-document adherence |
|
|
| Status | Resolved |
|
|
| Location | `src/ZB.MOM.WW.ScadaBridge.Communication/Grpc/SiteStreamGrpcClientFactory.cs:58` |
|
|
|
|
**Description**
|
|
|
|
The design doc states that `SiteStreamGrpcClientFactory` "Disposes clients on site
|
|
removal or address change." `RemoveSiteAsync` implements the disposal mechanism, but
|
|
a repo-wide search finds **no production caller** — only tests invoke it. Combined
|
|
with the cache-by-site-identifier behaviour (Communication-012), the consequence is
|
|
that once a site's `SiteStreamGrpcClient` is created, a later edit to that site's
|
|
`GrpcNodeAAddress` / `GrpcNodeBAddress` (via the Central UI or CLI) is never reflected
|
|
in the cached client — it keeps using the stale channel for the life of the process.
|
|
`CentralCommunicationActor` already refreshes the *Akka* address cache every 60s and
|
|
recreates ClusterClients on change, but there is no equivalent invalidation path
|
|
wired into the gRPC client factory. A site whose gRPC endpoints are corrected after
|
|
an initial misconfiguration will never have working debug streaming until the central
|
|
node is restarted.
|
|
|
|
**Recommendation**
|
|
|
|
Wire a site-removal / address-change signal into `SiteStreamGrpcClientFactory` —
|
|
e.g. have `CentralCommunicationActor` (which already detects address changes in
|
|
`HandleSiteAddressCacheLoaded`) call `RemoveSiteAsync` for sites whose gRPC addresses
|
|
changed or were removed, or fold the gRPC endpoints into the same refresh cycle. If
|
|
the on-the-fly address-change requirement is intentionally dropped, remove
|
|
`RemoveSiteAsync` and correct the design doc.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-17 (commit pending). The address-*change* staleness — the primary
|
|
impact ("never have working debug streaming until the central node is restarted") —
|
|
is fixed in-module by the Communication-012 change: `GetOrCreate` is now
|
|
endpoint-change-aware, so the next time `DebugStreamBridgeActor` requests a stream
|
|
with a corrected `GrpcNodeAAddress`/`GrpcNodeBAddress` the stale cached client is
|
|
disposed and replaced — no central restart needed and no external wiring required.
|
|
`RemoveSiteAsync` is retained as the disposal path for full site *removal* (a deleted
|
|
site record) and its doc comment now states that role explicitly; wiring a
|
|
delete-site callback belongs to the site-management flow in another module and is out
|
|
of this module's scope. Regression test
|
|
`SiteStreamGrpcClientFactoryTests.GetOrCreate_EndpointChanged_DisposesPriorClient`
|
|
fails against the pre-fix factory (stale client never disposed/replaced) and passes
|
|
after.
|
|
|
|
### Communication-014 — Untrusted gRPC `correlation_id` flows directly into an Akka actor name
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Security |
|
|
| Status | Resolved |
|
|
| Location | `src/ZB.MOM.WW.ScadaBridge.Communication/Grpc/SiteStreamGrpcServer.cs:124` |
|
|
|
|
**Description**
|
|
|
|
`SubscribeInstance` is a public gRPC endpoint hosted on each site node. It creates the
|
|
relay actor with `$"stream-relay-{request.CorrelationId}-{actorSeq}"` as the actor
|
|
name, where `request.CorrelationId` comes straight off the wire. Akka actor names have
|
|
a restricted character set; a `correlation_id` containing `/`, whitespace, or other
|
|
disallowed characters makes `ActorSystem.ActorOf` throw `InvalidActorNameException`.
|
|
That exception is not caught inside `SubscribeInstance`, so it escapes as an unhandled
|
|
RPC fault (and after the `_streamSubscriber.Subscribe` / `_activeStreams` entry has
|
|
already been set up for the duration, though the `finally` does not run because the
|
|
throw is before the `try`). In practice central always supplies a GUID, so impact is
|
|
low, but the server is trusting client-supplied input to be actor-name-safe.
|
|
|
|
**Recommendation**
|
|
|
|
Validate `request.CorrelationId` on entry (non-empty, matches an expected GUID/safe
|
|
pattern) and reject with `StatusCode.InvalidArgument` otherwise; or derive the actor
|
|
name solely from the internal `_actorCounter` and keep the correlation ID only as
|
|
actor state / dictionary key.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-17 (commit pending). Root cause confirmed: `SubscribeInstance` fed
|
|
the off-the-wire `request.CorrelationId` straight into the `stream-relay-…` actor
|
|
name, so an id with `/`, whitespace, or other disallowed characters made `ActorOf`
|
|
throw `InvalidActorNameException` as an unhandled RPC fault. Fix: `SubscribeInstance`
|
|
now validates `CorrelationId` on entry — rejecting null/empty or any value failing
|
|
`ActorPath.IsValidPathElement` with `StatusCode.InvalidArgument` before any actor or
|
|
subscription state is created. Regression test
|
|
`SiteStreamGrpcServerTests.RejectsCorrelationIdThatIsNotActorNameSafe` (theory:
|
|
`/`-bearing, whitespace, empty, `$`-prefixed ids) fails against the pre-fix server
|
|
and passes after; `AcceptsActorNameSafeCorrelationId` confirms a normal GUID is still
|
|
accepted.
|
|
|
|
### Communication-015 — No test exercises the real gRPC client factory across a node flip
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Testing coverage |
|
|
| Status | Resolved |
|
|
| Location | `tests/ZB.MOM.WW.ScadaBridge.Communication.Tests/Grpc/DebugStreamBridgeActorTests.cs:401`, `tests/ZB.MOM.WW.ScadaBridge.Communication.Tests/Grpc/SiteStreamGrpcClientFactoryTests.cs` |
|
|
|
|
**Description**
|
|
|
|
`DebugStreamBridgeActorTests` exercises the reconnect/failover paths through
|
|
`MockSiteStreamGrpcClientFactory`, which returns one fixed mock client regardless of
|
|
the `grpcEndpoint` argument. This is exactly the behaviour the *real*
|
|
`SiteStreamGrpcClientFactory` exhibits incorrectly (Communication-012), so the mock
|
|
masks the defect: `On_GrpcError_Reconnects_To_Other_Node` passes even though the real
|
|
factory never reaches the other node. `SiteStreamGrpcClientFactoryTests` only asserts
|
|
`GetOrCreate` returns the same client for the same site — it never checks what happens
|
|
when the same site is requested with a *different* endpoint.
|
|
|
|
**Recommendation**
|
|
|
|
Add a `SiteStreamGrpcClientFactoryTests` case that calls `GetOrCreate(site, endpointA)`
|
|
then `GetOrCreate(site, endpointB)` and asserts the second call targets `endpointB`
|
|
(it should fail today and pass after Communication-012 is fixed). Have the bridge-actor
|
|
test's mock factory track the endpoint per call so node-flip coverage is meaningful.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-17 (commit pending). `SiteStreamGrpcClientFactoryTests` gained
|
|
`GetOrCreate_EndpointChanged_ReturnsClientBoundToNewEndpoint` and
|
|
`GetOrCreate_EndpointChanged_DisposesPriorClient`, which drive the *real*
|
|
`SiteStreamGrpcClientFactory` across a node flip and assert the second call yields a
|
|
client bound to the new endpoint with the stale one disposed — both fail against the
|
|
pre-fix factory and pass after (the Communication-012 fix). `DebugStreamBridgeActorTests`
|
|
gained `On_GrpcError_Reconnects_To_Other_Node_Endpoint`, which uses a new
|
|
`EndpointTrackingGrpcClientFactory` test double that hands out a distinct mock client
|
|
per endpoint (instead of one fixed mock regardless of endpoint), so the bridge actor's
|
|
NodeA→NodeB reconnect is now verified to actually target the NodeB endpoint rather
|
|
than being masked by an endpoint-agnostic mock.
|
|
|
|
### Communication-016 — `HandleConnectionStateChanged` is dead code — the documented disconnect-cleanup workflow never fires
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | High |
|
|
| Category | Design-document adherence |
|
|
| Status | Resolved |
|
|
| Location | `src/ZB.MOM.WW.ScadaBridge.Communication/Actors/CentralCommunicationActor.cs:169`, `src/ZB.MOM.WW.ScadaBridge.Communication/Actors/CentralCommunicationActor.cs:338-375` |
|
|
|
|
**Resolution** — deleted the dead code path in favour of the keepalive-based
|
|
detection that is the actual production behaviour: removed the
|
|
`Receive<ConnectionStateChanged>` handler, the `HandleConnectionStateChanged`
|
|
method, the `_debugSubscriptions` / `_inProgressDeployments` tracking dicts
|
|
+ the `TrackMessageForCleanup` helper that fed them, and the dead message
|
|
record `src/ZB.MOM.WW.ScadaBridge.Commons/Messages/Communication/ConnectionStateChanged.cs`.
|
|
The two dead tests (`ConnectionLost_DebugStreamsKilled` in
|
|
CentralCommunicationActorTests, `RoundTrip_ConnectionStateChanged_Succeeds`
|
|
in CompatibilityTests) were removed alongside. The design doc
|
|
`docs/requirements/Component-Communication.md` "Connection Failure Behavior"
|
|
section was updated to state explicitly that disconnect is detected at the
|
|
transport layer (gRPC keepalive PING ~25 s for debug streams; Ask-timeout
|
|
at the CommunicationService layer for command/control), with no
|
|
application-level signal. `DebugStreamTerminated` survives because
|
|
`DebugStreamBridgeActor` uses it for an unrelated intra-actor stop signal.
|
|
|
|
**Description**
|
|
|
|
`CentralCommunicationActor.HandleConnectionStateChanged` is wired to
|
|
`Receive<ConnectionStateChanged>` and implements two important workflows on
|
|
`IsConnected == false`: (1) kill every active debug stream for the disconnected
|
|
site (`_debugSubscriptions` walk → `DebugStreamTerminated` Tell to each
|
|
subscriber); (2) mark every in-progress deployment for that site as failed
|
|
(`_inProgressDeployments` walk → entry removal). Both are documented in the
|
|
component design doc's "Connection Failure Behavior" section and in WP-5 of the
|
|
work plan referenced in the class's own XML doc comment.
|
|
|
|
A repo-wide search (`grep -rn ConnectionStateChanged src/ tests/`) shows **no
|
|
production code ever emits `ConnectionStateChanged`**. The only producers are
|
|
the unit test `CentralCommunicationActorTests.ConnectionLost_DebugStreamsKilled`
|
|
(line 137) and the Commons message-roundtrip test. The
|
|
`CentralCommunicationActor` therefore never receives one in production, the
|
|
disconnect-cleanup workflow never fires, and `_debugSubscriptions` /
|
|
`_inProgressDeployments` are never pruned via this path.
|
|
|
|
Concrete consequences:
|
|
- A site goes down → its active debug streams do **not** get a synchronous
|
|
`DebugStreamTerminated` notification from central. The bridge actor must
|
|
detect the disconnect itself via gRPC keepalive timing out (~25s) or TCP RST.
|
|
Subscribers wait that long for the `OnStreamTerminated` callback instead of
|
|
the documented "immediately killed by central" behaviour.
|
|
- In-progress deployments to a disconnected site continue to occupy the
|
|
Ask-reply window and only fail when the Ask times out at the
|
|
`CommunicationService.DeployInstanceAsync` layer (120s). They are never
|
|
proactively marked failed.
|
|
- The unit test gives a strong false impression that the workflow works — it
|
|
exercises a code path that has no production caller.
|
|
|
|
The design doc and CLAUDE.md mention "ClusterClient handles failover between
|
|
NodeA and NodeB internally — there is no application-level NodeA preference /
|
|
NodeB fallback logic" — so the ClusterClient mechanism is the documented
|
|
failover transport. But that says nothing about *signalling* a fully-down
|
|
remote cluster to central's coordinator actor, which is exactly what
|
|
`ConnectionStateChanged` was meant to do.
|
|
|
|
**Recommendation**
|
|
|
|
Pick one of:
|
|
- Wire a producer for `ConnectionStateChanged` — e.g. subscribe to
|
|
`ClusterClient`'s contact-point/cluster events (`ClusterClient.ContactPoints`
|
|
Refresh / `ContactPointAdded` / `ContactPointRemoved`) or watch the
|
|
ClusterClient actor for a "no contact points reachable" state — and have it
|
|
publish `ConnectionStateChanged` to `Self` on each transition.
|
|
- If the documented "synchronously kill streams on disconnect" behaviour is
|
|
intentionally being dropped in favour of the slower keepalive-based
|
|
detection, delete the handler, the `ConnectionStateChanged` record, and the
|
|
related `_debugSubscriptions` / `_inProgressDeployments` tracking, then
|
|
update the design doc's "Connection Failure Behavior" section accordingly.
|
|
|
|
Either way, replace `CentralCommunicationActorTests.ConnectionLost_DebugStreamsKilled`
|
|
— at present it asserts a behaviour that no production code triggers.
|
|
|
|
---
|
|
|
|
### Communication-017 — `_inProgressDeployments` grows unboundedly — successful deployments are never cleaned up
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Medium |
|
|
| Category | Performance & resource management |
|
|
| Status | Resolved |
|
|
| Location | `src/ZB.MOM.WW.ScadaBridge.Communication/Actors/CentralCommunicationActor.cs:73`, `src/ZB.MOM.WW.ScadaBridge.Communication/Actors/CentralCommunicationActor.cs:501`, `src/ZB.MOM.WW.ScadaBridge.Communication/Actors/CentralCommunicationActor.cs:357-367` |
|
|
|
|
**Resolution (2026-05-28):** Closed by Comm-016 — field removed in commit ac96b83.
|
|
The `_inProgressDeployments` dictionary, the `TrackMessageForCleanup` helper,
|
|
and the `HandleConnectionStateChanged` handler that consumed them were all
|
|
deleted as part of Comm-016's dead-code purge. A grep for `_inProgressDeployments`
|
|
in `CentralCommunicationActor.cs` finds only the explanatory comment block at
|
|
line 63-74 that documents the removal. There is no longer any unbounded-growth
|
|
hazard — the field does not exist.
|
|
|
|
**Description**
|
|
|
|
`TrackMessageForCleanup` inserts `_inProgressDeployments[deploy.DeploymentId] =
|
|
envelope.SiteId` on every `DeployInstanceCommand` routed to a site (line 501).
|
|
The only places that *remove* from `_inProgressDeployments` are:
|
|
- `HandleConnectionStateChanged` on `IsConnected == false` (line 366) — which
|
|
per Communication-016 never fires in production.
|
|
- `PostStop` (line 553) — only on actor death (central failover).
|
|
|
|
There is **no removal on the normal happy path** — neither when the site replies
|
|
`DeploymentStatusResponse` (the reply goes to the Ask's temporary reply actor,
|
|
not back through `CentralCommunicationActor`), nor on Ask timeout. Every
|
|
successful or failed deployment leaves its entry behind for the lifetime of the
|
|
process.
|
|
|
|
Memory impact is modest (each entry is ~70-100 bytes), but the dictionary grows
|
|
monotonically. Over months of operation across all sites a central node could
|
|
accumulate tens of thousands of entries — a real, observable leak. More
|
|
seriously, the field is *also* the source-of-truth set the
|
|
`HandleConnectionStateChanged` walk uses to fail in-progress deployments, so
|
|
even if a `ConnectionStateChanged` *were* fired today, the walk would
|
|
"fail" thousands of already-completed deployments and Tell their (now stale)
|
|
correlation-IDs into the void.
|
|
|
|
`_debugSubscriptions` (line 67) shares the same shape — but a normal debug
|
|
session ends with an `UnsubscribeDebugViewRequest` that *does* drive cleanup
|
|
(line 497), so leaks are only realised when a consumer crashes without
|
|
unsubscribing.
|
|
|
|
**Recommendation**
|
|
|
|
Either remove `_inProgressDeployments` entirely (it has no other consumer once
|
|
Communication-016 is fixed by deletion) or, if the disconnect-cleanup workflow
|
|
is retained, add a removal hook on the reply path. The simplest fix is to
|
|
subscribe `CentralCommunicationActor` to the Ask reply: route
|
|
`DeployInstanceCommand` through the actor with the actor as the Ask sender,
|
|
forward the reply to the original caller, and `_inProgressDeployments.Remove`
|
|
in the same handler. (Today the Ask is taken on the *actor* itself by the
|
|
caller, so the reply skips the coordinator.)
|
|
|
|
---
|
|
|
|
### Communication-018 — Site heartbeats hard-code `IsActive: true` regardless of node role
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Design-document adherence |
|
|
| Status | Resolved |
|
|
| Location | `src/ZB.MOM.WW.ScadaBridge.Communication/Actors/SiteCommunicationActor.cs:376-465` |
|
|
|
|
**Description**
|
|
|
|
`SiteCommunicationActor.SendHeartbeatToCentral` builds
|
|
`new HeartbeatMessage(_siteId, hostname, IsActive: true, DateTimeOffset.UtcNow)`
|
|
on every periodic tick (line 366), with no inspection of whether this node is
|
|
actually the active site node or a standby. The `HeartbeatMessage.IsActive`
|
|
field thus carries the literal value `true` on every heartbeat from every
|
|
node, and the field is effectively dead — central's `HandleHeartbeat` doesn't
|
|
consume it either (line 297 only passes `SiteId` and `Timestamp` to
|
|
`MarkHeartbeat`).
|
|
|
|
Per CLAUDE.md's Cluster & Failover section the active/standby distinction is
|
|
real ("Both nodes are seed nodes", "keep-oldest split-brain resolver",
|
|
"automatic dual-node recovery"), so a heartbeat that *could* carry node-role
|
|
information would be useful for the central health dashboard distinguishing
|
|
"active node down, standby up" from "site fully offline". As shipped, the
|
|
field is contract noise and a future implementer might mistakenly assume it
|
|
already carries meaningful state.
|
|
|
|
**Recommendation**
|
|
|
|
Either (a) resolve the current cluster role at heartbeat-send time and pass it
|
|
through — e.g. `Cluster.Get(Context.System).SelfRoles.Contains("active")` or
|
|
the project's existing role mechanism — and have the central aggregator
|
|
consume `IsActive`; or (b) drop the `IsActive` field from `HeartbeatMessage`
|
|
(additive-only-evolution: deprecate the field, default to `true`, plan
|
|
removal in a major message contract revision).
|
|
|
|
**Resolution (2026-05-28):** Took option (a). `SiteCommunicationActor` now
|
|
accepts an optional `Func<bool>? isActiveCheck` (default = real `Cluster.Get`
|
|
leader check mirroring `ActiveNodeGate` / `ActiveNodeHealthCheck`) and
|
|
`SendHeartbeatToCentral` stamps `HeartbeatMessage.IsActive` with the result.
|
|
A try/catch keeps the heartbeat tick alive when the cluster state is
|
|
unreadable (warm-up / TestKit without cluster plugin) — falls back to
|
|
`IsActive: false`, the safe non-claiming value. Added parameterised test
|
|
`Heartbeat_StampsIsActive_FromInjectedCheck`. Tests green (199/199 in
|
|
Communication.Tests).
|
|
|
|
---
|
|
|
|
### Communication-019 — `LoadSiteAddressesFromDb` does not pass a `CancellationToken` to the repository
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Error handling & resilience |
|
|
| Status | Resolved |
|
|
| Location | `src/ZB.MOM.WW.ScadaBridge.Communication/Actors/CentralCommunicationActor.cs:397-431` |
|
|
|
|
**Description**
|
|
|
|
`LoadSiteAddressesFromDb` runs `await repo.GetAllSitesAsync()` inside
|
|
`Task.Run(async () => ...).PipeTo(self)` with no cancellation token (line 404).
|
|
The repository signature accepts `CancellationToken` (the test mock declares
|
|
`GetAllSitesAsync(Arg.Any<CancellationToken>())`), but the actor calls the
|
|
no-arg overload — so a hung MS SQL connection has no upper bound. The
|
|
60-second-periodic refresh keeps firing; each tick spawns a fresh `Task.Run`
|
|
that piles up if the database is consistently slow. The actor itself is
|
|
unaffected (it's not blocked), but pending tasks and DB connection-pool
|
|
resources accumulate, and the `Status.Failure` handler (Communication-006)
|
|
never fires because the task never faults — it just sits.
|
|
|
|
**Recommendation**
|
|
|
|
Maintain a per-load `CancellationTokenSource` with a deadline (e.g. the same
|
|
60s the refresh runs on, or a configurable timeout in `CommunicationOptions`).
|
|
Pass its `Token` to `GetAllSitesAsync`. Cancel the prior token before spinning
|
|
a new load to avoid task accumulation.
|
|
|
|
**Resolution (2026-05-28):** Added a per-actor lifecycle `CancellationTokenSource`
|
|
on `CentralCommunicationActor`, cancelled+disposed in `PostStop`. Its `Token`
|
|
is now passed into `repo.GetAllSitesAsync(ct)` so a hung MS SQL query is
|
|
bounded by actor shutdown rather than holding piped tasks open. The existing
|
|
60-second refresh cadence and `Status.Failure` handler (Comm-006) are unchanged
|
|
— a deadline-per-load was scoped out as a future enhancement; this fix
|
|
addresses the immediate "no upper bound on actor stop" concern called out in
|
|
the finding.
|
|
|
|
---
|
|
|
|
### Communication-020 — `SiteAddressCacheLoaded` carries mutable `Dictionary`/`List` types
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Akka.NET conventions |
|
|
| Status | Resolved |
|
|
| Location | `src/ZB.MOM.WW.ScadaBridge.Communication/Actors/CentralCommunicationActor.cs:567` |
|
|
|
|
**Resolution (2026-05-28):** `SiteAddressCacheLoaded`'s `SiteContacts` payload is now typed as `IReadOnlyDictionary<string, IReadOnlyList<string>>`, enforcing the Akka.NET message-immutability convention at the type level rather than relying on producer discipline. The producer (`LoadSiteAddressesFromDb`) builds the working buckets as before and wraps each inner `List<string>` with `AsReadOnly()` before constructing the message — the freeze is local to the single refresh tick and the cost is negligible. The consumer (`HandleSiteAddressCacheLoaded`) only ever read via `Keys`, foreach-deconstruct, `Select`, `Count` and `ToImmutableHashSet`, all of which are supported by the new read-only types, so no consumer changes were needed. The existing `MalformedSiteAddress_DoesNotAbortRefresh_OtherSitesStillRegistered` and `ClusterClientRouting_RoutesToConfiguredSite` regression tests exercise the producer→consumer flow and continue to pass under the read-only types.
|
|
|
|
**Description**
|
|
|
|
The Akka.NET convention is that messages crossing actor boundaries (even
|
|
internal Self-messages over an async task boundary) are immutable.
|
|
`SiteAddressCacheLoaded(Dictionary<string, List<string>> SiteContacts)` is a
|
|
record but its `SiteContacts` payload is a mutable `Dictionary` whose values
|
|
are mutable `List<string>`. Constructed inside `Task.Run` and handed off to
|
|
the actor, the cache could in principle be mutated by either side; in
|
|
practice nothing does, but the type is a stale-evidence guarantee that
|
|
CLAUDE.md's "message immutability" rule is being followed only by convention.
|
|
|
|
**Recommendation**
|
|
|
|
Change the record signature to use `IReadOnlyDictionary<string, IReadOnlyList<string>>`
|
|
(or `ImmutableDictionary` / `ImmutableArray<string>`) and freeze the data
|
|
before piping. The cost is negligible — the payload is built and consumed
|
|
once per refresh tick.
|
|
|
|
---
|
|
|
|
### Communication-021 — `SiteStreamGrpcServer.SubscribeInstance` leaks the `StreamRelayActor` if `Subscribe` throws pre-try
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Error handling & resilience |
|
|
| Status | Resolved |
|
|
| Location | `src/ZB.MOM.WW.ScadaBridge.Communication/Grpc/SiteStreamGrpcServer.cs:188-200` |
|
|
|
|
**Description**
|
|
|
|
`SubscribeInstance` performs these statements in order (lines 189-194), all
|
|
*before* the `try` block at line 200:
|
|
1. `Interlocked.Increment(ref _actorCounter)`
|
|
2. `_actorSystem!.ActorOf(Props.Create(typeof(StreamRelayActor), ...))`
|
|
3. `_streamSubscriber.Subscribe(request.InstanceUniqueName, relayActor)`
|
|
|
|
If step 3 throws (the subscriber is wired but its `Subscribe` faults — a stale
|
|
instance name, a temporary index lookup failure, etc.), the exception escapes
|
|
the method as an unhandled `RpcException` *and* leaks the freshly-created
|
|
`relayActor`. The `finally` block at line 211 is unreachable because the
|
|
throw happens before the `try`. The actor's `_activeStreams` entry, the
|
|
`StreamEntry.Cts`, and the `Channel<SiteStreamEvent>` are also leaked.
|
|
|
|
In normal operation `_streamSubscriber.Subscribe` does not throw, so the bug is
|
|
latent — but a misbehaving site runtime (e.g. `SiteStreamManager` faulted
|
|
because the actor system is shutting down) would surface it.
|
|
|
|
**Recommendation**
|
|
|
|
Restructure to either (a) wrap the `Subscribe` call in a `try` whose `catch`
|
|
stops the relay actor and disposes the CTS, or (b) move the actor + subscriber
|
|
creation *inside* the existing `try` block (the `finally` will then handle
|
|
cleanup uniformly). Option (b) is the simplest — just move lines 189-194 down
|
|
past the `try {` brace.
|
|
|
|
**Resolution (2026-05-28):** Took option (a). `_streamSubscriber.Subscribe(...)`
|
|
is now wrapped in its own try/catch — on throw, the freshly-created relay actor
|
|
is stopped via `_actorSystem.Stop`, the bounded channel is completed via
|
|
`channel.Writer.TryComplete()`, and the `_activeStreams` entry is removed via
|
|
the ownership-preserving `TryRemove(KeyValuePair)` overload before the
|
|
exception is re-thrown to the caller. Added regression test
|
|
`SiteStreamGrpcServerTests.Comm021_SubscribeThrows_StopsRelayActorAndRemovesActiveStreamEntry`
|
|
using an NSubstitute `ISiteStreamSubscriber` that throws on Subscribe;
|
|
asserts `ActiveStreamCount == 0` and that `RemoveSubscriber` was NOT called
|
|
(confirming the catch path, not the finally path).
|
|
|
|
---
|
|
|
|
### Communication-022 — `_debugSubscriptions` keyed by caller-supplied correlation ID; reuse silently orphans the prior subscriber
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Correctness & logic bugs |
|
|
| Status | Resolved |
|
|
| Location | `src/ZB.MOM.WW.ScadaBridge.Communication/Actors/CentralCommunicationActor.cs:67`, `src/ZB.MOM.WW.ScadaBridge.Communication/Actors/CentralCommunicationActor.cs:493` |
|
|
|
|
**Description**
|
|
|
|
`TrackMessageForCleanup` on `SubscribeDebugViewRequest` does
|
|
`_debugSubscriptions[sub.CorrelationId] = (envelope.SiteId, Sender)` (line 493).
|
|
The dictionary indexer silently overwrites any prior entry for the same
|
|
`CorrelationId`. If two debug sessions ever reuse the same correlation ID (e.g.
|
|
two Blazor users start a stream at the same moment with a non-GUID id, or a
|
|
caller bug, or a malicious caller as flagged in the cousin
|
|
Communication-014), the first subscriber's entry is overwritten and lost —
|
|
on a later `ConnectionStateChanged(false)` (per Communication-016 it never
|
|
actually fires today, but the design intent stands), only the *second*
|
|
subscriber would be notified of the disconnect.
|
|
|
|
`DebugStreamService.StartStreamAsync` uses `Guid.NewGuid().ToString("N")` as
|
|
the session id (`DebugStreamService.cs:97`), so a real collision is
|
|
astronomically unlikely in normal operation. But the central side is not
|
|
defending itself: a CLI consumer or a future caller is implicitly trusted to
|
|
generate globally-unique ids.
|
|
|
|
**Recommendation**
|
|
|
|
When the slot is already occupied, log a Warning and either reject the new
|
|
subscription with an error response or evict the prior subscriber via
|
|
`DebugStreamTerminated` before installing the new one. Mirrors the
|
|
`SiteStreamGrpcServer` defensive behaviour where a duplicate `correlation_id`
|
|
cancels the existing stream (line 167).
|
|
|
|
**Resolution (2026-05-28):** Closed by Comm-016 — field removed in commit ac96b83.
|
|
The `_debugSubscriptions` dictionary, `TrackMessageForCleanup` helper, and the
|
|
`HandleConnectionStateChanged` handler that consumed them were all deleted as
|
|
part of Comm-016's dead-code purge. There is no longer any caller-supplied
|
|
correlation-id keyed map to overwrite — the orphan-on-reuse hazard is gone.
|