Files
ScadaBridge/code-reviews/Communication/findings.md
T
Joseph Doherty d39089f4ed docs(code-review): full review at 4307c381 — 18 modules, 67 findings recorded + remediation tracked
Full per-module re-review of the 16 stale modules (last seen 1eb6e97 / 2026-05-28)
plus first-ever reviews of KpiHistory (#26) and ScriptAnalysis (#25), at HEAD 4307c381.

67 new findings (0 Critical, 6 High, 27 Medium, 34 Low). Remediation in commit
fd618cf1 closed 5 of the 6 Highs and ~33 Medium/Low; the rest are Deferred/Won't Fix
with rationale. Remaining pending (4) are all InboundAPI's Database-helper findings
(IA-026 High .. IA-029), left to the active feat/ipsen-movein effort per owner decision.

Highlights: caught a central-only-delivery security drift (SMTP creds broadcast to
sites — DM-025/SR-031), a never-committed 'Resolved' fix (SiteEventLogging-016 → -024),
an unguarded KPI recorder tick (KH-001), a trust-analyzer fallback weakening (SA-001),
and a native-alarm subscribe-path leak (DCL-023). ScriptAnalysis verdict: trust boundary
is semantically sound (symbol-based) in the production cluster config.

README regenerated; regen-readme.py --check passes (4 pending / 567 total).
2026-06-20 18:02:32 -04:00

1257 lines
72 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Code Review — Communication
| Field | Value |
|-------|-------|
| Module | `src/ZB.MOM.WW.ScadaBridge.Communication` |
| Design doc | `docs/requirements/Component-Communication.md` |
| Status | Reviewed |
| Last reviewed | 2026-06-20 |
| Reviewer | claude-agent |
| Commit reviewed | `4307c381` |
| Open findings | 0 |
## Summary
The Communication module is generally well-structured and matches the design doc's
two-transport model (ClusterClient for command/control, gRPC server-streaming for
real-time data). The actors keep mutable state on the actor thread, use `PipeTo` for
async work, and the gRPC server/client lifecycle is mostly disciplined. However the
review found several High and Medium issues clustered around two themes:
**(a) gRPC subscription bookkeeping races** — `SiteStreamGrpcClient` overwrites and
removes subscription entries by correlation ID without disposal or ownership checks,
so reconnect cycles leak `CancellationTokenSource`es and can cancel the wrong stream;
and **(b) missing supervision strategy** on the coordinator actors, contrary to the
CLAUDE.md "Resume for coordinator actors" decision. Design-doc adherence is otherwise
good. Test coverage is broad for happy paths but has gaps around failover, cache
mutation races, and the snapshot-timeout cleanup path.
#### Re-review 2026-05-17 (commit `39d737e`)
All prior findings (Communication-001..011) are confirmed `Resolved` in this commit
and the fixes hold up against the source. The re-review walked all 10 checklist
categories again and uncovered a previously-missed defect at the centre of the gRPC
node-failover path: **`SiteStreamGrpcClientFactory.GetOrCreate` caches one client per
site identifier and silently ignores the `grpcEndpoint` argument on a cache hit**. The
`DebugStreamBridgeActor` reconnect logic flips `_useNodeA` and passes the *other*
node's endpoint, but the factory hands back the original NodeA-bound client every
time — so the documented "try the other site node endpoint" failover never actually
moves to NodeB (Communication-012). The same caching defect means a site address
change is never picked up because `RemoveSiteAsync` has no production caller
(Communication-013). Two Low findings round out the re-review: an untrusted
gRPC-supplied `correlation_id` flows straight into an Akka actor name
(Communication-014), and the factory's endpoint-reuse defect is masked by the test
mock (Communication-015). Four new findings, all Open: one High, one Medium, two Low.
#### Re-review 2026-05-28 (commit `1eb6e97`)
All prior findings (Communication-001..015) remain `Resolved` in this commit. The
re-review walked all 10 checklist categories again on the surface that has not
been re-examined before — the central↔site command/control routing surface
(`CentralCommunicationActor`, `SiteCommunicationActor`) rather than the
previously-mined gRPC streaming surface — and uncovered a cluster of defects
around the connection-state-change workflow. The single material finding is
**`HandleConnectionStateChanged` is dead code**: no production code path emits
`ConnectionStateChanged`, so the documented "kill active debug streams for the
disconnected site" + "mark in-progress deployments as failed" workflow never
fires at runtime (Communication-016). The downstream consequence is
**`_inProgressDeployments` grows unboundedly** — entries are inserted on every
deployment but only cleaned via that dead path (Communication-017). Three
smaller items round out the re-review: site heartbeats hard-code
`IsActive: true` regardless of node role (Communication-018), the
60-second-periodic `LoadSiteAddressesFromDb` task has no CancellationToken so a
hung DB query has no upper bound (Communication-019), the
`SiteAddressCacheLoaded` internal message carries a mutable
`Dictionary`/`List` (Communication-020), `SiteStreamGrpcServer.SubscribeInstance`
leaks the StreamRelayActor if `_streamSubscriber.Subscribe` throws between
`ActorOf` and the `try` block (Communication-021), and `_debugSubscriptions`
keyed by caller-supplied `CorrelationId` could orphan a subscriber on ID reuse
(Communication-022). Seven new findings, all Open: one High, one Medium, five
Low.
#### Re-review 2026-06-20 (commit `4307c381`) — full review
All prior findings (Communication-001..022) remain closed (Resolved, or Resolved-by-Comm-016).
This module was renamed `ScadaLink → ZB.MOM.WW.ScadaBridge` since `1eb6e97`, so the
`git diff 1eb6e97 HEAD` shows the whole module as added; the review re-examined the
full current state across all 10 categories, with focus on the surface added since the
last review — the four new gRPC RPCs on `SiteStreamService` (`IngestAuditEvents`,
`IngestCachedTelemetry`, `PullAuditEvents`, `PullSiteCalls`) and the M7 command/control
forwards (`SearchAddressSpaceCommand`, `WriteTagRequest`, `VerifyEndpointCommand`,
cert-trust commands). The module is healthy: actor state stays on the actor thread,
`PipeTo`/`Forward`/Sender-capture are clean, gRPC channel/stream/CTS lifecycles are
disciplined, the duplicate-stream and Subscribe-throws cleanup paths are correct, and
the gRPC ingest/pull handlers degrade best-effort (empty ack / empty response, rows stay
Pending) on fault. The only new findings are **design-doc drift**: the doc still claims
`SiteStreamService` has "a single RPC `SubscribeInstance`" and that audit/telemetry
reconciliation rides ClusterClient, when in fact four RPCs now exist and the
reconciliation *pull* runs over gRPC (Communication-023); and the AlarmStateUpdate field
table stops at field 21 while the proto/code carry fields 2223 (Communication-024). Two
new findings, both Open: zero Critical/High, zero Medium-code, one Medium-doc, one Low.
## Checklist coverage 2026-06-20
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ✓ | `SubscribeInstance` duplicate-removal + Subscribe-throws cleanup (Comm-021) correct; telemetry open/close gauge is balanced (counted only after Subscribe succeeds, decremented in finally). `DebugStreamBridgeActor` stream-first buffer/dedup sound. No new issues. |
| 2 | Akka.NET conventions | ✓ | Coordinator `Resume` strategies present on both actors; `Forward` preserves Sender on every site relay; `SiteAddressCacheLoaded` now read-only-typed (Comm-020). gRPC ingest handlers capture Sender before `PipeTo`. No new issues. |
| 3 | Concurrency & thread safety | ✓ | `_activeStreams`/`_subscriptions` use ownership-preserving `TryRemove(KeyValuePair)`; `GetOrCreate` uses `AddOrUpdate` atomically; bridge state mutated only on the actor thread via `Self.Tell`. No new issues. |
| 4 | Error handling & resilience | ✓ | gRPC `Ingest*`/`Pull*` degrade to empty ack/response on fault (rows stay Pending) — divergent from the ClusterClient path's `Status.Failure` pipe but behaviourally equivalent (both → Pending+retry), and explicitly documented. `LoadSiteAddressesFromDb` lifecycle-CT bounded (Comm-019). No new issues. |
| 5 | Security | ✓ | `correlation_id` validated against `ActorPath.IsValidPathElement` before actor naming (Comm-014). `Pull*` reject `batch_size <= 0` with `InvalidArgument`. Secrets ride encrypted blocks, not this layer. No new issues. |
| 6 | Performance & resource management | ✓ | gRPC channel disposed synchronously (Comm-007); per-stream CTS `CancelAfter(GrpcMaxStreamLifetime)`; bounded `Channel` (1000, DropOldest). `RemoveSiteAsync` still test-only but Comm-013 documents that endpoint change is handled by `GetOrCreate`. No new issues. |
| 7 | Design-document adherence | ✓ | Doc claims a *single* RPC and ClusterClient-only reconciliation; reality is 4 RPCs + gRPC reconciliation pulls (Communication-023). AlarmStateUpdate field table stops at 21; proto/code carry 2223 (Communication-024). |
| 8 | Code organization & conventions | ✓ | DTO mappers correctly hosted in Communication (owns generated DTOs); proto evolution is additive (field numbers 2223 appended, never reused); Options pattern intact. No new issues. |
| 9 | Testing coverage | ✓ | `ProtoContractTests` guards oneof variants; `StreamRelayActorTests` + `ConvertToDomainEvent` cover enriched fields 823 end-to-end; dedicated proto tests cover the audit/telemetry/pull DTOs. `ProtoRoundtripTests.AlarmStateUpdate_RoundTrip` only asserts fields 15, but the gap is covered elsewhere — no separate finding. |
| 10 | Documentation & comments | ✓ | In-code XML docs are thorough and accurate (the prior-finding comments are precise). The drift is in the design doc only (Communication-023/024), not the code comments. |
## Checklist coverage 2026-05-28
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ✓ | `HandleConnectionStateChanged` and its `_inProgressDeployments` / `_debugSubscriptions` cleanup never fire — the connection-state workflow is dead (Communication-016, Communication-017). `_debugSubscriptions` correlation-ID overwrite risk (Communication-022). |
| 2 | Akka.NET conventions | ✓ | `SiteAddressCacheLoaded` carries mutable `Dictionary<string, List<string>>` — violates message-immutability convention (Communication-020). `Forward`/`PipeTo`/Sender-capture all clean. |
| 3 | Concurrency & thread safety | ✓ | All mutable state mutated on the actor thread. `_subscriptions` ConcurrentDictionary use disciplined. No new issues. |
| 4 | Error handling & resilience | ✓ | `LoadSiteAddressesFromDb` lacks a `CancellationToken` propagation point (Communication-019). `SubscribeInstance` leaks the relay actor if `Subscribe` throws pre-try (Communication-021). |
| 5 | Security | ✓ | Correlation-id validation in place (Communication-014). No new issues. |
| 6 | Performance & resource management | ✓ | `_inProgressDeployments` grows unboundedly (Communication-017). gRPC client/server lifecycles otherwise clean. |
| 7 | Design-document adherence | ✓ | `ConnectionStateChanged` handler is dead code — the doc-stated "kill streams on disconnect, fail in-progress deployments" workflow does not actually run (Communication-016). Site heartbeats always report `IsActive: true` regardless of role (Communication-018). |
| 8 | Code organization & conventions | ✓ | Options pattern correct; mapper placement and proto evolution are additive-only. No new issues. |
| 9 | Testing coverage | ✓ | `CentralCommunicationActorTests.ConnectionLost_DebugStreamsKilled` exercises a code path that no production caller ever drives — gives false confidence (related to Communication-016). |
| 10 | Documentation & comments | ✓ | Detailed XML docs added in this commit. No new issues. |
## Checklist coverage
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ✓ | Re-review: factory ignores endpoint on cache hit, defeating NodeA→NodeB stream failover (Communication-012). Prior items resolved. |
| 2 | Akka.NET conventions | ✓ | Coordinator `Resume` strategies now present and verified. No new issues. |
| 3 | Concurrency & thread safety | ✓ | Subscription-map register/remove now ownership-checked. `_siteClients` readonly. No new issues. |
| 4 | Error handling & resilience | ✓ | `Status.Failure` handler added; reconnect unsubscribes prior stream. No new issues. |
| 5 | Security | ✓ | Re-review: public gRPC `correlation_id` flows unvalidated into an Akka actor name (Communication-014). |
| 6 | Performance & resource management | ✓ | Synchronous `Dispose` paths fixed; CTS leaks resolved. No new issues. |
| 7 | Design-document adherence | ✓ | Re-review: site gRPC address-change disposal not wired — `RemoveSiteAsync` is dead code (Communication-013). gRPC options now applied. |
| 8 | Code organization & conventions | ✓ | Options pattern correct; public records still declared in actor files (acceptable). No structural issues. |
| 9 | Testing coverage | ✓ | Re-review: prior gaps closed, but the factory mock masks the endpoint-reuse defect — no real node-flip coverage (Communication-015). |
| 10 | Documentation & comments | ✓ | `DebugStreamBridgeActor` summary corrected. No new issues. |
## Findings
### Communication-001 — Early stream termination escapes StartStreamAsync's narrow exception handling
| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.Communication/DebugStreamService.cs:130-143` |
**Re-triaged 2026-05-16:** originally filed Critical, claiming an orphaned bridge actor
and a multi-minute site-side resource leak on every snapshot timeout. On verification
that impact does **not** occur: `DebugStreamBridgeActor` calls `CleanupGrpc()` and
`Context.Stop(Self)` on every path that invokes `onTerminated` (site disconnect, gRPC
max-retries, `ReceiveTimeout`), so it always self-terminates and releases its gRPC
subscription; and the pure-timeout path does reach `StopStream`, which also stops it.
The genuine defect described below is an error-handling gap, not a leak — severity
corrected to Medium.
**Description**
`StartStreamAsync` awaits the initial snapshot inside a `try` whose only handler is
`catch (OperationCanceledException)`. When the stream terminates before the snapshot
arrives, `onTerminatedWrapper` completes the await via
`snapshotTcs.TrySetException(new InvalidOperationException(...))`. That
`InvalidOperationException` is not an `OperationCanceledException`, so it escapes the
catch entirely: the caller (Blazor debug view / SignalR hub) receives a raw,
untranslated exception, and `StartStreamAsync` performs no teardown of its own on that
path — it relies implicitly on the bridge actor self-terminating. Cleanup from the
service side is therefore not deterministic, and the failure surfaced to the caller is
not a meaningful, documented result.
**Recommendation**
In `StartStreamAsync`, catch any exception from the snapshot await, deterministically
tear down the bridge actor (`Tell(StopDebugStream)` via the local actor reference, since
a racing `onTerminatedWrapper` may already have removed the session entry), and translate
the failure into a meaningful exception for the caller.
**Resolution**
Resolved 2026-05-16. The `catch (OperationCanceledException)`-only block in
`StartStreamAsync` was replaced with `catch (Exception)`: it removes the session entry,
sends `StopDebugStream` to the bridge actor via the local reference (idempotent — the
actor may already be stopping itself), and throws a descriptive exception —
`TimeoutException` for the 30s timeout, otherwise an `InvalidOperationException` that
names the instance/site and wraps the underlying cause. Regression test
`DebugStreamServiceTests.StartStreamAsync_StreamTerminatesBeforeSnapshot_ThrowsMeaningfulException`
fails against the pre-fix code and passes after. Fixed by the commit whose message
references `Communication-001`.
### Communication-002 — gRPC reconnect does not unsubscribe the previous stream, leaking site-side relay actors
| | |
|--|--|
| Severity | High |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.Communication/Actors/DebugStreamBridgeActor.cs:170`, `src/ZB.MOM.WW.ScadaBridge.Communication/Actors/DebugStreamBridgeActor.cs:143` |
**Description**
On a gRPC stream error, `HandleGrpcError` increments the retry count, flips
`_useNodeA`, and schedules `OpenGrpcStream`. `OpenGrpcStream` cancels and disposes
`_grpcCts` and starts a fresh `SubscribeInstance` call — but it never calls
`client.Unsubscribe(_correlationId)` on the *old* node's client, and the site-side
`SiteStreamGrpcServer` keys active streams by `correlation_id` only. Because the new
subscription goes to the *other* node (`_useNodeA` flipped), the old node's
`SiteStreamGrpcServer` still has an active stream + `StreamRelayActor` +
`SiteStreamManager` subscription for that correlation ID. The old node only learns the
client is gone via TCP RST or keepalive — exactly the failure mode that triggered the
reconnect (network partition / silent node), so detection may take ~25s or never. Each
reconnect can therefore leave a zombie relay actor on the failed node. `CleanupGrpc`
(which *does* call `Unsubscribe`) is only invoked on terminal paths, not between
reconnect attempts.
**Recommendation**
Before reconnecting in `HandleGrpcError` / at the top of `OpenGrpcStream`, call
`Unsubscribe(_correlationId)` on the client for the *previous* endpoint (the one that
just failed) so the local CTS is cancelled and — where the channel is still alive —
the gRPC cancellation reaches the site and stops the relay actor.
**Resolution**
Resolved 2026-05-16 (commit `<pending>`). Root cause confirmed against source:
`HandleGrpcError` flipped `_useNodeA` and scheduled `OpenGrpcStream` without ever
unsubscribing the failed stream, leaving the old node's `StreamRelayActor` zombie until
TCP/keepalive timeout. Fix: `HandleGrpcError` now resolves the client for the
*previous* endpoint (before flipping `_useNodeA`) and calls `Unsubscribe(_correlationId)`
on it, so the local CTS is cancelled and gRPC cancellation reaches the still-alive site.
Regression test `DebugStreamBridgeActorTests.On_GrpcError_Unsubscribes_Old_Stream_Before_Reconnect`
fails against the pre-fix code and passes after.
### Communication-003 — SiteStreamGrpcClient subscription map overwritten without disposal; reconnect can cancel the wrong stream
| | |
|--|--|
| Severity | High |
| Category | Concurrency & thread safety |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.Communication/Grpc/SiteStreamGrpcClient.cs:77`, `src/ZB.MOM.WW.ScadaBridge.Communication/Grpc/SiteStreamGrpcClient.cs:106` |
**Description**
`SubscribeAsync` does `_subscriptions[correlationId] = cts;` (line 77),
unconditionally overwriting any existing entry for that correlation ID without
cancelling or disposing the previous `CancellationTokenSource`. The `finally` block
then does `_subscriptions.TryRemove(correlationId, out _)` (line 106) which removes
the entry **by key only, regardless of which CTS is stored**. Because
`DebugStreamBridgeActor` reuses the same `_correlationId` across reconnect attempts
(and `SiteStreamGrpcClientFactory` returns the same `SiteStreamGrpcClient` for a site
even after a node flip), two `SubscribeAsync` calls can briefly share a correlation
ID. The first call's `finally` then removes the *second* call's CTS entry, so a later
`Unsubscribe(correlationId)` finds nothing and the live stream is never cancelled — an
orphan. Conversely the overwritten CTS is leaked (never disposed).
**Recommendation**
When inserting, cancel+dispose any prior CTS for that correlation ID. In the `finally`,
remove only if the stored CTS is the one this call created (use the
`TryRemove(KeyValuePair)` overload, mirroring what `SiteStreamGrpcServer` already does
with `StreamEntry`). Consider keying subscriptions by a per-call GUID rather than the
caller-supplied correlation ID.
**Resolution**
Resolved 2026-05-16 (commit `<pending>`). Root cause confirmed against source: the
inline `_subscriptions[correlationId] = cts` overwrote a prior CTS without
cancel/dispose (leak), and the `finally`'s `TryRemove(correlationId, out _)` removed by
key only — a racing reconnect's live CTS could be removed by the prior call's `finally`,
orphaning the live stream. Fix: extracted two internal helpers used by `SubscribeAsync`
`RegisterSubscription` cancels+disposes any existing CTS for the correlation ID before
inserting, and `RemoveSubscription` uses the `ConcurrentDictionary.TryRemove(KeyValuePair)`
overload so it removes only the CTS that call created (mirroring `SiteStreamGrpcServer`'s
`StreamEntry` pattern). Regression tests
`SiteStreamGrpcClientTests.RegisterSubscription_ReusedCorrelationId_CancelsAndDisposesPriorCts`
and `SiteStreamGrpcClientTests.RemoveSubscription_OnlyRemovesOwnCts_NotAReplacement`
fail against the pre-fix logic and pass after.
### Communication-004 — Coordinator actors declare no SupervisorStrategy (design requires Resume)
| | |
|--|--|
| Severity | Medium |
| Category | Akka.NET conventions |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.Communication/Actors/CentralCommunicationActor.cs:42`, `src/ZB.MOM.WW.ScadaBridge.Communication/Actors/SiteCommunicationActor.cs:22` |
**Description**
CLAUDE.md ("Explicit supervision strategies: Resume for coordinator actors, Stop for
short-lived execution actors") requires coordinator actors to use an explicit `Resume`
supervision strategy. `CentralCommunicationActor` and `SiteCommunicationActor` are
long-lived coordinators (they own the per-site ClusterClient map, debug
subscriptions, in-progress deployments) but neither overrides `SupervisorStrategy`.
They fall back to the Akka default (`OneForOneStrategy` with `Restart`). A child fault
— e.g. a `ClusterClient` child of `CentralCommunicationActor` created by
`DefaultSiteClientFactory` — would `Restart` under the default strategy, and any
exception in the coordinator itself would restart it, wiping `_siteClients`,
`_debugSubscriptions`, and `_inProgressDeployments` silently. The design intent is
`Resume` so transient child faults do not discard coordinator state.
**Recommendation**
Override `SupervisorStrategy` on both actors to return an explicit
`OneForOneStrategy` with `Directive.Resume` (or the project's standard coordinator
strategy), matching the documented decision and other coordinator actors.
**Resolution**
Resolved 2026-05-16 (commit pending). Root cause confirmed: neither
`CentralCommunicationActor` nor `SiteCommunicationActor` overrode `SupervisorStrategy`,
so child faults fell back to the Akka default (`Restart`). Note that an actor's own
`SupervisorStrategy` governs its *children* — a transient child fault would `Restart`
the child and discard its in-memory state, contrary to the CLAUDE.md "Resume for
coordinator actors" decision. Fix: both actors now override `SupervisorStrategy()` to
return a `OneForOneStrategy` with an unbounded `Decider` resolving to `Directive.Resume`
(mirroring `DataConnectionManagerActor`). Regression tests
`CoordinatorSupervisionTests.CentralCommunicationActor_SupervisorStrategy_IsResume` and
`CoordinatorSupervisionTests.SiteCommunicationActor_SupervisorStrategy_IsResume` fail
against the pre-fix code (decider yields `Restart`) and pass after.
### Communication-005 — gRPC keepalive and max-stream-lifetime options are defined but never applied
| | |
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.Communication/Grpc/SiteStreamGrpcClient.cs:25`, `src/ZB.MOM.WW.ScadaBridge.Communication/CommunicationOptions.cs:36` |
**Description**
`CommunicationOptions` exposes `GrpcKeepAlivePingDelay`, `GrpcKeepAlivePingTimeout`,
`GrpcMaxStreamLifetime`, and `GrpcMaxConcurrentStreams`, and the design doc's
"gRPC Connection Keepalive" section explicitly states these are configurable. However
`SiteStreamGrpcClient`'s constructor hard-codes `KeepAlivePingDelay =
TimeSpan.FromSeconds(15)` and `KeepAlivePingTimeout = TimeSpan.FromSeconds(10)`
instead of reading the options. `GrpcMaxStreamLifetime` (the documented "Session
timeout — 4 hours" third layer of dead-client detection) is not referenced anywhere
`SiteStreamGrpcServer.SubscribeInstance` creates a linked CTS from the call
cancellation token only, with no `CancelAfter`. The 4-hour zombie-stream safety net
described in the design doc does not exist in code. `GrpcMaxConcurrentStreams` is also
not wired to the server (`SiteStreamGrpcServer` takes a `maxConcurrentStreams`
constructor parameter defaulting to 100, but nothing binds the option to it).
**Recommendation**
Flow `CommunicationOptions` into `SiteStreamGrpcClient` and `SiteStreamGrpcServer`
(via the factory / DI). Apply `GrpcKeepAlivePingDelay` / `GrpcKeepAlivePingTimeout` to
the `SocketsHttpHandler`, bind `GrpcMaxConcurrentStreams` to the server's limit, and
implement the `GrpcMaxStreamLifetime` session timeout with `CancelAfter` on the
server-side stream CTS — or, if the 4-hour cap is intentionally dropped, remove the
option and update the design doc.
**Resolution**
Resolved 2026-05-16 (commit pending). Root cause confirmed: `SiteStreamGrpcClient`
hard-coded the keepalive values, `GrpcMaxStreamLifetime` was referenced nowhere, and
`GrpcMaxConcurrentStreams` was never bound to the server. Fix (scoped to
`src/ZB.MOM.WW.ScadaBridge.Communication`): `SiteStreamGrpcClient` gained a constructor taking
`CommunicationOptions` and now applies `GrpcKeepAlivePingDelay`/`GrpcKeepAlivePingTimeout`
to its `SocketsHttpHandler`; `SiteStreamGrpcClientFactory` gained an
`IOptions<CommunicationOptions>` DI constructor and flows the options into every client
it creates; `SiteStreamGrpcServer` gained an `IOptions<CommunicationOptions>` DI
constructor that binds `GrpcMaxConcurrentStreams` and implements the documented 4-hour
session timeout via `CancellationTokenSource.CancelAfter(GrpcMaxStreamLifetime)` on the
per-stream CTS. The Host's existing `AddSingleton<SiteStreamGrpcServer>()` registration
resolves the new DI constructor via greedy resolution — no Host change required.
Regression tests `GrpcOptionsWiringTests.SiteStreamGrpcClient_AppliesKeepAliveFromOptions`,
`GrpcOptionsWiringTests.SiteStreamGrpcClientFactory_FlowsOptionsToCreatedClients`, and
`GrpcOptionsWiringTests.SiteStreamGrpcServer_BindsMaxConcurrentStreamsAndLifetimeFromOptions`
exercise the wiring (they require the new members to even compile).
### Communication-006 — Site address load failures are silently swallowed, leaving a stale cache
| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.Communication/Actors/CentralCommunicationActor.cs:204` |
**Description**
`LoadSiteAddressesFromDb` runs the repository query inside `Task.Run(...).PipeTo(self)`.
If `GetAllSitesAsync` throws (database unavailable, transient connection error), the
faulted task is piped to `Self` as a `Status.Failure`. `CentralCommunicationActor` has
no `Receive<Status.Failure>` handler, so the failure becomes an unhandled message
(logged at debug, not surfaced) and the periodic refresh silently fails. If the
*first* startup load fails the actor runs with an empty `_siteClients` map — every
`SiteEnvelope` is dropped (line 187) and every Ask times out with no indication of the
root cause.
**Recommendation**
Add a `Receive<Status.Failure>` handler that logs the load failure at Warning/Error
level so operators can distinguish "site has no addresses configured" from "database
is down". Optionally surface a health metric for repeated load failures.
**Resolution**
Resolved 2026-05-16 (commit pending). Root cause confirmed: a faulted
`LoadSiteAddressesFromDb` task is piped to `Self` as a `Status.Failure`, but the actor
had no handler for it — the failure became an unhandled message (debug-level only) and
the periodic refresh failed silently. Fix: added a `Receive<Status.Failure>` handler
that logs the load failure at `Warning` with the underlying exception as the cause, so
operators can distinguish a missing-addresses configuration from a database outage.
Regression test
`CentralCommunicationActorTests.LoadSiteAddressesFailure_IsLoggedNotSilentlySwallowed`
(repository query throws) asserts the Warning is emitted — it produces no warning
against the pre-fix code and passes after.
### Communication-007 — `SiteStreamGrpcClientFactory.Dispose` blocks on async work (sync-over-async)
| | |
|--|--|
| Severity | Medium |
| Category | Performance & resource management |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.Communication/Grpc/SiteStreamGrpcClientFactory.cs:53` |
**Description**
`Dispose()` calls `DisposeAsync().AsTask().GetAwaiter().GetResult()`. This is the
classic sync-over-async pattern: it blocks the calling thread until all per-site
`SiteStreamGrpcClient.DisposeAsync` calls complete. If `Dispose` is invoked from a
context with a single-threaded synchronization context or from DI container shutdown
on a constrained thread pool, this can deadlock or stall host shutdown. The class
already implements `IAsyncDisposable`.
**Recommendation**
Prefer registering and disposing the factory through `IAsyncDisposable` only (modern
.NET DI honours it for singletons). If a synchronous `Dispose` must remain, dispose
the underlying `GrpcChannel`s directly (synchronous) rather than blocking on the async
path, or document why blocking is safe here.
**Resolution**
Resolved 2026-05-16 (commit pending). Root cause confirmed: `Dispose()` called
`DisposeAsync().AsTask().GetAwaiter().GetResult()`, the classic sync-over-async pattern.
Fix: `SiteStreamGrpcClient` now also implements `IDisposable` with a synchronous
`Dispose()` that releases its CancellationTokenSources and underlying `GrpcChannel`
directly (all of that teardown is inherently synchronous); `SiteStreamGrpcClientFactory.Dispose()`
now disposes each cached client via that synchronous path with no blocking on the async
path. A `CreateClient` seam was extracted so the test can substitute a tracking client
while still exercising the factory's real caching/disposal machinery. Regression test
`SiteStreamGrpcClientFactoryDisposeTests.Dispose_DisposesClientsSynchronously_NotViaAsyncPath`
fails against the pre-fix code (clients disposed via `DisposeAsync`) and passes after;
`Dispose_DoesNotDeadlock_UnderSingleThreadedSynchronizationContext` guards the stall path.
### Communication-008 — Reconnect retry-count reset can mask a flapping stream indefinitely
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.Communication/Actors/DebugStreamBridgeActor.cs:71`, `src/ZB.MOM.WW.ScadaBridge.Communication/Actors/DebugStreamBridgeActor.cs:174` |
**Description**
`_retryCount` is reset to 0 every time a single `AttributeValueChanged` or
`AlarmStateChanged` event is received (lines 72, 77). Combined with `MaxRetries = 3`,
a stream that connects, delivers exactly one event, then fails — repeatedly — will
reconnect forever. The design doc states "max 3 retries, terminate the session if all
retries fail"; the current logic only terminates after 3 *consecutive* failures with
zero intervening events, so a flapping site never trips the limit and the debug
session (and its site-side relay) lives on indefinitely. The `ReceiveTimeout` orphan
net is also reset by every received message, so it does not bound this case either.
**Recommendation**
Either reset `_retryCount` only after the stream has been stably connected for some
minimum duration (e.g. a timer armed on stream open, cancelled on the next error), or
keep a separate cumulative reconnect counter / time window that bounds total
reconnects regardless of intervening events.
**Resolution**
Resolved 2026-05-16 (commit pending). Root cause confirmed: `_retryCount` was reset to
0 on every received `AttributeValueChanged`/`AlarmStateChanged`, so a stream that
connected, delivered one event, then failed — repeatedly — never tripped `MaxRetries`.
Fix (recommendation option a): the per-event reset was removed; instead `OpenGrpcStream`
arms a single `StabilityWindow` timer (60s default, internal-settable for tests), and
only when it fires (`GrpcStreamStable`) — i.e. the stream stayed up long enough to be
considered recovered — is `_retryCount` reset. `HandleGrpcError` cancels that timer, so
a stream that fails before the window elapses does not recover its retry budget. A
flapping stream therefore terminates after `MaxRetries` regardless of intervening
events. Regression test
`DebugStreamBridgeActorTests.FlappingStream_DeliveringEventsBetweenFailures_StillTerminatesAfterMaxRetries`
fails against the pre-fix code (actor never terminates) and passes after;
`RetryCount_RecoveredOnlyAfterStreamStaysStableForStabilityWindow` verifies the budget
is recovered after a stable interval. The pre-existing test that codified the buggy
per-event reset (`Grpc_Error_Resets_RetryCount_On_Successful_Event`) was replaced.
### Communication-009 — `_siteClients` field is mutable and reassignable; cache update is not atomic on failure
| | |
|--|--|
| Severity | Low |
| Category | Concurrency & thread safety |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.Communication/Actors/CentralCommunicationActor.cs:53`, `src/ZB.MOM.WW.ScadaBridge.Communication/Actors/CentralCommunicationActor.cs:240` |
**Description**
`_siteClients` is a non-`readonly` `Dictionary` field. It is only mutated on the actor
thread (correct), but the field is needlessly reassignable, and
`HandleSiteAddressCacheLoaded` mutates it in place across several loops. If
`ActorPath.Parse` throws on a malformed address mid-loop (e.g. a site row with a
garbage `NodeAAddress`), the method aborts partway through, having already stopped
some ClusterClients and added others — leaving the cache partially updated with no
recovery until the next 60s refresh. The other actor mutable collections
(`_debugSubscriptions`, `_inProgressDeployments`) are correctly `readonly`.
**Recommendation**
Mark `_siteClients` `readonly`. Validate/parse all addresses up front (or wrap
`ActorPath.Parse` in a try/catch that logs and skips the bad site) so a single
malformed site record cannot abort the whole refresh and leave a half-updated cache.
**Resolution**
Resolved 2026-05-16 (commit pending). Root cause confirmed: `_siteClients` was a
non-`readonly` field, and `HandleSiteAddressCacheLoaded`'s add/update loop called
`ActorPath.Parse` per site with no guard — a malformed `NodeAAddress` threw mid-loop and
aborted the refresh, leaving the cache half-updated until the next 60s cycle. Fix:
`_siteClients` is now `readonly`, and the per-site `ActorPath.Parse` is wrapped in a
try/catch that logs the bad address at Warning and `continue`s to the next site, so a
single garbage row cannot starve other sites of their ClusterClient. Regression test
`CentralCommunicationActorTests.MalformedSiteAddress_DoesNotAbortRefresh_OtherSitesStillRegistered`
(bad site ordered before a good one) fails against the pre-fix code (good site never
registered) and passes after.
### Communication-010 — `DebugStreamBridgeActor` XML doc incorrectly describes it as a "Persistent actor"
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.Communication/Actors/DebugStreamBridgeActor.cs:10` |
**Description**
The class summary opens with "Persistent actor (one per active debug session)...".
The actor derives from `ReceiveActor`, not a persistent actor base class, holds no
`PersistenceId`, and writes no journal/snapshot. "Persistent" is misleading — debug
sessions are explicitly "session-based and temporary" per the design doc. A reader
could assume state survives restart, which it does not.
**Recommendation**
Reword the summary to "Long-lived (per active debug session) actor on the central
side..." or similar, removing the word "Persistent".
**Resolution**
Resolved 2026-05-16 (commit pending). Root cause confirmed: the class summary opened
with "Persistent actor (one per active debug session)..." but the actor derives from
`ReceiveActor`, holds no `PersistenceId`, and writes no journal/snapshot. Fix
(documentation only — no behaviour change, so no regression test): the summary was
reworded to "Long-lived (one per active debug session) actor on the central side. Debug
sessions are session-based and temporary — this actor holds no persisted state and does
not derive from an Akka.Persistence base class; its state does not survive a restart."
### Communication-011 — No test coverage for snapshot-timeout cleanup, address-cache failure, or gRPC reconnect leak
| | |
|--|--|
| Severity | Low |
| Category | Testing coverage |
| Status | Resolved |
| Location | `tests/ZB.MOM.WW.ScadaBridge.Communication.Tests/` (module-wide) |
**Description**
The test suite covers happy-path routing, handler-not-registered failures, heartbeat
bumping, cache refresh, and gRPC bridge reconnect/retry. However several critical
paths identified in this review have no coverage:
- The `DebugStreamService.StartStreamAsync` snapshot-timeout path (Communication-001)
— no test verifies bridge actor / site subscription teardown on timeout, nor the
`onTerminated`-before-snapshot race that throws a non-`OperationCanceledException`.
- `CentralCommunicationActor` behaviour when `LoadSiteAddressesFromDb` faults
(Communication-006) — `RefreshSiteAddresses_UpdatesCache` only exercises success.
- `SiteStreamGrpcClient` subscription-map overwrite/removal race (Communication-003)
and gRPC reconnect not unsubscribing the old node (Communication-002).
- A malformed `NodeAAddress` aborting `HandleSiteAddressCacheLoaded` (Communication-009).
**Recommendation**
Add tests for: snapshot timeout / pre-snapshot termination cleanup; address-load
failure logging and empty-cache behaviour; reusing a correlation ID across
`SubscribeAsync` calls; and a malformed site address during cache refresh.
**Resolution**
Resolved 2026-05-16 (commit pending). This is a meta-coverage finding; every gap it
enumerates is now covered by a regression test (each fails against its pre-fix code and
passes after):
- Snapshot timeout / pre-snapshot termination (Communication-001) —
`DebugStreamServiceTests.StartStreamAsync_StreamTerminatesBeforeSnapshot_ThrowsMeaningfulException`.
- gRPC reconnect not unsubscribing the old node (Communication-002) —
`DebugStreamBridgeActorTests.On_GrpcError_Unsubscribes_Old_Stream_Before_Reconnect`.
- `SiteStreamGrpcClient` subscription-map overwrite/removal race (Communication-003) —
`SiteStreamGrpcClientTests.RegisterSubscription_ReusedCorrelationId_CancelsAndDisposesPriorCts`
and `RemoveSubscription_OnlyRemovesOwnCts_NotAReplacement`.
- `LoadSiteAddressesFromDb` fault (Communication-006) —
`CentralCommunicationActorTests.LoadSiteAddressesFailure_IsLoggedNotSilentlySwallowed`.
- Malformed `NodeAAddress` aborting `HandleSiteAddressCacheLoaded` (Communication-009) —
`CentralCommunicationActorTests.MalformedSiteAddress_DoesNotAbortRefresh_OtherSitesStillRegistered`
(added with this finding's resolution).
The full module suite (`dotnet test tests/ZB.MOM.WW.ScadaBridge.Communication.Tests`) is green at
111 passing tests.
### Communication-012 — gRPC client factory ignores the endpoint on a cache hit, breaking NodeA→NodeB stream failover
| | |
|--|--|
| Severity | High |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.Communication/Grpc/SiteStreamGrpcClientFactory.cs:39`, `src/ZB.MOM.WW.ScadaBridge.Communication/Actors/DebugStreamBridgeActor.cs:166` |
**Description**
`SiteStreamGrpcClientFactory.GetOrCreate` is `_clients.GetOrAdd(siteIdentifier, _ =>
CreateClient(grpcEndpoint))` — it keys the cache by **site identifier only** and the
`grpcEndpoint` argument is used *exclusively* for the first-ever creation. Every
subsequent call for that site returns the originally-cached `SiteStreamGrpcClient`,
which is permanently bound to the `GrpcChannel` of whatever endpoint was passed first.
`DebugStreamBridgeActor` relies on the opposite behaviour. On a gRPC stream error,
`HandleGrpcError` flips `_useNodeA` and `OpenGrpcStream` recomputes
`endpoint = _useNodeA ? _grpcNodeAAddress : _grpcNodeBAddress`, then calls
`_grpcFactory.GetOrCreate(_siteIdentifier, endpoint)` expecting a client connected to
the *other* node. Because the factory ignores the new endpoint, the bridge actor
reconnects to the **same failed NodeA endpoint** on every retry. The design doc's
core debug-stream failover behaviour ("tries the other site node endpoint", "NodeB if
NodeA failed, or vice versa") is therefore inoperative — when a site node goes down,
the debug stream cannot move to the surviving node and simply exhausts `MaxRetries`
against the dead endpoint and terminates. The `_useNodeA` flip, the `previousEndpoint`
computation in `HandleGrpcError`, and the `CleanupGrpc` endpoint selection are all
dead logic. (Communication-002's `Unsubscribe`-before-reconnect fix still functions,
but it unsubscribes and re-subscribes on the *same* client/node rather than the
intended other node.)
**Recommendation**
Make the per-site client aware of both endpoints, or key the cache by
`(siteIdentifier, endpoint)`, or have `GetOrCreate` detect an endpoint change and
dispose+recreate the cached client. Given the design intent ("Falls back to NodeB if
NodeA connection fails"), the cleanest fix is to give `SiteStreamGrpcClient` (or a
per-site holder) both NodeA/NodeB addresses and let it switch channels internally,
removing the endpoint argument from `GetOrCreate` entirely. Add a test that drives a
real `SiteStreamGrpcClientFactory` through a node flip and asserts the second client
targets the other endpoint.
**Resolution**
Resolved 2026-05-17 (commit pending). Root cause confirmed against source:
`GetOrCreate` was `_clients.GetOrAdd(siteIdentifier, …)` — keyed by site identifier
only, so the `grpcEndpoint` argument was honoured solely on first creation and the
NodeA→NodeB flip reconnected to the dead endpoint forever. Fix:
`SiteStreamGrpcClient` now exposes its bound `Endpoint`, and `GetOrCreate` compares
the cached client's endpoint against the requested one — on a mismatch it atomically
installs (via `ConcurrentDictionary.AddOrUpdate`) a fresh client for the new endpoint
and disposes the stale one, so a node flip actually moves to the surviving node.
Regression tests `SiteStreamGrpcClientFactoryTests.GetOrCreate_EndpointChanged_ReturnsClientBoundToNewEndpoint`
and `GetOrCreate_SameEndpoint_DoesNotDisposeOrRecreate` fail against the pre-fix
factory and pass after.
### Communication-013 — Site gRPC address changes are never applied; `RemoveSiteAsync` has no production caller
| | |
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.Communication/Grpc/SiteStreamGrpcClientFactory.cs:58` |
**Description**
The design doc states that `SiteStreamGrpcClientFactory` "Disposes clients on site
removal or address change." `RemoveSiteAsync` implements the disposal mechanism, but
a repo-wide search finds **no production caller** — only tests invoke it. Combined
with the cache-by-site-identifier behaviour (Communication-012), the consequence is
that once a site's `SiteStreamGrpcClient` is created, a later edit to that site's
`GrpcNodeAAddress` / `GrpcNodeBAddress` (via the Central UI or CLI) is never reflected
in the cached client — it keeps using the stale channel for the life of the process.
`CentralCommunicationActor` already refreshes the *Akka* address cache every 60s and
recreates ClusterClients on change, but there is no equivalent invalidation path
wired into the gRPC client factory. A site whose gRPC endpoints are corrected after
an initial misconfiguration will never have working debug streaming until the central
node is restarted.
**Recommendation**
Wire a site-removal / address-change signal into `SiteStreamGrpcClientFactory`
e.g. have `CentralCommunicationActor` (which already detects address changes in
`HandleSiteAddressCacheLoaded`) call `RemoveSiteAsync` for sites whose gRPC addresses
changed or were removed, or fold the gRPC endpoints into the same refresh cycle. If
the on-the-fly address-change requirement is intentionally dropped, remove
`RemoveSiteAsync` and correct the design doc.
**Resolution**
Resolved 2026-05-17 (commit pending). The address-*change* staleness — the primary
impact ("never have working debug streaming until the central node is restarted") —
is fixed in-module by the Communication-012 change: `GetOrCreate` is now
endpoint-change-aware, so the next time `DebugStreamBridgeActor` requests a stream
with a corrected `GrpcNodeAAddress`/`GrpcNodeBAddress` the stale cached client is
disposed and replaced — no central restart needed and no external wiring required.
`RemoveSiteAsync` is retained as the disposal path for full site *removal* (a deleted
site record) and its doc comment now states that role explicitly; wiring a
delete-site callback belongs to the site-management flow in another module and is out
of this module's scope. Regression test
`SiteStreamGrpcClientFactoryTests.GetOrCreate_EndpointChanged_DisposesPriorClient`
fails against the pre-fix factory (stale client never disposed/replaced) and passes
after.
### Communication-014 — Untrusted gRPC `correlation_id` flows directly into an Akka actor name
| | |
|--|--|
| Severity | Low |
| Category | Security |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.Communication/Grpc/SiteStreamGrpcServer.cs:124` |
**Description**
`SubscribeInstance` is a public gRPC endpoint hosted on each site node. It creates the
relay actor with `$"stream-relay-{request.CorrelationId}-{actorSeq}"` as the actor
name, where `request.CorrelationId` comes straight off the wire. Akka actor names have
a restricted character set; a `correlation_id` containing `/`, whitespace, or other
disallowed characters makes `ActorSystem.ActorOf` throw `InvalidActorNameException`.
That exception is not caught inside `SubscribeInstance`, so it escapes as an unhandled
RPC fault (and after the `_streamSubscriber.Subscribe` / `_activeStreams` entry has
already been set up for the duration, though the `finally` does not run because the
throw is before the `try`). In practice central always supplies a GUID, so impact is
low, but the server is trusting client-supplied input to be actor-name-safe.
**Recommendation**
Validate `request.CorrelationId` on entry (non-empty, matches an expected GUID/safe
pattern) and reject with `StatusCode.InvalidArgument` otherwise; or derive the actor
name solely from the internal `_actorCounter` and keep the correlation ID only as
actor state / dictionary key.
**Resolution**
Resolved 2026-05-17 (commit pending). Root cause confirmed: `SubscribeInstance` fed
the off-the-wire `request.CorrelationId` straight into the `stream-relay-…` actor
name, so an id with `/`, whitespace, or other disallowed characters made `ActorOf`
throw `InvalidActorNameException` as an unhandled RPC fault. Fix: `SubscribeInstance`
now validates `CorrelationId` on entry — rejecting null/empty or any value failing
`ActorPath.IsValidPathElement` with `StatusCode.InvalidArgument` before any actor or
subscription state is created. Regression test
`SiteStreamGrpcServerTests.RejectsCorrelationIdThatIsNotActorNameSafe` (theory:
`/`-bearing, whitespace, empty, `$`-prefixed ids) fails against the pre-fix server
and passes after; `AcceptsActorNameSafeCorrelationId` confirms a normal GUID is still
accepted.
### Communication-015 — No test exercises the real gRPC client factory across a node flip
| | |
|--|--|
| Severity | Low |
| Category | Testing coverage |
| Status | Resolved |
| Location | `tests/ZB.MOM.WW.ScadaBridge.Communication.Tests/Grpc/DebugStreamBridgeActorTests.cs:401`, `tests/ZB.MOM.WW.ScadaBridge.Communication.Tests/Grpc/SiteStreamGrpcClientFactoryTests.cs` |
**Description**
`DebugStreamBridgeActorTests` exercises the reconnect/failover paths through
`MockSiteStreamGrpcClientFactory`, which returns one fixed mock client regardless of
the `grpcEndpoint` argument. This is exactly the behaviour the *real*
`SiteStreamGrpcClientFactory` exhibits incorrectly (Communication-012), so the mock
masks the defect: `On_GrpcError_Reconnects_To_Other_Node` passes even though the real
factory never reaches the other node. `SiteStreamGrpcClientFactoryTests` only asserts
`GetOrCreate` returns the same client for the same site — it never checks what happens
when the same site is requested with a *different* endpoint.
**Recommendation**
Add a `SiteStreamGrpcClientFactoryTests` case that calls `GetOrCreate(site, endpointA)`
then `GetOrCreate(site, endpointB)` and asserts the second call targets `endpointB`
(it should fail today and pass after Communication-012 is fixed). Have the bridge-actor
test's mock factory track the endpoint per call so node-flip coverage is meaningful.
**Resolution**
Resolved 2026-05-17 (commit pending). `SiteStreamGrpcClientFactoryTests` gained
`GetOrCreate_EndpointChanged_ReturnsClientBoundToNewEndpoint` and
`GetOrCreate_EndpointChanged_DisposesPriorClient`, which drive the *real*
`SiteStreamGrpcClientFactory` across a node flip and assert the second call yields a
client bound to the new endpoint with the stale one disposed — both fail against the
pre-fix factory and pass after (the Communication-012 fix). `DebugStreamBridgeActorTests`
gained `On_GrpcError_Reconnects_To_Other_Node_Endpoint`, which uses a new
`EndpointTrackingGrpcClientFactory` test double that hands out a distinct mock client
per endpoint (instead of one fixed mock regardless of endpoint), so the bridge actor's
NodeA→NodeB reconnect is now verified to actually target the NodeB endpoint rather
than being masked by an endpoint-agnostic mock.
### Communication-016 — `HandleConnectionStateChanged` is dead code — the documented disconnect-cleanup workflow never fires
| | |
|--|--|
| Severity | High |
| Category | Design-document adherence |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.Communication/Actors/CentralCommunicationActor.cs:169`, `src/ZB.MOM.WW.ScadaBridge.Communication/Actors/CentralCommunicationActor.cs:338-375` |
**Resolution** — deleted the dead code path in favour of the keepalive-based
detection that is the actual production behaviour: removed the
`Receive<ConnectionStateChanged>` handler, the `HandleConnectionStateChanged`
method, the `_debugSubscriptions` / `_inProgressDeployments` tracking dicts
+ the `TrackMessageForCleanup` helper that fed them, and the dead message
record `src/ZB.MOM.WW.ScadaBridge.Commons/Messages/Communication/ConnectionStateChanged.cs`.
The two dead tests (`ConnectionLost_DebugStreamsKilled` in
CentralCommunicationActorTests, `RoundTrip_ConnectionStateChanged_Succeeds`
in CompatibilityTests) were removed alongside. The design doc
`docs/requirements/Component-Communication.md` "Connection Failure Behavior"
section was updated to state explicitly that disconnect is detected at the
transport layer (gRPC keepalive PING ~25 s for debug streams; Ask-timeout
at the CommunicationService layer for command/control), with no
application-level signal. `DebugStreamTerminated` survives because
`DebugStreamBridgeActor` uses it for an unrelated intra-actor stop signal.
**Description**
`CentralCommunicationActor.HandleConnectionStateChanged` is wired to
`Receive<ConnectionStateChanged>` and implements two important workflows on
`IsConnected == false`: (1) kill every active debug stream for the disconnected
site (`_debugSubscriptions` walk → `DebugStreamTerminated` Tell to each
subscriber); (2) mark every in-progress deployment for that site as failed
(`_inProgressDeployments` walk → entry removal). Both are documented in the
component design doc's "Connection Failure Behavior" section and in WP-5 of the
work plan referenced in the class's own XML doc comment.
A repo-wide search (`grep -rn ConnectionStateChanged src/ tests/`) shows **no
production code ever emits `ConnectionStateChanged`**. The only producers are
the unit test `CentralCommunicationActorTests.ConnectionLost_DebugStreamsKilled`
(line 137) and the Commons message-roundtrip test. The
`CentralCommunicationActor` therefore never receives one in production, the
disconnect-cleanup workflow never fires, and `_debugSubscriptions` /
`_inProgressDeployments` are never pruned via this path.
Concrete consequences:
- A site goes down → its active debug streams do **not** get a synchronous
`DebugStreamTerminated` notification from central. The bridge actor must
detect the disconnect itself via gRPC keepalive timing out (~25s) or TCP RST.
Subscribers wait that long for the `OnStreamTerminated` callback instead of
the documented "immediately killed by central" behaviour.
- In-progress deployments to a disconnected site continue to occupy the
Ask-reply window and only fail when the Ask times out at the
`CommunicationService.DeployInstanceAsync` layer (120s). They are never
proactively marked failed.
- The unit test gives a strong false impression that the workflow works — it
exercises a code path that has no production caller.
The design doc and CLAUDE.md mention "ClusterClient handles failover between
NodeA and NodeB internally — there is no application-level NodeA preference /
NodeB fallback logic" — so the ClusterClient mechanism is the documented
failover transport. But that says nothing about *signalling* a fully-down
remote cluster to central's coordinator actor, which is exactly what
`ConnectionStateChanged` was meant to do.
**Recommendation**
Pick one of:
- Wire a producer for `ConnectionStateChanged` — e.g. subscribe to
`ClusterClient`'s contact-point/cluster events (`ClusterClient.ContactPoints`
Refresh / `ContactPointAdded` / `ContactPointRemoved`) or watch the
ClusterClient actor for a "no contact points reachable" state — and have it
publish `ConnectionStateChanged` to `Self` on each transition.
- If the documented "synchronously kill streams on disconnect" behaviour is
intentionally being dropped in favour of the slower keepalive-based
detection, delete the handler, the `ConnectionStateChanged` record, and the
related `_debugSubscriptions` / `_inProgressDeployments` tracking, then
update the design doc's "Connection Failure Behavior" section accordingly.
Either way, replace `CentralCommunicationActorTests.ConnectionLost_DebugStreamsKilled`
— at present it asserts a behaviour that no production code triggers.
---
### Communication-017 — `_inProgressDeployments` grows unboundedly — successful deployments are never cleaned up
| | |
|--|--|
| Severity | Medium |
| Category | Performance & resource management |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.Communication/Actors/CentralCommunicationActor.cs:73`, `src/ZB.MOM.WW.ScadaBridge.Communication/Actors/CentralCommunicationActor.cs:501`, `src/ZB.MOM.WW.ScadaBridge.Communication/Actors/CentralCommunicationActor.cs:357-367` |
**Resolution (2026-05-28):** Closed by Comm-016 — field removed in commit ac96b83.
The `_inProgressDeployments` dictionary, the `TrackMessageForCleanup` helper,
and the `HandleConnectionStateChanged` handler that consumed them were all
deleted as part of Comm-016's dead-code purge. A grep for `_inProgressDeployments`
in `CentralCommunicationActor.cs` finds only the explanatory comment block at
line 63-74 that documents the removal. There is no longer any unbounded-growth
hazard — the field does not exist.
**Description**
`TrackMessageForCleanup` inserts `_inProgressDeployments[deploy.DeploymentId] =
envelope.SiteId` on every `DeployInstanceCommand` routed to a site (line 501).
The only places that *remove* from `_inProgressDeployments` are:
- `HandleConnectionStateChanged` on `IsConnected == false` (line 366) — which
per Communication-016 never fires in production.
- `PostStop` (line 553) — only on actor death (central failover).
There is **no removal on the normal happy path** — neither when the site replies
`DeploymentStatusResponse` (the reply goes to the Ask's temporary reply actor,
not back through `CentralCommunicationActor`), nor on Ask timeout. Every
successful or failed deployment leaves its entry behind for the lifetime of the
process.
Memory impact is modest (each entry is ~70-100 bytes), but the dictionary grows
monotonically. Over months of operation across all sites a central node could
accumulate tens of thousands of entries — a real, observable leak. More
seriously, the field is *also* the source-of-truth set the
`HandleConnectionStateChanged` walk uses to fail in-progress deployments, so
even if a `ConnectionStateChanged` *were* fired today, the walk would
"fail" thousands of already-completed deployments and Tell their (now stale)
correlation-IDs into the void.
`_debugSubscriptions` (line 67) shares the same shape — but a normal debug
session ends with an `UnsubscribeDebugViewRequest` that *does* drive cleanup
(line 497), so leaks are only realised when a consumer crashes without
unsubscribing.
**Recommendation**
Either remove `_inProgressDeployments` entirely (it has no other consumer once
Communication-016 is fixed by deletion) or, if the disconnect-cleanup workflow
is retained, add a removal hook on the reply path. The simplest fix is to
subscribe `CentralCommunicationActor` to the Ask reply: route
`DeployInstanceCommand` through the actor with the actor as the Ask sender,
forward the reply to the original caller, and `_inProgressDeployments.Remove`
in the same handler. (Today the Ask is taken on the *actor* itself by the
caller, so the reply skips the coordinator.)
---
### Communication-018 — Site heartbeats hard-code `IsActive: true` regardless of node role
| | |
|--|--|
| Severity | Low |
| Category | Design-document adherence |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.Communication/Actors/SiteCommunicationActor.cs:376-465` |
**Description**
`SiteCommunicationActor.SendHeartbeatToCentral` builds
`new HeartbeatMessage(_siteId, hostname, IsActive: true, DateTimeOffset.UtcNow)`
on every periodic tick (line 366), with no inspection of whether this node is
actually the active site node or a standby. The `HeartbeatMessage.IsActive`
field thus carries the literal value `true` on every heartbeat from every
node, and the field is effectively dead — central's `HandleHeartbeat` doesn't
consume it either (line 297 only passes `SiteId` and `Timestamp` to
`MarkHeartbeat`).
Per CLAUDE.md's Cluster & Failover section the active/standby distinction is
real ("Both nodes are seed nodes", "keep-oldest split-brain resolver",
"automatic dual-node recovery"), so a heartbeat that *could* carry node-role
information would be useful for the central health dashboard distinguishing
"active node down, standby up" from "site fully offline". As shipped, the
field is contract noise and a future implementer might mistakenly assume it
already carries meaningful state.
**Recommendation**
Either (a) resolve the current cluster role at heartbeat-send time and pass it
through — e.g. `Cluster.Get(Context.System).SelfRoles.Contains("active")` or
the project's existing role mechanism — and have the central aggregator
consume `IsActive`; or (b) drop the `IsActive` field from `HeartbeatMessage`
(additive-only-evolution: deprecate the field, default to `true`, plan
removal in a major message contract revision).
**Resolution (2026-05-28):** Took option (a). `SiteCommunicationActor` now
accepts an optional `Func<bool>? isActiveCheck` (default = real `Cluster.Get`
leader check mirroring `ActiveNodeGate` / `ActiveNodeHealthCheck`) and
`SendHeartbeatToCentral` stamps `HeartbeatMessage.IsActive` with the result.
A try/catch keeps the heartbeat tick alive when the cluster state is
unreadable (warm-up / TestKit without cluster plugin) — falls back to
`IsActive: false`, the safe non-claiming value. Added parameterised test
`Heartbeat_StampsIsActive_FromInjectedCheck`. Tests green (199/199 in
Communication.Tests).
---
### Communication-019 — `LoadSiteAddressesFromDb` does not pass a `CancellationToken` to the repository
| | |
|--|--|
| Severity | Low |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.Communication/Actors/CentralCommunicationActor.cs:397-431` |
**Description**
`LoadSiteAddressesFromDb` runs `await repo.GetAllSitesAsync()` inside
`Task.Run(async () => ...).PipeTo(self)` with no cancellation token (line 404).
The repository signature accepts `CancellationToken` (the test mock declares
`GetAllSitesAsync(Arg.Any<CancellationToken>())`), but the actor calls the
no-arg overload — so a hung MS SQL connection has no upper bound. The
60-second-periodic refresh keeps firing; each tick spawns a fresh `Task.Run`
that piles up if the database is consistently slow. The actor itself is
unaffected (it's not blocked), but pending tasks and DB connection-pool
resources accumulate, and the `Status.Failure` handler (Communication-006)
never fires because the task never faults — it just sits.
**Recommendation**
Maintain a per-load `CancellationTokenSource` with a deadline (e.g. the same
60s the refresh runs on, or a configurable timeout in `CommunicationOptions`).
Pass its `Token` to `GetAllSitesAsync`. Cancel the prior token before spinning
a new load to avoid task accumulation.
**Resolution (2026-05-28):** Added a per-actor lifecycle `CancellationTokenSource`
on `CentralCommunicationActor`, cancelled+disposed in `PostStop`. Its `Token`
is now passed into `repo.GetAllSitesAsync(ct)` so a hung MS SQL query is
bounded by actor shutdown rather than holding piped tasks open. The existing
60-second refresh cadence and `Status.Failure` handler (Comm-006) are unchanged
— a deadline-per-load was scoped out as a future enhancement; this fix
addresses the immediate "no upper bound on actor stop" concern called out in
the finding.
---
### Communication-020 — `SiteAddressCacheLoaded` carries mutable `Dictionary`/`List` types
| | |
|--|--|
| Severity | Low |
| Category | Akka.NET conventions |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.Communication/Actors/CentralCommunicationActor.cs:567` |
**Resolution (2026-05-28):** `SiteAddressCacheLoaded`'s `SiteContacts` payload is now typed as `IReadOnlyDictionary<string, IReadOnlyList<string>>`, enforcing the Akka.NET message-immutability convention at the type level rather than relying on producer discipline. The producer (`LoadSiteAddressesFromDb`) builds the working buckets as before and wraps each inner `List<string>` with `AsReadOnly()` before constructing the message — the freeze is local to the single refresh tick and the cost is negligible. The consumer (`HandleSiteAddressCacheLoaded`) only ever read via `Keys`, foreach-deconstruct, `Select`, `Count` and `ToImmutableHashSet`, all of which are supported by the new read-only types, so no consumer changes were needed. The existing `MalformedSiteAddress_DoesNotAbortRefresh_OtherSitesStillRegistered` and `ClusterClientRouting_RoutesToConfiguredSite` regression tests exercise the producer→consumer flow and continue to pass under the read-only types.
**Description**
The Akka.NET convention is that messages crossing actor boundaries (even
internal Self-messages over an async task boundary) are immutable.
`SiteAddressCacheLoaded(Dictionary<string, List<string>> SiteContacts)` is a
record but its `SiteContacts` payload is a mutable `Dictionary` whose values
are mutable `List<string>`. Constructed inside `Task.Run` and handed off to
the actor, the cache could in principle be mutated by either side; in
practice nothing does, but the type is a stale-evidence guarantee that
CLAUDE.md's "message immutability" rule is being followed only by convention.
**Recommendation**
Change the record signature to use `IReadOnlyDictionary<string, IReadOnlyList<string>>`
(or `ImmutableDictionary` / `ImmutableArray<string>`) and freeze the data
before piping. The cost is negligible — the payload is built and consumed
once per refresh tick.
---
### Communication-021 — `SiteStreamGrpcServer.SubscribeInstance` leaks the `StreamRelayActor` if `Subscribe` throws pre-try
| | |
|--|--|
| Severity | Low |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.Communication/Grpc/SiteStreamGrpcServer.cs:188-200` |
**Description**
`SubscribeInstance` performs these statements in order (lines 189-194), all
*before* the `try` block at line 200:
1. `Interlocked.Increment(ref _actorCounter)`
2. `_actorSystem!.ActorOf(Props.Create(typeof(StreamRelayActor), ...))`
3. `_streamSubscriber.Subscribe(request.InstanceUniqueName, relayActor)`
If step 3 throws (the subscriber is wired but its `Subscribe` faults — a stale
instance name, a temporary index lookup failure, etc.), the exception escapes
the method as an unhandled `RpcException` *and* leaks the freshly-created
`relayActor`. The `finally` block at line 211 is unreachable because the
throw happens before the `try`. The actor's `_activeStreams` entry, the
`StreamEntry.Cts`, and the `Channel<SiteStreamEvent>` are also leaked.
In normal operation `_streamSubscriber.Subscribe` does not throw, so the bug is
latent — but a misbehaving site runtime (e.g. `SiteStreamManager` faulted
because the actor system is shutting down) would surface it.
**Recommendation**
Restructure to either (a) wrap the `Subscribe` call in a `try` whose `catch`
stops the relay actor and disposes the CTS, or (b) move the actor + subscriber
creation *inside* the existing `try` block (the `finally` will then handle
cleanup uniformly). Option (b) is the simplest — just move lines 189-194 down
past the `try {` brace.
**Resolution (2026-05-28):** Took option (a). `_streamSubscriber.Subscribe(...)`
is now wrapped in its own try/catch — on throw, the freshly-created relay actor
is stopped via `_actorSystem.Stop`, the bounded channel is completed via
`channel.Writer.TryComplete()`, and the `_activeStreams` entry is removed via
the ownership-preserving `TryRemove(KeyValuePair)` overload before the
exception is re-thrown to the caller. Added regression test
`SiteStreamGrpcServerTests.Comm021_SubscribeThrows_StopsRelayActorAndRemovesActiveStreamEntry`
using an NSubstitute `ISiteStreamSubscriber` that throws on Subscribe;
asserts `ActiveStreamCount == 0` and that `RemoveSubscriber` was NOT called
(confirming the catch path, not the finally path).
---
### Communication-022 — `_debugSubscriptions` keyed by caller-supplied correlation ID; reuse silently orphans the prior subscriber
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.Communication/Actors/CentralCommunicationActor.cs:67`, `src/ZB.MOM.WW.ScadaBridge.Communication/Actors/CentralCommunicationActor.cs:493` |
**Description**
`TrackMessageForCleanup` on `SubscribeDebugViewRequest` does
`_debugSubscriptions[sub.CorrelationId] = (envelope.SiteId, Sender)` (line 493).
The dictionary indexer silently overwrites any prior entry for the same
`CorrelationId`. If two debug sessions ever reuse the same correlation ID (e.g.
two Blazor users start a stream at the same moment with a non-GUID id, or a
caller bug, or a malicious caller as flagged in the cousin
Communication-014), the first subscriber's entry is overwritten and lost —
on a later `ConnectionStateChanged(false)` (per Communication-016 it never
actually fires today, but the design intent stands), only the *second*
subscriber would be notified of the disconnect.
`DebugStreamService.StartStreamAsync` uses `Guid.NewGuid().ToString("N")` as
the session id (`DebugStreamService.cs:97`), so a real collision is
astronomically unlikely in normal operation. But the central side is not
defending itself: a CLI consumer or a future caller is implicitly trusted to
generate globally-unique ids.
**Recommendation**
When the slot is already occupied, log a Warning and either reject the new
subscription with an error response or evict the prior subscriber via
`DebugStreamTerminated` before installing the new one. Mirrors the
`SiteStreamGrpcServer` defensive behaviour where a duplicate `correlation_id`
cancels the existing stream (line 167).
**Resolution (2026-05-28):** Closed by Comm-016 — field removed in commit ac96b83.
The `_debugSubscriptions` dictionary, `TrackMessageForCleanup` helper, and the
`HandleConnectionStateChanged` handler that consumed them were all deleted as
part of Comm-016's dead-code purge. There is no longer any caller-supplied
correlation-id keyed map to overwrite — the orphan-on-reuse hazard is gone.
---
### Communication-023 — Design doc describes `SiteStreamService` as a single RPC and reconciliation as ClusterClient-only; both are now stale
| | |
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Resolved |
| Location | `docs/requirements/Component-Communication.md:85`, `docs/requirements/Component-Communication.md:160-166`, `src/ZB.MOM.WW.ScadaBridge.Communication/Protos/sitestream.proto:8-14` |
**Description**
The design doc's "gRPC Proto Definition" section states: *"**Service**: `SiteStreamService`
with **a single RPC** `SubscribeInstance(InstanceStreamRequest) returns (stream
SiteStreamEvent)`."* (line 85). The proto and the generated service now expose **five**
RPCs — `SubscribeInstance`, `IngestAuditEvents`, `IngestCachedTelemetry`,
`PullAuditEvents`, and `PullSiteCalls` (`sitestream.proto:8-14`,
`SiteStreamGrpcServer.cs:211/330/386/436/519`). The four new RPCs are not mentioned
anywhere in the component doc — neither the proto section, the responsibilities, the
topology diagram, nor the Dependencies (the server now also depends on `ISiteAuditQueue`
and `IOperationTrackingStore`).
This drift also contradicts two transport statements elsewhere in the same doc. The
"Cached Call Telemetry" section (line 166) and the "Notification Submission" section
(line 158) both assert **"Transport: ClusterClient"**, and the reconciliation-pull
paragraph (line 164) describes `CachedCallReconcileRequest`/`CachedCallReconcileResponse`.
In the shipped code the reconciliation **pull** runs over **gRPC**, not ClusterClient:
`GrpcPullAuditEventsClient` / `GrpcPullSiteCallsClient`
(`src/ZB.MOM.WW.ScadaBridge.AuditLog/Central/`) call the `PullAuditEvents` / `PullSiteCalls`
RPCs against `SiteStreamGrpcServer`. So the gRPC channel — documented as carrying
**real-time data only** ("`SiteCommunicationActor` and `CentralCommunicationActor` have no
role in streaming event delivery", "gRPC server-streaming for real-time data") — now also
carries site→central audit/telemetry **ingest** and central→site **reconciliation pulls**.
A reader relying on the doc to understand the two-transport split (the central CLAUDE.md
"Data & Communication" invariant) would be materially misled about what flows over which
transport.
**Recommendation**
Update `Component-Communication.md`: (1) replace "a single RPC" with the full RPC list
and document each new RPC's direction, payload, and idempotency; (2) reconcile the
"Transport: ClusterClient" / `CachedCallReconcile*` wording in §10 (and the Site Call
Audit / Audit Log interaction notes) with the gRPC `PullAuditEvents` / `PullSiteCalls`
reality; (3) add `ISiteAuditQueue` / `IOperationTrackingStore` to the SiteStreamGrpcServer
description and Dependencies. If the doc's "gRPC = real-time data only" framing is meant
to be a hard invariant, state explicitly that audit/telemetry ingest and reconciliation
are the documented exceptions.
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): the design doc (Component-Communication.md) now
enumerates all five `SiteStreamService` RPCs (the one server-streaming `SubscribeInstance`
plus the unary `IngestAuditEvents`/`IngestCachedTelemetry`/`PullAuditEvents`/`PullSiteCalls`),
reconciles the §10 reconciliation wording with the gRPC-pull reality (push telemetry stays
ClusterClient; reconciliation pulls are gRPC request/response), and adds `ISiteAuditQueue`/`IOperationTrackingStore`
to Dependencies. Doc-only.
---
### Communication-024 — AlarmStateUpdate proto carries fields 2223 but the design-doc field table stops at 21
| | |
|--|--|
| Severity | Low |
| Category | Design-document adherence |
| Status | Resolved |
| Location | `docs/requirements/Component-Communication.md:92-114`, `src/ZB.MOM.WW.ScadaBridge.Communication/Protos/sitestream.proto:87-88` |
**Description**
The "Enriched AlarmStateUpdate" section states the message "was extended **additively**:
existing fields 17 are unchanged, and **fields 821** carry the enriched native-alarm
state" (line 92), and the field table enumerates fields 821 (lines 94-109). The proto
now also defines **field 22 `native_source_canonical_name`** and **field 23
`is_configured_placeholder`** (`sitestream.proto:87-88`), both populated on the server
(`StreamRelayActor.HandleAlarmStateChanged`, lines 97-98) and read back on the client
(`SiteStreamGrpcClient.ConvertToDomainEvent`, lines 257-258). The doc's field table and
its "fields 821" / "repopulated from fields 821" prose (lines 92, 112) are stale — they
omit fields 2223. `is_configured_placeholder` in particular is load-bearing: it is the
discriminator `StreamRelayActor` uses to drop snapshot-only placeholder rows before
packing the live stream (lines 63-66), a behaviour the doc does not capture.
The proto evolution itself is correct (additive, field numbers appended and never reused),
so this is purely doc staleness, not a contract regression.
**Recommendation**
Extend the AlarmStateUpdate field table to fields 2223, update the "fields 821" prose to
"823" in both the intro and the client-mapping bullet, and add a one-line note that
`is_configured_placeholder` rows are dropped at the relay (never relayed to the live gRPC
stream).
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): the `AlarmStateUpdate` field table was extended to
fields 2223 (`native_source_canonical_name`, `is_configured_placeholder`) with a note that
placeholder rows are dropped at the relay. Doc-only.