fix(communication): resolve Communication-001 — early stream termination handling

DebugStreamService.StartStreamAsync awaited the initial debug snapshot inside
a try whose only handler was catch (OperationCanceledException). When the
stream terminated before the snapshot arrived, onTerminatedWrapper completed
the await with an InvalidOperationException that escaped the catch — the
caller got a raw, untranslated exception and the service did no teardown of
its own on that path.

Replaced with catch (Exception): it removes the session entry, sends
StopDebugStream to the bridge actor via the local reference (deterministic
teardown, idempotent), and throws a descriptive exception — TimeoutException
for the 30s timeout, otherwise an InvalidOperationException naming the
instance/site and wrapping the cause.

Re-triaged Critical -> Medium: the originally-claimed multi-minute site-side
resource leak does not occur (the bridge actor self-terminates on every
onTerminated path). Adds the first DebugStreamService test, which fails
against the pre-fix code.
This commit is contained in:
Joseph Doherty
2026-05-16 18:32:52 -04:00
parent 239bee3bc4
commit a9ceba00d0
4 changed files with 146 additions and 44 deletions

View File

@@ -8,7 +8,7 @@
| Last reviewed | 2026-05-16 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 11 |
| Open findings | 10 |
## Summary
@@ -16,9 +16,7 @@ The Communication module is generally well-structured and matches the design doc
two-transport model (ClusterClient for command/control, gRPC server-streaming for
real-time data). The actors keep mutable state on the actor thread, use `PipeTo` for
async work, and the gRPC server/client lifecycle is mostly disciplined. However the
review found one Critical issue (a `TimeoutException` from `DebugStreamService` leaves
an orphaned bridge actor and an active site-side subscription, leaking resources on
every snapshot timeout) and several High/Medium issues clustered around two themes:
review found several High and Medium issues clustered around two themes:
**(a) gRPC subscription bookkeeping races** — `SiteStreamGrpcClient` overwrites and
removes subscription entries by correlation ID without disposal or ownership checks,
so reconnect cycles leak `CancellationTokenSource`es and can cancel the wrong stream;
@@ -44,43 +42,55 @@ mutation races, and the snapshot-timeout cleanup path.
## Findings
### Communication-001 — Snapshot timeout leaves orphaned bridge actor and site subscription
### Communication-001 — Early stream termination escapes StartStreamAsync's narrow exception handling
| | |
|--|--|
| Severity | Critical |
| Category | Performance & resource management |
| Status | Open |
| Location | `src/ScadaLink.Communication/DebugStreamService.cs:139`, `src/ScadaLink.Communication/DebugStreamService.cs:149` |
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | `src/ScadaLink.Communication/DebugStreamService.cs:130-143` |
**Re-triaged 2026-05-16:** originally filed Critical, claiming an orphaned bridge actor
and a multi-minute site-side resource leak on every snapshot timeout. On verification
that impact does **not** occur: `DebugStreamBridgeActor` calls `CleanupGrpc()` and
`Context.Stop(Self)` on every path that invokes `onTerminated` (site disconnect, gRPC
max-retries, `ReceiveTimeout`), so it always self-terminates and releases its gRPC
subscription; and the pure-timeout path does reach `StopStream`, which also stops it.
The genuine defect described below is an error-handling gap, not a leak — severity
corrected to Medium.
**Description**
When `StartStreamAsync` times out waiting for the initial snapshot it calls
`StopStream(sessionId)` and throws. `StopStream` only sends `StopDebugStream` to the
bridge actor **if the session is still in `_sessions`**. But the bridge actor was added
to `_sessions` at line 124 and is only removed by `onTerminatedWrapper`. The serious
case is the race where `onTerminatedWrapper` fires first (e.g. site disconnect arrives
during the wait): `snapshotTcs.TrySetException` completes the await with an
`InvalidOperationException` rather than `OperationCanceledException`, which is **not**
caught by the `catch (OperationCanceledException)` block. The exception propagates
uncaught, `StopStream` is never reached, and if the bridge actor is instead orphaned
(snapshot never arrives, site silent, no terminate) the only cleanup is the 5-minute
`ReceiveTimeout` in the actor — meaning a site-side `StreamRelayActor` and gRPC stream
can stay alive for up to 5 minutes after the central caller has given up. Combined with
the 30s timeout, every transient snapshot delay leaks site resources for minutes.
`StartStreamAsync` awaits the initial snapshot inside a `try` whose only handler is
`catch (OperationCanceledException)`. When the stream terminates before the snapshot
arrives, `onTerminatedWrapper` completes the await via
`snapshotTcs.TrySetException(new InvalidOperationException(...))`. That
`InvalidOperationException` is not an `OperationCanceledException`, so it escapes the
catch entirely: the caller (Blazor debug view / SignalR hub) receives a raw,
untranslated exception, and `StartStreamAsync` performs no teardown of its own on that
path — it relies implicitly on the bridge actor self-terminating. Cleanup from the
service side is therefore not deterministic, and the failure surfaced to the caller is
not a meaningful, documented result.
**Recommendation**
In `StartStreamAsync`, wrap the `await` so that *any* failure or cancellation
deterministically calls `StopStream(sessionId)` (e.g. `try/catch (Exception)` or a
`finally` that stops the session when the result was not returned). Ensure
`StopStream` is idempotent and always sends `StopDebugStream` even if the session was
already removed, so the bridge actor (and its site-side subscription) is torn down
promptly rather than waiting for the orphan `ReceiveTimeout`.
In `StartStreamAsync`, catch any exception from the snapshot await, deterministically
tear down the bridge actor (`Tell(StopDebugStream)` via the local actor reference, since
a racing `onTerminatedWrapper` may already have removed the session entry), and translate
the failure into a meaningful exception for the caller.
**Resolution**
_Unresolved._
Resolved 2026-05-16. The `catch (OperationCanceledException)`-only block in
`StartStreamAsync` was replaced with `catch (Exception)`: it removes the session entry,
sends `StopDebugStream` to the bridge actor via the local reference (idempotent — the
actor may already be stopping itself), and throws a descriptive exception —
`TimeoutException` for the 30s timeout, otherwise an `InvalidOperationException` that
names the instance/site and wraps the underlying cause. Regression test
`DebugStreamServiceTests.StartStreamAsync_StreamTerminatesBeforeSnapshot_ThrowsMeaningfulException`
fails against the pre-fix code and passes after. Fixed by the commit whose message
references `Communication-001`.
### Communication-002 — gRPC reconnect does not unsubscribe the previous stream, leaking site-side relay actors