fix(server): propagate watch-list cancellation; doc + test gaps (Server-051..053)

This commit is contained in:
Joseph Doherty
2026-06-15 02:39:11 -04:00
parent 410acc92eb
commit 258e09e0de
6 changed files with 283 additions and 6 deletions
+95 -2
View File
@@ -4,8 +4,8 @@
|---|---|
| Module | `src/ZB.MOM.WW.MxGateway.Server` |
| Reviewer | Claude Code |
| Review date | 2026-05-24 |
| Commit reviewed | `42b0037` |
| Review date | 2026-06-15 |
| Commit reviewed | `410acc9` |
| Status | Re-reviewed |
| Open findings | 0 |
@@ -120,6 +120,38 @@ contention nor the bounded `_events` channel saw any changes in this wave.
| 9 | Testing coverage | No issues found in this module — see Tests-026 in the Tests module for the missing EventsHub broadcast coverage. |
| 10 | Documentation & comments | Issues found: Server-040, Server-043 (both documentation gaps). |
### 2026-06-15 re-review (commit 410acc9)
Re-review pass at `410acc9` over the `42b0037..HEAD` diff. The diff is large (~137 files)
but the bulk is vendored theme/CSS/font asset swaps (`wwwroot`), generated code, and the
shared-library auth refactor / TLS cert-autogen / lazy-browse / canonical-audit waves that
each carry their own design+plan and were verified in passing only. This pass is scoped to
the **alarm-provider subtag-fallback** wave the task called out: the central
`GatewayAlarmMonitor` provider-mode seeding + failover/failback handling, the new
`AlarmWatchListResolver` / `IAlarmWatchListResolver`, `AlarmFallbackOptions` /
`AlarmDiscoveryOptions` / `AlarmSubtagNameOptions` and their `GatewayOptionsValidator`
wiring, the `DashboardAlarmProviderStatus` badge + `AlarmsPage.razor` hub attach, the
provider-mode gauge + `provider_switches` counter (`GatewayMetrics`,
`AlarmProviderSwitchReason`), the Galaxy alarm-attribute discovery query
(`GalaxyRepository.GetAlarmAttributesAsync` / `AlarmAttributesSql` / `GalaxyAlarmAttributeRow`),
the `/auth/login` POST move + configurable `Dashboard:CookieName`, and the
`BrowseChildrenRequest` scope-resolver entry. Prior findings Server-044 through Server-050
are confirmed resolved by the SessionManager/GatewaySession changes in range and remain
closed. New findings filed against this pass: Server-051..053.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Issues found: Server-051 (`AlarmWatchListResolver.ResolveAsync`'s broad `catch (Exception)` swallows `OperationCanceledException`, contradicting the `IAlarmWatchListResolver` cancellation contract). |
| 2 | mxaccessgw conventions | No issues found — file-scoped namespaces, `sealed`, `Async` suffix, Options pattern, MXAccess-aligned naming all hold; no UI component libraries (badge is Bootstrap-only); the alarm SQL is a parameterless constant; no secret/tag-value logging; well-known reason strings centralised in `AlarmProviderReasons`. |
| 3 | Concurrency & thread safety | No issues found — `_providerMode`/`_providerDegraded`/`_providerReason`/`_providerSince` are read/written only under `_sync`; `BroadcastToAll` runs under `_sync`; the reconcile after a mode change is intentionally awaited outside `_sync` to avoid the documented self-deadlock; the provider-mode gauge is serialized on `GatewayMetrics._syncRoot`. |
| 4 | Error handling & resilience | Issues found: Server-051 (cancellation swallowed in the resolver — also an error-handling/contract concern). |
| 5 | Security | No issues found — `BrowseChildren` runs the same `ResolveBrowseSubtrees()` constraint scoping and `MetadataRead` scope as `DiscoverHierarchy`; the configurable `Dashboard:CookieName` falls back to the canonical default and cannot be blanked; the `/auth/login` POST keeps antiforgery + return-URL sanitisation. |
| 6 | Performance & resource management | No issues found in the alarm-fallback code — discovery is a one-shot per subscribe lifecycle; the watch-list is composed once. |
| 7 | Design-document adherence | No issues found — `docs/GatewayConfiguration.md`, `docs/Metrics.md`, `docs/GalaxyRepository.md`, and the `docs/plans/2026-06-13-alarm-subtag-fallback*` / `2026-06-15-forced-subtag-mode-fix.md` plans were landed in the same range and match the code. |
| 8 | Code organization & conventions | No issues found — new alarm types live under `Alarms/`, options under `Configuration/`, metric helper under `Metrics/`, registered via `AddGatewayAlarms`. |
| 9 | Testing coverage | Issues found: Server-053 (`AlarmWatchListResolver` `ExcludeAttributes`-vs-`IncludeAttributes` precedence and the resolver's cancellation contract are untested; no redundant-mode-change guard test). |
| 10 | Documentation & comments | Issues found: Server-052 (`IAlarmWatchListResolver` XML contract claims cancellation propagates while the implementation swallows it; the `Discovery:ExcludeAttributes` doc says "Repository-derived watch-list" while the code also removes matching explicit `IncludeAttributes`). |
## Findings
### Server-001
@@ -929,3 +961,64 @@ Today neither call site has a Blazor error boundary, so an unhandled exception l
**Recommendation:** Add a general `catch (Exception exception)` after the `SessionManagerException` catch in both `CloseSessionAsync` and `KillWorkerAsync`, log a warning (matching the SessionManagerException pattern), and return `DashboardSessionAdminResult.Fail($"{operation} failed unexpectedly. See the gateway log for details.")`. This makes the result type truly the only output the page sees. Add a regression test using a `ThrowingSessionManager` that throws e.g. `InvalidOperationException` from `KillWorkerAsync` and asserts the service returns a failing result rather than propagating.
**Resolution:** 2026-05-24 — Added the recommended general `catch (Exception)` arms to both `DashboardSessionAdminService.CloseSessionAsync` and `KillWorkerAsync` (`src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardSessionAdminService.cs`), placed after the `SessionManagerException` catches and behind a `catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested) throw;` so caller cancellation still propagates cleanly. The new catches log a warning with actor + session id and return `DashboardSessionAdminResult.Fail("{Operation} failed unexpectedly for session {SessionId}. See the gateway log for details.")`, mirroring the SessionManagerException pattern. Regression tests in `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Dashboard/DashboardSessionAdminServiceTests.cs`: `CloseSessionAsync_WhenManagerThrowsUnexpected_ReturnsFriendlyFail` (the `ISessionManager` throws `InvalidOperationException("unexpected")`) and `KillWorkerAsync_WhenManagerThrowsUnexpected_ReturnsFriendlyFail` (throws `IOException("pipe broken")`); both assert the service returns a failing result with a non-blank message rather than propagating. The fake's new `CloseThrowsUnexpected` / `KillThrowsUnexpected` properties hold the configured exception. Confirmed to fail before the fix (raw exception propagated) and pass after.
### Server-051
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | `src/ZB.MOM.WW.MxGateway.Server/Alarms/AlarmWatchListResolver.cs:64-78` |
| Status | Resolved |
**Description:** `AlarmWatchListResolver.ResolveAsync` wraps the Galaxy Repository discovery call in a bare `catch (Exception ex)` that logs a warning and continues with an empty (config-only) discovery set:
```csharp
try { rows = await _repository.GetAlarmAttributesAsync(cancellationToken)...; }
catch (Exception ex) { _logger.LogWarning(ex, "...continuing with configuration-only watch-list."); rows = []; }
```
`OperationCanceledException` / `TaskCanceledException` derive from `Exception`, so a cancellation triggered while `GetAlarmAttributesAsync` is awaiting SQL is **swallowed**, not propagated. The resolver then returns a (config-only or empty) watch-list as though the call completed normally. This directly contradicts the `IAlarmWatchListResolver.ResolveAsync` XML contract, which states: *"Cancellation is the one exception: a triggered cancellationToken still propagates an OperationCanceledException."* In practice the resolver is called from `GatewayAlarmMonitor.SubscribeAlarmsAsync` on the monitor's lifecycle token; if the gateway is shutting down (or the monitor lifecycle is being torn down) mid-discovery, the resolver hides the cancellation and the monitor proceeds to issue `SubscribeAlarms` with a wrong (empty) watch-list instead of unwinding promptly. The `GalaxyRepository.GetAlarmAttributesAsync` SQL path does honour the token (`OpenAsync(ct)` / `ExecuteReaderAsync(ct)` / `ReadAsync(ct)`), so a real cancellation can land inside this catch.
**Recommendation:** Add a `catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested) { throw; }` ahead of the general catch (or filter the general catch with `when (ex is not OperationCanceledException)`), so cancellation propagates per the documented contract while genuine discovery failures still degrade to a config-only list. Add a regression test that cancels the token mid-`GetAlarmAttributesAsync` and asserts `OperationCanceledException` propagates.
**Resolution:** Resolved 2026-06-15. Confirmed against source: the bare `catch (Exception ex)` swallowed `OperationCanceledException`. Filtered the catch with `when (ex is not OperationCanceledException)` so a real cancellation propagates per the `IAlarmWatchListResolver` contract while genuine discovery failures still degrade to a config-only list. Regression test: `AlarmWatchListResolverTests.ResolveAsync_RepositoryCancelled_PropagatesOperationCanceled` (failed before the fix, passes after).
### Server-052
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | `src/ZB.MOM.WW.MxGateway.Server/Alarms/IAlarmWatchListResolver.cs:24-30`, `src/ZB.MOM.WW.MxGateway.Server/Alarms/AlarmWatchListResolver.cs:101-114`, `docs/GatewayConfiguration.md:247` |
| Status | Resolved |
**Description:** Two prose-vs-code mismatches in the watch-list resolver:
1. The `IAlarmWatchListResolver.ResolveAsync` XML `<returns>` promises that a triggered `cancellationToken` propagates an `OperationCanceledException`, but the implementation swallows it (see Server-051). Whichever way Server-051 is resolved, exactly one of the doc or the code is currently wrong; right now the doc over-promises.
2. `AlarmDiscoveryOptions.ExcludeAttributes` and `docs/GatewayConfiguration.md:247` both describe `ExcludeAttributes` as removing entries from the **"Repository-derived"** watch-list. The implementation's `ordered.RemoveAll(e => excluded.Contains(e.Reference))` runs over the combined list — Galaxy-Repository rows **and** the explicit `Discovery:IncludeAttributes` entries appended just above it — so an exclude entry that matches an explicit include silently removes that include too. The behaviour is defensible (excludes win) but is not what the "Repository-derived" wording says, and an operator who adds an attribute via `IncludeAttributes` and also lists it in `ExcludeAttributes` would be surprised it disappears.
**Recommendation:** For (1), align the `IAlarmWatchListResolver` doc with whatever Server-051 settles on. For (2), either restrict the exclude to GR-discovered rows (apply `RemoveAll` before appending the `IncludeAttributes` entries) or update the option XML doc and `GatewayConfiguration.md` to say excludes are applied to the merged GR-plus-include list and therefore also suppress matching explicit includes.
**Resolution:** Resolved 2026-06-15. (1) No longer over-promises: the Server-051 fix makes the implementation propagate `OperationCanceledException`, so the `IAlarmWatchListResolver.ResolveAsync` `<returns>` doc is now accurate and was left unchanged. (2) Kept the "excludes win" code behaviour (excludes applied to the merged GR-plus-include list) and corrected the prose to match: `AlarmDiscoveryOptions.ExcludeAttributes` XML doc and `docs/GatewayConfiguration.md:247` now state the exclude runs after the GR rows and explicit `IncludeAttributes` are combined, so an exclude matching an explicit include suppresses it too. The "excludes win" precedence is pinned by `AlarmWatchListResolverTests.ResolveAsync_ExcludeAlsoSuppressesMatchingExplicitInclude`.
### Server-053
| Field | Value |
|---|---|
| Severity | Low |
| Category | Testing coverage |
| Location | `src/ZB.MOM.WW.MxGateway.Tests/Alarms/AlarmWatchListResolverTests.cs`, `src/ZB.MOM.WW.MxGateway.Tests/Alarms/GatewayAlarmMonitorProviderModeTests.cs` |
| Status | Resolved |
**Description:** The new alarm-fallback surface is broadly well-tested (`AlarmWatchListResolverTests`, `GatewayAlarmMonitorProviderModeTests`, `DashboardBrowseAndAlarmModelTests`, `GalaxyAlarmAttributeMappingTests`, `GatewayOptionsValidatorTests`), but two behaviours that the diff introduced have no coverage:
- **Resolver cancellation contract (Server-051):** no test cancels the token mid-discovery and asserts `OperationCanceledException` propagates. Because the existing `ResolveAsync_RepositoryThrows_LogsAndReturnsConfigOnlySet` asserts the swallow path, the cancellation regression is precisely the case that would catch the Server-051 bug — and its absence is why the contract violation went unnoticed.
- **Exclude-vs-include precedence (Server-052 item 2):** no test exercises a `Discovery:IncludeAttributes` entry that also appears in `ExcludeAttributes`, so the "excludes also drop explicit includes" behaviour is unpinned and would silently change if the merge order were edited.
Additionally, `GatewayAlarmMonitor.ApplyProviderModeChangeAsync` increments the `mxgateway.alarms.provider_switches` counter and resets `_providerSince` unconditionally on every `OnAlarmProviderModeChanged` event, with no guard for a redundant event whose `toMode` equals the current mode; there is no test asserting the from==to / no-op behaviour either way.
**Recommendation:** Add resolver tests for (a) cancellation propagation and (b) an include that is also excluded; and a `GatewayAlarmMonitorProviderMode` test pinning the provider-switch counter behaviour for a same-mode repeat event (whichever semantics the team intends). These lock down the contracts the Server-051/052 findings expose.
**Resolution:** Resolved 2026-06-15. Added all three missing tests: (a) `AlarmWatchListResolverTests.ResolveAsync_RepositoryCancelled_PropagatesOperationCanceled` (cancellation propagation, also covers Server-051); (b) `AlarmWatchListResolverTests.ResolveAsync_ExcludeAlsoSuppressesMatchingExplicitInclude` (exclude-vs-include precedence, also Server-052 item 2); and (c) `GatewayAlarmMonitorProviderModeTests.ProviderModeChange_RepeatedSameMode_RecordsASwitchForEachEvent`, which pins the existing semantics — each worker-reported `OnAlarmProviderModeChanged` event records a `provider_switches` increment (and resets `_providerSince`) even when `toMode` equals the current mode, since the worker is the authority on when a mode change occurred and the gateway does not synthesize or suppress it.