test(gateway): cover failback reason, FromFeed/SinceUtc badge paths; style + bounded drain (Tests-032..035)

2026-06-15 02:46:06 -04:00
parent b57d02cc4d
commit 56dd56954b
5 changed files with 328 additions and 15 deletions
@@ -4,13 +4,43 @@
 |---|---|
 | Module | `src/ZB.MOM.WW.MxGateway.Tests` |
 | Reviewer | Claude Code |
-| Review date | 2026-05-24 |
-| Commit reviewed | `42b0037` |
+| Review date | 2026-06-15 |
+| Commit reviewed | `410acc9` |
 | Status | Re-reviewed |
 | Open findings | 0 |

 ## Checklist coverage

+### 2026-06-15 re-review (commit `410acc9`)
+
+Re-review of the `42b0037..410acc9` diff (≈57 files), scoped to the alarm-provider
+fallback feature: the end-to-end failover/failback lifecycle test
+(`AlarmFailoverEndToEndTests`), the provider-mode/metric tests
+(`GatewayAlarmMonitorProviderModeTests`), the watch-list resolver tests
+(`AlarmWatchListResolverTests`), the validator additions
+(`GatewayOptionsValidatorTests` AlarmFallback block), the dashboard badge model
+(`DashboardBrowseAndAlarmModelTests`), the alarm metric tests
+(`GatewayMetricsTests`), the Galaxy alarm mapper (`GalaxyAlarmAttributeMappingTests`),
+and the new `provider_status` / degraded-provenance protobuf round-trips. The
+non-alarm churn in the diff (kill/shutdown SessionManager tests closing prior
+Tests-028/029, XML-doc-only additions to `SessionManagerBulkTests`/`GatewaySessionTests`,
+browse-tab and TLS tests) was walked but is not the review focus.
+
+| # | Category | Result |
+|---|---|---|
+| 1 | Correctness & logic bugs | No issues found in this diff. The lifecycle test correctly disambiguates the recovery `ProviderStatus` from the baseline by matching on `Reason == "recovered"`; the `ModeString_MapsToForcedProviderMode` `Assert.Empty(WatchList)` is weak (the stub resolver returns `[]` regardless of mode) but not wrong. |
+| 2 | mxaccessgw conventions | Issue found: Tests-034 (`GatewayLogRedactorSeamTests.cs` is in the global namespace with redundant `System.Collections.Generic`/`Xunit` usings, `var`, and a non-`sealed` `public class` — the same C# style drift family as the resolved Tests-008). |
+| 3 | Concurrency & thread safety | Issue found: Tests-035 (`AlarmFailoverEndToEndTests.DegradedTransition_*`'s second-subscriber `await foreach`-to-`SnapshotComplete` has no `WaitTimeout`, so a regression that never emits `SnapshotComplete` hangs the test instead of failing cleanly). The metric/feed reader races are correctly gated by `baselineReceived` TCS before emitting events. |
+| 4 | Error handling & resilience | No issues found in this diff. `AlarmWatchListResolverTests.ResolveAsync_RepositoryThrows_LogsAndReturnsConfigOnlySet` covers the discovery-unavailable degradation path; validator failure paths are well covered. |
+| 5 | Security | No issues found in this diff. The redaction seam assertion in `GatewayLogRedactorSeamTests` (despite its style drift) meaningfully pins API-key masking in `ClientIdentity`; secured-bulk credential round-trips are pinned. |
+| 6 | Performance & resource management | No issues found in this diff. Monitors/CTSs are disposed; `using GatewayMetrics`/`using GatewayAlarmMonitor` throughout. |
+| 7 | Design-document adherence | No issues found in this diff. Tests match the alarm-fallback plan and the forced-vs-failover-degraded badge distinction. |
+| 8 | Code organization & conventions | See Tests-034. The two alarm-monitor test files replicate (not share) the `FakeSessionManager`/`StubWatchListResolver` harness; the in-file remark documents this is deliberate to keep the sibling untouched — acceptable, not filed. |
+| 9 | Testing coverage | Issues found: Tests-032 (the monitor's `toMode`→`AlarmProviderSwitchReason` derivation — Subtag→Failover, Alarmmgr→Failback — is untested: `Failback` is asserted nowhere and the monitor tests check only the switch *count*, so a swapped/`Unknown` reason regression passes), Tests-033 (`DashboardAlarmProviderStatus.FromFeed` and its non-provider-status `ArgumentException` guard, the `SinceUtc` mapping, the `DegradedLabel` text, and the `Degraded && Mode==Alarmmgr` guard branch are all uncovered). |
+| 10 | Documentation & comments | No issues found in this diff. New alarm test files carry orienting class-level summaries; `GalaxyAlarmAttributeMappingTests`'s "derivation" framing slightly overstates the pass-through mapper but is harmless. |
+
+### 2026-05-24 re-review of the Tests-013–019 batch
+
 This pass (commit `a020350`) re-reviews the module after the Tests-013–019 batch was resolved alongside Server-017, Server-021, and Contracts-010.

 | # | Category | Result |
@@ -557,3 +587,63 @@ The cancellation tests for `WorkerClient` in `WorkerClientTests` *do* exercise t
 **Recommendation:** (a) The cheap fix: have `ThrowOnceThenYieldSnapshotService` record `_firstThrowAt = DateTimeOffset.UtcNow` immediately before the `throw`, and change the assertion to `secondSubscribeAt - firstThrowAt >= reconnectDelay - 10ms` — the gap then measures only the reconnect delay, eliminating the variable scheduling baseline. (b) The deeper fix: extend `DashboardSnapshotPublisher` to accept an `ITimeProvider`-style delay seam (or a virtual `DelayAsync` hook) so a `ManualTimeProvider` could advance time deterministically. (a) is preferred for now; (b) belongs as a follow-up if more reconnect-loop tests are added.

 **Resolution:** 2026-05-24 — Applied option (a). Added `FirstThrowAt` to `ThrowOnceThenYieldSnapshotService` and set it via `FirstThrowAt = DateTimeOffset.UtcNow;` immediately before the first-call `throw`. Removed the pre-`StartAsync` `startedAt` baseline; the assertion now reads `gap = secondSubscribeAt - firstThrowAt` (both timestamps captured inside the fake), and the 10 ms slack absorbs the Windows `Task.Delay` quantum without the variable `StartAsync` / scheduling overhead in the baseline. This is the same flake-isolation pattern Tests-006 / Tests-017 used (measuring only the production delay, not test-side setup). Suite green; the test passes deterministically across repeated runs.
+
+### Tests-032
+
+| Field | Value |
+|---|---|
+| Severity | Medium |
+| Category | Testing coverage |
+| Location | `src/ZB.MOM.WW.MxGateway.Server/Alarms/GatewayAlarmMonitor.cs:435-441`, `src/ZB.MOM.WW.MxGateway.Tests/Alarms/GatewayAlarmMonitorProviderModeTests.cs`, `src/ZB.MOM.WW.MxGateway.Tests/Alarms/AlarmFailoverEndToEndTests.cs` |
+| Status | Resolved |
+
+**Description:** `GatewayAlarmMonitor.HandleProviderModeChanged` derives the provider-switch reason from the target mode: `toMode switch { Subtag => Failover, Alarmmgr => Failback, _ => Unknown }` (lines 435-439), then calls `_metrics.AlarmProviderSwitched(fromModeInt, toModeInt, switchReason)`. No test in the diff asserts this derivation. `GatewayAlarmMonitorProviderModeTests.ProviderModeChange_BroadcastsDegradedStatus_AndIncrementsSwitchMetric` only asserts the switch *count* (`switchCount == 1`) — it never inspects the `from`/`to`/`reason` tags on the measurement. `AlarmFailoverEndToEndTests.ProviderFailoverAndFailback_FullLifecycle` drives both a failover (alarmmgr→subtag) and a failback (subtag→alarmmgr) but asserts only on feed `ProviderStatus` messages, not on the metric tags. The only place the `reason` tag is read is `GatewayMetricsTests.AlarmProviderSwitched_IncrementsCounterWithExpectedTags`, which passes `AlarmProviderSwitchReason.Failover` *explicitly* to the metrics layer — that pins the metrics-side tag formatting, not the monitor's `toMode→reason` mapping. `AlarmProviderSwitchReason.Failback` is asserted nowhere in the suite. A regression that swapped the Failover/Failback arms, or collapsed them to `Unknown`, would pass every existing test while emitting wrong dashboard/observability data for every failback.
+
+**Recommendation:** Extend `GatewayAlarmMonitorProviderModeTests` (or add a failback case) to capture the `reason` tag through a `MeterListener` and assert it equals `"failover"` on an alarmmgr→subtag change and `"failback"` on a subtag→alarmmgr change, mirroring the tag-capturing pattern already in `GatewayMetricsTests.AlarmProviderSwitched_IncrementsCounterWithExpectedTags`. This pins the monitor's `toMode→AlarmProviderSwitchReason` derivation, not just the count.
+
+**Resolution:** 2026-06-15 — Confirmed root cause: the existing monitor tests asserted only the switch *count*, and `Failback` was asserted nowhere in the suite, so a swapped/`Unknown` reason arm would pass. Added `GatewayAlarmMonitorProviderModeTests.ProviderModeChange_FailoverThenFailback_RecordsCorrectReasonTags`, which captures the `reason` tag off the `mxgateway.alarms.provider_switches` counter via a `MeterListener` and drives an alarmmgr→subtag change then a subtag→alarmmgr change, asserting the captured reasons are exactly `["failover", "failback"]`. This pins the monitor's `toMode→AlarmProviderSwitchReason` derivation (`ApplyProviderModeChangeAsync`). Test passes against current production code (no production change); no bug found.
+
+### Tests-033
+
+| Field | Value |
+|---|---|
+| Severity | Low |
+| Category | Testing coverage |
+| Location | `src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardAlarmProviderStatus.cs`, `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Dashboard/DashboardBrowseAndAlarmModelTests.cs:140-195` |
+| Status | Resolved |
+
+**Description:** The three new badge-mapping tests cover `FromProviderStatus` for green (Alarmmgr/not-degraded), amber (Subtag/degraded), and cyan (Subtag/forced). Several adjacent behaviours of the same projection are uncovered: (1) `DashboardAlarmProviderStatus.FromFeed(AlarmFeedMessage)` — the public entry the dashboard SignalR snapshot path actually calls — and its `ArgumentException` thrown when the message is not a `ProviderStatus` payload have zero coverage in the suite (a grep for `FromFeed` across the test project returns no hits). (2) The `SinceUtc` field (`status.Since?.ToDateTimeOffset()`) is never asserted, so a regression dropping or mis-converting the badge timestamp would not be caught. (3) The `DegradedLabel` constant text ("Subtag monitoring (degraded)") is asserted nowhere — the amber test only checks the `bg-warning` CSS class, so a label swap would pass. (4) The `degraded = status.Degraded || status.Mode == AlarmProviderMode.Subtag` guard's second branch (`Degraded == true` while `Mode == Alarmmgr`) — an explicitly-degraded alarmmgr status — is untested, so the "guard against either being set independently" comment in the product code is unverified.
+
+**Recommendation:** Add `FromFeed_NonProviderStatusPayload_Throws` (asserting `ArgumentException`) and `FromFeed_ProviderStatusPayload_ProjectsBadge`; assert `SinceUtc` on a status carrying a `Since` timestamp; assert `model.Label == DashboardAlarmProviderStatus.DegradedLabel` in the amber test; and add a `Degraded=true, Mode=Alarmmgr` case asserting it maps to the degraded (amber) badge per the independent-flag guard.
+
+**Resolution:** 2026-06-15 — Confirmed the four coverage gaps against `DashboardAlarmProviderStatus`. Added to `DashboardBrowseAndAlarmModelTests`: `FromFeed_ProviderStatusPayload_ProjectsBadge` and `FromFeed_NonProviderStatusPayload_Throws` (the latter asserts `ArgumentException` for a `SnapshotComplete` feed message); `FromProviderStatus_WithSinceTimestamp_MapsSinceUtc` (pins `SinceUtc` round-trips the protobuf `Since` timestamp); `FromProviderStatus_Alarmmgr_DegradedFlagSet_WarningBadge` (the `Degraded && Mode==Alarmmgr` independent-flag branch maps to the amber degraded badge); and a `DegradedLabel` text assertion added to the existing amber `FromProviderStatus_Subtag_Degraded_WarningBadge`. All pass against current production code (no production change); no bug found.
+
+### Tests-034
+
+| Field | Value |
+|---|---|
+| Severity | Low |
+| Category | mxaccessgw conventions |
+| Location | `src/ZB.MOM.WW.MxGateway.Tests/Diagnostics/GatewayLogRedactorSeamTests.cs:1-15` |
+| Status | Resolved |
+
+**Description:** `GatewayLogRedactorSeamTests.cs` diverges from the project's C# style guide and the rest of the test suite: it declares no file-scoped namespace (the class lands in the global namespace, unlike every other test file which sits under `ZB.MOM.WW.MxGateway.Tests.*`); it carries redundant explicit `using System.Collections.Generic;` and `using Xunit;` (both are implicit global usings in this project, enforced elsewhere — see the resolved Tests-008); it uses `var` for `redactor`/`props` where the suite uses explicit types per `docs/style-guides/CSharpStyleGuide.md`; and it declares `public class` rather than the project's `sealed`-by-default convention. The redaction assertion itself is sound (it meaningfully pins API-key masking in `ClientIdentity`), so this is purely the same style-drift family as the previously-filed-and-resolved Tests-008, not a correctness issue.
+
+**Recommendation:** Add `namespace ZB.MOM.WW.MxGateway.Tests.Diagnostics;` (file-scoped), drop the redundant `System.Collections.Generic`/`Xunit` usings, mark the class `public sealed class`, and replace the two `var` declarations with explicit types (`GatewayLogRedactorSeam` / `Dictionary<string, object?>`).
+
+**Resolution:** 2026-06-15 — Confirmed the style drift. Rewrote `GatewayLogRedactorSeamTests.cs` to add the file-scoped `namespace ZB.MOM.WW.MxGateway.Tests.Diagnostics;`, dropped the redundant `using System.Collections.Generic;`/`using Xunit;` (both implicit global usings), marked the class `public sealed class`, and replaced the two `var` declarations with explicit `GatewayLogRedactorSeam` / `Dictionary<string, object?>` types. The single `Redact_MasksApiKeyInClientIdentity` assertion is unchanged and still passes.
+
+### Tests-035
+
+| Field | Value |
+|---|---|
+| Severity | Low |
+| Category | Concurrency & thread safety |
+| Location | `src/ZB.MOM.WW.MxGateway.Tests/Alarms/AlarmFailoverEndToEndTests.cs:315-329` |
+| Status | Resolved |
+
+**Description:** In `DegradedTransition_CachedThenReplayed_CarriesDegradedAndSourceProviderToNewSubscriber`, the second-subscriber loop iterates `monitor.StreamAsync(null, newStreamCts.Token)` with no timeout, breaking only when a `SnapshotComplete` payload arrives (lines 317-329). Every other wait in this file routes through `WaitForAsync(..., WaitTimeout)` or `Task.WaitAsync(WaitTimeout)`; this `await foreach` does not. If a regression caused the monitor to stop emitting `SnapshotComplete` for a new subscriber (e.g. a snapshot-replay path that throws before the terminal message), the test would hang on the `await foreach` rather than fail with a `TimeoutException`, relying on the xUnit `longRunningTestSeconds` warning or the CI hard-kill instead of a clean assertion failure. The first subscriber in the same test is correctly bounded by `WaitForAsync`.
+
+**Recommendation:** Bound the second-subscriber drain with the same `WaitTimeout` used elsewhere — e.g. link `newStreamCts` to a `CancellationTokenSource.CreateLinkedTokenSource` plus `CancelAfter(WaitTimeout)`, or wrap the drain in a `Task` awaited via `WaitAsync(WaitTimeout)` — so a missing `SnapshotComplete` surfaces as a deterministic failure rather than a hang.
+
+**Resolution:** 2026-06-15 — Confirmed the unbounded `await foreach` in `DegradedTransition_CachedThenReplayed_CarriesDegradedAndSourceProviderToNewSubscriber`. Bounded the second-subscriber drain with a `CancellationTokenSource.CreateLinkedTokenSource(newStreamCts.Token, drainTimeoutCts.Token)` where `drainTimeoutCts.CancelAfter(WaitTimeout)`, and wrapped the loop in a `try/catch (OperationCanceledException) when (drainTimeoutCts.IsCancellationRequested)` that rethrows a `TimeoutException`. A regression that never emits `SnapshotComplete` now fails cleanly instead of hanging. Test still passes.