fix(worker): resilient failover switch; FIPS-safe synthetic GUID; dup-reference guard + tests (Worker-026..028, Worker.Tests-031..033)
This commit is contained in:
@@ -4,11 +4,48 @@
|
||||
|---|---|
|
||||
| Module | `src/ZB.MOM.WW.MxGateway.Worker.Tests` |
|
||||
| Reviewer | Claude Code |
|
||||
| Review date | 2026-05-24 |
|
||||
| Commit reviewed | `42b0037` |
|
||||
| Review date | 2026-06-15 |
|
||||
| Commit reviewed | `410acc9` |
|
||||
| Status | Re-reviewed |
|
||||
| Open findings | 0 |
|
||||
|
||||
## 2026-06-15 re-review (commit `410acc9`)
|
||||
|
||||
Re-review of the alarm-fallback test additions in `git diff 42b0037..HEAD --
|
||||
src/ZB.MOM.WW.MxGateway.Worker.Tests/`. New unit suites land for the subtag
|
||||
fallback (`SubtagAlarmConsumerTests`, `SubtagAlarmStateMachineTests`,
|
||||
`SyntheticAlarmGuidTests`, `LmxSubtagAlarmSourceTests`) and the auto-failover
|
||||
composite (`FailoverAlarmConsumerTests`); the existing alarm suites are updated
|
||||
for the `SubscribeAlarmsCommand`-based handler signature, the
|
||||
`(eq, affinity, comFactory)` handler-factory delegate, and the new
|
||||
degraded/source-provider fields. Most of the change is genuinely new coverage
|
||||
plus a large volume of XML-doc additions on existing test doubles (benign).
|
||||
|
||||
Findings: the failover state-machine transitions (failover at threshold,
|
||||
failback after stable probes, intermittent-failure reset, before/after-switch
|
||||
forwarding, ack delegation, `ProbeOnce`-never-re-Subscribes) are all covered;
|
||||
the acked latch (`OutOfOrderAckThenClear_StillEmitsAckRtn`), the dup-address
|
||||
guard (`DuplicateActiveSubtag_Throws`), and the exact-match-vs-substring ack
|
||||
resolution (`AcknowledgeByName_PrefixNameDoesNotFalseMatch`,
|
||||
`AcknowledgeByGuid_*`) are all pinned. Three coverage gaps remain
|
||||
(Worker.Tests-031/032/033), all in new alarm-fallback code paths. The two
|
||||
newest files (`SyntheticAlarmGuidTests`, `LmxSubtagAlarmSourceTests`) omit an
|
||||
explicit `using Xunit;` but compile via the `<Using Include="Xunit" />` global
|
||||
using in the csproj, so that is not a finding.
|
||||
|
||||
| # | Category | Result |
|
||||
|---|---|---|
|
||||
| 1 | Correctness & logic bugs | No issues found — failover/state-machine/ack tests assert meaningful post-conditions (mode, emitted state, target subtag address) and do not pass for the wrong reason; the prefix-name and unknown-guid negative cases pin the exact-match contract. |
|
||||
| 2 | mxaccessgw conventions | No issues found — new test methods follow `Method_Scenario_Expectation`; STA-affinity respected (state machine / consumer driven synchronously through internal seams). |
|
||||
| 3 | Concurrency & thread safety | No issues found — new failover/subtag suites are single-threaded and event-driven; no wall-clock floors or fixed sleeps were introduced (the `MxAccessValueCacheTests` change only deletes the old Worker.Tests-020 comment block). |
|
||||
| 4 | Error handling & resilience | Issues found: Worker.Tests-032 — the `RunPrimary` `when (ex is not OutOfMemoryException)` filter (the OOM-safe catch path) and the `FailoverSettings` clamp branches are untested. |
|
||||
| 5 | Security | No issues found — no secrets/credentials; ack-operator identity fields are sentinels. |
|
||||
| 6 | Performance & resource management | No issues found — `IDisposable` test subjects use `using`; the `LmxSubtagAlarmSource` dispose-idempotency / unadvise-only-advised-handles teardown is regression-tested. |
|
||||
| 7 | Design-document adherence | No issues found — tests mirror the alarm-fallback plan (degraded flag, synthetic GUID, subtag-ack via ack-comment, single-subscribe primary). |
|
||||
| 8 | Code organization & conventions | No issues found — new suites live under `MxAccess/`; test doubles are per-file (acceptable for these narrow fakes). |
|
||||
| 9 | Testing coverage | Issues found: Worker.Tests-031 (`ProbeIntervalSeconds` throttle-active branch never exercised — every test uses `probeIntervalSeconds: 0`), Worker.Tests-033 (`SubtagAlarmStateMachine` ack-while-inactive and priority-subtag branches uncovered). |
|
||||
| 10 | Documentation & comments | No issues found — test XML docs match assertions; no misleading names observed. |
|
||||
|
||||
## 2026-05-24 re-review (commit `42b0037`)
|
||||
|
||||
**Re-review: no new findings.** `git diff --name-only d692232..42b0037 -- src/ZB.MOM.WW.MxGateway.Worker.Tests` returns empty — the Worker.Tests module has zero source changes since the previous review. All ten checklist categories therefore inherit "No issues found" from the `d692232` pass. The header is bumped to track the latest reviewed commit; Worker.Tests-001..030 remain closed.
|
||||
@@ -533,3 +570,48 @@ findings (Worker.Tests-001 through -030) are unaffected.
|
||||
**Recommendation:** Either (a) reassign `CreateCancelEnvelope` to a sequence value `>` shutdown (or pass the sequence as a parameter, matching `CreateGatewayHelloEnvelope`'s parameter style), so the wire trace reads in ascending order; (b) add an XML-doc note on the cancel test stating that the worker has no inbound monotonicity check and the test ignores envelope sequence ordering; (c) parameterise all four helper methods so each test passes its desired sequence and the literal numbers stop carrying implicit meaning. Option (c) is the cleanest because `CreateGatewayHelloEnvelope` is already parameter-driven for nonce/version.
|
||||
|
||||
**Resolution:** 2026-05-20 — Took option (c): parameterised `CreateGatewayHelloEnvelope`/`CreateCommandEnvelope`/`CreateCancelEnvelope`/`CreateShutdownEnvelope` with a `ulong sequence` argument (defaults 1/2/2/3 respectively, matching the typical Hello/Command/Cancel/Shutdown ordering), so the literal sequence values no longer carry implicit meaning. Updated the cancel-correlation test's wire trace to ascend (Hello=1, Cancel=2, Shutdown=3) and added a comment noting that the worker has no inbound monotonicity check — the parameter exists so multi-frame tests can pin the trace ordering explicitly when needed.
|
||||
|
||||
### Worker.Tests-031
|
||||
|
||||
| Field | Value |
|
||||
|---|---|
|
||||
| Severity | Medium |
|
||||
| Category | Testing coverage |
|
||||
| Location | `src/ZB.MOM.WW.MxGateway.Worker.Tests/MxAccess/FailoverAlarmConsumerTests.cs` (all `FailoverSettings` constructions) |
|
||||
| Status | Resolved |
|
||||
|
||||
**Description:** Every `FailoverSettings` in `FailoverAlarmConsumerTests` is built with `probeIntervalSeconds: 0`, which deliberately *disables* the probe throttle. The throttle-active branch in `FailoverAlarmConsumer.ProbeOnce` (`src/ZB.MOM.WW.MxGateway.Worker/MxAccess/FailoverAlarmConsumer.cs:211-215`) — where a probe is *skipped* because fewer than `ProbeIntervalSeconds` have elapsed since `lastProbeAtUtc` — is therefore never exercised. This is a genuine production behaviour: the failback cadence is the only thing preventing a degraded worker from hammering the broken primary with a `PollOnce` on every timer tick, and `AlarmCommandHandlerTests.Subscribe_AutoModeWithWatchList_...` wires a real non-zero `FailbackProbeIntervalSeconds = 1` into the handler, so the throttle is on the live path. A regression that inverted the comparison (probing only *after* the interval became `>=` instead of skipping while `<`), dropped the `lastProbeAtUtc` update, or removed the throttle entirely would not be caught by any test. The task brief named "ProbeIntervalSeconds enforcement" as an explicit focus area.
|
||||
|
||||
**Recommendation:** Add a test that constructs `FailoverSettings(threshold: 1, probeIntervalSeconds: <N>, stableProbes: 1)` with a non-zero interval, forces failover, makes the primary healthy, then calls `ProbeOnce()` twice in quick succession and asserts the second call did *not* probe (e.g. assert `primary.Polls` advanced by exactly one and `Mode` is still `Subtag`). Because the throttle reads `DateTime.UtcNow` directly, either accept a coarse same-wall-clock-instant assertion (two back-to-back calls reliably fall inside any interval ≥ 1s) or, preferably, refactor `ProbeOnce` to take an injectable clock so the throttle boundary can be pinned deterministically without wall-clock dependence (consistent with the Worker.Tests-020 manual-time-source approach).
|
||||
|
||||
**Resolution:** 2026-06-15 — Took the coarse same-wall-clock-instant approach (no production-code clock injection needed). Added `FailoverAlarmConsumerTests.ProbeOnce_WithNonZeroInterval_ThrottlesSecondProbeWithinInterval`: builds `FailoverSettings(threshold: 1, probeIntervalSeconds: 3600, stableProbes: 5)`, forces failover to Subtag, makes the primary healthy, then calls `ProbeOnce()` twice back-to-back. The first probe re-polls the primary (`primary.Polls == 1`); the second falls inside the 3600s interval and is throttled, so `primary.Polls` is unchanged and `Mode` stays `Subtag`. `stableProbes: 5` keeps a single clean probe from failing back, so the throttled `ProbeOnce` path stays in scope. A 1-hour interval makes the two back-to-back calls reliably fall inside the window without any timing flakiness.
|
||||
|
||||
### Worker.Tests-032
|
||||
|
||||
| Field | Value |
|
||||
|---|---|
|
||||
| Severity | Low |
|
||||
| Category | Error handling & resilience |
|
||||
| Location | `src/ZB.MOM.WW.MxGateway.Worker.Tests/MxAccess/FailoverAlarmConsumerTests.cs` |
|
||||
| Status | Resolved |
|
||||
|
||||
**Description:** Two resilience branches of `FailoverAlarmConsumer` are uncovered by the new tests. (1) `RunPrimary` catches `Exception ex when (ex is not OutOfMemoryException)` (`FailoverAlarmConsumer.cs:295`) — the OOM-safe catch path the task brief explicitly called out. No test throws `OutOfMemoryException` from the primary to verify it *propagates* (rather than being swallowed and counted toward the failover threshold like every other exception); the `FlakyPrimary` fake throws only `COMException`. A regression that broadened the filter to swallow OOM would convert a fatal allocation failure into a silent failover. (2) The `FailoverSettings` constructor clamps `threshold < 1 → 1` and `stableProbes < 1 → 1` (`FailoverSettings.cs:38-40`); no test passes a sub-1 value to confirm the clamp, so a misconfigured `ConsecutiveFailureThreshold = 0` from the gateway could change failover semantics undetected.
|
||||
|
||||
**Recommendation:** Add a `FlakyPrimary`-style fake (or a flag on the existing one) that throws `OutOfMemoryException` from `PollOnce`, and assert `sut.PollOnce()` rethrows it via `Assert.Throws<OutOfMemoryException>` and that no `ProviderModeChanged` fired. Add a small `FailoverSettings` fact (or `[Theory]`) asserting `new FailoverSettings(0, 0, 0).Threshold == 1` and `.StableProbes == 1` to pin the clamp.
|
||||
|
||||
**Resolution:** 2026-06-15 — Added a `ThrowOutOfMemoryOnPoll` flag to the existing `FlakyPrimary` fake (its `PollOnce` throws `OutOfMemoryException` when set, checked before the `COMException` branch). Regression test `FailoverAlarmConsumerTests.RunPrimary_WhenPrimaryThrowsOutOfMemory_PropagatesAndDoesNotFailOver` drives `PollOnce` through the primary, asserts `Assert.Throws<OutOfMemoryException>`, and asserts no `ProviderModeChanged` fired and `Mode` stays `Alarmmgr` — pinning that the `when (ex is not OutOfMemoryException)` filter lets OOM propagate rather than swallowing it and counting it toward the failover threshold. The clamp is pinned by `FailoverSettings_ClampsSubMinimumValues` (a `[Theory]`): `(0,0,0)→(1,0,1)`, `(-5,-5,-5)→(1,0,1)`, and a pass-through `(3,7,2)→(3,7,2)` to confirm in-range values are not altered.
|
||||
|
||||
### Worker.Tests-033
|
||||
|
||||
| Field | Value |
|
||||
|---|---|
|
||||
| Severity | Low |
|
||||
| Category | Testing coverage |
|
||||
| Location | `src/ZB.MOM.WW.MxGateway.Worker.Tests/MxAccess/SubtagAlarmStateMachineTests.cs` |
|
||||
| Status | Resolved |
|
||||
|
||||
**Description:** `SubtagAlarmStateMachineTests` covers the core transition matrix and the acked latch well, but two branches of the new state machine are unexercised. (1) The ack-while-inactive path in `SubtagAlarmStateMachine.ApplyAcked` (`src/ZB.MOM.WW.MxGateway.Worker/MxAccess/SubtagAlarmStateMachine.cs:156-164`): when `.acked` flips true while the alarm is *not* active, the machine must emit nothing and must *not* set `AckedDuringEpisode` — otherwise a stale ack from a prior episode could mis-latch the next raise into a spurious `ACK_RTN`. No test drives an `.acked` change without a preceding active raise. (2) The priority-subtag path (`SubtagRole.Priority` → `state.Priority = CoerceInt(...)`, line 76-78): `SubtagAlarmConsumerTests.Subscribe_AdvisesAllSubtagsIncludingAckComment` confirms the priority subtag is *advised*, but no test raises a priority value change and asserts it flows into the emitted/snapshot record's `Priority`, so `CoerceInt` and the priority assignment are untested in the state-machine layer.
|
||||
|
||||
**Recommendation:** Add (a) `AckedTrueWhileInactive_EmitsNothingAndDoesNotLatch` — apply `.acked=true` with no prior active raise, assert `Apply` returns empty, then raise active and clear and assert the clear emits `UnackRtn` (proving the stale ack did not latch); and (b) `PriorityChange_FlowsIntoEmittedRecord` — apply a priority value then an active raise and assert the emitted record's `Priority` equals the supplied value (and a `CoerceInt` string/garbage case falls back).
|
||||
|
||||
**Resolution:** 2026-06-15 — Added both tests to `SubtagAlarmStateMachineTests`. `AckedTrueWhileInactive_EmitsNothingAndDoesNotLatch` applies `.acked=true` with no preceding active raise (asserts `Apply` returns empty), then drives a fresh raise→clear episode and asserts the clear emits `UnackRtn` — proving the stale inactive ack did not latch `AckedDuringEpisode`. `PriorityChange_FlowsIntoEmittedRecord` (the target now includes a `PrioritySubtag`) applies an `int` priority `750` (asserts the priority change emits nothing), raises active and asserts the emitted record's `Priority == 750` (exercising `CoerceInt`'s `int` path and the priority assignment), then applies a non-numeric `"not-a-number"` priority and asserts the snapshot `Priority` is still `750` (the `CoerceInt` string fallback keeps the prior value, not zero).
|
||||
|
||||
Reference in New Issue
Block a user