diff --git a/code-reviews/Worker.Tests/findings.md b/code-reviews/Worker.Tests/findings.md index fd06549..3b4ecb0 100644 --- a/code-reviews/Worker.Tests/findings.md +++ b/code-reviews/Worker.Tests/findings.md @@ -4,11 +4,48 @@ |---|---| | Module | `src/ZB.MOM.WW.MxGateway.Worker.Tests` | | Reviewer | Claude Code | -| Review date | 2026-05-24 | -| Commit reviewed | `42b0037` | +| Review date | 2026-06-15 | +| Commit reviewed | `410acc9` | | Status | Re-reviewed | | Open findings | 0 | +## 2026-06-15 re-review (commit `410acc9`) + +Re-review of the alarm-fallback test additions in `git diff 42b0037..HEAD -- +src/ZB.MOM.WW.MxGateway.Worker.Tests/`. New unit suites land for the subtag +fallback (`SubtagAlarmConsumerTests`, `SubtagAlarmStateMachineTests`, +`SyntheticAlarmGuidTests`, `LmxSubtagAlarmSourceTests`) and the auto-failover +composite (`FailoverAlarmConsumerTests`); the existing alarm suites are updated +for the `SubscribeAlarmsCommand`-based handler signature, the +`(eq, affinity, comFactory)` handler-factory delegate, and the new +degraded/source-provider fields. Most of the change is genuinely new coverage +plus a large volume of XML-doc additions on existing test doubles (benign). + +Findings: the failover state-machine transitions (failover at threshold, +failback after stable probes, intermittent-failure reset, before/after-switch +forwarding, ack delegation, `ProbeOnce`-never-re-Subscribes) are all covered; +the acked latch (`OutOfOrderAckThenClear_StillEmitsAckRtn`), the dup-address +guard (`DuplicateActiveSubtag_Throws`), and the exact-match-vs-substring ack +resolution (`AcknowledgeByName_PrefixNameDoesNotFalseMatch`, +`AcknowledgeByGuid_*`) are all pinned. Three coverage gaps remain +(Worker.Tests-031/032/033), all in new alarm-fallback code paths. The two +newest files (`SyntheticAlarmGuidTests`, `LmxSubtagAlarmSourceTests`) omit an +explicit `using Xunit;` but compile via the `` global +using in the csproj, so that is not a finding. + +| # | Category | Result | +|---|---|---| +| 1 | Correctness & logic bugs | No issues found — failover/state-machine/ack tests assert meaningful post-conditions (mode, emitted state, target subtag address) and do not pass for the wrong reason; the prefix-name and unknown-guid negative cases pin the exact-match contract. | +| 2 | mxaccessgw conventions | No issues found — new test methods follow `Method_Scenario_Expectation`; STA-affinity respected (state machine / consumer driven synchronously through internal seams). | +| 3 | Concurrency & thread safety | No issues found — new failover/subtag suites are single-threaded and event-driven; no wall-clock floors or fixed sleeps were introduced (the `MxAccessValueCacheTests` change only deletes the old Worker.Tests-020 comment block). | +| 4 | Error handling & resilience | Issues found: Worker.Tests-032 — the `RunPrimary` `when (ex is not OutOfMemoryException)` filter (the OOM-safe catch path) and the `FailoverSettings` clamp branches are untested. | +| 5 | Security | No issues found — no secrets/credentials; ack-operator identity fields are sentinels. | +| 6 | Performance & resource management | No issues found — `IDisposable` test subjects use `using`; the `LmxSubtagAlarmSource` dispose-idempotency / unadvise-only-advised-handles teardown is regression-tested. | +| 7 | Design-document adherence | No issues found — tests mirror the alarm-fallback plan (degraded flag, synthetic GUID, subtag-ack via ack-comment, single-subscribe primary). | +| 8 | Code organization & conventions | No issues found — new suites live under `MxAccess/`; test doubles are per-file (acceptable for these narrow fakes). | +| 9 | Testing coverage | Issues found: Worker.Tests-031 (`ProbeIntervalSeconds` throttle-active branch never exercised — every test uses `probeIntervalSeconds: 0`), Worker.Tests-033 (`SubtagAlarmStateMachine` ack-while-inactive and priority-subtag branches uncovered). | +| 10 | Documentation & comments | No issues found — test XML docs match assertions; no misleading names observed. | + ## 2026-05-24 re-review (commit `42b0037`) **Re-review: no new findings.** `git diff --name-only d692232..42b0037 -- src/ZB.MOM.WW.MxGateway.Worker.Tests` returns empty — the Worker.Tests module has zero source changes since the previous review. All ten checklist categories therefore inherit "No issues found" from the `d692232` pass. The header is bumped to track the latest reviewed commit; Worker.Tests-001..030 remain closed. @@ -533,3 +570,48 @@ findings (Worker.Tests-001 through -030) are unaffected. **Recommendation:** Either (a) reassign `CreateCancelEnvelope` to a sequence value `>` shutdown (or pass the sequence as a parameter, matching `CreateGatewayHelloEnvelope`'s parameter style), so the wire trace reads in ascending order; (b) add an XML-doc note on the cancel test stating that the worker has no inbound monotonicity check and the test ignores envelope sequence ordering; (c) parameterise all four helper methods so each test passes its desired sequence and the literal numbers stop carrying implicit meaning. Option (c) is the cleanest because `CreateGatewayHelloEnvelope` is already parameter-driven for nonce/version. **Resolution:** 2026-05-20 — Took option (c): parameterised `CreateGatewayHelloEnvelope`/`CreateCommandEnvelope`/`CreateCancelEnvelope`/`CreateShutdownEnvelope` with a `ulong sequence` argument (defaults 1/2/2/3 respectively, matching the typical Hello/Command/Cancel/Shutdown ordering), so the literal sequence values no longer carry implicit meaning. Updated the cancel-correlation test's wire trace to ascend (Hello=1, Cancel=2, Shutdown=3) and added a comment noting that the worker has no inbound monotonicity check — the parameter exists so multi-frame tests can pin the trace ordering explicitly when needed. + +### Worker.Tests-031 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Testing coverage | +| Location | `src/ZB.MOM.WW.MxGateway.Worker.Tests/MxAccess/FailoverAlarmConsumerTests.cs` (all `FailoverSettings` constructions) | +| Status | Resolved | + +**Description:** Every `FailoverSettings` in `FailoverAlarmConsumerTests` is built with `probeIntervalSeconds: 0`, which deliberately *disables* the probe throttle. The throttle-active branch in `FailoverAlarmConsumer.ProbeOnce` (`src/ZB.MOM.WW.MxGateway.Worker/MxAccess/FailoverAlarmConsumer.cs:211-215`) — where a probe is *skipped* because fewer than `ProbeIntervalSeconds` have elapsed since `lastProbeAtUtc` — is therefore never exercised. This is a genuine production behaviour: the failback cadence is the only thing preventing a degraded worker from hammering the broken primary with a `PollOnce` on every timer tick, and `AlarmCommandHandlerTests.Subscribe_AutoModeWithWatchList_...` wires a real non-zero `FailbackProbeIntervalSeconds = 1` into the handler, so the throttle is on the live path. A regression that inverted the comparison (probing only *after* the interval became `>=` instead of skipping while `<`), dropped the `lastProbeAtUtc` update, or removed the throttle entirely would not be caught by any test. The task brief named "ProbeIntervalSeconds enforcement" as an explicit focus area. + +**Recommendation:** Add a test that constructs `FailoverSettings(threshold: 1, probeIntervalSeconds: , stableProbes: 1)` with a non-zero interval, forces failover, makes the primary healthy, then calls `ProbeOnce()` twice in quick succession and asserts the second call did *not* probe (e.g. assert `primary.Polls` advanced by exactly one and `Mode` is still `Subtag`). Because the throttle reads `DateTime.UtcNow` directly, either accept a coarse same-wall-clock-instant assertion (two back-to-back calls reliably fall inside any interval ≥ 1s) or, preferably, refactor `ProbeOnce` to take an injectable clock so the throttle boundary can be pinned deterministically without wall-clock dependence (consistent with the Worker.Tests-020 manual-time-source approach). + +**Resolution:** 2026-06-15 — Took the coarse same-wall-clock-instant approach (no production-code clock injection needed). Added `FailoverAlarmConsumerTests.ProbeOnce_WithNonZeroInterval_ThrottlesSecondProbeWithinInterval`: builds `FailoverSettings(threshold: 1, probeIntervalSeconds: 3600, stableProbes: 5)`, forces failover to Subtag, makes the primary healthy, then calls `ProbeOnce()` twice back-to-back. The first probe re-polls the primary (`primary.Polls == 1`); the second falls inside the 3600s interval and is throttled, so `primary.Polls` is unchanged and `Mode` stays `Subtag`. `stableProbes: 5` keeps a single clean probe from failing back, so the throttled `ProbeOnce` path stays in scope. A 1-hour interval makes the two back-to-back calls reliably fall inside the window without any timing flakiness. + +### Worker.Tests-032 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Error handling & resilience | +| Location | `src/ZB.MOM.WW.MxGateway.Worker.Tests/MxAccess/FailoverAlarmConsumerTests.cs` | +| Status | Resolved | + +**Description:** Two resilience branches of `FailoverAlarmConsumer` are uncovered by the new tests. (1) `RunPrimary` catches `Exception ex when (ex is not OutOfMemoryException)` (`FailoverAlarmConsumer.cs:295`) — the OOM-safe catch path the task brief explicitly called out. No test throws `OutOfMemoryException` from the primary to verify it *propagates* (rather than being swallowed and counted toward the failover threshold like every other exception); the `FlakyPrimary` fake throws only `COMException`. A regression that broadened the filter to swallow OOM would convert a fatal allocation failure into a silent failover. (2) The `FailoverSettings` constructor clamps `threshold < 1 → 1` and `stableProbes < 1 → 1` (`FailoverSettings.cs:38-40`); no test passes a sub-1 value to confirm the clamp, so a misconfigured `ConsecutiveFailureThreshold = 0` from the gateway could change failover semantics undetected. + +**Recommendation:** Add a `FlakyPrimary`-style fake (or a flag on the existing one) that throws `OutOfMemoryException` from `PollOnce`, and assert `sut.PollOnce()` rethrows it via `Assert.Throws` and that no `ProviderModeChanged` fired. Add a small `FailoverSettings` fact (or `[Theory]`) asserting `new FailoverSettings(0, 0, 0).Threshold == 1` and `.StableProbes == 1` to pin the clamp. + +**Resolution:** 2026-06-15 — Added a `ThrowOutOfMemoryOnPoll` flag to the existing `FlakyPrimary` fake (its `PollOnce` throws `OutOfMemoryException` when set, checked before the `COMException` branch). Regression test `FailoverAlarmConsumerTests.RunPrimary_WhenPrimaryThrowsOutOfMemory_PropagatesAndDoesNotFailOver` drives `PollOnce` through the primary, asserts `Assert.Throws`, and asserts no `ProviderModeChanged` fired and `Mode` stays `Alarmmgr` — pinning that the `when (ex is not OutOfMemoryException)` filter lets OOM propagate rather than swallowing it and counting it toward the failover threshold. The clamp is pinned by `FailoverSettings_ClampsSubMinimumValues` (a `[Theory]`): `(0,0,0)→(1,0,1)`, `(-5,-5,-5)→(1,0,1)`, and a pass-through `(3,7,2)→(3,7,2)` to confirm in-range values are not altered. + +### Worker.Tests-033 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Testing coverage | +| Location | `src/ZB.MOM.WW.MxGateway.Worker.Tests/MxAccess/SubtagAlarmStateMachineTests.cs` | +| Status | Resolved | + +**Description:** `SubtagAlarmStateMachineTests` covers the core transition matrix and the acked latch well, but two branches of the new state machine are unexercised. (1) The ack-while-inactive path in `SubtagAlarmStateMachine.ApplyAcked` (`src/ZB.MOM.WW.MxGateway.Worker/MxAccess/SubtagAlarmStateMachine.cs:156-164`): when `.acked` flips true while the alarm is *not* active, the machine must emit nothing and must *not* set `AckedDuringEpisode` — otherwise a stale ack from a prior episode could mis-latch the next raise into a spurious `ACK_RTN`. No test drives an `.acked` change without a preceding active raise. (2) The priority-subtag path (`SubtagRole.Priority` → `state.Priority = CoerceInt(...)`, line 76-78): `SubtagAlarmConsumerTests.Subscribe_AdvisesAllSubtagsIncludingAckComment` confirms the priority subtag is *advised*, but no test raises a priority value change and asserts it flows into the emitted/snapshot record's `Priority`, so `CoerceInt` and the priority assignment are untested in the state-machine layer. + +**Recommendation:** Add (a) `AckedTrueWhileInactive_EmitsNothingAndDoesNotLatch` — apply `.acked=true` with no prior active raise, assert `Apply` returns empty, then raise active and clear and assert the clear emits `UnackRtn` (proving the stale ack did not latch); and (b) `PriorityChange_FlowsIntoEmittedRecord` — apply a priority value then an active raise and assert the emitted record's `Priority` equals the supplied value (and a `CoerceInt` string/garbage case falls back). + +**Resolution:** 2026-06-15 — Added both tests to `SubtagAlarmStateMachineTests`. `AckedTrueWhileInactive_EmitsNothingAndDoesNotLatch` applies `.acked=true` with no preceding active raise (asserts `Apply` returns empty), then drives a fresh raise→clear episode and asserts the clear emits `UnackRtn` — proving the stale inactive ack did not latch `AckedDuringEpisode`. `PriorityChange_FlowsIntoEmittedRecord` (the target now includes a `PrioritySubtag`) applies an `int` priority `750` (asserts the priority change emits nothing), raises active and asserts the emitted record's `Priority == 750` (exercising `CoerceInt`'s `int` path and the priority assignment), then applies a non-numeric `"not-a-number"` priority and asserts the snapshot `Priority` is still `750` (the `CoerceInt` string fallback keeps the prior value, not zero). diff --git a/code-reviews/Worker/findings.md b/code-reviews/Worker/findings.md index 1a6273c..d7e4d73 100644 --- a/code-reviews/Worker/findings.md +++ b/code-reviews/Worker/findings.md @@ -4,11 +4,38 @@ |---|---| | Module | `src/ZB.MOM.WW.MxGateway.Worker` | | Reviewer | Claude Code | -| Review date | 2026-05-24 | -| Commit reviewed | `42b0037` | +| Review date | 2026-06-15 | +| Commit reviewed | `410acc9` | | Status | Re-reviewed | | Open findings | 0 | +## 2026-06-15 re-review (commit `410acc9`) + +Re-review of the `42b0037..410acc9` diff — the alarm-provider subtag-fallback +feature (`git diff 42b0037..410acc9 -- src/ZB.MOM.WW.MxGateway.Worker/`). New +substantive code: `SubtagAlarmConsumer`, `SubtagAlarmStateMachine`, +`FailoverAlarmConsumer`, `LmxSubtagAlarmSource`, `SyntheticAlarmGuid`, +`AlarmProviderModeChange`, `FailoverSettings`, `ISubtagAlarmSource` / +`SubtagValueChange`, plus the degraded/`source_provider` propagation in +`AlarmDispatcher` / `MxAccessAlarmEventSink` / `MxAccessEventMapper`, the +`ForcedMode`/watch-list routing and STA-COM-factory threading in +`AlarmCommandHandler` / `MxAccessStaSession`, and the `SubscribeAlarmsCommand` +re-plumb in `MxAccessCommandExecutor`. Three new findings: **Worker-026** (High), +**Worker-027** (Medium), **Worker-028** (Low). Worker-001..025 remain closed. + +| # | Category | Result | +|---|---|---| +| 1 | Correctness & logic bugs | No issues found. Subtag synthesis (`SubtagAlarmStateMachine` raise/ack/clear, `AckedDuringEpisode` latch, segment-boundary name derivation), exact-match ack resolution (`ResolveTargetByName` avoids the prefix false-positive), and `MapTransition`'s `Unspecified→*Alm` raise path are all sound. | +| 2 | mxaccessgw conventions | No issues found. The synthesis is worker-side and every degraded record/event carries `degraded=true` + `source_provider=SUBTAG`, satisfying the explicit opt-in non-parity exception to the "never synthesize events" rule. The gateway never instantiates COM. net48 constraint respected — `AlarmProviderModeChange`/`FailoverSettings` are plain classes with get-only ctor-assigned props (no init/positional records); no `WriteRecord`-style init usage introduced. | +| 3 | Concurrency & thread safety | Issue found: Worker-026 (an exception in the failover switch path — `SwitchToStandby`'s priming snapshot or either switch's `ProviderModeChanged` handler — escapes the state machine after `active` has already flipped, killing the STA alarm-poll loop with no mode-changed event). STA affinity itself is sound: `LmxSubtagAlarmSource` owns its own apartment-bound `LMXProxyServerClass`, all consumer calls are STA-confined via `AlarmCommandHandler`'s affinity guard, and `Dispose` UnAdvises before tearing handles down so a late pump callback cannot re-enter. | +| 4 | Error handling & resilience | Issue found: Worker-027 (`SyntheticAlarmGuid` uses `MD5.Create()`, which throws on a net48 FIPS-policy host — breaking every subtag transition stamp and snapshot, and feeding Worker-026's poll-loop-kill path). `FailoverSettings` clamps tunables to safe minimums; `LmxSubtagAlarmSource` teardown is best-effort/idempotent. | +| 5 | Security | No issues found. No secret/credential logging on the alarm path; ack comments are operator-supplied alarm metadata, not secrets. Synthetic GUID is non-cryptographic by design and not a security control. | +| 6 | Performance & resource management | No issues found. `LmxSubtagAlarmSource` releases its COM object via `FinalReleaseComObject` and tracks advised-vs-added handles so `Dispose` only UnAdvises what it advised. The standby is armed once and gated-by-active rather than churning subscribe/unsubscribe per switch. | +| 7 | Design-document adherence | No issues found. Implementation matches `docs/plans/2026-06-13-alarm-subtag-fallback-design.md` (auto-failover/failback, ack-comment-write ack, worker-side synthesis, additive proto fields). The probe re-polls the still-subscribed primary (single-subscribe constraint) as the design's "Superseded" notes describe. | +| 8 | Code organization & conventions | Issue found: Worker-028 (the dup-subtag-address guard in `SubtagAlarmStateMachine.Bind` does not cover duplicate `AlarmFullReference` entries, which silently overwrite in `targetsByReference`/`_statesByReference`). One-public-type-per-file is otherwise respected for the new files. | +| 9 | Testing coverage | No standalone finding. New unit suites exist for each major component (`SubtagAlarmConsumerTests`, `SubtagAlarmStateMachineTests`, `FailoverAlarmConsumerTests`, `LmxSubtagAlarmSourceTests`, `SyntheticAlarmGuidTests`), matching the design's test matrix. The switch-path exception fragility (Worker-026) and the dup-reference case (Worker-028) are untested edge cases noted in those findings. | +| 10 | Documentation & comments | No issues found. The new types carry accurate XML docs; the net48-constraint rationale is documented inline on `FailoverSettings`/`AlarmProviderModeChange`; the "why PollOnce only, no re-Subscribe" and probe-throttle behaviour are documented on `FailoverAlarmConsumer.ProbeOnce`. | + ## 2026-05-24 re-review (commit `42b0037`) **Re-review: no new findings.** `git diff --name-only d692232..42b0037 -- src/ZB.MOM.WW.MxGateway.Worker` returns empty — the Worker module has zero source changes since the previous review. All ten checklist categories therefore inherit "No issues found" from the `d692232` pass. The header is bumped to track the latest reviewed commit; Worker-001..025 remain closed. @@ -464,3 +491,50 @@ _runtimeSession = _runtimeSessionFactory() Match the pattern `AlarmCommandHandler.Subscribe` already uses for `consumerFactory()` (`AlarmCommandHandler.cs:76-77`). **Resolution:** 2026-05-20 — `WorkerPipeSession.RunAsync` now uses `_runtimeSession = _runtimeSessionFactory() ?? throw new InvalidOperationException("Worker runtime session factory returned null.");`, matching the pattern `AlarmCommandHandler.Subscribe` uses for its `consumerFactory()`. A null factory return now produces a clear diagnostic exception at the call site instead of NRE-ing on the next dereference (and the `finally` block's `_runtimeSession?.Dispose()` silently no-oping on a half-initialized session). Regression test `WorkerPipeSessionTests.RunAsync_WhenRuntimeSessionFactoryReturnsNull_ThrowsDiagnosticException` drives `RunAsync` with `() => null!` and asserts the diagnostic `InvalidOperationException` is thrown with the expected message. + +### Worker-026 + +| Field | Value | +|---|---| +| Severity | High | +| Category | Concurrency & thread safety | +| Location | `src/ZB.MOM.WW.MxGateway.Worker/MxAccess/FailoverAlarmConsumer.cs:289-338`, `src/ZB.MOM.WW.MxGateway.Worker/MxAccess/MxAccessStaSession.cs:307-320` | +| Status | Resolved | + +**Description:** `FailoverAlarmConsumer.SwitchToStandby` flips `active = Active.Standby` / `mode = Subtag` first, then calls `_ = standby.SnapshotActiveAlarms();` (the priming side-effect), and only then calls `RaiseModeChanged(...)`. If `standby.SnapshotActiveAlarms()` throws, the exception escapes `SwitchToStandby`, escapes the `catch` in `RunPrimary`, and escapes `FailoverAlarmConsumer.PollOnce`/`Subscribe`. The `SubtagAlarmConsumer.SnapshotActiveAlarms` path is not exception-free: it calls `StampSynthetic` → `SyntheticAlarmGuid.ForReference` (which throws on a FIPS host — see Worker-027) and walks live state. The same exposure exists for `RaiseModeChanged` itself: the attached `AlarmCommandHandler.OnProviderModeChanged` handler runs synchronously and calls `eventQueue.Enqueue(...)`, which throws `MxAccessEventQueueOverflowException` at capacity; that also propagates out of both `SwitchToStandby` and `SwitchToPrimary`. + +When this happens the consumer has **already** transitioned `active`/`mode` to Standby (or Primary) but the `ProviderModeChanged` event is never emitted — so the gateway never learns the feed went degraded. Worse, because the failover calls run on the worker's STA inside `RunAlarmPollLoopAsync`, the escaping exception lands in that loop's trailing `catch (Exception)` arm (`MxAccessStaSession.cs:307-320`), which records a single fault and **permanently stops the alarm poll loop**. The standby is then never pumped or probed again — i.e. a transient primary COM fault that should have produced a clean degraded-mode handoff instead produces a total, undetected alarm outage for the session, defeating the entire purpose of the fallback feature. There is no safe operator workaround short of restarting the session. + +**Recommendation:** Make the switch atomic and exception-isolated: raise `ProviderModeChanged` (and perform the priming snapshot) inside their own `try`/`catch` so a snapshot or handler failure cannot abort the switch or unwind into the poll loop. Order the state flip so the mode-changed notification is guaranteed to fire even if priming fails (e.g. flip state, raise mode-changed in a guarded block, then attempt the priming snapshot in a separate guarded block whose failure is logged/faulted but non-fatal). Add a regression test where the standby's `SnapshotActiveAlarms` throws on the first call after failover, asserting (a) `ProviderModeChanged` still fires and (b) `PollOnce` does not rethrow. + +**Resolution:** 2026-06-15 — Reordered and exception-isolated the failover switch in `FailoverAlarmConsumer`. `SwitchToStandby` now flips `active`/`mode`, then raises `ProviderModeChanged` FIRST (so the gateway always learns the feed went degraded), then primes the standby snapshot via a new `TryPrimeStandbySnapshot()` whose failure is swallowed (`catch when ex is not OutOfMemoryException`) — a priming failure can no longer abort the switch or unwind into the poll loop. `RaiseModeChanged` itself now wraps `ProviderModeChanged?.Invoke` in a `try`/`catch (when ex is not OutOfMemoryException)` so a subscriber handler exception (e.g. `AlarmCommandHandler.OnProviderModeChanged`'s `eventQueue.Enqueue` overflowing) cannot escape `SwitchToStandby`/`SwitchToPrimary` into `RunAlarmPollLoopAsync`'s trailing catch and permanently stop alarm polling. `OutOfMemoryException` is deliberately allowed to propagate. The MXAccessStaSession poll-loop arm is unchanged — the fix prevents the escape rather than catching it there. Regression tests in `FailoverAlarmConsumerTests`: `Failover_WhenStandbyPrimingSnapshotThrows_StillRaisesModeChangeAndDoesNotRethrow` (standby `SnapshotActiveAlarms` throws on the priming call → `ProviderModeChanged` still fires, `Mode` is Subtag, `Subscribe`/`PollOnce` do not rethrow) and `Failover_WhenModeChangedHandlerThrows_SwitchStillTakesEffectAndDoesNotRethrow` (a throwing `ProviderModeChanged` subscriber → switch still takes effect, no rethrow). + +### Worker-027 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Error handling & resilience | +| Location | `src/ZB.MOM.WW.MxGateway.Worker/MxAccess/SyntheticAlarmGuid.cs:38-40` | +| Status | Resolved | + +**Description:** `SyntheticAlarmGuid.ForReference` derives the deterministic alarm GUID via `using MD5 md5 = MD5.Create();`. The worker targets .NET Framework 4.8, where `MD5.Create()` returns `MD5CryptoServiceProvider`. When the host has the Windows FIPS-compliance policy enabled (`Enabled=1` under `HKLM\System\CurrentControlSet\Control\Lsa\FIPSAlgorithmPolicy`), the non-validated `MD5CryptoServiceProvider` constructor throws `InvalidOperationException` ("This implementation is not part of the Windows Platform FIPS validated cryptographic algorithms."). `SyntheticAlarmGuid.ForReference` is on the hot path of the subtag fallback: `SubtagAlarmConsumer.StampSynthetic` calls it for **every** synthesized transition and **every** snapshot record. On a FIPS host the subtag fallback therefore throws on first use; combined with Worker-026 that exception kills the STA alarm-poll loop, so the fallback is not merely degraded but completely non-functional exactly when it is needed (after the primary alarmmgr provider has failed). The comment already notes MD5 is "never for security" — the issue is availability under FIPS policy, not cryptographic strength. The regulated deployment hosts (Zimmer) are a plausible FIPS environment. + +**Recommendation:** Replace `MD5.Create()` with a FIPS-agnostic non-cryptographic 128-bit hash that does not route through the crypto FIPS gate — e.g. compute the 16 GUID bytes from a stable hash that does not use `System.Security.Cryptography` (a fixed FNV-1a / xxHash-style derivation over the UTF-8 bytes), or use `SHA256` truncated to 16 bytes via the managed `SHA256Managed`/`IncrementalHash` only if confirmed FIPS-safe on net48 (it is not guaranteed — prefer the non-crypto route). The mapping only needs determinism and collision resistance for distinct references, not cryptographic properties. Add a test that exercises `ForReference` without depending on a crypto provider. + +**Resolution:** 2026-06-15 — Replaced the `MD5.Create()` derivation in `SyntheticAlarmGuid.ForReference` with a pure-managed FNV-1a hash: two independent 64-bit FNV-1a passes over the UTF-8 bytes (the high pass mixes the byte index into its accumulator to decorrelate the halves) fill the low/high 64 bits of the 128-bit GUID, and the input length is folded in so the empty string is non-degenerate (never `Guid.Empty`). The `using System.Security.Cryptography;` import is gone, so no FIPS-gated `MD5CryptoServiceProvider` is ever constructed — the subtag fallback no longer throws on a FIPS-policy host. The derivation stays deterministic and distinct-per-reference. The existing `SyntheticAlarmGuidTests` (`SameReference_SameGuid`, `DifferentReference_DifferentGuid`, `Reference_ProducesNonEmptyGuid`) pin only those properties — not a specific GUID literal — so they continue to pass unchanged; no test needed a value update. Added regression tests `SyntheticAlarmGuidTests.EmptyReference_ProducesNonEmptyGuid` (length-fold guard against a degenerate all-zero result) and `ForReference_UnderFipsEnforcement_DoesNotThrowAndStaysDeterministic` (sets the managed `UseLegacyFipsThrow` AppContext switch and asserts the derivation still succeeds deterministically; a regression reintroducing a FIPS-gated provider would throw here). + +### Worker-028 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Code organization & conventions | +| Location | `src/ZB.MOM.WW.MxGateway.Worker/MxAccess/SubtagAlarmStateMachine.cs:43-52`, `src/ZB.MOM.WW.MxGateway.Worker/MxAccess/SubtagAlarmConsumer.cs:70-75` | +| Status | Resolved | + +**Description:** `SubtagAlarmStateMachine.Bind` throws `ArgumentException` on a duplicate subtag **item address** (the documented dup-address guard), but neither the state machine nor `SubtagAlarmConsumer` guards against a duplicate `AlarmFullReference` in the watch list. When two `AlarmSubtagTarget` entries share an `AlarmFullReference` but use different subtag addresses, `_statesByReference[target.AlarmFullReference] = state` and `targetsByReference[reference] = target` each silently overwrite the earlier entry, while the earlier target's subtag addresses are still bound to an orphaned `AlarmState`. The orphaned state is mutated by incoming value changes but is invisible to `SnapshotActive` (which iterates only the surviving `_statesByReference.Values`) and to ack resolution (which uses the surviving `targetsByReference`). The result is silently inconsistent synthesized state for that reference. This is a watch-list configuration error (the gateway resolves the watch list), so impact is limited, but the asymmetry — addresses are guarded, references are not — is surprising and silent. + +**Recommendation:** Add a duplicate-`AlarmFullReference` guard symmetric with the dup-address guard: throw a descriptive `ArgumentException` from the `SubtagAlarmStateMachine` (or `SubtagAlarmConsumer`) constructor when two watch-list entries share a reference, so a misconfigured watch list fails fast at subscribe time rather than producing silently inconsistent state. Cover it with a unit test. + +**Resolution:** 2026-06-15 — Added a duplicate-`AlarmFullReference` guard in the `SubtagAlarmStateMachine` constructor symmetric with the existing dup-address guard in `Bind`: before adding each target's `_statesByReference` entry it checks `ContainsKey` (the dictionary is `OrdinalIgnoreCase`, matching the consumer's `targetsByReference` lookup) and throws a descriptive `ArgumentException` ("Duplicate alarm full reference '{reference}' is bound to more than one alarm target."). Because `SubtagAlarmConsumer` constructs the state machine before populating its own `targetsByReference`, this guard fires before the consumer's silent overwrite too, covering both dictionaries from one canonical check. Regression test `SubtagAlarmStateMachineTests.DuplicateAlarmFullReference_Throws` (two targets sharing a reference but using distinct active subtags → `ArgumentException`). diff --git a/src/ZB.MOM.WW.MxGateway.Worker.Tests/MxAccess/FailoverAlarmConsumerTests.cs b/src/ZB.MOM.WW.MxGateway.Worker.Tests/MxAccess/FailoverAlarmConsumerTests.cs index 07060fb..575bcfa 100644 --- a/src/ZB.MOM.WW.MxGateway.Worker.Tests/MxAccess/FailoverAlarmConsumerTests.cs +++ b/src/ZB.MOM.WW.MxGateway.Worker.Tests/MxAccess/FailoverAlarmConsumerTests.cs @@ -27,6 +27,15 @@ public sealed class FailoverAlarmConsumerTests public event EventHandler? AlarmTransitionEmitted; public bool ThrowOnPoll = true; + + /// + /// When set, throws + /// instead of a + /// , to + /// exercise the OOM-safe exception filter (Worker.Tests-032). + /// + public bool ThrowOutOfMemoryOnPoll; + public int Polls; /// @@ -48,6 +57,11 @@ public sealed class FailoverAlarmConsumerTests public void PollOnce() { Polls++; + if (ThrowOutOfMemoryOnPoll) + { + throw new OutOfMemoryException("simulated allocation failure"); + } + if (ThrowOnPoll) { throw new System.Runtime.InteropServices.COMException("boom", unchecked((int)0x80004005)); @@ -75,6 +89,15 @@ public sealed class FailoverAlarmConsumerTests public bool Subscribed; + /// + /// When set, throws — modeling a + /// priming-snapshot failure during failover (Worker-026). + /// + public bool ThrowOnSnapshot; + + /// Number of calls. + public int SnapshotCalls; + public void Subscribe(string s) => Subscribed = true; public void PollOnce() { } @@ -83,7 +106,16 @@ public sealed class FailoverAlarmConsumerTests public int AcknowledgeByName(string n, string p, string gr, string c, string a, string b, string d, string e) => 22; - public IReadOnlyList SnapshotActiveAlarms() => Array.Empty(); + public IReadOnlyList SnapshotActiveAlarms() + { + SnapshotCalls++; + if (ThrowOnSnapshot) + { + throw new InvalidOperationException("priming snapshot failed"); + } + + return Array.Empty(); + } public void Dispose() { } @@ -291,4 +323,151 @@ public sealed class FailoverAlarmConsumerTests sut.ProbeOnce(); // clean 3 → failback Assert.Equal(AlarmProviderMode.Alarmmgr, sut.Mode); } + + /// + /// Worker-026 regression: when the standby's priming + /// SnapshotActiveAlarms throws during failover, the switch must + /// still (a) fire ProviderModeChanged so the gateway learns the + /// feed went degraded, (b) leave + /// in Subtag, and (c) not rethrow out of PollOnce (which on the + /// real STA would land in the poll loop's trailing catch and permanently + /// stop alarm delivery). + /// + [Fact] + public void Failover_WhenStandbyPrimingSnapshotThrows_StillRaisesModeChangeAndDoesNotRethrow() + { + FlakyPrimary primary = new FlakyPrimary { ThrowOnPoll = true }; + StubStandby standby = new StubStandby { ThrowOnSnapshot = true }; + FailoverSettings settings = new FailoverSettings(threshold: 1, probeIntervalSeconds: 0, stableProbes: 1); + using FailoverAlarmConsumer sut = new FailoverAlarmConsumer(primary, standby, settings); + + List changes = new List(); + sut.ProviderModeChanged += (_, e) => changes.Add(e); + + // threshold=1 → the Subscribe failure triggers the switch, which primes + // the standby snapshot (throwing). The exception must be contained. + Exception? escaped = Record.Exception(() => sut.Subscribe(@"\\HOST\Galaxy!Area")); + + Assert.Null(escaped); + Assert.Single(changes); + Assert.Equal(AlarmProviderMode.Subtag, changes[0].Mode); + Assert.Equal(AlarmProviderMode.Subtag, sut.Mode); + Assert.True(standby.SnapshotCalls >= 1); // priming was attempted + + // A subsequent degraded PollOnce (standby.PollOnce + ProbeOnce) must also + // not rethrow the snapshot failure. + Exception? pollEscaped = Record.Exception(() => sut.PollOnce()); + Assert.Null(pollEscaped); + } + + /// + /// Worker-026 regression: when a ProviderModeChanged subscriber's + /// handler throws (modeling the AlarmCommandHandler's event-queue enqueue + /// overflowing at capacity), the switch must still take effect and the + /// exception must not escape the switch path into the poll loop. + /// + [Fact] + public void Failover_WhenModeChangedHandlerThrows_SwitchStillTakesEffectAndDoesNotRethrow() + { + FlakyPrimary primary = new FlakyPrimary { ThrowOnPoll = true }; + StubStandby standby = new StubStandby(); + FailoverSettings settings = new FailoverSettings(threshold: 1, probeIntervalSeconds: 0, stableProbes: 1); + using FailoverAlarmConsumer sut = new FailoverAlarmConsumer(primary, standby, settings); + + int handlerInvocations = 0; + sut.ProviderModeChanged += (_, _) => + { + handlerInvocations++; + throw new InvalidOperationException("subscriber handler blew up"); + }; + + Exception? escaped = Record.Exception(() => sut.Subscribe(@"\\HOST\Galaxy!Area")); + + Assert.Null(escaped); + Assert.Equal(1, handlerInvocations); // the event still fired + Assert.Equal(AlarmProviderMode.Subtag, sut.Mode); // the switch still took effect + } + + /// + /// Worker.Tests-031 regression: with a non-zero + /// , two back-to-back + /// ProbeOnce calls must throttle — the second falls inside the + /// interval and must NOT re-poll the primary. Two consecutive calls + /// reliably fall inside any interval of one second or more, so this needs + /// no injected clock. + /// + [Fact] + public void ProbeOnce_WithNonZeroInterval_ThrottlesSecondProbeWithinInterval() + { + FlakyPrimary primary = new FlakyPrimary { ThrowOnPoll = true }; + StubStandby standby = new StubStandby(); + // stableProbes high enough that a single clean probe cannot fail back, + // so Mode stays Subtag and ProbeOnce remains the throttled path. + FailoverSettings settings = new FailoverSettings(threshold: 1, probeIntervalSeconds: 3600, stableProbes: 5); + using FailoverAlarmConsumer sut = new FailoverAlarmConsumer(primary, standby, settings); + + sut.Subscribe(@"\\HOST\Galaxy!Area"); // threshold=1 → switch to Subtag + Assert.Equal(AlarmProviderMode.Subtag, sut.Mode); + + primary.ThrowOnPoll = false; // primary healthy so a probe would poll cleanly + + sut.ProbeOnce(); // first probe runs: re-polls the primary + int pollsAfterFirstProbe = primary.Polls; + Assert.Equal(1, pollsAfterFirstProbe); + + sut.ProbeOnce(); // within the 3600s interval → throttled, must NOT re-poll + + Assert.Equal(pollsAfterFirstProbe, primary.Polls); + Assert.Equal(AlarmProviderMode.Subtag, sut.Mode); + } + + /// + /// Worker.Tests-032 regression: RunPrimary's + /// when (ex is not OutOfMemoryException) filter must let an + /// propagate rather than swallowing it + /// and counting it toward the failover threshold. No mode change must + /// fire — a fatal allocation failure is not a clean degraded handoff. + /// + [Fact] + public void RunPrimary_WhenPrimaryThrowsOutOfMemory_PropagatesAndDoesNotFailOver() + { + FlakyPrimary primary = new FlakyPrimary { ThrowOnPoll = false, ThrowOutOfMemoryOnPoll = true }; + StubStandby standby = new StubStandby(); + FailoverSettings settings = new FailoverSettings(threshold: 1, probeIntervalSeconds: 0, stableProbes: 1); + using FailoverAlarmConsumer sut = new FailoverAlarmConsumer(primary, standby, settings); + + bool modeChanged = false; + sut.ProviderModeChanged += (_, _) => modeChanged = true; + + sut.Subscribe(@"\\HOST\Galaxy!Area"); // Subscribe path does not poll; no throw here + + Assert.Throws(() => sut.PollOnce()); + Assert.False(modeChanged); + Assert.Equal(AlarmProviderMode.Alarmmgr, sut.Mode); + } + + /// + /// Worker.Tests-032 regression: clamps + /// sub-1 threshold and stableProbes (and sub-0 + /// probeIntervalSeconds) to their safe minimums so a misconfigured + /// bind cannot change failover semantics. + /// + [Theory] + [InlineData(0, 0, 0, 1, 0, 1)] + [InlineData(-5, -5, -5, 1, 0, 1)] + [InlineData(3, 7, 2, 3, 7, 2)] + public void FailoverSettings_ClampsSubMinimumValues( + int threshold, + int probeInterval, + int stableProbes, + int expectedThreshold, + int expectedProbeInterval, + int expectedStableProbes) + { + FailoverSettings settings = new FailoverSettings(threshold, probeInterval, stableProbes); + + Assert.Equal(expectedThreshold, settings.Threshold); + Assert.Equal(expectedProbeInterval, settings.ProbeIntervalSeconds); + Assert.Equal(expectedStableProbes, settings.StableProbes); + } } diff --git a/src/ZB.MOM.WW.MxGateway.Worker.Tests/MxAccess/SubtagAlarmStateMachineTests.cs b/src/ZB.MOM.WW.MxGateway.Worker.Tests/MxAccess/SubtagAlarmStateMachineTests.cs index 15b8493..6f1ee53 100644 --- a/src/ZB.MOM.WW.MxGateway.Worker.Tests/MxAccess/SubtagAlarmStateMachineTests.cs +++ b/src/ZB.MOM.WW.MxGateway.Worker.Tests/MxAccess/SubtagAlarmStateMachineTests.cs @@ -109,6 +109,31 @@ public sealed class SubtagAlarmStateMachineTests Assert.Throws(() => new SubtagAlarmStateMachine(new[] { first, second })); } + /// + /// Worker-028 regression: two watch-list entries sharing an + /// (but using distinct + /// subtag addresses) must throw at construction, symmetric with the + /// duplicate-address guard, rather than silently overwriting the earlier + /// reference's state and orphaning its bound addresses. + /// + [Fact] + public void DuplicateAlarmFullReference_Throws() + { + var first = new AlarmSubtagTarget + { + AlarmFullReference = "Galaxy!Area.Tank01.Level.HiHi", + SourceObjectReference = "Tank01", + ActiveSubtag = "Tank01.Level.HiHi.active", + }; + var second = new AlarmSubtagTarget + { + AlarmFullReference = "Galaxy!Area.Tank01.Level.HiHi", + SourceObjectReference = "Tank01", + ActiveSubtag = "Other.active", + }; + Assert.Throws(() => new SubtagAlarmStateMachine(new[] { first, second })); + } + [Fact] public void AckedTrueWhileActive_EmitsAck() { @@ -162,4 +187,64 @@ public sealed class SubtagAlarmStateMachineTests var events = sm.Apply("Some.Other.Tag.active", true, DateTime.UtcNow); Assert.Empty(events); } + + /// + /// Worker.Tests-033 regression: an ack arriving while the alarm is NOT + /// active must emit nothing and must NOT latch + /// AckedDuringEpisode — otherwise a stale ack from a prior episode + /// would mis-latch the next raise into a spurious ACK_RTN on clear. The + /// subsequent raise/clear must therefore still emit UNACK_RTN. + /// + [Fact] + public void AckedTrueWhileInactive_EmitsNothingAndDoesNotLatch() + { + var sm = new SubtagAlarmStateMachine(new[] { Target() }); + var ts = new DateTime(2026, 6, 13, 9, 0, 0, DateTimeKind.Utc); + + // Ack with no preceding active raise: must be a no-op. + var ackEvents = sm.Apply("Tank01.Level.HiHi.acked", true, ts); + Assert.Empty(ackEvents); + + // A fresh episode: raise then clear. Because the earlier ack must not + // have latched AckedDuringEpisode, the clear must be UNACK_RTN. + sm.Apply("Tank01.Level.HiHi.active", true, ts.AddSeconds(5)); + var clearEvents = sm.Apply("Tank01.Level.HiHi.active", false, ts.AddSeconds(10)); + var clear = Assert.Single(clearEvents); + Assert.Equal(MxAlarmStateKind.UnackRtn, clear.Record.State); + } + + /// + /// Worker.Tests-033 regression: a priority-subtag value change must flow + /// through CoerceInt into the emitted record's + /// . A non-numeric value must + /// leave the prior priority unchanged (the CoerceInt fallback path). + /// + [Fact] + public void PriorityChange_FlowsIntoEmittedRecord() + { + var target = new AlarmSubtagTarget + { + AlarmFullReference = "Galaxy!Area.Tank01.Level.HiHi", + SourceObjectReference = "Tank01", + ActiveSubtag = "Tank01.Level.HiHi.active", + AckedSubtag = "Tank01.Level.HiHi.acked", + PrioritySubtag = "Tank01.Level.HiHi.priority", + }; + var sm = new SubtagAlarmStateMachine(new[] { target }); + var ts = new DateTime(2026, 6, 13, 9, 0, 0, DateTimeKind.Utc); + + // A priority change alone emits nothing but records the priority. + var priorityEvents = sm.Apply("Tank01.Level.HiHi.priority", 750, ts); + Assert.Empty(priorityEvents); + + // Raise: the emitted record carries the recorded priority. + var raiseEvents = sm.Apply("Tank01.Level.HiHi.active", true, ts.AddSeconds(1)); + var raise = Assert.Single(raiseEvents); + Assert.Equal(750, raise.Record.Priority); + + // A non-numeric priority must fall back to the existing value, not zero. + sm.Apply("Tank01.Level.HiHi.priority", "not-a-number", ts.AddSeconds(2)); + var snap = Assert.Single(sm.SnapshotActive()); + Assert.Equal(750, snap.Priority); + } } diff --git a/src/ZB.MOM.WW.MxGateway.Worker.Tests/MxAccess/SyntheticAlarmGuidTests.cs b/src/ZB.MOM.WW.MxGateway.Worker.Tests/MxAccess/SyntheticAlarmGuidTests.cs index 6877bf3..f0a1021 100644 --- a/src/ZB.MOM.WW.MxGateway.Worker.Tests/MxAccess/SyntheticAlarmGuidTests.cs +++ b/src/ZB.MOM.WW.MxGateway.Worker.Tests/MxAccess/SyntheticAlarmGuidTests.cs @@ -24,4 +24,43 @@ public sealed class SyntheticAlarmGuidTests [Fact] public void Reference_ProducesNonEmptyGuid() => Assert.NotEqual(Guid.Empty, SyntheticAlarmGuid.ForReference("A.B.C")); + + /// + /// Verifies the empty string still derives a non-empty GUID. The length + /// fold in the derivation prevents a degenerate all-zero (Guid.Empty) + /// result, which would collide with the unset-record default downstream. + /// + [Fact] + public void EmptyReference_ProducesNonEmptyGuid() => + Assert.NotEqual(Guid.Empty, SyntheticAlarmGuid.ForReference(string.Empty)); + + /// + /// Worker-027 regression: + /// must derive its GUID without routing through + /// , because on net48 + /// MD5.Create() throws under the Windows FIPS-compliance policy. + /// This test enables the per-AppContext FIPS-enforcement switch (which the + /// managed crypto factories honour) and asserts the derivation still + /// succeeds deterministically — a regression that reintroduced a FIPS-gated + /// provider would throw here instead of returning a stable GUID. + /// + [Fact] + public void ForReference_UnderFipsEnforcement_DoesNotThrowAndStaysDeterministic() + { + const string switchName = "Switch.System.Security.Cryptography.UseLegacyFipsThrow"; + bool original = AppContext.TryGetSwitch(switchName, out bool value) && value; + AppContext.SetSwitch(switchName, true); + try + { + Guid first = SyntheticAlarmGuid.ForReference("Galaxy!Area.Tank01.Level.HiHi"); + Guid second = SyntheticAlarmGuid.ForReference("Galaxy!Area.Tank01.Level.HiHi"); + + Assert.NotEqual(Guid.Empty, first); + Assert.Equal(first, second); + } + finally + { + AppContext.SetSwitch(switchName, original); + } + } } diff --git a/src/ZB.MOM.WW.MxGateway.Worker/MxAccess/FailoverAlarmConsumer.cs b/src/ZB.MOM.WW.MxGateway.Worker/MxAccess/FailoverAlarmConsumer.cs index 079c8b9..9c12ece 100644 --- a/src/ZB.MOM.WW.MxGateway.Worker/MxAccess/FailoverAlarmConsumer.cs +++ b/src/ZB.MOM.WW.MxGateway.Worker/MxAccess/FailoverAlarmConsumer.cs @@ -313,12 +313,20 @@ public sealed class FailoverAlarmConsumer : IMxAccessAlarmConsumer consecutiveFailures = 0; cleanProbes = 0; + // Emit the mode-changed notification FIRST and in a guarded block, so + // the gateway always learns the feed went degraded even if the priming + // snapshot below throws. A handler exception here must never escape the + // switch — escaping would (a) leave `active` flipped with no + // notification and (b) unwind into RunAlarmPollLoopAsync's trailing + // catch, which permanently stops alarm polling (Worker-026). + RaiseModeChanged(AlarmProviderMode.Subtag, reason, hresult); + // Warm the standby snapshot for the gateway hand-off. The gateway // reconciles state from this snapshot, so the return value is not - // consumed here — the call exists for its priming side effect. - _ = standby.SnapshotActiveAlarms(); - - RaiseModeChanged(AlarmProviderMode.Subtag, reason, hresult); + // consumed here — the call exists for its priming side effect. A + // failure to prime is non-fatal: the switch has already completed and + // been announced, and the standby's live transitions will still flow. + TryPrimeStandbySnapshot(); } private void SwitchToPrimary(string reason, int hresult) @@ -327,14 +335,49 @@ public sealed class FailoverAlarmConsumer : IMxAccessAlarmConsumer mode = AlarmProviderMode.Alarmmgr; consecutiveFailures = 0; cleanProbes = 0; + + // Guarded so a ProviderModeChanged handler exception cannot escape into + // the STA poll loop and kill alarm delivery (Worker-026). RaiseModeChanged(AlarmProviderMode.Alarmmgr, reason, hresult); } + /// + /// Primes the standby snapshot for the gateway hand-off, swallowing any + /// failure. The switch has already completed and the mode change has + /// already been announced before this runs, so a priming failure must + /// not abort the switch or unwind into the poll loop. + /// + private void TryPrimeStandbySnapshot() + { + try + { + _ = standby.SnapshotActiveAlarms(); + } + catch (Exception ex) when (ex is not OutOfMemoryException) + { + // Non-fatal: the standby is active and its live transitions still + // flow; the gateway will reconcile from subsequent records. Do not + // let a transient snapshot failure escape and stop the poll loop. + } + } + private void RaiseModeChanged(AlarmProviderMode newMode, string reason, int hresult) { - ProviderModeChanged?.Invoke( - this, - new AlarmProviderModeChange(newMode, reason, hresult, DateTime.UtcNow)); + try + { + ProviderModeChanged?.Invoke( + this, + new AlarmProviderModeChange(newMode, reason, hresult, DateTime.UtcNow)); + } + catch (Exception ex) when (ex is not OutOfMemoryException) + { + // A subscriber's OnProviderModeChanged handler threw (e.g. the + // AlarmCommandHandler's eventQueue.Enqueue hitting capacity). The + // switch itself has already taken effect; swallow so the failure + // cannot unwind into RunAlarmPollLoopAsync and permanently stop + // alarm polling (Worker-026). The event-queue overflow it most + // likely signals is already surfaced as a fault on the IPC path. + } } private void OnChildTransition(object? sender, MxAlarmTransitionEvent e) diff --git a/src/ZB.MOM.WW.MxGateway.Worker/MxAccess/SubtagAlarmStateMachine.cs b/src/ZB.MOM.WW.MxGateway.Worker/MxAccess/SubtagAlarmStateMachine.cs index 80a0aa6..3cf7261 100644 --- a/src/ZB.MOM.WW.MxGateway.Worker/MxAccess/SubtagAlarmStateMachine.cs +++ b/src/ZB.MOM.WW.MxGateway.Worker/MxAccess/SubtagAlarmStateMachine.cs @@ -42,8 +42,22 @@ public sealed class SubtagAlarmStateMachine foreach (AlarmSubtagTarget target in targets) { + // Guard duplicate references symmetrically with the dup-address guard + // in Bind: two watch-list entries that share an AlarmFullReference but + // differ in subtag addresses would otherwise silently overwrite the + // earlier _statesByReference entry while its addresses stay bound to an + // orphaned (and therefore invisible) AlarmState, producing silently + // inconsistent synthesized state. Fail fast at subscribe time instead. + string reference = target.AlarmFullReference ?? string.Empty; + if (_statesByReference.ContainsKey(reference)) + { + throw new ArgumentException( + $"Duplicate alarm full reference '{reference}' is bound to more than one alarm target.", + nameof(targets)); + } + var state = new AlarmState(target); - _statesByReference[target.AlarmFullReference] = state; + _statesByReference[reference] = state; Bind(target.ActiveSubtag, state, SubtagRole.Active); Bind(target.AckedSubtag, state, SubtagRole.Acked); diff --git a/src/ZB.MOM.WW.MxGateway.Worker/MxAccess/SyntheticAlarmGuid.cs b/src/ZB.MOM.WW.MxGateway.Worker/MxAccess/SyntheticAlarmGuid.cs index 18d67e0..df47661 100644 --- a/src/ZB.MOM.WW.MxGateway.Worker/MxAccess/SyntheticAlarmGuid.cs +++ b/src/ZB.MOM.WW.MxGateway.Worker/MxAccess/SyntheticAlarmGuid.cs @@ -1,5 +1,4 @@ using System; -using System.Security.Cryptography; using System.Text; namespace ZB.MOM.WW.MxGateway.Worker.MxAccess; @@ -11,8 +10,29 @@ namespace ZB.MOM.WW.MxGateway.Worker.MxAccess; /// repeated transitions for the same alarm reference correlate downstream /// (acknowledge, snapshot, OPC UA mapping) without an alarmmgr-supplied GUID. /// +/// +/// The 128-bit value is computed with a fixed FNV-1a hash over the UTF-8 +/// bytes of the reference, deliberately not via +/// System.Security.Cryptography. On .NET Framework 4.8 +/// MD5.Create() returns the non-validated +/// MD5CryptoServiceProvider, whose constructor throws under the +/// Windows FIPS-compliance policy ("not part of the Windows Platform FIPS +/// validated cryptographic algorithms"). Because this derivation needs only +/// determinism and distinctness — never cryptographic strength — a plain +/// non-crypto hash avoids the FIPS gate entirely, so the subtag fallback +/// keeps working on regulated (FIPS-enabled) hosts exactly when it is needed. +/// internal static class SyntheticAlarmGuid { + // 64-bit FNV-1a constants (RFC-style; widely used reference values). + private const ulong FnvOffsetBasis = 14695981039346656037UL; + private const ulong FnvPrime = 1099511628211UL; + + // A second independent seed for the high 8 bytes so the full 128-bit value + // is well-distributed across distinct references rather than two correlated + // halves of the same single-pass hash. + private const ulong FnvSecondSeed = 1469598103934665603UL; + /// /// Produces a stable for the given alarm reference. /// The same reference always maps to the same GUID; distinct references @@ -32,11 +52,39 @@ internal static class SyntheticAlarmGuid byte[] bytes = Encoding.UTF8.GetBytes(reference); - // MD5 is used purely for a stable, non-cryptographic identity mapping - // (reference -> 16-byte GUID), never for security. Its 128-bit output - // fits a GUID exactly, which is why it is preferred here. - using MD5 md5 = MD5.Create(); - byte[] hash = md5.ComputeHash(bytes); - return new Guid(hash); + // Two independent FNV-1a passes fill the low and high 64 bits of the + // 128-bit value. The second pass mixes the running length into its seed + // so single-character differences and re-orderings still diverge in both + // halves, avoiding correlated-half collisions a single pass would risk. + ulong low = FnvOffsetBasis; + ulong high = FnvSecondSeed; + for (int i = 0; i < bytes.Length; i++) + { + byte b = bytes[i]; + + low ^= b; + low *= FnvPrime; + + high ^= unchecked(b + (ulong)i); + high *= FnvPrime; + } + + // Fold the length in so the empty string and other short inputs are not + // degenerate (an all-zero / Guid.Empty result is undesirable downstream). + low ^= (ulong)bytes.Length; + low *= FnvPrime; + + byte[] guidBytes = new byte[16]; + WriteUInt64(guidBytes, 0, low); + WriteUInt64(guidBytes, 8, high); + return new Guid(guidBytes); + } + + private static void WriteUInt64(byte[] buffer, int offset, ulong value) + { + for (int i = 0; i < 8; i++) + { + buffer[offset + i] = (byte)(value >> (i * 8)); + } } }