diff --git a/code-reviews/Worker.Tests/findings.md b/code-reviews/Worker.Tests/findings.md
index fd06549..3b4ecb0 100644
--- a/code-reviews/Worker.Tests/findings.md
+++ b/code-reviews/Worker.Tests/findings.md
@@ -4,11 +4,48 @@
|---|---|
| Module | `src/ZB.MOM.WW.MxGateway.Worker.Tests` |
| Reviewer | Claude Code |
-| Review date | 2026-05-24 |
-| Commit reviewed | `42b0037` |
+| Review date | 2026-06-15 |
+| Commit reviewed | `410acc9` |
| Status | Re-reviewed |
| Open findings | 0 |
+## 2026-06-15 re-review (commit `410acc9`)
+
+Re-review of the alarm-fallback test additions in `git diff 42b0037..HEAD --
+src/ZB.MOM.WW.MxGateway.Worker.Tests/`. New unit suites land for the subtag
+fallback (`SubtagAlarmConsumerTests`, `SubtagAlarmStateMachineTests`,
+`SyntheticAlarmGuidTests`, `LmxSubtagAlarmSourceTests`) and the auto-failover
+composite (`FailoverAlarmConsumerTests`); the existing alarm suites are updated
+for the `SubscribeAlarmsCommand`-based handler signature, the
+`(eq, affinity, comFactory)` handler-factory delegate, and the new
+degraded/source-provider fields. Most of the change is genuinely new coverage
+plus a large volume of XML-doc additions on existing test doubles (benign).
+
+Findings: the failover state-machine transitions (failover at threshold,
+failback after stable probes, intermittent-failure reset, before/after-switch
+forwarding, ack delegation, `ProbeOnce`-never-re-Subscribes) are all covered;
+the acked latch (`OutOfOrderAckThenClear_StillEmitsAckRtn`), the dup-address
+guard (`DuplicateActiveSubtag_Throws`), and the exact-match-vs-substring ack
+resolution (`AcknowledgeByName_PrefixNameDoesNotFalseMatch`,
+`AcknowledgeByGuid_*`) are all pinned. Three coverage gaps remain
+(Worker.Tests-031/032/033), all in new alarm-fallback code paths. The two
+newest files (`SyntheticAlarmGuidTests`, `LmxSubtagAlarmSourceTests`) omit an
+explicit `using Xunit;` but compile via the `` global
+using in the csproj, so that is not a finding.
+
+| # | Category | Result |
+|---|---|---|
+| 1 | Correctness & logic bugs | No issues found — failover/state-machine/ack tests assert meaningful post-conditions (mode, emitted state, target subtag address) and do not pass for the wrong reason; the prefix-name and unknown-guid negative cases pin the exact-match contract. |
+| 2 | mxaccessgw conventions | No issues found — new test methods follow `Method_Scenario_Expectation`; STA-affinity respected (state machine / consumer driven synchronously through internal seams). |
+| 3 | Concurrency & thread safety | No issues found — new failover/subtag suites are single-threaded and event-driven; no wall-clock floors or fixed sleeps were introduced (the `MxAccessValueCacheTests` change only deletes the old Worker.Tests-020 comment block). |
+| 4 | Error handling & resilience | Issues found: Worker.Tests-032 — the `RunPrimary` `when (ex is not OutOfMemoryException)` filter (the OOM-safe catch path) and the `FailoverSettings` clamp branches are untested. |
+| 5 | Security | No issues found — no secrets/credentials; ack-operator identity fields are sentinels. |
+| 6 | Performance & resource management | No issues found — `IDisposable` test subjects use `using`; the `LmxSubtagAlarmSource` dispose-idempotency / unadvise-only-advised-handles teardown is regression-tested. |
+| 7 | Design-document adherence | No issues found — tests mirror the alarm-fallback plan (degraded flag, synthetic GUID, subtag-ack via ack-comment, single-subscribe primary). |
+| 8 | Code organization & conventions | No issues found — new suites live under `MxAccess/`; test doubles are per-file (acceptable for these narrow fakes). |
+| 9 | Testing coverage | Issues found: Worker.Tests-031 (`ProbeIntervalSeconds` throttle-active branch never exercised — every test uses `probeIntervalSeconds: 0`), Worker.Tests-033 (`SubtagAlarmStateMachine` ack-while-inactive and priority-subtag branches uncovered). |
+| 10 | Documentation & comments | No issues found — test XML docs match assertions; no misleading names observed. |
+
## 2026-05-24 re-review (commit `42b0037`)
**Re-review: no new findings.** `git diff --name-only d692232..42b0037 -- src/ZB.MOM.WW.MxGateway.Worker.Tests` returns empty — the Worker.Tests module has zero source changes since the previous review. All ten checklist categories therefore inherit "No issues found" from the `d692232` pass. The header is bumped to track the latest reviewed commit; Worker.Tests-001..030 remain closed.
@@ -533,3 +570,48 @@ findings (Worker.Tests-001 through -030) are unaffected.
**Recommendation:** Either (a) reassign `CreateCancelEnvelope` to a sequence value `>` shutdown (or pass the sequence as a parameter, matching `CreateGatewayHelloEnvelope`'s parameter style), so the wire trace reads in ascending order; (b) add an XML-doc note on the cancel test stating that the worker has no inbound monotonicity check and the test ignores envelope sequence ordering; (c) parameterise all four helper methods so each test passes its desired sequence and the literal numbers stop carrying implicit meaning. Option (c) is the cleanest because `CreateGatewayHelloEnvelope` is already parameter-driven for nonce/version.
**Resolution:** 2026-05-20 — Took option (c): parameterised `CreateGatewayHelloEnvelope`/`CreateCommandEnvelope`/`CreateCancelEnvelope`/`CreateShutdownEnvelope` with a `ulong sequence` argument (defaults 1/2/2/3 respectively, matching the typical Hello/Command/Cancel/Shutdown ordering), so the literal sequence values no longer carry implicit meaning. Updated the cancel-correlation test's wire trace to ascend (Hello=1, Cancel=2, Shutdown=3) and added a comment noting that the worker has no inbound monotonicity check — the parameter exists so multi-frame tests can pin the trace ordering explicitly when needed.
+
+### Worker.Tests-031
+
+| Field | Value |
+|---|---|
+| Severity | Medium |
+| Category | Testing coverage |
+| Location | `src/ZB.MOM.WW.MxGateway.Worker.Tests/MxAccess/FailoverAlarmConsumerTests.cs` (all `FailoverSettings` constructions) |
+| Status | Resolved |
+
+**Description:** Every `FailoverSettings` in `FailoverAlarmConsumerTests` is built with `probeIntervalSeconds: 0`, which deliberately *disables* the probe throttle. The throttle-active branch in `FailoverAlarmConsumer.ProbeOnce` (`src/ZB.MOM.WW.MxGateway.Worker/MxAccess/FailoverAlarmConsumer.cs:211-215`) — where a probe is *skipped* because fewer than `ProbeIntervalSeconds` have elapsed since `lastProbeAtUtc` — is therefore never exercised. This is a genuine production behaviour: the failback cadence is the only thing preventing a degraded worker from hammering the broken primary with a `PollOnce` on every timer tick, and `AlarmCommandHandlerTests.Subscribe_AutoModeWithWatchList_...` wires a real non-zero `FailbackProbeIntervalSeconds = 1` into the handler, so the throttle is on the live path. A regression that inverted the comparison (probing only *after* the interval became `>=` instead of skipping while `<`), dropped the `lastProbeAtUtc` update, or removed the throttle entirely would not be caught by any test. The task brief named "ProbeIntervalSeconds enforcement" as an explicit focus area.
+
+**Recommendation:** Add a test that constructs `FailoverSettings(threshold: 1, probeIntervalSeconds: , stableProbes: 1)` with a non-zero interval, forces failover, makes the primary healthy, then calls `ProbeOnce()` twice in quick succession and asserts the second call did *not* probe (e.g. assert `primary.Polls` advanced by exactly one and `Mode` is still `Subtag`). Because the throttle reads `DateTime.UtcNow` directly, either accept a coarse same-wall-clock-instant assertion (two back-to-back calls reliably fall inside any interval ≥ 1s) or, preferably, refactor `ProbeOnce` to take an injectable clock so the throttle boundary can be pinned deterministically without wall-clock dependence (consistent with the Worker.Tests-020 manual-time-source approach).
+
+**Resolution:** 2026-06-15 — Took the coarse same-wall-clock-instant approach (no production-code clock injection needed). Added `FailoverAlarmConsumerTests.ProbeOnce_WithNonZeroInterval_ThrottlesSecondProbeWithinInterval`: builds `FailoverSettings(threshold: 1, probeIntervalSeconds: 3600, stableProbes: 5)`, forces failover to Subtag, makes the primary healthy, then calls `ProbeOnce()` twice back-to-back. The first probe re-polls the primary (`primary.Polls == 1`); the second falls inside the 3600s interval and is throttled, so `primary.Polls` is unchanged and `Mode` stays `Subtag`. `stableProbes: 5` keeps a single clean probe from failing back, so the throttled `ProbeOnce` path stays in scope. A 1-hour interval makes the two back-to-back calls reliably fall inside the window without any timing flakiness.
+
+### Worker.Tests-032
+
+| Field | Value |
+|---|---|
+| Severity | Low |
+| Category | Error handling & resilience |
+| Location | `src/ZB.MOM.WW.MxGateway.Worker.Tests/MxAccess/FailoverAlarmConsumerTests.cs` |
+| Status | Resolved |
+
+**Description:** Two resilience branches of `FailoverAlarmConsumer` are uncovered by the new tests. (1) `RunPrimary` catches `Exception ex when (ex is not OutOfMemoryException)` (`FailoverAlarmConsumer.cs:295`) — the OOM-safe catch path the task brief explicitly called out. No test throws `OutOfMemoryException` from the primary to verify it *propagates* (rather than being swallowed and counted toward the failover threshold like every other exception); the `FlakyPrimary` fake throws only `COMException`. A regression that broadened the filter to swallow OOM would convert a fatal allocation failure into a silent failover. (2) The `FailoverSettings` constructor clamps `threshold < 1 → 1` and `stableProbes < 1 → 1` (`FailoverSettings.cs:38-40`); no test passes a sub-1 value to confirm the clamp, so a misconfigured `ConsecutiveFailureThreshold = 0` from the gateway could change failover semantics undetected.
+
+**Recommendation:** Add a `FlakyPrimary`-style fake (or a flag on the existing one) that throws `OutOfMemoryException` from `PollOnce`, and assert `sut.PollOnce()` rethrows it via `Assert.Throws` and that no `ProviderModeChanged` fired. Add a small `FailoverSettings` fact (or `[Theory]`) asserting `new FailoverSettings(0, 0, 0).Threshold == 1` and `.StableProbes == 1` to pin the clamp.
+
+**Resolution:** 2026-06-15 — Added a `ThrowOutOfMemoryOnPoll` flag to the existing `FlakyPrimary` fake (its `PollOnce` throws `OutOfMemoryException` when set, checked before the `COMException` branch). Regression test `FailoverAlarmConsumerTests.RunPrimary_WhenPrimaryThrowsOutOfMemory_PropagatesAndDoesNotFailOver` drives `PollOnce` through the primary, asserts `Assert.Throws`, and asserts no `ProviderModeChanged` fired and `Mode` stays `Alarmmgr` — pinning that the `when (ex is not OutOfMemoryException)` filter lets OOM propagate rather than swallowing it and counting it toward the failover threshold. The clamp is pinned by `FailoverSettings_ClampsSubMinimumValues` (a `[Theory]`): `(0,0,0)→(1,0,1)`, `(-5,-5,-5)→(1,0,1)`, and a pass-through `(3,7,2)→(3,7,2)` to confirm in-range values are not altered.
+
+### Worker.Tests-033
+
+| Field | Value |
+|---|---|
+| Severity | Low |
+| Category | Testing coverage |
+| Location | `src/ZB.MOM.WW.MxGateway.Worker.Tests/MxAccess/SubtagAlarmStateMachineTests.cs` |
+| Status | Resolved |
+
+**Description:** `SubtagAlarmStateMachineTests` covers the core transition matrix and the acked latch well, but two branches of the new state machine are unexercised. (1) The ack-while-inactive path in `SubtagAlarmStateMachine.ApplyAcked` (`src/ZB.MOM.WW.MxGateway.Worker/MxAccess/SubtagAlarmStateMachine.cs:156-164`): when `.acked` flips true while the alarm is *not* active, the machine must emit nothing and must *not* set `AckedDuringEpisode` — otherwise a stale ack from a prior episode could mis-latch the next raise into a spurious `ACK_RTN`. No test drives an `.acked` change without a preceding active raise. (2) The priority-subtag path (`SubtagRole.Priority` → `state.Priority = CoerceInt(...)`, line 76-78): `SubtagAlarmConsumerTests.Subscribe_AdvisesAllSubtagsIncludingAckComment` confirms the priority subtag is *advised*, but no test raises a priority value change and asserts it flows into the emitted/snapshot record's `Priority`, so `CoerceInt` and the priority assignment are untested in the state-machine layer.
+
+**Recommendation:** Add (a) `AckedTrueWhileInactive_EmitsNothingAndDoesNotLatch` — apply `.acked=true` with no prior active raise, assert `Apply` returns empty, then raise active and clear and assert the clear emits `UnackRtn` (proving the stale ack did not latch); and (b) `PriorityChange_FlowsIntoEmittedRecord` — apply a priority value then an active raise and assert the emitted record's `Priority` equals the supplied value (and a `CoerceInt` string/garbage case falls back).
+
+**Resolution:** 2026-06-15 — Added both tests to `SubtagAlarmStateMachineTests`. `AckedTrueWhileInactive_EmitsNothingAndDoesNotLatch` applies `.acked=true` with no preceding active raise (asserts `Apply` returns empty), then drives a fresh raise→clear episode and asserts the clear emits `UnackRtn` — proving the stale inactive ack did not latch `AckedDuringEpisode`. `PriorityChange_FlowsIntoEmittedRecord` (the target now includes a `PrioritySubtag`) applies an `int` priority `750` (asserts the priority change emits nothing), raises active and asserts the emitted record's `Priority == 750` (exercising `CoerceInt`'s `int` path and the priority assignment), then applies a non-numeric `"not-a-number"` priority and asserts the snapshot `Priority` is still `750` (the `CoerceInt` string fallback keeps the prior value, not zero).
diff --git a/code-reviews/Worker/findings.md b/code-reviews/Worker/findings.md
index 1a6273c..d7e4d73 100644
--- a/code-reviews/Worker/findings.md
+++ b/code-reviews/Worker/findings.md
@@ -4,11 +4,38 @@
|---|---|
| Module | `src/ZB.MOM.WW.MxGateway.Worker` |
| Reviewer | Claude Code |
-| Review date | 2026-05-24 |
-| Commit reviewed | `42b0037` |
+| Review date | 2026-06-15 |
+| Commit reviewed | `410acc9` |
| Status | Re-reviewed |
| Open findings | 0 |
+## 2026-06-15 re-review (commit `410acc9`)
+
+Re-review of the `42b0037..410acc9` diff — the alarm-provider subtag-fallback
+feature (`git diff 42b0037..410acc9 -- src/ZB.MOM.WW.MxGateway.Worker/`). New
+substantive code: `SubtagAlarmConsumer`, `SubtagAlarmStateMachine`,
+`FailoverAlarmConsumer`, `LmxSubtagAlarmSource`, `SyntheticAlarmGuid`,
+`AlarmProviderModeChange`, `FailoverSettings`, `ISubtagAlarmSource` /
+`SubtagValueChange`, plus the degraded/`source_provider` propagation in
+`AlarmDispatcher` / `MxAccessAlarmEventSink` / `MxAccessEventMapper`, the
+`ForcedMode`/watch-list routing and STA-COM-factory threading in
+`AlarmCommandHandler` / `MxAccessStaSession`, and the `SubscribeAlarmsCommand`
+re-plumb in `MxAccessCommandExecutor`. Three new findings: **Worker-026** (High),
+**Worker-027** (Medium), **Worker-028** (Low). Worker-001..025 remain closed.
+
+| # | Category | Result |
+|---|---|---|
+| 1 | Correctness & logic bugs | No issues found. Subtag synthesis (`SubtagAlarmStateMachine` raise/ack/clear, `AckedDuringEpisode` latch, segment-boundary name derivation), exact-match ack resolution (`ResolveTargetByName` avoids the prefix false-positive), and `MapTransition`'s `Unspecified→*Alm` raise path are all sound. |
+| 2 | mxaccessgw conventions | No issues found. The synthesis is worker-side and every degraded record/event carries `degraded=true` + `source_provider=SUBTAG`, satisfying the explicit opt-in non-parity exception to the "never synthesize events" rule. The gateway never instantiates COM. net48 constraint respected — `AlarmProviderModeChange`/`FailoverSettings` are plain classes with get-only ctor-assigned props (no init/positional records); no `WriteRecord`-style init usage introduced. |
+| 3 | Concurrency & thread safety | Issue found: Worker-026 (an exception in the failover switch path — `SwitchToStandby`'s priming snapshot or either switch's `ProviderModeChanged` handler — escapes the state machine after `active` has already flipped, killing the STA alarm-poll loop with no mode-changed event). STA affinity itself is sound: `LmxSubtagAlarmSource` owns its own apartment-bound `LMXProxyServerClass`, all consumer calls are STA-confined via `AlarmCommandHandler`'s affinity guard, and `Dispose` UnAdvises before tearing handles down so a late pump callback cannot re-enter. |
+| 4 | Error handling & resilience | Issue found: Worker-027 (`SyntheticAlarmGuid` uses `MD5.Create()`, which throws on a net48 FIPS-policy host — breaking every subtag transition stamp and snapshot, and feeding Worker-026's poll-loop-kill path). `FailoverSettings` clamps tunables to safe minimums; `LmxSubtagAlarmSource` teardown is best-effort/idempotent. |
+| 5 | Security | No issues found. No secret/credential logging on the alarm path; ack comments are operator-supplied alarm metadata, not secrets. Synthetic GUID is non-cryptographic by design and not a security control. |
+| 6 | Performance & resource management | No issues found. `LmxSubtagAlarmSource` releases its COM object via `FinalReleaseComObject` and tracks advised-vs-added handles so `Dispose` only UnAdvises what it advised. The standby is armed once and gated-by-active rather than churning subscribe/unsubscribe per switch. |
+| 7 | Design-document adherence | No issues found. Implementation matches `docs/plans/2026-06-13-alarm-subtag-fallback-design.md` (auto-failover/failback, ack-comment-write ack, worker-side synthesis, additive proto fields). The probe re-polls the still-subscribed primary (single-subscribe constraint) as the design's "Superseded" notes describe. |
+| 8 | Code organization & conventions | Issue found: Worker-028 (the dup-subtag-address guard in `SubtagAlarmStateMachine.Bind` does not cover duplicate `AlarmFullReference` entries, which silently overwrite in `targetsByReference`/`_statesByReference`). One-public-type-per-file is otherwise respected for the new files. |
+| 9 | Testing coverage | No standalone finding. New unit suites exist for each major component (`SubtagAlarmConsumerTests`, `SubtagAlarmStateMachineTests`, `FailoverAlarmConsumerTests`, `LmxSubtagAlarmSourceTests`, `SyntheticAlarmGuidTests`), matching the design's test matrix. The switch-path exception fragility (Worker-026) and the dup-reference case (Worker-028) are untested edge cases noted in those findings. |
+| 10 | Documentation & comments | No issues found. The new types carry accurate XML docs; the net48-constraint rationale is documented inline on `FailoverSettings`/`AlarmProviderModeChange`; the "why PollOnce only, no re-Subscribe" and probe-throttle behaviour are documented on `FailoverAlarmConsumer.ProbeOnce`. |
+
## 2026-05-24 re-review (commit `42b0037`)
**Re-review: no new findings.** `git diff --name-only d692232..42b0037 -- src/ZB.MOM.WW.MxGateway.Worker` returns empty — the Worker module has zero source changes since the previous review. All ten checklist categories therefore inherit "No issues found" from the `d692232` pass. The header is bumped to track the latest reviewed commit; Worker-001..025 remain closed.
@@ -464,3 +491,50 @@ _runtimeSession = _runtimeSessionFactory()
Match the pattern `AlarmCommandHandler.Subscribe` already uses for `consumerFactory()` (`AlarmCommandHandler.cs:76-77`).
**Resolution:** 2026-05-20 — `WorkerPipeSession.RunAsync` now uses `_runtimeSession = _runtimeSessionFactory() ?? throw new InvalidOperationException("Worker runtime session factory returned null.");`, matching the pattern `AlarmCommandHandler.Subscribe` uses for its `consumerFactory()`. A null factory return now produces a clear diagnostic exception at the call site instead of NRE-ing on the next dereference (and the `finally` block's `_runtimeSession?.Dispose()` silently no-oping on a half-initialized session). Regression test `WorkerPipeSessionTests.RunAsync_WhenRuntimeSessionFactoryReturnsNull_ThrowsDiagnosticException` drives `RunAsync` with `() => null!` and asserts the diagnostic `InvalidOperationException` is thrown with the expected message.
+
+### Worker-026
+
+| Field | Value |
+|---|---|
+| Severity | High |
+| Category | Concurrency & thread safety |
+| Location | `src/ZB.MOM.WW.MxGateway.Worker/MxAccess/FailoverAlarmConsumer.cs:289-338`, `src/ZB.MOM.WW.MxGateway.Worker/MxAccess/MxAccessStaSession.cs:307-320` |
+| Status | Resolved |
+
+**Description:** `FailoverAlarmConsumer.SwitchToStandby` flips `active = Active.Standby` / `mode = Subtag` first, then calls `_ = standby.SnapshotActiveAlarms();` (the priming side-effect), and only then calls `RaiseModeChanged(...)`. If `standby.SnapshotActiveAlarms()` throws, the exception escapes `SwitchToStandby`, escapes the `catch` in `RunPrimary`, and escapes `FailoverAlarmConsumer.PollOnce`/`Subscribe`. The `SubtagAlarmConsumer.SnapshotActiveAlarms` path is not exception-free: it calls `StampSynthetic` → `SyntheticAlarmGuid.ForReference` (which throws on a FIPS host — see Worker-027) and walks live state. The same exposure exists for `RaiseModeChanged` itself: the attached `AlarmCommandHandler.OnProviderModeChanged` handler runs synchronously and calls `eventQueue.Enqueue(...)`, which throws `MxAccessEventQueueOverflowException` at capacity; that also propagates out of both `SwitchToStandby` and `SwitchToPrimary`.
+
+When this happens the consumer has **already** transitioned `active`/`mode` to Standby (or Primary) but the `ProviderModeChanged` event is never emitted — so the gateway never learns the feed went degraded. Worse, because the failover calls run on the worker's STA inside `RunAlarmPollLoopAsync`, the escaping exception lands in that loop's trailing `catch (Exception)` arm (`MxAccessStaSession.cs:307-320`), which records a single fault and **permanently stops the alarm poll loop**. The standby is then never pumped or probed again — i.e. a transient primary COM fault that should have produced a clean degraded-mode handoff instead produces a total, undetected alarm outage for the session, defeating the entire purpose of the fallback feature. There is no safe operator workaround short of restarting the session.
+
+**Recommendation:** Make the switch atomic and exception-isolated: raise `ProviderModeChanged` (and perform the priming snapshot) inside their own `try`/`catch` so a snapshot or handler failure cannot abort the switch or unwind into the poll loop. Order the state flip so the mode-changed notification is guaranteed to fire even if priming fails (e.g. flip state, raise mode-changed in a guarded block, then attempt the priming snapshot in a separate guarded block whose failure is logged/faulted but non-fatal). Add a regression test where the standby's `SnapshotActiveAlarms` throws on the first call after failover, asserting (a) `ProviderModeChanged` still fires and (b) `PollOnce` does not rethrow.
+
+**Resolution:** 2026-06-15 — Reordered and exception-isolated the failover switch in `FailoverAlarmConsumer`. `SwitchToStandby` now flips `active`/`mode`, then raises `ProviderModeChanged` FIRST (so the gateway always learns the feed went degraded), then primes the standby snapshot via a new `TryPrimeStandbySnapshot()` whose failure is swallowed (`catch when ex is not OutOfMemoryException`) — a priming failure can no longer abort the switch or unwind into the poll loop. `RaiseModeChanged` itself now wraps `ProviderModeChanged?.Invoke` in a `try`/`catch (when ex is not OutOfMemoryException)` so a subscriber handler exception (e.g. `AlarmCommandHandler.OnProviderModeChanged`'s `eventQueue.Enqueue` overflowing) cannot escape `SwitchToStandby`/`SwitchToPrimary` into `RunAlarmPollLoopAsync`'s trailing catch and permanently stop alarm polling. `OutOfMemoryException` is deliberately allowed to propagate. The MXAccessStaSession poll-loop arm is unchanged — the fix prevents the escape rather than catching it there. Regression tests in `FailoverAlarmConsumerTests`: `Failover_WhenStandbyPrimingSnapshotThrows_StillRaisesModeChangeAndDoesNotRethrow` (standby `SnapshotActiveAlarms` throws on the priming call → `ProviderModeChanged` still fires, `Mode` is Subtag, `Subscribe`/`PollOnce` do not rethrow) and `Failover_WhenModeChangedHandlerThrows_SwitchStillTakesEffectAndDoesNotRethrow` (a throwing `ProviderModeChanged` subscriber → switch still takes effect, no rethrow).
+
+### Worker-027
+
+| Field | Value |
+|---|---|
+| Severity | Medium |
+| Category | Error handling & resilience |
+| Location | `src/ZB.MOM.WW.MxGateway.Worker/MxAccess/SyntheticAlarmGuid.cs:38-40` |
+| Status | Resolved |
+
+**Description:** `SyntheticAlarmGuid.ForReference` derives the deterministic alarm GUID via `using MD5 md5 = MD5.Create();`. The worker targets .NET Framework 4.8, where `MD5.Create()` returns `MD5CryptoServiceProvider`. When the host has the Windows FIPS-compliance policy enabled (`Enabled=1` under `HKLM\System\CurrentControlSet\Control\Lsa\FIPSAlgorithmPolicy`), the non-validated `MD5CryptoServiceProvider` constructor throws `InvalidOperationException` ("This implementation is not part of the Windows Platform FIPS validated cryptographic algorithms."). `SyntheticAlarmGuid.ForReference` is on the hot path of the subtag fallback: `SubtagAlarmConsumer.StampSynthetic` calls it for **every** synthesized transition and **every** snapshot record. On a FIPS host the subtag fallback therefore throws on first use; combined with Worker-026 that exception kills the STA alarm-poll loop, so the fallback is not merely degraded but completely non-functional exactly when it is needed (after the primary alarmmgr provider has failed). The comment already notes MD5 is "never for security" — the issue is availability under FIPS policy, not cryptographic strength. The regulated deployment hosts (Zimmer) are a plausible FIPS environment.
+
+**Recommendation:** Replace `MD5.Create()` with a FIPS-agnostic non-cryptographic 128-bit hash that does not route through the crypto FIPS gate — e.g. compute the 16 GUID bytes from a stable hash that does not use `System.Security.Cryptography` (a fixed FNV-1a / xxHash-style derivation over the UTF-8 bytes), or use `SHA256` truncated to 16 bytes via the managed `SHA256Managed`/`IncrementalHash` only if confirmed FIPS-safe on net48 (it is not guaranteed — prefer the non-crypto route). The mapping only needs determinism and collision resistance for distinct references, not cryptographic properties. Add a test that exercises `ForReference` without depending on a crypto provider.
+
+**Resolution:** 2026-06-15 — Replaced the `MD5.Create()` derivation in `SyntheticAlarmGuid.ForReference` with a pure-managed FNV-1a hash: two independent 64-bit FNV-1a passes over the UTF-8 bytes (the high pass mixes the byte index into its accumulator to decorrelate the halves) fill the low/high 64 bits of the 128-bit GUID, and the input length is folded in so the empty string is non-degenerate (never `Guid.Empty`). The `using System.Security.Cryptography;` import is gone, so no FIPS-gated `MD5CryptoServiceProvider` is ever constructed — the subtag fallback no longer throws on a FIPS-policy host. The derivation stays deterministic and distinct-per-reference. The existing `SyntheticAlarmGuidTests` (`SameReference_SameGuid`, `DifferentReference_DifferentGuid`, `Reference_ProducesNonEmptyGuid`) pin only those properties — not a specific GUID literal — so they continue to pass unchanged; no test needed a value update. Added regression tests `SyntheticAlarmGuidTests.EmptyReference_ProducesNonEmptyGuid` (length-fold guard against a degenerate all-zero result) and `ForReference_UnderFipsEnforcement_DoesNotThrowAndStaysDeterministic` (sets the managed `UseLegacyFipsThrow` AppContext switch and asserts the derivation still succeeds deterministically; a regression reintroducing a FIPS-gated provider would throw here).
+
+### Worker-028
+
+| Field | Value |
+|---|---|
+| Severity | Low |
+| Category | Code organization & conventions |
+| Location | `src/ZB.MOM.WW.MxGateway.Worker/MxAccess/SubtagAlarmStateMachine.cs:43-52`, `src/ZB.MOM.WW.MxGateway.Worker/MxAccess/SubtagAlarmConsumer.cs:70-75` |
+| Status | Resolved |
+
+**Description:** `SubtagAlarmStateMachine.Bind` throws `ArgumentException` on a duplicate subtag **item address** (the documented dup-address guard), but neither the state machine nor `SubtagAlarmConsumer` guards against a duplicate `AlarmFullReference` in the watch list. When two `AlarmSubtagTarget` entries share an `AlarmFullReference` but use different subtag addresses, `_statesByReference[target.AlarmFullReference] = state` and `targetsByReference[reference] = target` each silently overwrite the earlier entry, while the earlier target's subtag addresses are still bound to an orphaned `AlarmState`. The orphaned state is mutated by incoming value changes but is invisible to `SnapshotActive` (which iterates only the surviving `_statesByReference.Values`) and to ack resolution (which uses the surviving `targetsByReference`). The result is silently inconsistent synthesized state for that reference. This is a watch-list configuration error (the gateway resolves the watch list), so impact is limited, but the asymmetry — addresses are guarded, references are not — is surprising and silent.
+
+**Recommendation:** Add a duplicate-`AlarmFullReference` guard symmetric with the dup-address guard: throw a descriptive `ArgumentException` from the `SubtagAlarmStateMachine` (or `SubtagAlarmConsumer`) constructor when two watch-list entries share a reference, so a misconfigured watch list fails fast at subscribe time rather than producing silently inconsistent state. Cover it with a unit test.
+
+**Resolution:** 2026-06-15 — Added a duplicate-`AlarmFullReference` guard in the `SubtagAlarmStateMachine` constructor symmetric with the existing dup-address guard in `Bind`: before adding each target's `_statesByReference` entry it checks `ContainsKey` (the dictionary is `OrdinalIgnoreCase`, matching the consumer's `targetsByReference` lookup) and throws a descriptive `ArgumentException` ("Duplicate alarm full reference '{reference}' is bound to more than one alarm target."). Because `SubtagAlarmConsumer` constructs the state machine before populating its own `targetsByReference`, this guard fires before the consumer's silent overwrite too, covering both dictionaries from one canonical check. Regression test `SubtagAlarmStateMachineTests.DuplicateAlarmFullReference_Throws` (two targets sharing a reference but using distinct active subtags → `ArgumentException`).
diff --git a/src/ZB.MOM.WW.MxGateway.Worker.Tests/MxAccess/FailoverAlarmConsumerTests.cs b/src/ZB.MOM.WW.MxGateway.Worker.Tests/MxAccess/FailoverAlarmConsumerTests.cs
index 07060fb..575bcfa 100644
--- a/src/ZB.MOM.WW.MxGateway.Worker.Tests/MxAccess/FailoverAlarmConsumerTests.cs
+++ b/src/ZB.MOM.WW.MxGateway.Worker.Tests/MxAccess/FailoverAlarmConsumerTests.cs
@@ -27,6 +27,15 @@ public sealed class FailoverAlarmConsumerTests
public event EventHandler? AlarmTransitionEmitted;
public bool ThrowOnPoll = true;
+
+ ///
+ /// When set, throws
+ /// instead of a
+ /// , to
+ /// exercise the OOM-safe exception filter (Worker.Tests-032).
+ ///
+ public bool ThrowOutOfMemoryOnPoll;
+
public int Polls;
///
@@ -48,6 +57,11 @@ public sealed class FailoverAlarmConsumerTests
public void PollOnce()
{
Polls++;
+ if (ThrowOutOfMemoryOnPoll)
+ {
+ throw new OutOfMemoryException("simulated allocation failure");
+ }
+
if (ThrowOnPoll)
{
throw new System.Runtime.InteropServices.COMException("boom", unchecked((int)0x80004005));
@@ -75,6 +89,15 @@ public sealed class FailoverAlarmConsumerTests
public bool Subscribed;
+ ///
+ /// When set, throws — modeling a
+ /// priming-snapshot failure during failover (Worker-026).
+ ///
+ public bool ThrowOnSnapshot;
+
+ /// Number of calls.
+ public int SnapshotCalls;
+
public void Subscribe(string s) => Subscribed = true;
public void PollOnce() { }
@@ -83,7 +106,16 @@ public sealed class FailoverAlarmConsumerTests
public int AcknowledgeByName(string n, string p, string gr, string c, string a, string b, string d, string e) => 22;
- public IReadOnlyList SnapshotActiveAlarms() => Array.Empty();
+ public IReadOnlyList SnapshotActiveAlarms()
+ {
+ SnapshotCalls++;
+ if (ThrowOnSnapshot)
+ {
+ throw new InvalidOperationException("priming snapshot failed");
+ }
+
+ return Array.Empty();
+ }
public void Dispose() { }
@@ -291,4 +323,151 @@ public sealed class FailoverAlarmConsumerTests
sut.ProbeOnce(); // clean 3 → failback
Assert.Equal(AlarmProviderMode.Alarmmgr, sut.Mode);
}
+
+ ///
+ /// Worker-026 regression: when the standby's priming
+ /// SnapshotActiveAlarms throws during failover, the switch must
+ /// still (a) fire ProviderModeChanged so the gateway learns the
+ /// feed went degraded, (b) leave
+ /// in Subtag, and (c) not rethrow out of PollOnce (which on the
+ /// real STA would land in the poll loop's trailing catch and permanently
+ /// stop alarm delivery).
+ ///
+ [Fact]
+ public void Failover_WhenStandbyPrimingSnapshotThrows_StillRaisesModeChangeAndDoesNotRethrow()
+ {
+ FlakyPrimary primary = new FlakyPrimary { ThrowOnPoll = true };
+ StubStandby standby = new StubStandby { ThrowOnSnapshot = true };
+ FailoverSettings settings = new FailoverSettings(threshold: 1, probeIntervalSeconds: 0, stableProbes: 1);
+ using FailoverAlarmConsumer sut = new FailoverAlarmConsumer(primary, standby, settings);
+
+ List changes = new List();
+ sut.ProviderModeChanged += (_, e) => changes.Add(e);
+
+ // threshold=1 → the Subscribe failure triggers the switch, which primes
+ // the standby snapshot (throwing). The exception must be contained.
+ Exception? escaped = Record.Exception(() => sut.Subscribe(@"\\HOST\Galaxy!Area"));
+
+ Assert.Null(escaped);
+ Assert.Single(changes);
+ Assert.Equal(AlarmProviderMode.Subtag, changes[0].Mode);
+ Assert.Equal(AlarmProviderMode.Subtag, sut.Mode);
+ Assert.True(standby.SnapshotCalls >= 1); // priming was attempted
+
+ // A subsequent degraded PollOnce (standby.PollOnce + ProbeOnce) must also
+ // not rethrow the snapshot failure.
+ Exception? pollEscaped = Record.Exception(() => sut.PollOnce());
+ Assert.Null(pollEscaped);
+ }
+
+ ///
+ /// Worker-026 regression: when a ProviderModeChanged subscriber's
+ /// handler throws (modeling the AlarmCommandHandler's event-queue enqueue
+ /// overflowing at capacity), the switch must still take effect and the
+ /// exception must not escape the switch path into the poll loop.
+ ///
+ [Fact]
+ public void Failover_WhenModeChangedHandlerThrows_SwitchStillTakesEffectAndDoesNotRethrow()
+ {
+ FlakyPrimary primary = new FlakyPrimary { ThrowOnPoll = true };
+ StubStandby standby = new StubStandby();
+ FailoverSettings settings = new FailoverSettings(threshold: 1, probeIntervalSeconds: 0, stableProbes: 1);
+ using FailoverAlarmConsumer sut = new FailoverAlarmConsumer(primary, standby, settings);
+
+ int handlerInvocations = 0;
+ sut.ProviderModeChanged += (_, _) =>
+ {
+ handlerInvocations++;
+ throw new InvalidOperationException("subscriber handler blew up");
+ };
+
+ Exception? escaped = Record.Exception(() => sut.Subscribe(@"\\HOST\Galaxy!Area"));
+
+ Assert.Null(escaped);
+ Assert.Equal(1, handlerInvocations); // the event still fired
+ Assert.Equal(AlarmProviderMode.Subtag, sut.Mode); // the switch still took effect
+ }
+
+ ///
+ /// Worker.Tests-031 regression: with a non-zero
+ /// , two back-to-back
+ /// ProbeOnce calls must throttle — the second falls inside the
+ /// interval and must NOT re-poll the primary. Two consecutive calls
+ /// reliably fall inside any interval of one second or more, so this needs
+ /// no injected clock.
+ ///
+ [Fact]
+ public void ProbeOnce_WithNonZeroInterval_ThrottlesSecondProbeWithinInterval()
+ {
+ FlakyPrimary primary = new FlakyPrimary { ThrowOnPoll = true };
+ StubStandby standby = new StubStandby();
+ // stableProbes high enough that a single clean probe cannot fail back,
+ // so Mode stays Subtag and ProbeOnce remains the throttled path.
+ FailoverSettings settings = new FailoverSettings(threshold: 1, probeIntervalSeconds: 3600, stableProbes: 5);
+ using FailoverAlarmConsumer sut = new FailoverAlarmConsumer(primary, standby, settings);
+
+ sut.Subscribe(@"\\HOST\Galaxy!Area"); // threshold=1 → switch to Subtag
+ Assert.Equal(AlarmProviderMode.Subtag, sut.Mode);
+
+ primary.ThrowOnPoll = false; // primary healthy so a probe would poll cleanly
+
+ sut.ProbeOnce(); // first probe runs: re-polls the primary
+ int pollsAfterFirstProbe = primary.Polls;
+ Assert.Equal(1, pollsAfterFirstProbe);
+
+ sut.ProbeOnce(); // within the 3600s interval → throttled, must NOT re-poll
+
+ Assert.Equal(pollsAfterFirstProbe, primary.Polls);
+ Assert.Equal(AlarmProviderMode.Subtag, sut.Mode);
+ }
+
+ ///
+ /// Worker.Tests-032 regression: RunPrimary's
+ /// when (ex is not OutOfMemoryException) filter must let an
+ /// propagate rather than swallowing it
+ /// and counting it toward the failover threshold. No mode change must
+ /// fire — a fatal allocation failure is not a clean degraded handoff.
+ ///
+ [Fact]
+ public void RunPrimary_WhenPrimaryThrowsOutOfMemory_PropagatesAndDoesNotFailOver()
+ {
+ FlakyPrimary primary = new FlakyPrimary { ThrowOnPoll = false, ThrowOutOfMemoryOnPoll = true };
+ StubStandby standby = new StubStandby();
+ FailoverSettings settings = new FailoverSettings(threshold: 1, probeIntervalSeconds: 0, stableProbes: 1);
+ using FailoverAlarmConsumer sut = new FailoverAlarmConsumer(primary, standby, settings);
+
+ bool modeChanged = false;
+ sut.ProviderModeChanged += (_, _) => modeChanged = true;
+
+ sut.Subscribe(@"\\HOST\Galaxy!Area"); // Subscribe path does not poll; no throw here
+
+ Assert.Throws(() => sut.PollOnce());
+ Assert.False(modeChanged);
+ Assert.Equal(AlarmProviderMode.Alarmmgr, sut.Mode);
+ }
+
+ ///
+ /// Worker.Tests-032 regression: clamps
+ /// sub-1 threshold and stableProbes (and sub-0
+ /// probeIntervalSeconds) to their safe minimums so a misconfigured
+ /// bind cannot change failover semantics.
+ ///
+ [Theory]
+ [InlineData(0, 0, 0, 1, 0, 1)]
+ [InlineData(-5, -5, -5, 1, 0, 1)]
+ [InlineData(3, 7, 2, 3, 7, 2)]
+ public void FailoverSettings_ClampsSubMinimumValues(
+ int threshold,
+ int probeInterval,
+ int stableProbes,
+ int expectedThreshold,
+ int expectedProbeInterval,
+ int expectedStableProbes)
+ {
+ FailoverSettings settings = new FailoverSettings(threshold, probeInterval, stableProbes);
+
+ Assert.Equal(expectedThreshold, settings.Threshold);
+ Assert.Equal(expectedProbeInterval, settings.ProbeIntervalSeconds);
+ Assert.Equal(expectedStableProbes, settings.StableProbes);
+ }
}
diff --git a/src/ZB.MOM.WW.MxGateway.Worker.Tests/MxAccess/SubtagAlarmStateMachineTests.cs b/src/ZB.MOM.WW.MxGateway.Worker.Tests/MxAccess/SubtagAlarmStateMachineTests.cs
index 15b8493..6f1ee53 100644
--- a/src/ZB.MOM.WW.MxGateway.Worker.Tests/MxAccess/SubtagAlarmStateMachineTests.cs
+++ b/src/ZB.MOM.WW.MxGateway.Worker.Tests/MxAccess/SubtagAlarmStateMachineTests.cs
@@ -109,6 +109,31 @@ public sealed class SubtagAlarmStateMachineTests
Assert.Throws(() => new SubtagAlarmStateMachine(new[] { first, second }));
}
+ ///
+ /// Worker-028 regression: two watch-list entries sharing an
+ /// (but using distinct
+ /// subtag addresses) must throw at construction, symmetric with the
+ /// duplicate-address guard, rather than silently overwriting the earlier
+ /// reference's state and orphaning its bound addresses.
+ ///
+ [Fact]
+ public void DuplicateAlarmFullReference_Throws()
+ {
+ var first = new AlarmSubtagTarget
+ {
+ AlarmFullReference = "Galaxy!Area.Tank01.Level.HiHi",
+ SourceObjectReference = "Tank01",
+ ActiveSubtag = "Tank01.Level.HiHi.active",
+ };
+ var second = new AlarmSubtagTarget
+ {
+ AlarmFullReference = "Galaxy!Area.Tank01.Level.HiHi",
+ SourceObjectReference = "Tank01",
+ ActiveSubtag = "Other.active",
+ };
+ Assert.Throws(() => new SubtagAlarmStateMachine(new[] { first, second }));
+ }
+
[Fact]
public void AckedTrueWhileActive_EmitsAck()
{
@@ -162,4 +187,64 @@ public sealed class SubtagAlarmStateMachineTests
var events = sm.Apply("Some.Other.Tag.active", true, DateTime.UtcNow);
Assert.Empty(events);
}
+
+ ///
+ /// Worker.Tests-033 regression: an ack arriving while the alarm is NOT
+ /// active must emit nothing and must NOT latch
+ /// AckedDuringEpisode — otherwise a stale ack from a prior episode
+ /// would mis-latch the next raise into a spurious ACK_RTN on clear. The
+ /// subsequent raise/clear must therefore still emit UNACK_RTN.
+ ///
+ [Fact]
+ public void AckedTrueWhileInactive_EmitsNothingAndDoesNotLatch()
+ {
+ var sm = new SubtagAlarmStateMachine(new[] { Target() });
+ var ts = new DateTime(2026, 6, 13, 9, 0, 0, DateTimeKind.Utc);
+
+ // Ack with no preceding active raise: must be a no-op.
+ var ackEvents = sm.Apply("Tank01.Level.HiHi.acked", true, ts);
+ Assert.Empty(ackEvents);
+
+ // A fresh episode: raise then clear. Because the earlier ack must not
+ // have latched AckedDuringEpisode, the clear must be UNACK_RTN.
+ sm.Apply("Tank01.Level.HiHi.active", true, ts.AddSeconds(5));
+ var clearEvents = sm.Apply("Tank01.Level.HiHi.active", false, ts.AddSeconds(10));
+ var clear = Assert.Single(clearEvents);
+ Assert.Equal(MxAlarmStateKind.UnackRtn, clear.Record.State);
+ }
+
+ ///
+ /// Worker.Tests-033 regression: a priority-subtag value change must flow
+ /// through CoerceInt into the emitted record's
+ /// . A non-numeric value must
+ /// leave the prior priority unchanged (the CoerceInt fallback path).
+ ///
+ [Fact]
+ public void PriorityChange_FlowsIntoEmittedRecord()
+ {
+ var target = new AlarmSubtagTarget
+ {
+ AlarmFullReference = "Galaxy!Area.Tank01.Level.HiHi",
+ SourceObjectReference = "Tank01",
+ ActiveSubtag = "Tank01.Level.HiHi.active",
+ AckedSubtag = "Tank01.Level.HiHi.acked",
+ PrioritySubtag = "Tank01.Level.HiHi.priority",
+ };
+ var sm = new SubtagAlarmStateMachine(new[] { target });
+ var ts = new DateTime(2026, 6, 13, 9, 0, 0, DateTimeKind.Utc);
+
+ // A priority change alone emits nothing but records the priority.
+ var priorityEvents = sm.Apply("Tank01.Level.HiHi.priority", 750, ts);
+ Assert.Empty(priorityEvents);
+
+ // Raise: the emitted record carries the recorded priority.
+ var raiseEvents = sm.Apply("Tank01.Level.HiHi.active", true, ts.AddSeconds(1));
+ var raise = Assert.Single(raiseEvents);
+ Assert.Equal(750, raise.Record.Priority);
+
+ // A non-numeric priority must fall back to the existing value, not zero.
+ sm.Apply("Tank01.Level.HiHi.priority", "not-a-number", ts.AddSeconds(2));
+ var snap = Assert.Single(sm.SnapshotActive());
+ Assert.Equal(750, snap.Priority);
+ }
}
diff --git a/src/ZB.MOM.WW.MxGateway.Worker.Tests/MxAccess/SyntheticAlarmGuidTests.cs b/src/ZB.MOM.WW.MxGateway.Worker.Tests/MxAccess/SyntheticAlarmGuidTests.cs
index 6877bf3..f0a1021 100644
--- a/src/ZB.MOM.WW.MxGateway.Worker.Tests/MxAccess/SyntheticAlarmGuidTests.cs
+++ b/src/ZB.MOM.WW.MxGateway.Worker.Tests/MxAccess/SyntheticAlarmGuidTests.cs
@@ -24,4 +24,43 @@ public sealed class SyntheticAlarmGuidTests
[Fact]
public void Reference_ProducesNonEmptyGuid() =>
Assert.NotEqual(Guid.Empty, SyntheticAlarmGuid.ForReference("A.B.C"));
+
+ ///
+ /// Verifies the empty string still derives a non-empty GUID. The length
+ /// fold in the derivation prevents a degenerate all-zero (Guid.Empty)
+ /// result, which would collide with the unset-record default downstream.
+ ///
+ [Fact]
+ public void EmptyReference_ProducesNonEmptyGuid() =>
+ Assert.NotEqual(Guid.Empty, SyntheticAlarmGuid.ForReference(string.Empty));
+
+ ///
+ /// Worker-027 regression:
+ /// must derive its GUID without routing through
+ /// , because on net48
+ /// MD5.Create() throws under the Windows FIPS-compliance policy.
+ /// This test enables the per-AppContext FIPS-enforcement switch (which the
+ /// managed crypto factories honour) and asserts the derivation still
+ /// succeeds deterministically — a regression that reintroduced a FIPS-gated
+ /// provider would throw here instead of returning a stable GUID.
+ ///
+ [Fact]
+ public void ForReference_UnderFipsEnforcement_DoesNotThrowAndStaysDeterministic()
+ {
+ const string switchName = "Switch.System.Security.Cryptography.UseLegacyFipsThrow";
+ bool original = AppContext.TryGetSwitch(switchName, out bool value) && value;
+ AppContext.SetSwitch(switchName, true);
+ try
+ {
+ Guid first = SyntheticAlarmGuid.ForReference("Galaxy!Area.Tank01.Level.HiHi");
+ Guid second = SyntheticAlarmGuid.ForReference("Galaxy!Area.Tank01.Level.HiHi");
+
+ Assert.NotEqual(Guid.Empty, first);
+ Assert.Equal(first, second);
+ }
+ finally
+ {
+ AppContext.SetSwitch(switchName, original);
+ }
+ }
}
diff --git a/src/ZB.MOM.WW.MxGateway.Worker/MxAccess/FailoverAlarmConsumer.cs b/src/ZB.MOM.WW.MxGateway.Worker/MxAccess/FailoverAlarmConsumer.cs
index 079c8b9..9c12ece 100644
--- a/src/ZB.MOM.WW.MxGateway.Worker/MxAccess/FailoverAlarmConsumer.cs
+++ b/src/ZB.MOM.WW.MxGateway.Worker/MxAccess/FailoverAlarmConsumer.cs
@@ -313,12 +313,20 @@ public sealed class FailoverAlarmConsumer : IMxAccessAlarmConsumer
consecutiveFailures = 0;
cleanProbes = 0;
+ // Emit the mode-changed notification FIRST and in a guarded block, so
+ // the gateway always learns the feed went degraded even if the priming
+ // snapshot below throws. A handler exception here must never escape the
+ // switch — escaping would (a) leave `active` flipped with no
+ // notification and (b) unwind into RunAlarmPollLoopAsync's trailing
+ // catch, which permanently stops alarm polling (Worker-026).
+ RaiseModeChanged(AlarmProviderMode.Subtag, reason, hresult);
+
// Warm the standby snapshot for the gateway hand-off. The gateway
// reconciles state from this snapshot, so the return value is not
- // consumed here — the call exists for its priming side effect.
- _ = standby.SnapshotActiveAlarms();
-
- RaiseModeChanged(AlarmProviderMode.Subtag, reason, hresult);
+ // consumed here — the call exists for its priming side effect. A
+ // failure to prime is non-fatal: the switch has already completed and
+ // been announced, and the standby's live transitions will still flow.
+ TryPrimeStandbySnapshot();
}
private void SwitchToPrimary(string reason, int hresult)
@@ -327,14 +335,49 @@ public sealed class FailoverAlarmConsumer : IMxAccessAlarmConsumer
mode = AlarmProviderMode.Alarmmgr;
consecutiveFailures = 0;
cleanProbes = 0;
+
+ // Guarded so a ProviderModeChanged handler exception cannot escape into
+ // the STA poll loop and kill alarm delivery (Worker-026).
RaiseModeChanged(AlarmProviderMode.Alarmmgr, reason, hresult);
}
+ ///
+ /// Primes the standby snapshot for the gateway hand-off, swallowing any
+ /// failure. The switch has already completed and the mode change has
+ /// already been announced before this runs, so a priming failure must
+ /// not abort the switch or unwind into the poll loop.
+ ///
+ private void TryPrimeStandbySnapshot()
+ {
+ try
+ {
+ _ = standby.SnapshotActiveAlarms();
+ }
+ catch (Exception ex) when (ex is not OutOfMemoryException)
+ {
+ // Non-fatal: the standby is active and its live transitions still
+ // flow; the gateway will reconcile from subsequent records. Do not
+ // let a transient snapshot failure escape and stop the poll loop.
+ }
+ }
+
private void RaiseModeChanged(AlarmProviderMode newMode, string reason, int hresult)
{
- ProviderModeChanged?.Invoke(
- this,
- new AlarmProviderModeChange(newMode, reason, hresult, DateTime.UtcNow));
+ try
+ {
+ ProviderModeChanged?.Invoke(
+ this,
+ new AlarmProviderModeChange(newMode, reason, hresult, DateTime.UtcNow));
+ }
+ catch (Exception ex) when (ex is not OutOfMemoryException)
+ {
+ // A subscriber's OnProviderModeChanged handler threw (e.g. the
+ // AlarmCommandHandler's eventQueue.Enqueue hitting capacity). The
+ // switch itself has already taken effect; swallow so the failure
+ // cannot unwind into RunAlarmPollLoopAsync and permanently stop
+ // alarm polling (Worker-026). The event-queue overflow it most
+ // likely signals is already surfaced as a fault on the IPC path.
+ }
}
private void OnChildTransition(object? sender, MxAlarmTransitionEvent e)
diff --git a/src/ZB.MOM.WW.MxGateway.Worker/MxAccess/SubtagAlarmStateMachine.cs b/src/ZB.MOM.WW.MxGateway.Worker/MxAccess/SubtagAlarmStateMachine.cs
index 80a0aa6..3cf7261 100644
--- a/src/ZB.MOM.WW.MxGateway.Worker/MxAccess/SubtagAlarmStateMachine.cs
+++ b/src/ZB.MOM.WW.MxGateway.Worker/MxAccess/SubtagAlarmStateMachine.cs
@@ -42,8 +42,22 @@ public sealed class SubtagAlarmStateMachine
foreach (AlarmSubtagTarget target in targets)
{
+ // Guard duplicate references symmetrically with the dup-address guard
+ // in Bind: two watch-list entries that share an AlarmFullReference but
+ // differ in subtag addresses would otherwise silently overwrite the
+ // earlier _statesByReference entry while its addresses stay bound to an
+ // orphaned (and therefore invisible) AlarmState, producing silently
+ // inconsistent synthesized state. Fail fast at subscribe time instead.
+ string reference = target.AlarmFullReference ?? string.Empty;
+ if (_statesByReference.ContainsKey(reference))
+ {
+ throw new ArgumentException(
+ $"Duplicate alarm full reference '{reference}' is bound to more than one alarm target.",
+ nameof(targets));
+ }
+
var state = new AlarmState(target);
- _statesByReference[target.AlarmFullReference] = state;
+ _statesByReference[reference] = state;
Bind(target.ActiveSubtag, state, SubtagRole.Active);
Bind(target.AckedSubtag, state, SubtagRole.Acked);
diff --git a/src/ZB.MOM.WW.MxGateway.Worker/MxAccess/SyntheticAlarmGuid.cs b/src/ZB.MOM.WW.MxGateway.Worker/MxAccess/SyntheticAlarmGuid.cs
index 18d67e0..df47661 100644
--- a/src/ZB.MOM.WW.MxGateway.Worker/MxAccess/SyntheticAlarmGuid.cs
+++ b/src/ZB.MOM.WW.MxGateway.Worker/MxAccess/SyntheticAlarmGuid.cs
@@ -1,5 +1,4 @@
using System;
-using System.Security.Cryptography;
using System.Text;
namespace ZB.MOM.WW.MxGateway.Worker.MxAccess;
@@ -11,8 +10,29 @@ namespace ZB.MOM.WW.MxGateway.Worker.MxAccess;
/// repeated transitions for the same alarm reference correlate downstream
/// (acknowledge, snapshot, OPC UA mapping) without an alarmmgr-supplied GUID.
///
+///
+/// The 128-bit value is computed with a fixed FNV-1a hash over the UTF-8
+/// bytes of the reference, deliberately not via
+/// System.Security.Cryptography. On .NET Framework 4.8
+/// MD5.Create() returns the non-validated
+/// MD5CryptoServiceProvider, whose constructor throws under the
+/// Windows FIPS-compliance policy ("not part of the Windows Platform FIPS
+/// validated cryptographic algorithms"). Because this derivation needs only
+/// determinism and distinctness — never cryptographic strength — a plain
+/// non-crypto hash avoids the FIPS gate entirely, so the subtag fallback
+/// keeps working on regulated (FIPS-enabled) hosts exactly when it is needed.
+///
internal static class SyntheticAlarmGuid
{
+ // 64-bit FNV-1a constants (RFC-style; widely used reference values).
+ private const ulong FnvOffsetBasis = 14695981039346656037UL;
+ private const ulong FnvPrime = 1099511628211UL;
+
+ // A second independent seed for the high 8 bytes so the full 128-bit value
+ // is well-distributed across distinct references rather than two correlated
+ // halves of the same single-pass hash.
+ private const ulong FnvSecondSeed = 1469598103934665603UL;
+
///
/// Produces a stable for the given alarm reference.
/// The same reference always maps to the same GUID; distinct references
@@ -32,11 +52,39 @@ internal static class SyntheticAlarmGuid
byte[] bytes = Encoding.UTF8.GetBytes(reference);
- // MD5 is used purely for a stable, non-cryptographic identity mapping
- // (reference -> 16-byte GUID), never for security. Its 128-bit output
- // fits a GUID exactly, which is why it is preferred here.
- using MD5 md5 = MD5.Create();
- byte[] hash = md5.ComputeHash(bytes);
- return new Guid(hash);
+ // Two independent FNV-1a passes fill the low and high 64 bits of the
+ // 128-bit value. The second pass mixes the running length into its seed
+ // so single-character differences and re-orderings still diverge in both
+ // halves, avoiding correlated-half collisions a single pass would risk.
+ ulong low = FnvOffsetBasis;
+ ulong high = FnvSecondSeed;
+ for (int i = 0; i < bytes.Length; i++)
+ {
+ byte b = bytes[i];
+
+ low ^= b;
+ low *= FnvPrime;
+
+ high ^= unchecked(b + (ulong)i);
+ high *= FnvPrime;
+ }
+
+ // Fold the length in so the empty string and other short inputs are not
+ // degenerate (an all-zero / Guid.Empty result is undesirable downstream).
+ low ^= (ulong)bytes.Length;
+ low *= FnvPrime;
+
+ byte[] guidBytes = new byte[16];
+ WriteUInt64(guidBytes, 0, low);
+ WriteUInt64(guidBytes, 8, high);
+ return new Guid(guidBytes);
+ }
+
+ private static void WriteUInt64(byte[] buffer, int offset, ulong value)
+ {
+ for (int i = 0; i < 8; i++)
+ {
+ buffer[offset + i] = (byte)(value >> (i * 8));
+ }
}
}