fix(worker): resilient failover switch; FIPS-safe synthetic GUID; dup-reference guard + tests (Worker-026..028, Worker.Tests-031..033)

This commit is contained in:
Joseph Doherty
2026-06-15 02:56:15 -04:00
parent ddf2d84fbc
commit cebe67e9bd
8 changed files with 584 additions and 20 deletions
+84 -2
View File
@@ -4,11 +4,48 @@
|---|---|
| Module | `src/ZB.MOM.WW.MxGateway.Worker.Tests` |
| Reviewer | Claude Code |
| Review date | 2026-05-24 |
| Commit reviewed | `42b0037` |
| Review date | 2026-06-15 |
| Commit reviewed | `410acc9` |
| Status | Re-reviewed |
| Open findings | 0 |
## 2026-06-15 re-review (commit `410acc9`)
Re-review of the alarm-fallback test additions in `git diff 42b0037..HEAD --
src/ZB.MOM.WW.MxGateway.Worker.Tests/`. New unit suites land for the subtag
fallback (`SubtagAlarmConsumerTests`, `SubtagAlarmStateMachineTests`,
`SyntheticAlarmGuidTests`, `LmxSubtagAlarmSourceTests`) and the auto-failover
composite (`FailoverAlarmConsumerTests`); the existing alarm suites are updated
for the `SubscribeAlarmsCommand`-based handler signature, the
`(eq, affinity, comFactory)` handler-factory delegate, and the new
degraded/source-provider fields. Most of the change is genuinely new coverage
plus a large volume of XML-doc additions on existing test doubles (benign).
Findings: the failover state-machine transitions (failover at threshold,
failback after stable probes, intermittent-failure reset, before/after-switch
forwarding, ack delegation, `ProbeOnce`-never-re-Subscribes) are all covered;
the acked latch (`OutOfOrderAckThenClear_StillEmitsAckRtn`), the dup-address
guard (`DuplicateActiveSubtag_Throws`), and the exact-match-vs-substring ack
resolution (`AcknowledgeByName_PrefixNameDoesNotFalseMatch`,
`AcknowledgeByGuid_*`) are all pinned. Three coverage gaps remain
(Worker.Tests-031/032/033), all in new alarm-fallback code paths. The two
newest files (`SyntheticAlarmGuidTests`, `LmxSubtagAlarmSourceTests`) omit an
explicit `using Xunit;` but compile via the `<Using Include="Xunit" />` global
using in the csproj, so that is not a finding.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | No issues found — failover/state-machine/ack tests assert meaningful post-conditions (mode, emitted state, target subtag address) and do not pass for the wrong reason; the prefix-name and unknown-guid negative cases pin the exact-match contract. |
| 2 | mxaccessgw conventions | No issues found — new test methods follow `Method_Scenario_Expectation`; STA-affinity respected (state machine / consumer driven synchronously through internal seams). |
| 3 | Concurrency & thread safety | No issues found — new failover/subtag suites are single-threaded and event-driven; no wall-clock floors or fixed sleeps were introduced (the `MxAccessValueCacheTests` change only deletes the old Worker.Tests-020 comment block). |
| 4 | Error handling & resilience | Issues found: Worker.Tests-032 — the `RunPrimary` `when (ex is not OutOfMemoryException)` filter (the OOM-safe catch path) and the `FailoverSettings` clamp branches are untested. |
| 5 | Security | No issues found — no secrets/credentials; ack-operator identity fields are sentinels. |
| 6 | Performance & resource management | No issues found — `IDisposable` test subjects use `using`; the `LmxSubtagAlarmSource` dispose-idempotency / unadvise-only-advised-handles teardown is regression-tested. |
| 7 | Design-document adherence | No issues found — tests mirror the alarm-fallback plan (degraded flag, synthetic GUID, subtag-ack via ack-comment, single-subscribe primary). |
| 8 | Code organization & conventions | No issues found — new suites live under `MxAccess/`; test doubles are per-file (acceptable for these narrow fakes). |
| 9 | Testing coverage | Issues found: Worker.Tests-031 (`ProbeIntervalSeconds` throttle-active branch never exercised — every test uses `probeIntervalSeconds: 0`), Worker.Tests-033 (`SubtagAlarmStateMachine` ack-while-inactive and priority-subtag branches uncovered). |
| 10 | Documentation & comments | No issues found — test XML docs match assertions; no misleading names observed. |
## 2026-05-24 re-review (commit `42b0037`)
**Re-review: no new findings.** `git diff --name-only d692232..42b0037 -- src/ZB.MOM.WW.MxGateway.Worker.Tests` returns empty — the Worker.Tests module has zero source changes since the previous review. All ten checklist categories therefore inherit "No issues found" from the `d692232` pass. The header is bumped to track the latest reviewed commit; Worker.Tests-001..030 remain closed.
@@ -533,3 +570,48 @@ findings (Worker.Tests-001 through -030) are unaffected.
**Recommendation:** Either (a) reassign `CreateCancelEnvelope` to a sequence value `>` shutdown (or pass the sequence as a parameter, matching `CreateGatewayHelloEnvelope`'s parameter style), so the wire trace reads in ascending order; (b) add an XML-doc note on the cancel test stating that the worker has no inbound monotonicity check and the test ignores envelope sequence ordering; (c) parameterise all four helper methods so each test passes its desired sequence and the literal numbers stop carrying implicit meaning. Option (c) is the cleanest because `CreateGatewayHelloEnvelope` is already parameter-driven for nonce/version.
**Resolution:** 2026-05-20 — Took option (c): parameterised `CreateGatewayHelloEnvelope`/`CreateCommandEnvelope`/`CreateCancelEnvelope`/`CreateShutdownEnvelope` with a `ulong sequence` argument (defaults 1/2/2/3 respectively, matching the typical Hello/Command/Cancel/Shutdown ordering), so the literal sequence values no longer carry implicit meaning. Updated the cancel-correlation test's wire trace to ascend (Hello=1, Cancel=2, Shutdown=3) and added a comment noting that the worker has no inbound monotonicity check — the parameter exists so multi-frame tests can pin the trace ordering explicitly when needed.
### Worker.Tests-031
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Testing coverage |
| Location | `src/ZB.MOM.WW.MxGateway.Worker.Tests/MxAccess/FailoverAlarmConsumerTests.cs` (all `FailoverSettings` constructions) |
| Status | Resolved |
**Description:** Every `FailoverSettings` in `FailoverAlarmConsumerTests` is built with `probeIntervalSeconds: 0`, which deliberately *disables* the probe throttle. The throttle-active branch in `FailoverAlarmConsumer.ProbeOnce` (`src/ZB.MOM.WW.MxGateway.Worker/MxAccess/FailoverAlarmConsumer.cs:211-215`) — where a probe is *skipped* because fewer than `ProbeIntervalSeconds` have elapsed since `lastProbeAtUtc` — is therefore never exercised. This is a genuine production behaviour: the failback cadence is the only thing preventing a degraded worker from hammering the broken primary with a `PollOnce` on every timer tick, and `AlarmCommandHandlerTests.Subscribe_AutoModeWithWatchList_...` wires a real non-zero `FailbackProbeIntervalSeconds = 1` into the handler, so the throttle is on the live path. A regression that inverted the comparison (probing only *after* the interval became `>=` instead of skipping while `<`), dropped the `lastProbeAtUtc` update, or removed the throttle entirely would not be caught by any test. The task brief named "ProbeIntervalSeconds enforcement" as an explicit focus area.
**Recommendation:** Add a test that constructs `FailoverSettings(threshold: 1, probeIntervalSeconds: <N>, stableProbes: 1)` with a non-zero interval, forces failover, makes the primary healthy, then calls `ProbeOnce()` twice in quick succession and asserts the second call did *not* probe (e.g. assert `primary.Polls` advanced by exactly one and `Mode` is still `Subtag`). Because the throttle reads `DateTime.UtcNow` directly, either accept a coarse same-wall-clock-instant assertion (two back-to-back calls reliably fall inside any interval ≥ 1s) or, preferably, refactor `ProbeOnce` to take an injectable clock so the throttle boundary can be pinned deterministically without wall-clock dependence (consistent with the Worker.Tests-020 manual-time-source approach).
**Resolution:** 2026-06-15 — Took the coarse same-wall-clock-instant approach (no production-code clock injection needed). Added `FailoverAlarmConsumerTests.ProbeOnce_WithNonZeroInterval_ThrottlesSecondProbeWithinInterval`: builds `FailoverSettings(threshold: 1, probeIntervalSeconds: 3600, stableProbes: 5)`, forces failover to Subtag, makes the primary healthy, then calls `ProbeOnce()` twice back-to-back. The first probe re-polls the primary (`primary.Polls == 1`); the second falls inside the 3600s interval and is throttled, so `primary.Polls` is unchanged and `Mode` stays `Subtag`. `stableProbes: 5` keeps a single clean probe from failing back, so the throttled `ProbeOnce` path stays in scope. A 1-hour interval makes the two back-to-back calls reliably fall inside the window without any timing flakiness.
### Worker.Tests-032
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling & resilience |
| Location | `src/ZB.MOM.WW.MxGateway.Worker.Tests/MxAccess/FailoverAlarmConsumerTests.cs` |
| Status | Resolved |
**Description:** Two resilience branches of `FailoverAlarmConsumer` are uncovered by the new tests. (1) `RunPrimary` catches `Exception ex when (ex is not OutOfMemoryException)` (`FailoverAlarmConsumer.cs:295`) — the OOM-safe catch path the task brief explicitly called out. No test throws `OutOfMemoryException` from the primary to verify it *propagates* (rather than being swallowed and counted toward the failover threshold like every other exception); the `FlakyPrimary` fake throws only `COMException`. A regression that broadened the filter to swallow OOM would convert a fatal allocation failure into a silent failover. (2) The `FailoverSettings` constructor clamps `threshold < 1 → 1` and `stableProbes < 1 → 1` (`FailoverSettings.cs:38-40`); no test passes a sub-1 value to confirm the clamp, so a misconfigured `ConsecutiveFailureThreshold = 0` from the gateway could change failover semantics undetected.
**Recommendation:** Add a `FlakyPrimary`-style fake (or a flag on the existing one) that throws `OutOfMemoryException` from `PollOnce`, and assert `sut.PollOnce()` rethrows it via `Assert.Throws<OutOfMemoryException>` and that no `ProviderModeChanged` fired. Add a small `FailoverSettings` fact (or `[Theory]`) asserting `new FailoverSettings(0, 0, 0).Threshold == 1` and `.StableProbes == 1` to pin the clamp.
**Resolution:** 2026-06-15 — Added a `ThrowOutOfMemoryOnPoll` flag to the existing `FlakyPrimary` fake (its `PollOnce` throws `OutOfMemoryException` when set, checked before the `COMException` branch). Regression test `FailoverAlarmConsumerTests.RunPrimary_WhenPrimaryThrowsOutOfMemory_PropagatesAndDoesNotFailOver` drives `PollOnce` through the primary, asserts `Assert.Throws<OutOfMemoryException>`, and asserts no `ProviderModeChanged` fired and `Mode` stays `Alarmmgr` — pinning that the `when (ex is not OutOfMemoryException)` filter lets OOM propagate rather than swallowing it and counting it toward the failover threshold. The clamp is pinned by `FailoverSettings_ClampsSubMinimumValues` (a `[Theory]`): `(0,0,0)→(1,0,1)`, `(-5,-5,-5)→(1,0,1)`, and a pass-through `(3,7,2)→(3,7,2)` to confirm in-range values are not altered.
### Worker.Tests-033
| Field | Value |
|---|---|
| Severity | Low |
| Category | Testing coverage |
| Location | `src/ZB.MOM.WW.MxGateway.Worker.Tests/MxAccess/SubtagAlarmStateMachineTests.cs` |
| Status | Resolved |
**Description:** `SubtagAlarmStateMachineTests` covers the core transition matrix and the acked latch well, but two branches of the new state machine are unexercised. (1) The ack-while-inactive path in `SubtagAlarmStateMachine.ApplyAcked` (`src/ZB.MOM.WW.MxGateway.Worker/MxAccess/SubtagAlarmStateMachine.cs:156-164`): when `.acked` flips true while the alarm is *not* active, the machine must emit nothing and must *not* set `AckedDuringEpisode` — otherwise a stale ack from a prior episode could mis-latch the next raise into a spurious `ACK_RTN`. No test drives an `.acked` change without a preceding active raise. (2) The priority-subtag path (`SubtagRole.Priority``state.Priority = CoerceInt(...)`, line 76-78): `SubtagAlarmConsumerTests.Subscribe_AdvisesAllSubtagsIncludingAckComment` confirms the priority subtag is *advised*, but no test raises a priority value change and asserts it flows into the emitted/snapshot record's `Priority`, so `CoerceInt` and the priority assignment are untested in the state-machine layer.
**Recommendation:** Add (a) `AckedTrueWhileInactive_EmitsNothingAndDoesNotLatch` — apply `.acked=true` with no prior active raise, assert `Apply` returns empty, then raise active and clear and assert the clear emits `UnackRtn` (proving the stale ack did not latch); and (b) `PriorityChange_FlowsIntoEmittedRecord` — apply a priority value then an active raise and assert the emitted record's `Priority` equals the supplied value (and a `CoerceInt` string/garbage case falls back).
**Resolution:** 2026-06-15 — Added both tests to `SubtagAlarmStateMachineTests`. `AckedTrueWhileInactive_EmitsNothingAndDoesNotLatch` applies `.acked=true` with no preceding active raise (asserts `Apply` returns empty), then drives a fresh raise→clear episode and asserts the clear emits `UnackRtn` — proving the stale inactive ack did not latch `AckedDuringEpisode`. `PriorityChange_FlowsIntoEmittedRecord` (the target now includes a `PrioritySubtag`) applies an `int` priority `750` (asserts the priority change emits nothing), raises active and asserts the emitted record's `Priority == 750` (exercising `CoerceInt`'s `int` path and the priority assignment), then applies a non-numeric `"not-a-number"` priority and asserts the snapshot `Priority` is still `750` (the `CoerceInt` string fallback keeps the prior value, not zero).
+76 -2
View File
@@ -4,11 +4,38 @@
|---|---|
| Module | `src/ZB.MOM.WW.MxGateway.Worker` |
| Reviewer | Claude Code |
| Review date | 2026-05-24 |
| Commit reviewed | `42b0037` |
| Review date | 2026-06-15 |
| Commit reviewed | `410acc9` |
| Status | Re-reviewed |
| Open findings | 0 |
## 2026-06-15 re-review (commit `410acc9`)
Re-review of the `42b0037..410acc9` diff — the alarm-provider subtag-fallback
feature (`git diff 42b0037..410acc9 -- src/ZB.MOM.WW.MxGateway.Worker/`). New
substantive code: `SubtagAlarmConsumer`, `SubtagAlarmStateMachine`,
`FailoverAlarmConsumer`, `LmxSubtagAlarmSource`, `SyntheticAlarmGuid`,
`AlarmProviderModeChange`, `FailoverSettings`, `ISubtagAlarmSource` /
`SubtagValueChange`, plus the degraded/`source_provider` propagation in
`AlarmDispatcher` / `MxAccessAlarmEventSink` / `MxAccessEventMapper`, the
`ForcedMode`/watch-list routing and STA-COM-factory threading in
`AlarmCommandHandler` / `MxAccessStaSession`, and the `SubscribeAlarmsCommand`
re-plumb in `MxAccessCommandExecutor`. Three new findings: **Worker-026** (High),
**Worker-027** (Medium), **Worker-028** (Low). Worker-001..025 remain closed.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | No issues found. Subtag synthesis (`SubtagAlarmStateMachine` raise/ack/clear, `AckedDuringEpisode` latch, segment-boundary name derivation), exact-match ack resolution (`ResolveTargetByName` avoids the prefix false-positive), and `MapTransition`'s `Unspecified→*Alm` raise path are all sound. |
| 2 | mxaccessgw conventions | No issues found. The synthesis is worker-side and every degraded record/event carries `degraded=true` + `source_provider=SUBTAG`, satisfying the explicit opt-in non-parity exception to the "never synthesize events" rule. The gateway never instantiates COM. net48 constraint respected — `AlarmProviderModeChange`/`FailoverSettings` are plain classes with get-only ctor-assigned props (no init/positional records); no `WriteRecord`-style init usage introduced. |
| 3 | Concurrency & thread safety | Issue found: Worker-026 (an exception in the failover switch path — `SwitchToStandby`'s priming snapshot or either switch's `ProviderModeChanged` handler — escapes the state machine after `active` has already flipped, killing the STA alarm-poll loop with no mode-changed event). STA affinity itself is sound: `LmxSubtagAlarmSource` owns its own apartment-bound `LMXProxyServerClass`, all consumer calls are STA-confined via `AlarmCommandHandler`'s affinity guard, and `Dispose` UnAdvises before tearing handles down so a late pump callback cannot re-enter. |
| 4 | Error handling & resilience | Issue found: Worker-027 (`SyntheticAlarmGuid` uses `MD5.Create()`, which throws on a net48 FIPS-policy host — breaking every subtag transition stamp and snapshot, and feeding Worker-026's poll-loop-kill path). `FailoverSettings` clamps tunables to safe minimums; `LmxSubtagAlarmSource` teardown is best-effort/idempotent. |
| 5 | Security | No issues found. No secret/credential logging on the alarm path; ack comments are operator-supplied alarm metadata, not secrets. Synthetic GUID is non-cryptographic by design and not a security control. |
| 6 | Performance & resource management | No issues found. `LmxSubtagAlarmSource` releases its COM object via `FinalReleaseComObject` and tracks advised-vs-added handles so `Dispose` only UnAdvises what it advised. The standby is armed once and gated-by-active rather than churning subscribe/unsubscribe per switch. |
| 7 | Design-document adherence | No issues found. Implementation matches `docs/plans/2026-06-13-alarm-subtag-fallback-design.md` (auto-failover/failback, ack-comment-write ack, worker-side synthesis, additive proto fields). The probe re-polls the still-subscribed primary (single-subscribe constraint) as the design's "Superseded" notes describe. |
| 8 | Code organization & conventions | Issue found: Worker-028 (the dup-subtag-address guard in `SubtagAlarmStateMachine.Bind` does not cover duplicate `AlarmFullReference` entries, which silently overwrite in `targetsByReference`/`_statesByReference`). One-public-type-per-file is otherwise respected for the new files. |
| 9 | Testing coverage | No standalone finding. New unit suites exist for each major component (`SubtagAlarmConsumerTests`, `SubtagAlarmStateMachineTests`, `FailoverAlarmConsumerTests`, `LmxSubtagAlarmSourceTests`, `SyntheticAlarmGuidTests`), matching the design's test matrix. The switch-path exception fragility (Worker-026) and the dup-reference case (Worker-028) are untested edge cases noted in those findings. |
| 10 | Documentation & comments | No issues found. The new types carry accurate XML docs; the net48-constraint rationale is documented inline on `FailoverSettings`/`AlarmProviderModeChange`; the "why PollOnce only, no re-Subscribe" and probe-throttle behaviour are documented on `FailoverAlarmConsumer.ProbeOnce`. |
## 2026-05-24 re-review (commit `42b0037`)
**Re-review: no new findings.** `git diff --name-only d692232..42b0037 -- src/ZB.MOM.WW.MxGateway.Worker` returns empty — the Worker module has zero source changes since the previous review. All ten checklist categories therefore inherit "No issues found" from the `d692232` pass. The header is bumped to track the latest reviewed commit; Worker-001..025 remain closed.
@@ -464,3 +491,50 @@ _runtimeSession = _runtimeSessionFactory()
Match the pattern `AlarmCommandHandler.Subscribe` already uses for `consumerFactory()` (`AlarmCommandHandler.cs:76-77`).
**Resolution:** 2026-05-20 — `WorkerPipeSession.RunAsync` now uses `_runtimeSession = _runtimeSessionFactory() ?? throw new InvalidOperationException("Worker runtime session factory returned null.");`, matching the pattern `AlarmCommandHandler.Subscribe` uses for its `consumerFactory()`. A null factory return now produces a clear diagnostic exception at the call site instead of NRE-ing on the next dereference (and the `finally` block's `_runtimeSession?.Dispose()` silently no-oping on a half-initialized session). Regression test `WorkerPipeSessionTests.RunAsync_WhenRuntimeSessionFactoryReturnsNull_ThrowsDiagnosticException` drives `RunAsync` with `() => null!` and asserts the diagnostic `InvalidOperationException` is thrown with the expected message.
### Worker-026
| Field | Value |
|---|---|
| Severity | High |
| Category | Concurrency & thread safety |
| Location | `src/ZB.MOM.WW.MxGateway.Worker/MxAccess/FailoverAlarmConsumer.cs:289-338`, `src/ZB.MOM.WW.MxGateway.Worker/MxAccess/MxAccessStaSession.cs:307-320` |
| Status | Resolved |
**Description:** `FailoverAlarmConsumer.SwitchToStandby` flips `active = Active.Standby` / `mode = Subtag` first, then calls `_ = standby.SnapshotActiveAlarms();` (the priming side-effect), and only then calls `RaiseModeChanged(...)`. If `standby.SnapshotActiveAlarms()` throws, the exception escapes `SwitchToStandby`, escapes the `catch` in `RunPrimary`, and escapes `FailoverAlarmConsumer.PollOnce`/`Subscribe`. The `SubtagAlarmConsumer.SnapshotActiveAlarms` path is not exception-free: it calls `StampSynthetic``SyntheticAlarmGuid.ForReference` (which throws on a FIPS host — see Worker-027) and walks live state. The same exposure exists for `RaiseModeChanged` itself: the attached `AlarmCommandHandler.OnProviderModeChanged` handler runs synchronously and calls `eventQueue.Enqueue(...)`, which throws `MxAccessEventQueueOverflowException` at capacity; that also propagates out of both `SwitchToStandby` and `SwitchToPrimary`.
When this happens the consumer has **already** transitioned `active`/`mode` to Standby (or Primary) but the `ProviderModeChanged` event is never emitted — so the gateway never learns the feed went degraded. Worse, because the failover calls run on the worker's STA inside `RunAlarmPollLoopAsync`, the escaping exception lands in that loop's trailing `catch (Exception)` arm (`MxAccessStaSession.cs:307-320`), which records a single fault and **permanently stops the alarm poll loop**. The standby is then never pumped or probed again — i.e. a transient primary COM fault that should have produced a clean degraded-mode handoff instead produces a total, undetected alarm outage for the session, defeating the entire purpose of the fallback feature. There is no safe operator workaround short of restarting the session.
**Recommendation:** Make the switch atomic and exception-isolated: raise `ProviderModeChanged` (and perform the priming snapshot) inside their own `try`/`catch` so a snapshot or handler failure cannot abort the switch or unwind into the poll loop. Order the state flip so the mode-changed notification is guaranteed to fire even if priming fails (e.g. flip state, raise mode-changed in a guarded block, then attempt the priming snapshot in a separate guarded block whose failure is logged/faulted but non-fatal). Add a regression test where the standby's `SnapshotActiveAlarms` throws on the first call after failover, asserting (a) `ProviderModeChanged` still fires and (b) `PollOnce` does not rethrow.
**Resolution:** 2026-06-15 — Reordered and exception-isolated the failover switch in `FailoverAlarmConsumer`. `SwitchToStandby` now flips `active`/`mode`, then raises `ProviderModeChanged` FIRST (so the gateway always learns the feed went degraded), then primes the standby snapshot via a new `TryPrimeStandbySnapshot()` whose failure is swallowed (`catch when ex is not OutOfMemoryException`) — a priming failure can no longer abort the switch or unwind into the poll loop. `RaiseModeChanged` itself now wraps `ProviderModeChanged?.Invoke` in a `try`/`catch (when ex is not OutOfMemoryException)` so a subscriber handler exception (e.g. `AlarmCommandHandler.OnProviderModeChanged`'s `eventQueue.Enqueue` overflowing) cannot escape `SwitchToStandby`/`SwitchToPrimary` into `RunAlarmPollLoopAsync`'s trailing catch and permanently stop alarm polling. `OutOfMemoryException` is deliberately allowed to propagate. The MXAccessStaSession poll-loop arm is unchanged — the fix prevents the escape rather than catching it there. Regression tests in `FailoverAlarmConsumerTests`: `Failover_WhenStandbyPrimingSnapshotThrows_StillRaisesModeChangeAndDoesNotRethrow` (standby `SnapshotActiveAlarms` throws on the priming call → `ProviderModeChanged` still fires, `Mode` is Subtag, `Subscribe`/`PollOnce` do not rethrow) and `Failover_WhenModeChangedHandlerThrows_SwitchStillTakesEffectAndDoesNotRethrow` (a throwing `ProviderModeChanged` subscriber → switch still takes effect, no rethrow).
### Worker-027
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | `src/ZB.MOM.WW.MxGateway.Worker/MxAccess/SyntheticAlarmGuid.cs:38-40` |
| Status | Resolved |
**Description:** `SyntheticAlarmGuid.ForReference` derives the deterministic alarm GUID via `using MD5 md5 = MD5.Create();`. The worker targets .NET Framework 4.8, where `MD5.Create()` returns `MD5CryptoServiceProvider`. When the host has the Windows FIPS-compliance policy enabled (`Enabled=1` under `HKLM\System\CurrentControlSet\Control\Lsa\FIPSAlgorithmPolicy`), the non-validated `MD5CryptoServiceProvider` constructor throws `InvalidOperationException` ("This implementation is not part of the Windows Platform FIPS validated cryptographic algorithms."). `SyntheticAlarmGuid.ForReference` is on the hot path of the subtag fallback: `SubtagAlarmConsumer.StampSynthetic` calls it for **every** synthesized transition and **every** snapshot record. On a FIPS host the subtag fallback therefore throws on first use; combined with Worker-026 that exception kills the STA alarm-poll loop, so the fallback is not merely degraded but completely non-functional exactly when it is needed (after the primary alarmmgr provider has failed). The comment already notes MD5 is "never for security" — the issue is availability under FIPS policy, not cryptographic strength. The regulated deployment hosts (Zimmer) are a plausible FIPS environment.
**Recommendation:** Replace `MD5.Create()` with a FIPS-agnostic non-cryptographic 128-bit hash that does not route through the crypto FIPS gate — e.g. compute the 16 GUID bytes from a stable hash that does not use `System.Security.Cryptography` (a fixed FNV-1a / xxHash-style derivation over the UTF-8 bytes), or use `SHA256` truncated to 16 bytes via the managed `SHA256Managed`/`IncrementalHash` only if confirmed FIPS-safe on net48 (it is not guaranteed — prefer the non-crypto route). The mapping only needs determinism and collision resistance for distinct references, not cryptographic properties. Add a test that exercises `ForReference` without depending on a crypto provider.
**Resolution:** 2026-06-15 — Replaced the `MD5.Create()` derivation in `SyntheticAlarmGuid.ForReference` with a pure-managed FNV-1a hash: two independent 64-bit FNV-1a passes over the UTF-8 bytes (the high pass mixes the byte index into its accumulator to decorrelate the halves) fill the low/high 64 bits of the 128-bit GUID, and the input length is folded in so the empty string is non-degenerate (never `Guid.Empty`). The `using System.Security.Cryptography;` import is gone, so no FIPS-gated `MD5CryptoServiceProvider` is ever constructed — the subtag fallback no longer throws on a FIPS-policy host. The derivation stays deterministic and distinct-per-reference. The existing `SyntheticAlarmGuidTests` (`SameReference_SameGuid`, `DifferentReference_DifferentGuid`, `Reference_ProducesNonEmptyGuid`) pin only those properties — not a specific GUID literal — so they continue to pass unchanged; no test needed a value update. Added regression tests `SyntheticAlarmGuidTests.EmptyReference_ProducesNonEmptyGuid` (length-fold guard against a degenerate all-zero result) and `ForReference_UnderFipsEnforcement_DoesNotThrowAndStaysDeterministic` (sets the managed `UseLegacyFipsThrow` AppContext switch and asserts the derivation still succeeds deterministically; a regression reintroducing a FIPS-gated provider would throw here).
### Worker-028
| Field | Value |
|---|---|
| Severity | Low |
| Category | Code organization & conventions |
| Location | `src/ZB.MOM.WW.MxGateway.Worker/MxAccess/SubtagAlarmStateMachine.cs:43-52`, `src/ZB.MOM.WW.MxGateway.Worker/MxAccess/SubtagAlarmConsumer.cs:70-75` |
| Status | Resolved |
**Description:** `SubtagAlarmStateMachine.Bind` throws `ArgumentException` on a duplicate subtag **item address** (the documented dup-address guard), but neither the state machine nor `SubtagAlarmConsumer` guards against a duplicate `AlarmFullReference` in the watch list. When two `AlarmSubtagTarget` entries share an `AlarmFullReference` but use different subtag addresses, `_statesByReference[target.AlarmFullReference] = state` and `targetsByReference[reference] = target` each silently overwrite the earlier entry, while the earlier target's subtag addresses are still bound to an orphaned `AlarmState`. The orphaned state is mutated by incoming value changes but is invisible to `SnapshotActive` (which iterates only the surviving `_statesByReference.Values`) and to ack resolution (which uses the surviving `targetsByReference`). The result is silently inconsistent synthesized state for that reference. This is a watch-list configuration error (the gateway resolves the watch list), so impact is limited, but the asymmetry — addresses are guarded, references are not — is surprising and silent.
**Recommendation:** Add a duplicate-`AlarmFullReference` guard symmetric with the dup-address guard: throw a descriptive `ArgumentException` from the `SubtagAlarmStateMachine` (or `SubtagAlarmConsumer`) constructor when two watch-list entries share a reference, so a misconfigured watch list fails fast at subscribe time rather than producing silently inconsistent state. Cover it with a unit test.
**Resolution:** 2026-06-15 — Added a duplicate-`AlarmFullReference` guard in the `SubtagAlarmStateMachine` constructor symmetric with the existing dup-address guard in `Bind`: before adding each target's `_statesByReference` entry it checks `ContainsKey` (the dictionary is `OrdinalIgnoreCase`, matching the consumer's `targetsByReference` lookup) and throws a descriptive `ArgumentException` ("Duplicate alarm full reference '{reference}' is bound to more than one alarm target."). Because `SubtagAlarmConsumer` constructs the state machine before populating its own `targetsByReference`, this guard fires before the consumer's silent overwrite too, covering both dictionaries from one canonical check. Regression test `SubtagAlarmStateMachineTests.DuplicateAlarmFullReference_Throws` (two targets sharing a reference but using distinct active subtags → `ArgumentException`).
@@ -27,6 +27,15 @@ public sealed class FailoverAlarmConsumerTests
public event EventHandler<MxAlarmTransitionEvent>? AlarmTransitionEmitted;
public bool ThrowOnPoll = true;
/// <summary>
/// When set, <see cref="PollOnce"/> throws
/// <see cref="OutOfMemoryException"/> instead of a
/// <see cref="System.Runtime.InteropServices.COMException"/>, to
/// exercise the OOM-safe exception filter (Worker.Tests-032).
/// </summary>
public bool ThrowOutOfMemoryOnPoll;
public int Polls;
/// <summary>
@@ -48,6 +57,11 @@ public sealed class FailoverAlarmConsumerTests
public void PollOnce()
{
Polls++;
if (ThrowOutOfMemoryOnPoll)
{
throw new OutOfMemoryException("simulated allocation failure");
}
if (ThrowOnPoll)
{
throw new System.Runtime.InteropServices.COMException("boom", unchecked((int)0x80004005));
@@ -75,6 +89,15 @@ public sealed class FailoverAlarmConsumerTests
public bool Subscribed;
/// <summary>
/// When set, <see cref="SnapshotActiveAlarms"/> throws — modeling a
/// priming-snapshot failure during failover (Worker-026).
/// </summary>
public bool ThrowOnSnapshot;
/// <summary>Number of <see cref="SnapshotActiveAlarms"/> calls.</summary>
public int SnapshotCalls;
public void Subscribe(string s) => Subscribed = true;
public void PollOnce() { }
@@ -83,7 +106,16 @@ public sealed class FailoverAlarmConsumerTests
public int AcknowledgeByName(string n, string p, string gr, string c, string a, string b, string d, string e) => 22;
public IReadOnlyList<MxAlarmSnapshotRecord> SnapshotActiveAlarms() => Array.Empty<MxAlarmSnapshotRecord>();
public IReadOnlyList<MxAlarmSnapshotRecord> SnapshotActiveAlarms()
{
SnapshotCalls++;
if (ThrowOnSnapshot)
{
throw new InvalidOperationException("priming snapshot failed");
}
return Array.Empty<MxAlarmSnapshotRecord>();
}
public void Dispose() { }
@@ -291,4 +323,151 @@ public sealed class FailoverAlarmConsumerTests
sut.ProbeOnce(); // clean 3 → failback
Assert.Equal(AlarmProviderMode.Alarmmgr, sut.Mode);
}
/// <summary>
/// Worker-026 regression: when the standby's priming
/// <c>SnapshotActiveAlarms</c> throws during failover, the switch must
/// still (a) fire <c>ProviderModeChanged</c> so the gateway learns the
/// feed went degraded, (b) leave <see cref="FailoverAlarmConsumer.Mode"/>
/// in Subtag, and (c) not rethrow out of <c>PollOnce</c> (which on the
/// real STA would land in the poll loop's trailing catch and permanently
/// stop alarm delivery).
/// </summary>
[Fact]
public void Failover_WhenStandbyPrimingSnapshotThrows_StillRaisesModeChangeAndDoesNotRethrow()
{
FlakyPrimary primary = new FlakyPrimary { ThrowOnPoll = true };
StubStandby standby = new StubStandby { ThrowOnSnapshot = true };
FailoverSettings settings = new FailoverSettings(threshold: 1, probeIntervalSeconds: 0, stableProbes: 1);
using FailoverAlarmConsumer sut = new FailoverAlarmConsumer(primary, standby, settings);
List<AlarmProviderModeChange> changes = new List<AlarmProviderModeChange>();
sut.ProviderModeChanged += (_, e) => changes.Add(e);
// threshold=1 → the Subscribe failure triggers the switch, which primes
// the standby snapshot (throwing). The exception must be contained.
Exception? escaped = Record.Exception(() => sut.Subscribe(@"\\HOST\Galaxy!Area"));
Assert.Null(escaped);
Assert.Single(changes);
Assert.Equal(AlarmProviderMode.Subtag, changes[0].Mode);
Assert.Equal(AlarmProviderMode.Subtag, sut.Mode);
Assert.True(standby.SnapshotCalls >= 1); // priming was attempted
// A subsequent degraded PollOnce (standby.PollOnce + ProbeOnce) must also
// not rethrow the snapshot failure.
Exception? pollEscaped = Record.Exception(() => sut.PollOnce());
Assert.Null(pollEscaped);
}
/// <summary>
/// Worker-026 regression: when a <c>ProviderModeChanged</c> subscriber's
/// handler throws (modeling the AlarmCommandHandler's event-queue enqueue
/// overflowing at capacity), the switch must still take effect and the
/// exception must not escape the switch path into the poll loop.
/// </summary>
[Fact]
public void Failover_WhenModeChangedHandlerThrows_SwitchStillTakesEffectAndDoesNotRethrow()
{
FlakyPrimary primary = new FlakyPrimary { ThrowOnPoll = true };
StubStandby standby = new StubStandby();
FailoverSettings settings = new FailoverSettings(threshold: 1, probeIntervalSeconds: 0, stableProbes: 1);
using FailoverAlarmConsumer sut = new FailoverAlarmConsumer(primary, standby, settings);
int handlerInvocations = 0;
sut.ProviderModeChanged += (_, _) =>
{
handlerInvocations++;
throw new InvalidOperationException("subscriber handler blew up");
};
Exception? escaped = Record.Exception(() => sut.Subscribe(@"\\HOST\Galaxy!Area"));
Assert.Null(escaped);
Assert.Equal(1, handlerInvocations); // the event still fired
Assert.Equal(AlarmProviderMode.Subtag, sut.Mode); // the switch still took effect
}
/// <summary>
/// Worker.Tests-031 regression: with a non-zero
/// <see cref="FailoverSettings.ProbeIntervalSeconds"/>, two back-to-back
/// <c>ProbeOnce</c> calls must throttle — the second falls inside the
/// interval and must NOT re-poll the primary. Two consecutive calls
/// reliably fall inside any interval of one second or more, so this needs
/// no injected clock.
/// </summary>
[Fact]
public void ProbeOnce_WithNonZeroInterval_ThrottlesSecondProbeWithinInterval()
{
FlakyPrimary primary = new FlakyPrimary { ThrowOnPoll = true };
StubStandby standby = new StubStandby();
// stableProbes high enough that a single clean probe cannot fail back,
// so Mode stays Subtag and ProbeOnce remains the throttled path.
FailoverSettings settings = new FailoverSettings(threshold: 1, probeIntervalSeconds: 3600, stableProbes: 5);
using FailoverAlarmConsumer sut = new FailoverAlarmConsumer(primary, standby, settings);
sut.Subscribe(@"\\HOST\Galaxy!Area"); // threshold=1 → switch to Subtag
Assert.Equal(AlarmProviderMode.Subtag, sut.Mode);
primary.ThrowOnPoll = false; // primary healthy so a probe would poll cleanly
sut.ProbeOnce(); // first probe runs: re-polls the primary
int pollsAfterFirstProbe = primary.Polls;
Assert.Equal(1, pollsAfterFirstProbe);
sut.ProbeOnce(); // within the 3600s interval → throttled, must NOT re-poll
Assert.Equal(pollsAfterFirstProbe, primary.Polls);
Assert.Equal(AlarmProviderMode.Subtag, sut.Mode);
}
/// <summary>
/// Worker.Tests-032 regression: <c>RunPrimary</c>'s
/// <c>when (ex is not OutOfMemoryException)</c> filter must let an
/// <see cref="OutOfMemoryException"/> propagate rather than swallowing it
/// and counting it toward the failover threshold. No mode change must
/// fire — a fatal allocation failure is not a clean degraded handoff.
/// </summary>
[Fact]
public void RunPrimary_WhenPrimaryThrowsOutOfMemory_PropagatesAndDoesNotFailOver()
{
FlakyPrimary primary = new FlakyPrimary { ThrowOnPoll = false, ThrowOutOfMemoryOnPoll = true };
StubStandby standby = new StubStandby();
FailoverSettings settings = new FailoverSettings(threshold: 1, probeIntervalSeconds: 0, stableProbes: 1);
using FailoverAlarmConsumer sut = new FailoverAlarmConsumer(primary, standby, settings);
bool modeChanged = false;
sut.ProviderModeChanged += (_, _) => modeChanged = true;
sut.Subscribe(@"\\HOST\Galaxy!Area"); // Subscribe path does not poll; no throw here
Assert.Throws<OutOfMemoryException>(() => sut.PollOnce());
Assert.False(modeChanged);
Assert.Equal(AlarmProviderMode.Alarmmgr, sut.Mode);
}
/// <summary>
/// Worker.Tests-032 regression: <see cref="FailoverSettings"/> clamps
/// sub-1 <c>threshold</c> and <c>stableProbes</c> (and sub-0
/// <c>probeIntervalSeconds</c>) to their safe minimums so a misconfigured
/// bind cannot change failover semantics.
/// </summary>
[Theory]
[InlineData(0, 0, 0, 1, 0, 1)]
[InlineData(-5, -5, -5, 1, 0, 1)]
[InlineData(3, 7, 2, 3, 7, 2)]
public void FailoverSettings_ClampsSubMinimumValues(
int threshold,
int probeInterval,
int stableProbes,
int expectedThreshold,
int expectedProbeInterval,
int expectedStableProbes)
{
FailoverSettings settings = new FailoverSettings(threshold, probeInterval, stableProbes);
Assert.Equal(expectedThreshold, settings.Threshold);
Assert.Equal(expectedProbeInterval, settings.ProbeIntervalSeconds);
Assert.Equal(expectedStableProbes, settings.StableProbes);
}
}
@@ -109,6 +109,31 @@ public sealed class SubtagAlarmStateMachineTests
Assert.Throws<ArgumentException>(() => new SubtagAlarmStateMachine(new[] { first, second }));
}
/// <summary>
/// Worker-028 regression: two watch-list entries sharing an
/// <see cref="AlarmSubtagTarget.AlarmFullReference"/> (but using distinct
/// subtag addresses) must throw at construction, symmetric with the
/// duplicate-address guard, rather than silently overwriting the earlier
/// reference's state and orphaning its bound addresses.
/// </summary>
[Fact]
public void DuplicateAlarmFullReference_Throws()
{
var first = new AlarmSubtagTarget
{
AlarmFullReference = "Galaxy!Area.Tank01.Level.HiHi",
SourceObjectReference = "Tank01",
ActiveSubtag = "Tank01.Level.HiHi.active",
};
var second = new AlarmSubtagTarget
{
AlarmFullReference = "Galaxy!Area.Tank01.Level.HiHi",
SourceObjectReference = "Tank01",
ActiveSubtag = "Other.active",
};
Assert.Throws<ArgumentException>(() => new SubtagAlarmStateMachine(new[] { first, second }));
}
[Fact]
public void AckedTrueWhileActive_EmitsAck()
{
@@ -162,4 +187,64 @@ public sealed class SubtagAlarmStateMachineTests
var events = sm.Apply("Some.Other.Tag.active", true, DateTime.UtcNow);
Assert.Empty(events);
}
/// <summary>
/// Worker.Tests-033 regression: an ack arriving while the alarm is NOT
/// active must emit nothing and must NOT latch
/// <c>AckedDuringEpisode</c> — otherwise a stale ack from a prior episode
/// would mis-latch the next raise into a spurious ACK_RTN on clear. The
/// subsequent raise/clear must therefore still emit UNACK_RTN.
/// </summary>
[Fact]
public void AckedTrueWhileInactive_EmitsNothingAndDoesNotLatch()
{
var sm = new SubtagAlarmStateMachine(new[] { Target() });
var ts = new DateTime(2026, 6, 13, 9, 0, 0, DateTimeKind.Utc);
// Ack with no preceding active raise: must be a no-op.
var ackEvents = sm.Apply("Tank01.Level.HiHi.acked", true, ts);
Assert.Empty(ackEvents);
// A fresh episode: raise then clear. Because the earlier ack must not
// have latched AckedDuringEpisode, the clear must be UNACK_RTN.
sm.Apply("Tank01.Level.HiHi.active", true, ts.AddSeconds(5));
var clearEvents = sm.Apply("Tank01.Level.HiHi.active", false, ts.AddSeconds(10));
var clear = Assert.Single(clearEvents);
Assert.Equal(MxAlarmStateKind.UnackRtn, clear.Record.State);
}
/// <summary>
/// Worker.Tests-033 regression: a priority-subtag value change must flow
/// through <c>CoerceInt</c> into the emitted record's
/// <see cref="MxAlarmSnapshotRecord.Priority"/>. A non-numeric value must
/// leave the prior priority unchanged (the CoerceInt fallback path).
/// </summary>
[Fact]
public void PriorityChange_FlowsIntoEmittedRecord()
{
var target = new AlarmSubtagTarget
{
AlarmFullReference = "Galaxy!Area.Tank01.Level.HiHi",
SourceObjectReference = "Tank01",
ActiveSubtag = "Tank01.Level.HiHi.active",
AckedSubtag = "Tank01.Level.HiHi.acked",
PrioritySubtag = "Tank01.Level.HiHi.priority",
};
var sm = new SubtagAlarmStateMachine(new[] { target });
var ts = new DateTime(2026, 6, 13, 9, 0, 0, DateTimeKind.Utc);
// A priority change alone emits nothing but records the priority.
var priorityEvents = sm.Apply("Tank01.Level.HiHi.priority", 750, ts);
Assert.Empty(priorityEvents);
// Raise: the emitted record carries the recorded priority.
var raiseEvents = sm.Apply("Tank01.Level.HiHi.active", true, ts.AddSeconds(1));
var raise = Assert.Single(raiseEvents);
Assert.Equal(750, raise.Record.Priority);
// A non-numeric priority must fall back to the existing value, not zero.
sm.Apply("Tank01.Level.HiHi.priority", "not-a-number", ts.AddSeconds(2));
var snap = Assert.Single(sm.SnapshotActive());
Assert.Equal(750, snap.Priority);
}
}
@@ -24,4 +24,43 @@ public sealed class SyntheticAlarmGuidTests
[Fact]
public void Reference_ProducesNonEmptyGuid() =>
Assert.NotEqual(Guid.Empty, SyntheticAlarmGuid.ForReference("A.B.C"));
/// <summary>
/// Verifies the empty string still derives a non-empty GUID. The length
/// fold in the derivation prevents a degenerate all-zero (Guid.Empty)
/// result, which would collide with the unset-record default downstream.
/// </summary>
[Fact]
public void EmptyReference_ProducesNonEmptyGuid() =>
Assert.NotEqual(Guid.Empty, SyntheticAlarmGuid.ForReference(string.Empty));
/// <summary>
/// Worker-027 regression: <see cref="SyntheticAlarmGuid.ForReference"/>
/// must derive its GUID without routing through
/// <see cref="System.Security.Cryptography"/>, because on net48
/// <c>MD5.Create()</c> throws under the Windows FIPS-compliance policy.
/// This test enables the per-AppContext FIPS-enforcement switch (which the
/// managed crypto factories honour) and asserts the derivation still
/// succeeds deterministically — a regression that reintroduced a FIPS-gated
/// provider would throw here instead of returning a stable GUID.
/// </summary>
[Fact]
public void ForReference_UnderFipsEnforcement_DoesNotThrowAndStaysDeterministic()
{
const string switchName = "Switch.System.Security.Cryptography.UseLegacyFipsThrow";
bool original = AppContext.TryGetSwitch(switchName, out bool value) && value;
AppContext.SetSwitch(switchName, true);
try
{
Guid first = SyntheticAlarmGuid.ForReference("Galaxy!Area.Tank01.Level.HiHi");
Guid second = SyntheticAlarmGuid.ForReference("Galaxy!Area.Tank01.Level.HiHi");
Assert.NotEqual(Guid.Empty, first);
Assert.Equal(first, second);
}
finally
{
AppContext.SetSwitch(switchName, original);
}
}
}
@@ -313,12 +313,20 @@ public sealed class FailoverAlarmConsumer : IMxAccessAlarmConsumer
consecutiveFailures = 0;
cleanProbes = 0;
// Emit the mode-changed notification FIRST and in a guarded block, so
// the gateway always learns the feed went degraded even if the priming
// snapshot below throws. A handler exception here must never escape the
// switch — escaping would (a) leave `active` flipped with no
// notification and (b) unwind into RunAlarmPollLoopAsync's trailing
// catch, which permanently stops alarm polling (Worker-026).
RaiseModeChanged(AlarmProviderMode.Subtag, reason, hresult);
// Warm the standby snapshot for the gateway hand-off. The gateway
// reconciles state from this snapshot, so the return value is not
// consumed here — the call exists for its priming side effect.
_ = standby.SnapshotActiveAlarms();
RaiseModeChanged(AlarmProviderMode.Subtag, reason, hresult);
// consumed here — the call exists for its priming side effect. A
// failure to prime is non-fatal: the switch has already completed and
// been announced, and the standby's live transitions will still flow.
TryPrimeStandbySnapshot();
}
private void SwitchToPrimary(string reason, int hresult)
@@ -327,14 +335,49 @@ public sealed class FailoverAlarmConsumer : IMxAccessAlarmConsumer
mode = AlarmProviderMode.Alarmmgr;
consecutiveFailures = 0;
cleanProbes = 0;
// Guarded so a ProviderModeChanged handler exception cannot escape into
// the STA poll loop and kill alarm delivery (Worker-026).
RaiseModeChanged(AlarmProviderMode.Alarmmgr, reason, hresult);
}
/// <summary>
/// Primes the standby snapshot for the gateway hand-off, swallowing any
/// failure. The switch has already completed and the mode change has
/// already been announced before this runs, so a priming failure must
/// not abort the switch or unwind into the poll loop.
/// </summary>
private void TryPrimeStandbySnapshot()
{
try
{
_ = standby.SnapshotActiveAlarms();
}
catch (Exception ex) when (ex is not OutOfMemoryException)
{
// Non-fatal: the standby is active and its live transitions still
// flow; the gateway will reconcile from subsequent records. Do not
// let a transient snapshot failure escape and stop the poll loop.
}
}
private void RaiseModeChanged(AlarmProviderMode newMode, string reason, int hresult)
{
ProviderModeChanged?.Invoke(
this,
new AlarmProviderModeChange(newMode, reason, hresult, DateTime.UtcNow));
try
{
ProviderModeChanged?.Invoke(
this,
new AlarmProviderModeChange(newMode, reason, hresult, DateTime.UtcNow));
}
catch (Exception ex) when (ex is not OutOfMemoryException)
{
// A subscriber's OnProviderModeChanged handler threw (e.g. the
// AlarmCommandHandler's eventQueue.Enqueue hitting capacity). The
// switch itself has already taken effect; swallow so the failure
// cannot unwind into RunAlarmPollLoopAsync and permanently stop
// alarm polling (Worker-026). The event-queue overflow it most
// likely signals is already surfaced as a fault on the IPC path.
}
}
private void OnChildTransition(object? sender, MxAlarmTransitionEvent e)
@@ -42,8 +42,22 @@ public sealed class SubtagAlarmStateMachine
foreach (AlarmSubtagTarget target in targets)
{
// Guard duplicate references symmetrically with the dup-address guard
// in Bind: two watch-list entries that share an AlarmFullReference but
// differ in subtag addresses would otherwise silently overwrite the
// earlier _statesByReference entry while its addresses stay bound to an
// orphaned (and therefore invisible) AlarmState, producing silently
// inconsistent synthesized state. Fail fast at subscribe time instead.
string reference = target.AlarmFullReference ?? string.Empty;
if (_statesByReference.ContainsKey(reference))
{
throw new ArgumentException(
$"Duplicate alarm full reference '{reference}' is bound to more than one alarm target.",
nameof(targets));
}
var state = new AlarmState(target);
_statesByReference[target.AlarmFullReference] = state;
_statesByReference[reference] = state;
Bind(target.ActiveSubtag, state, SubtagRole.Active);
Bind(target.AckedSubtag, state, SubtagRole.Acked);
@@ -1,5 +1,4 @@
using System;
using System.Security.Cryptography;
using System.Text;
namespace ZB.MOM.WW.MxGateway.Worker.MxAccess;
@@ -11,8 +10,29 @@ namespace ZB.MOM.WW.MxGateway.Worker.MxAccess;
/// repeated transitions for the same alarm reference correlate downstream
/// (acknowledge, snapshot, OPC UA mapping) without an alarmmgr-supplied GUID.
/// </summary>
/// <remarks>
/// The 128-bit value is computed with a fixed FNV-1a hash over the UTF-8
/// bytes of the reference, deliberately <strong>not</strong> via
/// <c>System.Security.Cryptography</c>. On .NET Framework 4.8
/// <c>MD5.Create()</c> returns the non-validated
/// <c>MD5CryptoServiceProvider</c>, whose constructor throws under the
/// Windows FIPS-compliance policy ("not part of the Windows Platform FIPS
/// validated cryptographic algorithms"). Because this derivation needs only
/// determinism and distinctness — never cryptographic strength — a plain
/// non-crypto hash avoids the FIPS gate entirely, so the subtag fallback
/// keeps working on regulated (FIPS-enabled) hosts exactly when it is needed.
/// </remarks>
internal static class SyntheticAlarmGuid
{
// 64-bit FNV-1a constants (RFC-style; widely used reference values).
private const ulong FnvOffsetBasis = 14695981039346656037UL;
private const ulong FnvPrime = 1099511628211UL;
// A second independent seed for the high 8 bytes so the full 128-bit value
// is well-distributed across distinct references rather than two correlated
// halves of the same single-pass hash.
private const ulong FnvSecondSeed = 1469598103934665603UL;
/// <summary>
/// Produces a stable <see cref="Guid"/> for the given alarm reference.
/// The same reference always maps to the same GUID; distinct references
@@ -32,11 +52,39 @@ internal static class SyntheticAlarmGuid
byte[] bytes = Encoding.UTF8.GetBytes(reference);
// MD5 is used purely for a stable, non-cryptographic identity mapping
// (reference -> 16-byte GUID), never for security. Its 128-bit output
// fits a GUID exactly, which is why it is preferred here.
using MD5 md5 = MD5.Create();
byte[] hash = md5.ComputeHash(bytes);
return new Guid(hash);
// Two independent FNV-1a passes fill the low and high 64 bits of the
// 128-bit value. The second pass mixes the running length into its seed
// so single-character differences and re-orderings still diverge in both
// halves, avoiding correlated-half collisions a single pass would risk.
ulong low = FnvOffsetBasis;
ulong high = FnvSecondSeed;
for (int i = 0; i < bytes.Length; i++)
{
byte b = bytes[i];
low ^= b;
low *= FnvPrime;
high ^= unchecked(b + (ulong)i);
high *= FnvPrime;
}
// Fold the length in so the empty string and other short inputs are not
// degenerate (an all-zero / Guid.Empty result is undesirable downstream).
low ^= (ulong)bytes.Length;
low *= FnvPrime;
byte[] guidBytes = new byte[16];
WriteUInt64(guidBytes, 0, low);
WriteUInt64(guidBytes, 8, high);
return new Guid(guidBytes);
}
private static void WriteUInt64(byte[] buffer, int offset, ulong value)
{
for (int i = 0; i < 8; i++)
{
buffer[offset + i] = (byte)(value >> (i * 8));
}
}
}