From c6f17557f686f63e1544cad5972bfacff3fdc2ff Mon Sep 17 00:00:00 2001 From: Joseph Doherty Date: Mon, 15 Jun 2026 01:04:46 -0400 Subject: [PATCH] docs: forced-subtag mode fix plan --- .../2026-06-15-forced-subtag-mode-fix.md | 97 +++++++++++++++++++ 1 file changed, 97 insertions(+) create mode 100644 docs/plans/2026-06-15-forced-subtag-mode-fix.md diff --git a/docs/plans/2026-06-15-forced-subtag-mode-fix.md b/docs/plans/2026-06-15-forced-subtag-mode-fix.md new file mode 100644 index 0000000..654b1a0 --- /dev/null +++ b/docs/plans/2026-06-15-forced-subtag-mode-fix.md @@ -0,0 +1,97 @@ +# ForceSubtag Mode Fix Implementation Plan + +> Fixes the two defects surfaced by the D3 live validation (2026-06-15): forced-subtag +> doesn't actually run subtag (#1), and the gateway never reflects a forced provider mode +> into the gauge/feed (#2). + +**Goal:** Make `MxGateway:Alarms:Fallback:Mode=ForceSubtag` actually serve degraded subtag +alarms AND have the gateway advertise `provider_mode=2` / degraded badge. + +**Evidence:** Live ForceSubtag run returned alarmmgr-sourced active alarms (May raise +timestamps, `degraded=false`) and `provider_mode` stuck at 1, despite ForceSubtag binding +(proven by invalid-value crash) and the deployed worker containing the ForcedMode routing +(`3f5e5fc` ∈ `5976770`, worker dated 2026-06-14). + +--- + +## Defect #2 (CONFIRMED code defect) — gateway never reflects forced mode + +**Root cause:** `GatewayAlarmMonitor.RunMonitorAsync` hard-baselines `_providerMode=Alarmmgr` +and sets the gauge to 1, ignoring `_options.Fallback.Mode`. `_providerMode` only advances on a +worker `OnAlarmProviderModeChanged` event, which is raised ONLY by `FailoverAlarmConsumer` +(Auto mode). Forced-subtag builds `SubtagAlarmConsumer` directly → no event → gauge/feed stay +Alarmmgr forever. + +### Task 2: Seed provider mode from configured forced mode (gateway, net10) +**Classification:** small · **Parallelizable with:** none (precedes the diagnostic build) +**Files:** +- Modify: `src/ZB.MOM.WW.MxGateway.Server/Alarms/GatewayAlarmMonitor.cs` (`RunMonitorAsync`, ~lines 160-172; add a permanent observability log in `SubscribeAlarmsAsync` ~line 257) +- Test: `src/ZB.MOM.WW.MxGateway.Tests/Alarms/GatewayAlarmMonitorProviderModeTests.cs` + +**Change:** in `RunMonitorAsync`, compute `initialMode = MapForcedMode(_options.Fallback.Mode)` +mapped as `Subtag→Subtag`, `Alarmmgr→Alarmmgr`, `Unspecified→Alarmmgr` (Auto starts on the +alarmmgr primary). Set `_providerMode/_providerDegraded(=Subtag)/_providerReason/_providerSince` +and `_metrics.SetAlarmProviderMode(ModeToInt(initialMode))` — using the existing no-switch gauge +seam so `provider_switches` does NOT increment. Add a log in `SubscribeAlarmsAsync`: +`"Alarm subscribe forcedMode={ForcedMode} (configMode={ConfigMode}) watchList={Count}"`. + +**Tests (fake-worker, no MXAccess):** with `Fallback:Mode=ForceSubtag` assert (a) first +`StreamAlarms` message is `ProviderStatus{Mode=Subtag,Degraded=true}`; (b) gauge==2; (c) +`provider_switches`==0. Add `ForceAlarmManager`→gauge 1 and `Auto`→gauge 1 baseline cases. + +**Verify:** `dotnet build src/ZB.MOM.WW.MxGateway.Server`; `dotnet test src/ZB.MOM.WW.MxGateway.Tests --filter FullyQualifiedName~GatewayAlarmMonitorProviderMode`. + +--- + +## Defect #1 (runtime bug — needs diagnosis) — ForceSubtag runs alarmmgr + +The HEAD source path (gateway `MapForcedMode` → `SubscribeAlarmsCommand.ForcedMode=Subtag` → +IPC → worker `AlarmCommandHandler.BuildConsumer` → `SubtagAlarmConsumer`) is statically correct, +yet the runtime ran alarmmgr. A runtime diagnostic must locate where `forcedMode` becomes +`Unspecified`. + +### Task 1: Add worker BuildConsumer observability log (worker, net48 x86) +**Classification:** small (net48 — no init-only props) · **Parallelizable with:** Task 2 +**Files:** +- Modify: `src/ZB.MOM.WW.MxGateway.Worker/MxAccess/AlarmCommandHandler.cs` (`BuildConsumer`, ~line 206) + +**Change:** log at entry of `BuildConsumer`: +`"BuildConsumer forcedMode={ForcedMode} watchList={Count}"`. This is permanent observability. + +**Verify (windev only):** `dotnet build src/ZB.MOM.WW.MxGateway.Worker -p:Platform=x86`. + +### Task 3: Diagnostic run — capture the real forcedMode +**Classification:** standard (ops/diagnostic on windev) · **Parallelizable with:** none +**Steps:** rebuild worker (x86) + server from the branch, stand up the D3-style temp ForceSubtag +instance (alt ports 5122/5132, isolated `.D3` worker name, WMI-detached, Http2, Development env +for file logs), trigger the always-on monitor, and read the two new log lines: +- Gateway `Alarm subscribe forcedMode=...` — what the gateway SENDS. +- Worker `BuildConsumer forcedMode=...` — what the worker RECEIVES. + +Decision matrix: +- Gateway logs `Subtag`, worker logs `Unspecified` → IPC/serialization drops the enum → fix the + send/translation path (likely a worker-proto vs gateway-proto `SubscribeAlarmsCommand` mismatch + in the named-pipe envelope). +- Worker logs `Subtag` but alarmmgr data appears → bug in `BuildStandby`/`SubtagAlarmConsumer`/ + `AlarmDispatcher` snapshot path. +- Gateway logs `Unspecified` despite config ForceSubtag → gateway config/options read. + +### Task 4: Fix #1 per Task 3 diagnosis +**Classification:** standard/high-risk (depends on where the defect is) · the exact change is +determined by Task 3. Add a regression test at the identified layer (worker unit test for +BuildConsumer→SubtagAlarmConsumer, or a contract/IPC round-trip test if the enum is dropped). + +--- + +## Final: build, test, redeploy, re-validate +- Build gateway (macOS) + worker (windev x86); run gateway + worker test suites. +- Redeploy windev Server (and Worker if Task 4 changed it) per `project_deploy_mechanics`, + preserving appsettings. +- Re-validate live with the temp ForceSubtag instance: active alarms `degraded=true` / + `source_provider=SUBTAG` with recent timestamps, `provider_mode 2`. Tear down temp instance; + production untouched. + +## Execution note +Branch off `main`. #2 is the clean confirmed fix; #1 is diagnose-then-fix. Net48 worker +constraints apply (no init-only props/positional records). Do NOT increment `provider_switches` +on an initial forced-mode set.