docs: forced-subtag mode fix plan

This commit is contained in:
Joseph Doherty
2026-06-15 01:04:46 -04:00
parent bbbef4d098
commit c6f17557f6
@@ -0,0 +1,97 @@
# ForceSubtag Mode Fix Implementation Plan
> Fixes the two defects surfaced by the D3 live validation (2026-06-15): forced-subtag
> doesn't actually run subtag (#1), and the gateway never reflects a forced provider mode
> into the gauge/feed (#2).
**Goal:** Make `MxGateway:Alarms:Fallback:Mode=ForceSubtag` actually serve degraded subtag
alarms AND have the gateway advertise `provider_mode=2` / degraded badge.
**Evidence:** Live ForceSubtag run returned alarmmgr-sourced active alarms (May raise
timestamps, `degraded=false`) and `provider_mode` stuck at 1, despite ForceSubtag binding
(proven by invalid-value crash) and the deployed worker containing the ForcedMode routing
(`3f5e5fc``5976770`, worker dated 2026-06-14).
---
## Defect #2 (CONFIRMED code defect) — gateway never reflects forced mode
**Root cause:** `GatewayAlarmMonitor.RunMonitorAsync` hard-baselines `_providerMode=Alarmmgr`
and sets the gauge to 1, ignoring `_options.Fallback.Mode`. `_providerMode` only advances on a
worker `OnAlarmProviderModeChanged` event, which is raised ONLY by `FailoverAlarmConsumer`
(Auto mode). Forced-subtag builds `SubtagAlarmConsumer` directly → no event → gauge/feed stay
Alarmmgr forever.
### Task 2: Seed provider mode from configured forced mode (gateway, net10)
**Classification:** small · **Parallelizable with:** none (precedes the diagnostic build)
**Files:**
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Alarms/GatewayAlarmMonitor.cs` (`RunMonitorAsync`, ~lines 160-172; add a permanent observability log in `SubscribeAlarmsAsync` ~line 257)
- Test: `src/ZB.MOM.WW.MxGateway.Tests/Alarms/GatewayAlarmMonitorProviderModeTests.cs`
**Change:** in `RunMonitorAsync`, compute `initialMode = MapForcedMode(_options.Fallback.Mode)`
mapped as `Subtag→Subtag`, `Alarmmgr→Alarmmgr`, `Unspecified→Alarmmgr` (Auto starts on the
alarmmgr primary). Set `_providerMode/_providerDegraded(=Subtag)/_providerReason/_providerSince`
and `_metrics.SetAlarmProviderMode(ModeToInt(initialMode))` — using the existing no-switch gauge
seam so `provider_switches` does NOT increment. Add a log in `SubscribeAlarmsAsync`:
`"Alarm subscribe forcedMode={ForcedMode} (configMode={ConfigMode}) watchList={Count}"`.
**Tests (fake-worker, no MXAccess):** with `Fallback:Mode=ForceSubtag` assert (a) first
`StreamAlarms` message is `ProviderStatus{Mode=Subtag,Degraded=true}`; (b) gauge==2; (c)
`provider_switches`==0. Add `ForceAlarmManager`→gauge 1 and `Auto`→gauge 1 baseline cases.
**Verify:** `dotnet build src/ZB.MOM.WW.MxGateway.Server`; `dotnet test src/ZB.MOM.WW.MxGateway.Tests --filter FullyQualifiedName~GatewayAlarmMonitorProviderMode`.
---
## Defect #1 (runtime bug — needs diagnosis) — ForceSubtag runs alarmmgr
The HEAD source path (gateway `MapForcedMode``SubscribeAlarmsCommand.ForcedMode=Subtag`
IPC → worker `AlarmCommandHandler.BuildConsumer``SubtagAlarmConsumer`) is statically correct,
yet the runtime ran alarmmgr. A runtime diagnostic must locate where `forcedMode` becomes
`Unspecified`.
### Task 1: Add worker BuildConsumer observability log (worker, net48 x86)
**Classification:** small (net48 — no init-only props) · **Parallelizable with:** Task 2
**Files:**
- Modify: `src/ZB.MOM.WW.MxGateway.Worker/MxAccess/AlarmCommandHandler.cs` (`BuildConsumer`, ~line 206)
**Change:** log at entry of `BuildConsumer`:
`"BuildConsumer forcedMode={ForcedMode} watchList={Count}"`. This is permanent observability.
**Verify (windev only):** `dotnet build src/ZB.MOM.WW.MxGateway.Worker -p:Platform=x86`.
### Task 3: Diagnostic run — capture the real forcedMode
**Classification:** standard (ops/diagnostic on windev) · **Parallelizable with:** none
**Steps:** rebuild worker (x86) + server from the branch, stand up the D3-style temp ForceSubtag
instance (alt ports 5122/5132, isolated `.D3` worker name, WMI-detached, Http2, Development env
for file logs), trigger the always-on monitor, and read the two new log lines:
- Gateway `Alarm subscribe forcedMode=...` — what the gateway SENDS.
- Worker `BuildConsumer forcedMode=...` — what the worker RECEIVES.
Decision matrix:
- Gateway logs `Subtag`, worker logs `Unspecified` → IPC/serialization drops the enum → fix the
send/translation path (likely a worker-proto vs gateway-proto `SubscribeAlarmsCommand` mismatch
in the named-pipe envelope).
- Worker logs `Subtag` but alarmmgr data appears → bug in `BuildStandby`/`SubtagAlarmConsumer`/
`AlarmDispatcher` snapshot path.
- Gateway logs `Unspecified` despite config ForceSubtag → gateway config/options read.
### Task 4: Fix #1 per Task 3 diagnosis
**Classification:** standard/high-risk (depends on where the defect is) · the exact change is
determined by Task 3. Add a regression test at the identified layer (worker unit test for
BuildConsumer→SubtagAlarmConsumer, or a contract/IPC round-trip test if the enum is dropped).
---
## Final: build, test, redeploy, re-validate
- Build gateway (macOS) + worker (windev x86); run gateway + worker test suites.
- Redeploy windev Server (and Worker if Task 4 changed it) per `project_deploy_mechanics`,
preserving appsettings.
- Re-validate live with the temp ForceSubtag instance: active alarms `degraded=true` /
`source_provider=SUBTAG` with recent timestamps, `provider_mode 2`. Tear down temp instance;
production untouched.
## Execution note
Branch off `main`. #2 is the clean confirmed fix; #1 is diagnose-then-fix. Net48 worker
constraints apply (no init-only props/positional records). Do NOT increment `provider_switches`
on an initial forced-mode set.