docs: forced-subtag mode fix plan
This commit is contained in:
@@ -0,0 +1,97 @@
|
||||
# ForceSubtag Mode Fix Implementation Plan
|
||||
|
||||
> Fixes the two defects surfaced by the D3 live validation (2026-06-15): forced-subtag
|
||||
> doesn't actually run subtag (#1), and the gateway never reflects a forced provider mode
|
||||
> into the gauge/feed (#2).
|
||||
|
||||
**Goal:** Make `MxGateway:Alarms:Fallback:Mode=ForceSubtag` actually serve degraded subtag
|
||||
alarms AND have the gateway advertise `provider_mode=2` / degraded badge.
|
||||
|
||||
**Evidence:** Live ForceSubtag run returned alarmmgr-sourced active alarms (May raise
|
||||
timestamps, `degraded=false`) and `provider_mode` stuck at 1, despite ForceSubtag binding
|
||||
(proven by invalid-value crash) and the deployed worker containing the ForcedMode routing
|
||||
(`3f5e5fc` ∈ `5976770`, worker dated 2026-06-14).
|
||||
|
||||
---
|
||||
|
||||
## Defect #2 (CONFIRMED code defect) — gateway never reflects forced mode
|
||||
|
||||
**Root cause:** `GatewayAlarmMonitor.RunMonitorAsync` hard-baselines `_providerMode=Alarmmgr`
|
||||
and sets the gauge to 1, ignoring `_options.Fallback.Mode`. `_providerMode` only advances on a
|
||||
worker `OnAlarmProviderModeChanged` event, which is raised ONLY by `FailoverAlarmConsumer`
|
||||
(Auto mode). Forced-subtag builds `SubtagAlarmConsumer` directly → no event → gauge/feed stay
|
||||
Alarmmgr forever.
|
||||
|
||||
### Task 2: Seed provider mode from configured forced mode (gateway, net10)
|
||||
**Classification:** small · **Parallelizable with:** none (precedes the diagnostic build)
|
||||
**Files:**
|
||||
- Modify: `src/ZB.MOM.WW.MxGateway.Server/Alarms/GatewayAlarmMonitor.cs` (`RunMonitorAsync`, ~lines 160-172; add a permanent observability log in `SubscribeAlarmsAsync` ~line 257)
|
||||
- Test: `src/ZB.MOM.WW.MxGateway.Tests/Alarms/GatewayAlarmMonitorProviderModeTests.cs`
|
||||
|
||||
**Change:** in `RunMonitorAsync`, compute `initialMode = MapForcedMode(_options.Fallback.Mode)`
|
||||
mapped as `Subtag→Subtag`, `Alarmmgr→Alarmmgr`, `Unspecified→Alarmmgr` (Auto starts on the
|
||||
alarmmgr primary). Set `_providerMode/_providerDegraded(=Subtag)/_providerReason/_providerSince`
|
||||
and `_metrics.SetAlarmProviderMode(ModeToInt(initialMode))` — using the existing no-switch gauge
|
||||
seam so `provider_switches` does NOT increment. Add a log in `SubscribeAlarmsAsync`:
|
||||
`"Alarm subscribe forcedMode={ForcedMode} (configMode={ConfigMode}) watchList={Count}"`.
|
||||
|
||||
**Tests (fake-worker, no MXAccess):** with `Fallback:Mode=ForceSubtag` assert (a) first
|
||||
`StreamAlarms` message is `ProviderStatus{Mode=Subtag,Degraded=true}`; (b) gauge==2; (c)
|
||||
`provider_switches`==0. Add `ForceAlarmManager`→gauge 1 and `Auto`→gauge 1 baseline cases.
|
||||
|
||||
**Verify:** `dotnet build src/ZB.MOM.WW.MxGateway.Server`; `dotnet test src/ZB.MOM.WW.MxGateway.Tests --filter FullyQualifiedName~GatewayAlarmMonitorProviderMode`.
|
||||
|
||||
---
|
||||
|
||||
## Defect #1 (runtime bug — needs diagnosis) — ForceSubtag runs alarmmgr
|
||||
|
||||
The HEAD source path (gateway `MapForcedMode` → `SubscribeAlarmsCommand.ForcedMode=Subtag` →
|
||||
IPC → worker `AlarmCommandHandler.BuildConsumer` → `SubtagAlarmConsumer`) is statically correct,
|
||||
yet the runtime ran alarmmgr. A runtime diagnostic must locate where `forcedMode` becomes
|
||||
`Unspecified`.
|
||||
|
||||
### Task 1: Add worker BuildConsumer observability log (worker, net48 x86)
|
||||
**Classification:** small (net48 — no init-only props) · **Parallelizable with:** Task 2
|
||||
**Files:**
|
||||
- Modify: `src/ZB.MOM.WW.MxGateway.Worker/MxAccess/AlarmCommandHandler.cs` (`BuildConsumer`, ~line 206)
|
||||
|
||||
**Change:** log at entry of `BuildConsumer`:
|
||||
`"BuildConsumer forcedMode={ForcedMode} watchList={Count}"`. This is permanent observability.
|
||||
|
||||
**Verify (windev only):** `dotnet build src/ZB.MOM.WW.MxGateway.Worker -p:Platform=x86`.
|
||||
|
||||
### Task 3: Diagnostic run — capture the real forcedMode
|
||||
**Classification:** standard (ops/diagnostic on windev) · **Parallelizable with:** none
|
||||
**Steps:** rebuild worker (x86) + server from the branch, stand up the D3-style temp ForceSubtag
|
||||
instance (alt ports 5122/5132, isolated `.D3` worker name, WMI-detached, Http2, Development env
|
||||
for file logs), trigger the always-on monitor, and read the two new log lines:
|
||||
- Gateway `Alarm subscribe forcedMode=...` — what the gateway SENDS.
|
||||
- Worker `BuildConsumer forcedMode=...` — what the worker RECEIVES.
|
||||
|
||||
Decision matrix:
|
||||
- Gateway logs `Subtag`, worker logs `Unspecified` → IPC/serialization drops the enum → fix the
|
||||
send/translation path (likely a worker-proto vs gateway-proto `SubscribeAlarmsCommand` mismatch
|
||||
in the named-pipe envelope).
|
||||
- Worker logs `Subtag` but alarmmgr data appears → bug in `BuildStandby`/`SubtagAlarmConsumer`/
|
||||
`AlarmDispatcher` snapshot path.
|
||||
- Gateway logs `Unspecified` despite config ForceSubtag → gateway config/options read.
|
||||
|
||||
### Task 4: Fix #1 per Task 3 diagnosis
|
||||
**Classification:** standard/high-risk (depends on where the defect is) · the exact change is
|
||||
determined by Task 3. Add a regression test at the identified layer (worker unit test for
|
||||
BuildConsumer→SubtagAlarmConsumer, or a contract/IPC round-trip test if the enum is dropped).
|
||||
|
||||
---
|
||||
|
||||
## Final: build, test, redeploy, re-validate
|
||||
- Build gateway (macOS) + worker (windev x86); run gateway + worker test suites.
|
||||
- Redeploy windev Server (and Worker if Task 4 changed it) per `project_deploy_mechanics`,
|
||||
preserving appsettings.
|
||||
- Re-validate live with the temp ForceSubtag instance: active alarms `degraded=true` /
|
||||
`source_provider=SUBTAG` with recent timestamps, `provider_mode 2`. Tear down temp instance;
|
||||
production untouched.
|
||||
|
||||
## Execution note
|
||||
Branch off `main`. #2 is the clean confirmed fix; #1 is diagnose-then-fix. Net48 worker
|
||||
constraints apply (no init-only props/positional records). Do NOT increment `provider_switches`
|
||||
on an initial forced-mode set.
|
||||
Reference in New Issue
Block a user