From 371ce53409794a4aea990d0b71cfc22a676f72bb Mon Sep 17 00:00:00 2001 From: Joseph Doherty Date: Sun, 14 Jun 2026 23:46:21 -0400 Subject: [PATCH] docs: add deferred follow-ups plan (D1-D5) --- docs/plans/2026-06-14-deferred-followups.md | 234 ++++++++++++++++++++ 1 file changed, 234 insertions(+) create mode 100644 docs/plans/2026-06-14-deferred-followups.md diff --git a/docs/plans/2026-06-14-deferred-followups.md b/docs/plans/2026-06-14-deferred-followups.md new file mode 100644 index 0000000..17545f4 --- /dev/null +++ b/docs/plans/2026-06-14-deferred-followups.md @@ -0,0 +1,234 @@ +# Deferred Follow-ups Implementation Plan + +**Date:** 2026-06-14 +**Status:** Plan only — NOT yet executed. Saved for review. +**Context:** After the alarm-subtag-fallback cleanup (merged `5976770`) and its redeploy to +windev (10.100.0.48), five items remain deferred. This plan handles all five. They are +independent — execute in any order, or cherry-pick. Items D1–D2 are code (branch off `main`); +D3 is a dev-rig validation; D4–D5 are ops on the deployed hosts (no code). + +Source of the deferred list: the post-deploy review on 2026-06-14. See also memory +`project_deploy_mechanics`, `project_wonder_deployment`, `project_alarm_subtag_fallback`, +`project_rig_alarms_object_driven`. + +--- + +## D1 — Surface `AlarmProviderSwitchCount` on the dashboard metric list + +**Classification:** small · **Est:** ~10 min · **Where:** Server (net10), build+test on macOS +**Why deferred:** the dashboard reads provider *state* from the feed's `ProviderStatus` badge, +not the metrics snapshot, so the snapshot field (added in B1) and the OTEL counter were enough. +This makes the cumulative switch count visible in the dashboard's metric table too. + +**Files:** +- Modify: `src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardSnapshotService.cs` (`CreateMetricSummaries`, ~line 178–198) +- Test: `src/ZB.MOM.WW.MxGateway.Tests/Dashboard/DashboardSnapshotServiceTests.cs` (or the existing snapshot test file) + +**Steps:** +1. In `CreateMetricSummaries`, add to the `metrics` list (near the other counters): + `new("mxgateway.alarms.provider_switches", snapshot.AlarmProviderSwitchCount),` + Also consider adding the current mode as a gauge row if useful: + the snapshot does not carry the mode int (only the feed does), so DO NOT invent one — + only add the switch count, which the snapshot already exposes. +2. Add/extend a unit test asserting the summary list contains a + `mxgateway.alarms.provider_switches` row equal to the snapshot's `AlarmProviderSwitchCount` + after calling `metrics.AlarmProviderSwitched(...)`. +3. Verify: `dotnet build src/ZB.MOM.WW.MxGateway.Server` (macOS) and + `dotnet test src/ZB.MOM.WW.MxGateway.Tests --filter FullyQualifiedName~DashboardSnapshot`. +4. Commit. + +**Rollback:** revert the one-line list addition. + +--- + +## D2 — Reproduce and fix the `AmbiguousMatchException` on `GET /` + +**Classification:** standard · **Est:** ~30–45 min (repro + fix) · **Where:** Server (net10) +**Why deferred:** observed only in *old* (06/05, Development) logs: +`AmbiguousMatchException: The request matched multiple endpoints. Matches: / (/) / (/)`. +Unauthenticated `GET /` safely 302s to `/login`; the ambiguity would only throw for an +*authenticated* request that reaches routing. Pre-existing and unrelated to the alarm work, +but it would surface as a dashboard-home 500. Confirm it still reproduces before fixing. + +**Root-cause candidates (the two endpoints both matching `/`):** +- `DashboardHome.razor` (`@page "/"`) mapped by `MapRazorComponents()` + (`Dashboard/DashboardEndpointRouteBuilderExtensions.cs:92`). +- `MapStaticAssets(...)` (`GatewayApplication.cs:190`) — in .NET 8–10 the static-assets + endpoint can register a fingerprinted root that collides with the home page. + +**Step 0 — Reproduce (do this first; if it does NOT repro, downgrade to a documentation note):** +1. Log into the dashboard as `multi-role`/`password` (Administrator) at + `http://10.100.0.48:5130/login`. +2. Navigate to `http://10.100.0.48:5130/` (the "Dashboard" home). +3. If it renders the Overview page → not a live bug on this build; record that and stop + (the dev error-page throw was a Development-only artifact). If it 500s with + `AmbiguousMatchException` → proceed to fix. + +**Fix (only if it reproduces):** +- Files: `src/ZB.MOM.WW.MxGateway.Server/GatewayApplication.cs` (endpoint mapping order/region), + `src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardEndpointRouteBuilderExtensions.cs`, + `src/ZB.MOM.WW.MxGateway.Server/Dashboard/Components/Pages/DashboardHome.razor`. +- Candidate fixes, in order of preference: + 1. Inspect the actual two matched endpoint display names at runtime (the exception lists + them) — temporarily enable detailed errors or read the stderr `Matches:` block — to learn + exactly which two collide. + 2. If `MapStaticAssets` owns the second `/`: it must run, but confirm it isn't double-mapping + the home route; if it is, move `MapStaticAssets` ordering or scope it so it doesn't claim + the literal `/` content path. (Do NOT remove `MapStaticAssets` — Blazor needs it.) + 3. If two component pages both declare `@page "/"`: give the non-home one a distinct route. + 4. Last resort: give `DashboardHome` an explicit route (e.g. `@page "/overview"`) and add a + trivial `MapGet("/", () => Results.Redirect("/overview"))` with `.RequireAuthorization(...)`. +- Add a test that exercises the root route once resolved (a WebApplicationFactory-based + integration test asserting authenticated `GET /` returns 200, if the dashboard test harness + supports auth; otherwise a routing-uniqueness assertion). +- Verify: `dotnet build src/ZB.MOM.WW.MxGateway.Server`; re-run the authenticated `GET /` repro. + +**Rollback:** revert the routing change; the home page returns to its prior (ambiguous) state. + +**Note:** this fix ships only via a Server redeploy (D-deploy applies, see D5/windev redeploy +procedure in `project_deploy_mechanics`). Decide whether to bundle it with the next redeploy. + +--- + +## D3 — Validate the failover path end-to-end on the dev rig (ForceSubtag) + +**Classification:** standard (validation, no production code) · **Est:** ~30–45 min · **Where:** dev rig / windev +**Why deferred:** `provider_mode 1` (healthy alarmmgr) is verified in production, but the actual +alarmmgr→subtag **failover**, the **degraded badge**, the synthesized subtag alarms, and the +`provider_switches{from,to,reason}` counter (with the new bounded `failover`/`failback`/`unknown` +tag) have never fired live — a healthy system never switches, and the rig's alarms can't be made +to fail COM on demand. Use the explicit `ForceSubtag` config mode to exercise the degraded path +deterministically without an alarmmgr fault. + +**Approach:** run a *temporary* gateway instance (NOT the production service) in `ForceSubtag` +mode against the dev Galaxy, drive it with the operator/IDE alarm toggle, and confirm the +degraded surface. Do this on windev in a throwaway run so production stays on `Auto`. + +**Steps:** +1. On windev, from a build worktree at `main`, run the Server locally (or a second instance on + alt ports) with `MxGateway:Alarms:Fallback:Mode=ForceSubtag` (env + `MxGateway__Alarms__Fallback__Mode=ForceSubtag`), pointed at the dev Galaxy + (`\\DESKTOP-6JL3KKO\Galaxy!DEV`). Use distinct Kestrel ports to avoid clashing with the + production service on 5120/5130. +2. Subscribe an alarm client (or open the dashboard alarms page for that instance) and confirm: + - The provider badge shows **"Subtag monitoring (degraded)"** (amber `bg-warning`). + - `curl .../metrics` shows `mxgateway_alarms_provider_mode 2`. + - Active alarms appear with `degraded=true` and a synthetic (MD5-derived) GUID, with the + reference shape `Galaxy!..` (e.g. `Galaxy!TestArea.TestMachine_001.TestAlarm001`). +3. Drive a transition: have the operator/IDE toggle a `TestMachine_NNN.TestAlarmNNN` true→false + (external MXAccess writes are ignored — see `project_rig_alarms_object_driven`); confirm a + synthesized Raise then Clear, and an `AckMsg` write via AcknowledgeByName returns 0. +4. (Optional, to exercise the switch counter) run in `Auto` and induce a primary fault if a safe + way exists; otherwise document that the counter is unit-tested only and `ForceSubtag` covers + the degraded surface. Note: `ForceSubtag` may not increment `provider_switches` (no runtime + switch) — that counter's live exercise remains the one gap; record it explicitly rather than + claiming coverage. +5. Tear down the temporary instance. Production service is untouched (stays `Auto`). +6. Record results in `project_alarm_subtag_fallback` memory (degraded badge + synthesized + subtag alarms now live-validated; switch-counter still unit-test-only if not exercised). + +**Rollback:** none — temporary instance only; nothing in production changes. + +--- + +## D4 — Prune stale deploy backups on windev + +**Classification:** trivial (ops) · **Est:** ~10 min · **Where:** windev (10.100.0.48), no code +**Why deferred:** backups accumulate on every deploy; harmless but cluttering +`C:\publish\mxaccessgw\`. + +**Current backups observed (2026-06-14):** +- `Server.bak-20260526T143341`, `Server.bak-20260529T090320`, `Server.bak529.20260604T122616`, + `Server.bak-theme030-20260605-053056`, `Server.bak-theme031-20260605-083410`, + `Server.bak-20260614-prefallback` (today's), `Worker.bak-20260614-prefallback` (today's). +- Inside `Server\`: `ZB.MOM.WW.MxGateway.Server.dll.bak-20260604-loginfix`, + `appsettings.json.bak-20260604-{glauth35,ldapkeys}`. + +**Steps:** +1. **Keep** `Server.bak-20260614-prefallback` and `Worker.bak-20260614-prefallback` — the + immediate rollback for the current deploy. Keep until the new build has soaked (e.g. a few + days / one business cycle). +2. Delete the clearly-superseded older backups: the `Server.bak-2026052*`, `Server.bak529.*`, + and `Server.bak-theme03*` dirs. Confirm each is a Server dir (not something live) before + `rmdir /s /q`. +3. Optionally remove the in-`Server` `.bak-*` sidecar files (`*.dll.bak-*`, + `appsettings.json.bak-*`) — but FIRST confirm the live `appsettings.json` is correct (it is, + post-deploy) so the `.bak` ldap/glauth copies aren't the only record of a needed value. +4. Verify the service is unaffected: `Get-Service MxAccessGw` Running, `/health/ready` 200. + +**Rollback:** none needed (deletions only; the current-deploy backup is retained). If unsure +about any single dir, skip it. + +--- + +## D5 — Redeploy the wonder host to bring it to parity + +**Classification:** high-risk (separate production-ish box, fiddly access, divergent build) · **Est:** ~1–2 h · **Where:** wonder-app-vd03.zmr.zimmer.com (10.220.157.247) +**Why deferred:** wonder still runs the pre-feature build — it lacks the entire alarm-fallback +feature and the cleanup. Bring it to `5976770` if parity is wanted. This box is materially +different from windev (read `project_wonder_deployment` fully before starting). + +**Key differences from windev (do NOT reuse the windev recipe blindly):** +- Access: no usable `ssh host "cmd"` exec. Port 22 OpenSSH (PowerShell) was observed CLOSED + 2026-05-26; the reliable path is **servecli on port 2222** (cmd.exe PTY + SFTP only) using the + base64-`-EncodedCommand` pattern in `project_wonder_deployment` / `~/Desktop/servecli/instructions.md`. + Re-test port 22 first; if up, prefer it. +- Build shape: wonder runs a **self-contained single-file win-x64** Server (~118 MB), renamed to + `MxGateway.Server.exe` for the NSSM path `E:\ApiInstall\MxGateway\Server\MxGateway.Server.exe`. + So publish must be `-r win-x64 --self-contained -p:PublishSingleFile=true` (NOT the + framework-dependent windev recipe), then rename the exe to `MxGateway.Server.exe`. +- Static assets quirk: ship BOTH `ZB.MOM.WW.MxGateway.Server.staticwebassets.endpoints.json` and + `MxGateway.Server.staticwebassets.endpoints.json` next to the exe (the ZB-named one is used). +- Config: wonder's `appsettings.json` carries HTTP-switched Kestrel (`http://0.0.0.0:5130`), + `Dashboard.GroupToRole` = `{ SCADA-Admins/Designers/Deploy-All → Admin }`, + `Dashboard.RequireHttpsCookie:false`, LDAP base `dc=scadalink,dc=local`. **Preserve wonder's + appsettings.json** exactly like windev (do not ship the repo default). Note: wonder's + GroupToRole values are `"Admin"` — VERIFY the current build's validator accepts `"Admin"` (the + windev crash showed it requires `'Administrator'`/`'Viewer'`). **If the validator rejects + `"Admin"`, wonder's appsettings GroupToRole values must be updated to `Administrator` BEFORE/with + the deploy or the new build will crash-loop.** This is the single biggest risk — resolve it first. +- Worker: wonder's x86 worker is the original; the alarm-fallback adds new subscribe fields. To + use subtag fallback on wonder, the **x86 worker must also be redeployed** (publish x86, rename + if needed). If only the Server is updated, confirm worker IPC protobuf compat (contracts changed + for alarms — a stale worker may not understand the extended SubscribeAlarms). Safer to deploy + both Server and Worker together. + +**Steps (high level — expand against the runbook before executing):** +1. Confirm parity is actually wanted on wonder, and whether the dashboard (disabled there) and + the alarm monitor are even in scope for that box. If alarms/dashboard are off on wonder, this + may reduce to "Server binary parity only." +2. Resolve the **GroupToRole value compatibility** question (Admin vs Administrator) — inspect + the current build's `GatewayOptionsValidator` and decide whether to patch wonder's appsettings. +3. Build self-contained Server (`-r win-x64 --self-contained -p:PublishSingleFile=true`) and x86 + Worker from `main` (`5976770`). +4. Transfer via SFTP (servecli) to a staging dir on `E:\ApiInstall\MxGateway\`. +5. Stop `MxAccessGw`; back up `Server` → `Server.bak.` and `Worker` → `Worker.bak.`; + swap in the new build; **restore wonder's `appsettings.json`** (+ patched GroupToRole if + needed); ensure both staticwebassets manifests are present; rename Server exe to + `MxGateway.Server.exe`. +6. Start `MxAccessGw` (no dependents). Verify stderr line count does not grow (no crash loop), + `curl -k https://...:5130/health/live` (or HTTP per current config) Healthy, gRPC 5120 up. +7. Verify the new metric is present if the metrics endpoint is exposed there. + +**Rollback (documented in memory):** stop service, +`Move-Item Server Server.failed; Move-Item Server.bak. Server`, restore appsettings from the +backup, start. + +--- + +## Suggested order + +1. **D2 repro** (5 min — just log in and hit `/`): decides whether D2 is a real fix or a no-op. +2. **D1** (small code) + **D2 fix if needed** — bundle into one Server branch; they ship together + on the next Server redeploy. +3. **D4** (prune windev backups) — quick ops, independent. +4. **D3** (ForceSubtag validation) — exercises the degraded surface; do before/after the D1+D2 + redeploy so you also confirm the new Server build is healthy. +5. **D5** (wonder) — largest and riskiest; do last, only if parity is wanted, and resolve the + GroupToRole-value compatibility question first. + +## Execution note + +This plan is intentionally NOT executed. When ready, execute on a branch off `main` +(`feat/deferred-followups` or per-item branches) — do not commit to `main` directly. D1/D2 need a +Server redeploy to take effect; D4/D5 are host operations; D3 touches nothing permanent.